This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PMSSC: Parallelizable multi-subset based self-expressive model for subspace clustering

Katsuya Hotta Iwate University
Iwate, 020-8551 Japan
Email: hotta@iwate-u.ac.jp
   Takuya Akashi Iwate University
Iwate, 020-8551, Japan
Email: akashi@iwate-u.ac.jp
   Shogo Tokai University of Fukui
Fukui, 910-8507 Japan
Email: tokai@u-fukui.ac.jp
   Chao Zhang University of Fukui
Fukui, 910-8507 Japan
Email: zhang@u-fukui.ac.jp
Abstract

Subspace clustering methods which embrace a self-expressive model that represents each data point as a linear combination of other data points in the dataset provide powerful unsupervised learning techniques. However, when dealing with large datasets, representation of each data point by referring to all data points via a dictionary suffers from high computational complexity. To alleviate this issue, we introduce a parallelizable multi-subset based self-expressive model (PMS) which represents each data point by combining multiple subsets, with each consisting of only a small proportion of the samples. The adoption of PMS in subspace clustering (PMSSC) leads to computational advantages because the optimization problems decomposed over each subset are small, and can be solved efficiently in parallel. Furthermore, PMSSC is able to combine multiple self-expressive coefficient vectors obtained from subsets, which contributes to an improvement in self-expressiveness. Extensive experiments on synthetic and real-world datasets show the efficiency and effectiveness of our approach in comparison to other methods.

I Introduction

In many real-world cases, approximating high-dimensional data as a union of low-dimensional subspaces is a beneficial technique for reducing computational complexity and the effects of noise. The task of subspace clustering [1, 2], which is the segmentation of a set of data points into those lying on certain subspaces, has been studied in many practical applications such as face clustering [3], image segmentation [4], motion segmentation [5], scene segmentation [6], and homography detection [7]. Recently, self-expressive models [8, 9] have been explored, which embrace the self-expressive property of subspaces to compute an affinity matrix. The self-expressive property states that each data point from a union of subspaces can be represented as a linear combination of other points. Specifically, given a data matrix XD×NX\in\mathbb{R}^{D\times N} in which each data point is a column, the self-expressive model of data point 𝒙iD\boldsymbol{x}_{i}\in\mathbb{R}^{D} can be described as

𝒙i=X𝒄i,cii=0,\boldsymbol{x}_{i}=X\boldsymbol{c}_{i},\quad c_{ii}=0, (1)

where 𝒄iN\boldsymbol{c}_{i}\in\mathbb{R}^{N} is a coefficient vector, and the constraint cii=0c_{ii}=0 avoids the trivial solution of representing a point as a linear combination of itself. The feasible solutions of Eq. (1) are generally not unique because the number of data points lying on a subspace is larger than its dimensionality. However, at least one 𝒄i\boldsymbol{c}_{i} exists where cijc_{ij} is nonzero only if data points 𝒙i,𝒙j\boldsymbol{x}_{i},\boldsymbol{x}_{j} are in the same subspace, and such a state is called subspace-preserving [10]. Previous works have tried to compute subspace-preserving representations by imposing a regularization term on the coefficients 𝒄i\boldsymbol{c}_{i}. In particular, one algorithm for obtaining a sparse solution to Eq. (1), sparse subspace clustering (SSC) [8, 9], can recover subspaces under mild conditions by regularizing the coefficient matrix C:=[𝒄1,,𝒄N]N×NC:=[\boldsymbol{c}_{1},\ldots,\boldsymbol{c}_{N}]\in\mathbb{R}^{N\times N} corresponding to the coefficient vector of each data point 𝒙i\boldsymbol{x}_{i}. SSC not only achieves high clustering accuracy for datasets with outliers and missing entries, but also has the useful properties of giving theoretical guarantees and providing modeling flexibility, which have influenced many other approaches such as [11, 12]. However, SSC suffers from high computational and memory costs when dealing with a large-scale dataset because of the need to determine the 𝒪(N2)\mathcal{O}(N^{2}) coefficients of CC. In light of these problems, there has been much interest in recent years in developing scalable subspace clustering algorithms that can be applied to large-scale datasets, taking advantage of the ease of analyzing computational complexity due to the simplicity of the model.

Refer to caption
Figure 1: Overview of our self-expressive model. Given a data matrix XX in which each data point 𝒙i\boldsymbol{x}_{i} is a column, our approach represents a self-expressive model over the entire data by combining multiple subsets generated by sampling (Algorithm 1). Specifically, our method computes the self-expressive data point 𝒚i(t)\boldsymbol{y}_{i}^{(t)} by solving for the self-expressive coefficient vector 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)} for each point 𝒙i\boldsymbol{x}_{i} in TT subsets (Algorithm 2). Then, the self-expressive properties of the entire data are obtained by solving for 𝒃\boldsymbol{b}^{\ast} using YY with each data point 𝒚i(t)\boldsymbol{y}_{i}^{(t)} computed from each subset as columns (Algorithm 3).

Several works have attempted to address the problem of computational cost for large-scale datasets using a sampling strategy, motivated by the sparsity assumption that each data point can be represented as a linear combination of a few basis vectors. The self-expressive property with a few sampled data points and classifying of the other data points was proposed in [13]. While this strategy can produce clustering results more efficiently for a large-scale dataset than directly applying SSC to all data, it leads to poor clustering performance when the sampled data is not representative of the original dataset. Although a learning-based sampling method has also been proposed for generating a coefficient matrix that is representative of the original dataset [14], the accuracy and computational complexity still depend largely on the size of the subsets, as these methods attempt to solve for a self-expressive model in a single subset. Also, no effort has been made to explicitly improve the self-expressiveness of the self-expressive coefficient vectors in these methods.

To further improve self-expressiveness without increasing the computational burden, in this paper, we propose a self-expressive model adopting multiple subsets, which is computable in parallel. Specifically, our model obtains a self-expressive coefficient matrix by combining multiple subsets; each subset consists of only a small proportion of the samples. This strategy not only enjoys the benefit of low computational cost like other single subset-based methods, but also is more effectively subspace-preserving because the representation of the original data is a linear combination of multiple self-expressive coefficient vectors.

Our contributions are highlighted as follows:

  • a novel clustering approach that exploits a self-expressive model based on multiple subsets,

  • a concisely formulated model,

  • each subset can be computed independently in parallel without additional computational overhead,

  • extensive experiments on both synthetic data and real-world datasets showing that our proposed method can achieve better results without increasing processing time.

II Related Work

II-A Background

In the past few years, there has been a surge of spectral clustering-based algorithms that segment a set of data points by performing spectral clustering. Previous classical methods, such as kk-subspaces [15] and median kk-flats [16], assume that the dimensionalities of the underlying subspaces are given in advance. This latent knowledge is generally hard to access in many real-world applications. In addition, these methods are usually non-convex and thus sensitive to initialization [17, 18]. Aiming to relax the limitations of the kk-subspace algorithm, the majority of modern subspace clustering methods explored have turned to spectral clustering [19, 20], which segment data using an affinity matrix that captures whether a certain pair of data points lie on the same subspace. While many early methods [21, 22, 12, 23] achieve better segmentation than classical clustering algorithms even without the latent knowledge, these methods produce erroneous segmentation results for data points near the intersection of two subspaces due to the dense sampling of points lying on the subspace [24]. We now introduce previous subspace clustering approaches based on spectral clustering, then describe various techniques of scalable subspace clustering methods for dealing with large-scale datasets, which are closer to our proposed method.

II-B Subspace Clustering Using Spectral Clustering

Most subspace clustering approaches based on spectral clustering consist of two phases: (i) computing an affinity matrix based on the nonzero coefficients that appear in the representation of each data point as a combination of other points, and (ii) segmenting data points from the computed affinity matrix by applying spectral clustering. The key to the success of segmentation is the phase of computing the affinity matrix. Therefore, many methods have been proposed to compute the affinity matrix. For example, local subspace affinity [24] and spectral curvature clustering (SCC) [25] find neighborhoods based on the observation that a point and its kk-nearest neighbors often lie on the same subspace. However, the computational complexity of finding multi-way similarity in these methods grows exponentially with the number of subspace dimensions, motivating the use of a sampling strategy to lower the computational complexity [9]. Recently, the self-expressive model, which employs the self-expressive property in Eq. (1), has become the most popular one. In particular, SSC takes advantage of sparsity [26] by adopting 1\ell_{1} norm regularization of the coefficient vector to achieve high clustering performance. This idea has motivated many methods, using the 2\ell_{2} norm in least squares regression [27], the nuclear norm in low rank representation (LRR) [28], the 1\ell_{1} plus 2\ell_{2} norm in elastic net subspace clustering (EnSC) [29], and the Frobenius norm in efficient dense subspace clustering [30]. In practice, however, solving the p\ell_{p} norm minimization problem for large-scale data may be prohibitive. Also, the memory required becomes larger as the amount of data increases.

II-C Scalable Subspace Clustering

When constructing the affinity matrix, several methods based on spectral clustering suffer from high computational complexity. To reduce the computational complexity of this phase, a sparse self-expressive model adopting a greedy algorithm was proposed in [31, 10]. However, these approaches lead to unsatisfactory clustering results if the nonzero elements do not contain sufficient connections within each optimized coefficient vector [32]. Other popular approaches to alleviate the computational and memory loads were inspired by a sampling strategy. In [13], scalable sparse subspace clustering (SSSC) is computationally efficient, using a subset generated by random sampling. However, because the random sampling method relies on a single subset, data points from the same subspace will not be represented by the self-expressive model if they are not appropriately sampled. Exemplar-based subspace clustering [33, 12] is an efficient sampling technique that iteratively selects the least well-represented point as a subset to address the problem. Selective sampling-based scalable sparse subspace clustering (S5C) [14], which generates a subset by selective sampling, provides approximation guarantees of the subspace-preserving property. In [34], the subspace-preserving representations are found by solving a consensus problem with multiple subsets to improve the connectivity of the affinity matrix. In [35], a divided-and-conquer framework using multiple subsets obtained by separating the entire dataset is proposed. While this approach can deal with large-scale data, final segmentation results depend on the self-expressive properties of the optimized self-expressive coefficient vectors of each subset. Our method differs significantly from [35] and [34] in that our proposed self-representation model is designed to minimize the difference from the original data points by combining the self-expressive property of multiple subsets. Lastly, in this paper, we limit our discussion to non-deep learning approaches, which are more mathematically straightforward to explain and rely less on parameter tuning.

III Parallelizable Multi-Subset Based Sparse Subspace Clustering

III-A Problem and Approach

As a problem definition, our final goal is to find the self-expressive coefficient vector 𝒄i\boldsymbol{c}_{i}, which satisfies the subspace-preserving representation in Eq. (1). That is, the self-expressive residual can be obtained by solving the following optimization problem,

min𝒄i𝒙iX𝒄i22suchthat𝒄i0s,cii=0,\min_{\boldsymbol{c}_{i}}\|\boldsymbol{x}_{i}-X\boldsymbol{c}_{i}\|_{2}^{2}\quad\mathrm{such~{}that}\quad\|\boldsymbol{c}_{i}\|_{0}\leq s,\ c_{ii}=0, (2)

where 0\|\cdot\|_{0} is the 0\ell_{0} pseudo-norm that returns the number of nonzero entries in the vector. This optimization problem has been shown [36, 37] to recover provably subspace-preserving solutions using the orthogonal matching pursuit (OMP) algorithm [38]. ss is a tuning parameter for the OMP algorithm, which controls the sparsity of the solution by selecting up to ss entries in the coefficient vector 𝒄i\boldsymbol{c}_{i}. Although the OMP algorithm is computationally efficient and is guaranteed to give subspace-preserving solutions under mild conditions, it is unable to produce a subspace-preserving solution with a number of nonzero entries exceeding the dimensionality of the subspace [10]. This leads to poor clustering performance with too sparse affinity between data points, especially when the density of data points lying on the subspace is low.

We propose a novel subspace clustering algorithm with a parallelizable multi-subset based self-expressive model, as illustrated in Fig. 1. Sec. III-B introduces our proposed self-expressive model that extends the model in Eq. (2) to multiple subsets via a sampling technique. Sec. III-C then explains the solution of our self-expressive model by the OMP algorithm. Finally, we summarize the proposed subspace clustering algorithm in Algorithm 4.

III-B Parallelizable Multi-Subset based Self-Expressive Model

To deal with large-scale data, we first generate TT index subsets from the whole dataset by weighted random sampling [39] as follows:

(t)[N]s.t.n((t))=δN,t=1,,T,\mathcal{I}^{(t)}\subset[N]\quad\mathrm{s.t.}\quad n(\mathcal{I}^{(t)})=\lceil\delta N\rceil,\ t=1,\ldots,T, (3)

where (t)\mathcal{I}^{(t)} is the index set of the tt-th subset that is sampled with probability proportional to the elements of the weight vector 𝒘(t)N\boldsymbol{w}^{(t)}\in\mathbb{R}^{N}, [N][N] is NN indices {1,,N}\{1,\ldots,N\}, 0<δ10<\delta\leq 1 is the sampling rate, and n()n(\cdot) is the cardinality function that is a measure of the number of elements. The tt-th selected element of 𝒘(t)\boldsymbol{w}^{(t)} is updated as 𝒘i(t+1)=0.1𝒘i(t)\boldsymbol{w}_{i}^{(t+1)}=0.1\boldsymbol{w}_{i}^{(t)}. Then, in each sampled tt-th subset, the optimization problem in Eq. (2) can be expressed as follows:

𝒄i(t)=argmin𝒄i(t)𝒙i(t)X(t)𝒄i(t)22s.t.𝒄i(t)0s,cii(t)=0,\begin{split}\boldsymbol{c}_{i}^{\ast(t)}=&\operatorname*{arg\,min}_{\boldsymbol{c}_{i}^{(t)}}\|\boldsymbol{x}_{i}^{(t)}-X^{(t)}\boldsymbol{c}_{i}^{(t)}\|_{2}^{2}\\ &\mathrm{s.t.}\hskip 3.01125pt\|\boldsymbol{c}_{i}^{(t)}\|_{0}\leq s,\quad c_{ii}^{(t)}=0,\end{split} (4)

where X(t)D×NX^{(t)}\in\mathbb{R}^{D\times N} is the data matrix of the randomly sampled tt-th subset. 𝒄i(t)N\boldsymbol{c}_{i}^{(t)}\in\mathbb{R}^{N} is the self-expressive coefficient vector for each data point 𝒙i(t)\boldsymbol{x}_{i}^{(t)} in the tt-th subset. Note that to ensure the dimensionality of 𝒄i(t)\boldsymbol{c}_{i}^{(t)} is NN, the columns of each data matrix X(t)X^{(t)} corresponding to the non-sampled indices are replaced by zero-vectors: 𝒙i(t)=𝟎,i(t)\boldsymbol{x}_{i}^{(t)}=\boldsymbol{0},\ \forall i\notin\mathcal{I}^{(t)}. From each optimized coefficient vector 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)}, each data point 𝒙i(t)\boldsymbol{x}_{i}^{(t)} can be represented by a self-expressive model, given by:

𝒚i(t)=X(t)𝒄i(t)s.t.cii(t)=0,\boldsymbol{y}_{i}^{(t)}=X^{(t)}\boldsymbol{c}_{i}^{\ast(t)}\quad\mathrm{s.t.}\quad c_{ii}^{\ast(t)}=0, (5)
Algorithm 1 Optimization for the parallelizable multi-subset based self-expressive model (PMS)
1:Data matrix XD×NX\in\mathbb{R}^{D\times N}, number of subsets TT, sampling rate δ\delta, maximum number of repetitions ss, error term ϵ\epsilon
2:Generate TT index subsets {(t)}t=1T\{\mathcal{I}^{(t)}\}_{t=1}^{T} via Eq. (3);
3:Generate TT subset data matrices {X(t)}t=1T\{X^{(t)}\}_{t=1}^{T} based on {(t)}t=1T\{\mathcal{I}^{(t)}\}_{t=1}^{T};
4:for i=1,,Ni=1,\ldots,N do
5:     for t=1,,Tt=1,\ldots,T do
6:         Given X(t)X^{(t)} and 𝒙i(t)\boldsymbol{x}_{i}^{(t)}, solve for 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)} via Algorithm 2;
7:         Given 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)}, compute 𝒚i(t)\boldsymbol{y}_{i}^{(t)} for all data points via Eq. (5);
8:     end for
9:     Set Y=[𝒚i(1),,𝒚i(T)]D×TY=[\boldsymbol{y}_{i}^{(1)},\ldots,\boldsymbol{y}_{i}^{(T)}]\in\mathbb{R}^{D\times T};
10:     Given YY and 𝒙i\boldsymbol{x}_{i}, solve for 𝒃(i)\boldsymbol{b}^{\ast(i)} via Algorithm 3;
11:     Given 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)} and 𝒃(i)\boldsymbol{b}^{\ast(i)}, compute 𝒄i\boldsymbol{c}_{i}^{\ast} via Eq. (10);
12:end for
13:Set C=[𝒄1,,𝒄N]N×NC^{\ast}=[\boldsymbol{c}_{1}^{\ast},\ldots,\boldsymbol{c}_{N}^{\ast}]\in\mathbb{R}^{N\times N};
14:Coefficient matrix CC^{\ast}

where 𝒚i(t)\boldsymbol{y}_{i}^{(t)} is the data point computed by the self-expressive model from the tt-th subset. In practice, however, the data point 𝒚i(t)\boldsymbol{y}_{i}^{(t)} in Eq. (5) generally has an error term 𝒛i\boldsymbol{z}_{i}, i.e., 𝒚i(t)=𝒙i+𝒛i\boldsymbol{y}_{i}^{(t)}=\boldsymbol{x}_{i}+\boldsymbol{z}_{i}, because of the limitations of using X(t)X^{(t)} as a dictionary for reconstruction. To minimize 𝒛i\boldsymbol{z}_{i}, we first represent 𝒙i\boldsymbol{x}_{i} as a linear combination of 𝒚i(t)\boldsymbol{y}_{i}^{(t)}, as follows:

𝒙it=1Tbt(i)(X(t)𝒄i(t))t=1Tbt(i)𝒚i(t),\begin{split}\boldsymbol{x}_{i}&\approx\sum_{t=1}^{T}b_{t}^{(i)}(X^{(t)}\boldsymbol{c}_{i}^{\ast(t)})\\ &\approx\sum_{t=1}^{T}b_{t}^{(i)}\boldsymbol{y}_{i}^{(t)},\end{split} (6)

where 𝒃(i)T\boldsymbol{b}^{(i)}\in\mathbb{R}^{T} is the weight coefficient vector to represent 𝒙i\boldsymbol{x}_{i}, and bt(i)b_{t}^{(i)}\in\mathbb{R} is the tt-th entry of 𝒃(i)\boldsymbol{b}^{(i)}. The coefficient vector 𝒃(i)\boldsymbol{b}^{(i)} of the linear combination in Eq. (6) can be obtained by solving the following optimization problem,

𝒃(i)=argmin𝒃(i)𝒙it=1Tbt(i)𝒚i(t)22.\boldsymbol{b}^{\ast(i)}=\operatorname*{arg\,min}_{\boldsymbol{b}^{(i)}}\|\boldsymbol{x}_{i}-\sum_{t=1}^{T}b_{t}^{(i)}\boldsymbol{y}_{i}^{(t)}\|_{2}^{2}. (7)

For simplicity, we introduce a data matrix Y=[𝒚i(1),,𝒚i(T)]D×TY=[\boldsymbol{y}_{i}^{(1)},\ldots,\boldsymbol{y}_{i}^{(T)}]\in\mathbb{R}^{D\times T} with each data point 𝒚i(t)\boldsymbol{y}_{i}^{(t)} from Eq. (5) as columns, and rewrite Eq. (7) as

𝒃(i)=argmin𝒃(i)𝒙iY𝒃(i)22.\boldsymbol{b}^{\ast(i)}=\operatorname*{arg\,min}_{\boldsymbol{b}^{(i)}}\|\boldsymbol{x}_{i}-Y\boldsymbol{b}^{(i)}\|_{2}^{2}. (8)

This is the formulation of the optimization problem for subspace clustering in Eq. (2), and can be further described as:

𝒙iY𝒃(i).\boldsymbol{x}_{i}\approx Y\boldsymbol{b}^{\ast(i)}. (9)

Unlike in Eq. (1), here YY is the data matrix computed from each subset to represent 𝒙i\boldsymbol{x}_{i}. Thus, no constraint is required to avoid the trivial solution of representing a point as a linear combination of itself. To explicitly express Eq. (2), the self-expressive coefficient vector 𝒄i\boldsymbol{c}_{i}^{\ast} corresponding to XX is obtained by

𝒄i=t=1Tbt(i)𝒄i(t).\boldsymbol{c}_{i}^{\ast}=\sum_{t=1}^{T}b_{t}^{\ast(i)}\boldsymbol{c}_{i}^{\ast(t)}. (10)

It is worth noting that each 𝒄(t)\boldsymbol{c}^{\ast{(t)}} can be determined independently from each subset, so can be computed in parallel for speed.

Algorithm 2 OMP algorithm for finding 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)}
1:Data matrix X(t)D×NX^{(t)}\in\mathbb{R}^{D\times N}, reference data point 𝒙i(t)\boldsymbol{x}_{i}^{(t)}, maximum repetition count ss, error term ϵ\epsilon
2:Initialize k=0k=0, residual 𝒓=𝒙i(t)\boldsymbol{r}=\boldsymbol{x}_{i}^{(t)}, support set 𝒮=\mathcal{S}=\emptyset;
3:while k<sk<s and 𝒓2>ϵ\|\boldsymbol{r}\|_{2}>\epsilon do
4:     Find jj^{\ast} via Eq. (11);
5:     𝒮𝒮{j}\mathcal{S}\leftarrow\mathcal{S}\cup\{j^{\ast}\};
6:     Estimate 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)} via Eq. (12);
7:     Update the residual 𝒓\boldsymbol{r} via Eq. (13);
8:     k=k+1k=k+1;
9:end while
10:Self-expressive coefficient vector 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)}
Algorithm 3 OMP algorithm for finding 𝒃(i)\boldsymbol{b}^{\ast(i)}
1:Data matrix YD×TY\in\mathbb{R}^{D\times T}, reference data point 𝒙i\boldsymbol{x}_{i}, error term ϵ\epsilon
2:Initialize k=0k=0, residual 𝒓=𝒙i\boldsymbol{r}=\boldsymbol{x}_{i}, support set 𝒮=\mathcal{S}=\emptyset;
3:while k<Tk<T and 𝒓2>ϵ\|\boldsymbol{r}\|_{2}>\epsilon do
4:     Find jj^{\ast} via Eq. (14);
5:     Update 𝒮𝒮{j}\mathcal{S}\leftarrow\mathcal{S}\cup\{j^{\ast}\};
6:     Estimate 𝒃(i)\boldsymbol{b}^{\ast(i)} via Eq. (15);
7:     Update the residual 𝒓\boldsymbol{r} via Eq. (16);
8:     k=k+1k=k+1;
9:end while
10:Weight coefficient vector 𝒃(i)\boldsymbol{b}^{\ast(i)}

III-C Optimization with Orthogonal Matching Pursuit

In this section, we show that the parameters of the proposed PMS model can be determined by dividing the optimization problem into two small optimization problems as summarized in Algorithm 1. Overall, Eq. (10) can be determined by solving the minimization problems in Eqs. (4) and (8). We introduce both of the minimization procedures below. Specifically, initially, TT subset data matrices {X(t)}t=1T\{X^{(t)}\}_{t=1}^{T} are generated based on the sampling set (t)\mathcal{I}^{(t)} in Eq. (3).

To efficiently solve for 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)} in Eq. (4), we introduce Algorithm 2 based on the OMP algorithm. The support set and the residual are initialized to 𝒮=\mathcal{S}=\emptyset and 𝒓=𝒙i(t)\boldsymbol{r}=\boldsymbol{x}_{i}^{(t)}, respectively. 𝒮\mathcal{S} denotes the index set, which is updated on each iteration by adding one index jj^{\ast}. jj^{\ast} is computed using

j=argmaxj(t)𝒮𝒙j(t)𝒓.j^{\ast}=\operatorname*{arg\,max}_{j\in\mathcal{I}^{(t)}\setminus\mathcal{S}}\boldsymbol{x}_{j}^{(t)\top}\boldsymbol{r}. (11)

Then, using the updated 𝒮\mathcal{S}, the self-expressive coefficient vector 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)} is found by solving the following problem:

𝒄i(t)={argmin𝒄i(t)𝒙i(t)X(t)𝒄i(t)22,ifi(t),𝟎,otherwise,\displaystyle\boldsymbol{c}_{i}^{\ast(t)}=\begin{cases}\operatorname*{arg\,min}_{\boldsymbol{c}_{i}^{(t)}}\|\boldsymbol{x}_{i}^{(t)}-X^{(t)}\boldsymbol{c}_{i}^{(t)}\|_{2}^{2},&\mathrm{if}\hskip 2.0pti\in\mathcal{I}^{(t)},\\ \boldsymbol{0},&\mathrm{otherwise},\end{cases}
suchthatsupp(𝒄i(t))𝒮,\displaystyle\mathrm{such~{}that}\quad\mathrm{supp}(\boldsymbol{c}_{i}^{(t)})\subseteq\mathcal{S}, (12)

where supp(𝒄i(t))\mathrm{supp}(\boldsymbol{c}_{i}^{(t)}) is the support function that returns the subgroup of the domain containing the elements not mapped to zero. 𝒓\boldsymbol{r} is updated using:

𝒓𝒙i(t)X(t)𝒄i(t).\boldsymbol{r}\leftarrow\boldsymbol{x}_{i}^{(t)}-X^{(t)}\boldsymbol{c}_{i}^{\ast(t)}. (13)

This process is repeated until the number of iterations kk reaches its limit ss or 𝒓\boldsymbol{r} is smaller than the error ϵ\epsilon.

Algorithm 4 Parallelizable multi-subset based sparse subspace clustering (PMSSC)
1:Data matrix X=[𝒙1,,𝒙N]D×NX=[\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N}]\in\mathbb{R}^{D\times N}, parameters TT, δ\delta, ss, ϵ\epsilon
2:Compute coefficient matrix CC^{\ast} via Algorithm 1;
3:Define affinity matrix W=|C|+|CT|W=|C^{\ast}|+|C^{\ast T}|;
4:Apply spectral clustering;
5:Clustering results of XX

To find 𝒃(i)\boldsymbol{b}^{\ast(i)}, Eq. (8) can also be solved via the OMP algorithm, as shown in Algorithm 3. The input data matrix YD×TY\in\mathbb{R}^{D\times T} is generated by Eq. (5) from 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)}. Note that the size of YY depends on the number of subsets TT and is much smaller than the number of data points. The maximum number of repetitions is TT, and 𝒮\mathcal{S} is updated by finding the index jj^{\ast} satisying

j=argmaxj[T]𝒮𝒚i(j)𝒓.j^{\ast}=\operatorname*{arg\,max}_{j\in[T]\setminus\mathcal{S}}\boldsymbol{y}_{i}^{(j)\top}\boldsymbol{r}. (14)

In addition, the weight coefficient vector 𝒃(i)\boldsymbol{b}^{\ast(i)} and update of 𝒓\boldsymbol{r} are determined by solving

𝒃(i)=argmin𝒃(i)𝒙iY𝒃(i)22,s.t.supp(𝒃i)𝒮.\boldsymbol{b}^{\ast(i)}=\operatorname*{arg\,min}_{\boldsymbol{b}^{(i)}}\|\boldsymbol{x}_{i}-Y\boldsymbol{b}^{(i)}\|_{2}^{2},\hskip 10.03749pt\mathrm{s.t.}\hskip 3.01125pt\mathrm{supp}(\boldsymbol{b}_{i})\subseteq\mathcal{S}. (15)
𝒓𝒙iY𝒃t(i).\boldsymbol{r}\leftarrow\boldsymbol{x}_{i}-Y\boldsymbol{b}_{t}^{\ast(i)}. (16)

For clarity, we summarize the entire framework of our proposed subspace clustering approach in Algorithm 4, calling it the parallelizable multi-subset based sparse subspace clustering (PMSSC) method. Given XX and parameters TT, δ\delta, ss, and ϵ\epsilon, the optimal solution CC^{\ast} can be found using Algorithm 1. We thus define the affinity matrix as A=|C|+|CT|A=|C^{\ast}|+|C^{\ast T}| using the computed CC^{\ast}; the final clustering results can be obtained by applying spectral clustering to AA via normalized cut [19].

IV Experiments and Results

We have evaluated our approach using both synthetic data and real-world benchmark datasets.

IV-A Baselines and Evaluation Metrics.

We compare our approach to the following eight methods: SCC [25], LRR [28], thresholding-based subspace clustering (TSC) [40], low rank subspace clustering (LRSC) [41], SSSC [13], EnSC [29], SSC-OMP [10], and S5[14]. Tests for all comparative methods used provided source code, and each parameter was carefully tuned to give the best clustering accuracy. For spectral clustering, except for SCC, S5C, and SSSC, we applied normalized cut [19] to the affinity matrix A=|C|+|CT|A=|C|+|C^{T}|. (SCC and S5C have their own spectral clustering phase, while SSSC obtains clustering results from the data split into two parts). Unlike SSC-OMP, our method, which involves independent calculation for each subset, can be implemented in parallel with multi-core processing. All algorithms ran on an AMD Ryzen 7 3700x processor with 32 GB RAM. Following [10], as quantitative evaluation metrics, we evaluated each algorithm using clustering accuracy (acc: a%a\%), subspace-preserving representation error (sre: e%e\%), connectivity (conn: cc), and runtime (tt seconds). Clustering accuracy represents the percentage of correctly labeled data points:

a=maxπ100NijQπ(i)jestQijtrue,a=\max_{\pi}\frac{100}{N}\sum_{ij}Q_{\pi(i)j}^{\mathrm{est}}Q_{ij}^{\mathrm{true}}, (17)

where π\pi is a permutation of the LL cluster groups. QestQ^{\mathrm{est}} and QtrueQ^{\mathrm{true}} are the estimated labeling result and ground-truth, respectively, scoring one in the (i,j)(i,j)th entry if a data point jj belongs to the ii-th cluster and zero otherwise. The subspace-preserving representation error indicates the average fraction of affinities formed from other subspaces in each 𝒄j\boldsymbol{c}_{j},

e=100Nj(1i(ωij|cij|)/𝒄j1),e=\frac{100}{N}\sum_{j}\left(1-\sum_{i}(\omega_{ij}|c_{ij}|)/\|\boldsymbol{c}_{j}\|_{1}\right), (18)

where ωij{0,1}\omega_{ij}\in\{0,1\} is the true affinity, and 1\|\cdot\|_{1} returns the 1\ell_{1} norm. The connectivity shows the average connection of the affinity matrix with LL cluster groups as follows:

c=1Li=1Lλ2(i),c=\frac{1}{L}\sum_{i=1}^{L}\lambda_{2}^{(i)}, (19)

where λ2\lambda_{2} is the second smallest eigenvalue of the normalized Laplacian for each of the LL subgraph, and λ2(i)\lambda_{2}^{(i)} indicates the algebraic connectivity for each cluster. If c=0c=0, then at least one subgraph is not connected [42].

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Comparison of PMSSC, SSC-OMP, and SSSC on synthetic data in terms of (a) clustering accuracy, (b) subspace-preserving representation error, (c) connectivity, and (d) runtime. For SSSC, only clustering accuracy and runtime are shown as SSSC does not generate the self-expressive coefficient matrix and affinity matrix.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Visual examples for datasets: (a) Extended Yale B, (b) ORL, (c) GTSRB, (d) MNIST, (e) EMNIST-Letters, and (f) CIFAR-10.
TABLE I: Parameters (ss, δ\delta, and TT) used in PMSSC for benchmark datasets.
Datasets Ex. Yale B ORL GTSRB BBCSport MNIST4000 MNIST10000 MNIST EMNIST CIFAR-10
ss 5 5 3 3 10 10 10 10 3
δ\delta 0.6 0.6 0.2 0.4 0.3 0.2 0.1 0.2 0.2
TT 6 11 8 15 7 10 19 12 18

IV-B Experiments on Synthetic Data

IV-B1 Setup

We first report experimental results on data synthesised by randomly generating five linear subspaces of 6\mathbb{R}^{6} as ground-truth in an ambient space of 9\mathbb{R}^{9}. Each subspace contains nn randomly sampled data points. To confirm the statistical results, we conducted the experiments by varying nn from 100 to 4,000, so the total number NN of data points varies from 500 to 20,000. We set the parameters s=6s=6, δ=0.3\delta=0.3, and T=16T=16. The percentage of in-sample in SSSC is set to 10%10\% of the total number of data points. All experimental results recorded on synthetic data were averaged over 50 trials.

IV-B2 Results

The curve for each metric is shown as a function of nn in Fig. 2. We can observe from Fig. 2 that PMSSC outperforms SSC-OMP in terms of clustering accuracy. The difference is especially large when the density of data points on the underlying subspace is lower. This could be partly due to the fact that PMSSC succeeds in generating better connectivity than SSC-OMP (see Fig. 2), and achieves lower subspace-preserving representation error (see Fig. 2). On the other hand, as Fig. 2 shows, PMSSC is much faster with parallel implementation, which is advisable for solving problems involving large-scale data. In addition, compared to SSSC which adopts a similar sampling approach to our method, PMSSC outperforms both in clustering accuracy and runtime (using a parallel implementation).

IV-C Experiments on Benchmark Datasets for Real-world Applications

IV-C1 Setup

We conducted experiments on five benchmark datasets: Extended Yale B [43] and ORL [44] for face clustering, BBCSport [45] for text document clustering, German Traffic Sign Recognition Benchmark (GTSRB) [46] for street sign clustering, Modified National Institute of Standards and Technology database (MNIST) [47] and Extended MNIST (EMNIST) [48] for handwritten character clustering, and Canadian Institute For Advanced Research (CIFAR-10) [49] for object clustering. Parameter settings used for our method in these experiments are shown in Table I. Since the sparsity ss in PMSSC and SSC-OMP is related to the intrinsic dimensionality of the subspace, it is manually determined to be close to the dimensionality of the subspaces. For sampling rate δ\delta, we picked a smaller δ\delta for a larger dataset. For the number of subsets TT, we adopted a value that takes into account the trade-off between runtime and clustering accuracy. All experimental results recorded on all benchmark datasets are averaged over 10 trials. Details of each benchmark dataset and the corresponding clustering results are now given.

TABLE II: Comparative results on Extended Yale B; ‘-’ indicates the metric cannot be computed.
Method Extended Yale B
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC 9.54 - - 293.01
LRR 37.58 97.44 0.8175 219.91
TSC 52.40 - 0.0014 10.46
LRSC 56.71 91.64 0.4360 4.37
SSSC 49.77 - - 18.22
EnSC 55.76 18.90 0.0395 4.57
SSC-OMP 73.82 20.07 0.0364 1.27
S5C 62.99 58.74 0.2238 952.26
PMSSC 80.24 22.35 0.0858 2.66
TABLE III: Comparative results on ORL.
Method ORL
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC 16.40 - - 89.45
LRR 32.98 97.70 0.8394 3.57
TSC 68.03 - 0.0992 0.72
LRSC 43.12 93.72 0.5248 0.24
SSSC 65.12 - - 1.78
EnSC 70.03 32.46 0.1825 0.65
SSC-OMP 60.12 34.14 0.0770 0.12
S5C 69.48 63.26 0.3868 54.28
PMSSC 74.45 40.97 0.1708 0.51
TABLE IV: Comparative results on GTSRB.
Method GTSRB
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC 59.68 - - 84.26
LRR 27.87 86.64 0.4255 725.18
TSC 56.36 - 0.0016 242.62
LRSC 83.97 80.14 0.6056 12.89
SSSC 88.03 - - 16.86
EnSC 85.92 0.59 0.0065 10.77
SSC-OMP 81.28 5.38 0.0211 3.72
S5C 61.60 80.99 0.5941 422.35
PMSSC 91.57 7.69 0.0434 3.40
TABLE V: Comparative results on BBCSport.
Method BBCSport
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC 23.12 - - 3.60
LRR 71.37 76.26 0.7744 7.59
TSC 73.95 - 0.0053 0.36
LRSC 89.53 66.38 0.5997 0.18
SSSC 50.24 - - 0.26
EnSC 59.48 11.43 0.0243 0.61
SSC-OMP 69.85 15.96 0.0393 0.10
S5C 55.90 65.78 0.5434 17.99
PMSSC 81.71 14.36 0.0509 0.47
TABLE VI: Comparative results on MNIST4000 and MNIST10000.
Method MNIST4000 MNIST10000
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.) acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC 67.45 - - 5.93 70.43 - - 11.67
LRR 78.49 90.21 0.8979 43.03 77.53 90.60 0.8818 396.45
TSC 79.57 - 0.0009 11.76 80.62 - 0.0005 132.08
LRSC 81.23 75.67 0.5984 1.61 80.86 77.30 0.5983 7.58
SSSC 70.73 - - 3.11 84.32 - - 13.20
EnSC 89.08 21.14 0.1174 12.71 88.24 17.34 0.0975 35.63
SSC-OMP 91.49 34.69 0.1329 1.61 91.40 32.23 0.1169 6.44
S5C 81.52 66.28 0.4476 277.93 79.30 66.23 0.4466 683.65
PMSSC 92.85 38.27 0.1944 1.42 93.57 36.43 0.1817 4.55
TABLE VII: Comparative results on MNIST; ‘M’ indicates that 32 GB memory was exhausted.
Method MNIST70000
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC 69.08 - - 388.00
LRR M - - -
TSC M - - -
LRSC M - - -
SSSC 81.57 - - 303.28
EnSC 93.79 11.26 0.0596 408.62
SSC-OMP 82.83 28.57 0.0830 248.50
S5C 72.99 66.87 0.4437 4953.28
PMSSC 84.45 32.63 0.1148 65.08

IV-C2 Extended Yale B

Extended Yale B contains 2,432 facial images in 38 classes; see Fig. 3. In this experiment, following [9], we concatenated the pixels of each image resized to 48×4248\times 42, and used the 2016-dimensional vector as input data.

The results on Extended Yale B are shown in Table II. In each column, the best result is shown in bold, and the second-best result is underlined. They confirm that PMSSC yields the best clustering accuracy, and improves the clustering accuracy over SSC-OMP by 6.42%6.42\%. Although the subspace-preserving error and runtime are slightly lower than SSC-OMP, the connectivity is greatly improved compared to SSC-OMP, leading to a better clustering accuracy. LRR, LRSC, and S5C have good connectivity, but poor subspace-preserving errors result in low clustering accuracy.

IV-C3 ORL

ORL contains 400 facial images in 40 classes, as shown in Fig. 3. In this experiment, following [50], we concatenate the pixels of each image resized to 32×3232\times 32, and use a 1024-dimensional vector as input data. Compared to Extended Yale B, ORL is a more difficult problem setting for subspace clustering because the density of data lying near the same subspace (10 vs. 64) is lower due to the small number of images of each subject, and the subspaces have more non-linearity due to changes in facial expressions and details.

The results for ORL are listed in Table III. We can again observe that PMSSC achieves the best clustering accuracy, and improves the connectivity compared to SSC-OMP. However, since PMSSC does not incorporate nonlinear constraints, the subspace-preserving error does not improve along with the improvement of the connectivity.

IV-C4 GTSRB

GTSRB contains over 50,000 street sign images in 43 classes; see Fig. 3. Following [33], we preprocess the dataset represented by a 1568-dimensional HOG feature to get an imbalanced dataset of the 500-dimensional vectors with 12,390 samples in 14 classes.

The results on GTSRB are reported in Table IV. Again PMSSC yields the best clustering accuracy and runtime, and improves the clustering accuracy roughly by 10%10\% compared to SSC-OMP. In particular, PMSSC has both good subspace-preserving error and connectivity. While EnSC and SSSC also achieve competitive clustering accuracy, their computational costs are much higher.

TABLE VIII: Comparative results on EMNIST-Letters; ‘M’ indicates that 32 GB memory was exhausted.
Method EMNIST-Letters
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC M - - -
LRR M - - -
TSC M - - -
LRSC M - - -
SSSC 60.62 - - 1538.46
EnSC 64.15 26.20 0.0086 1575.46
SSC-OMP 58.71 43.93 0.0000 1214.31
S5C 60.01 83.37 0.3517 15698.90
PMSSC 66.52 46.76 0.0019 638.03

IV-C5 BBCSport

BBCSport contains 737 documents in five classes. The data provided by the database has been preprocessed by stemming, stop-word removal, and low term frequency filtering. In this experiment, we reduced the dimensionality of feature vectors to 500 by PCA.

The results on BBCSport are summarized in Table V. We can observe that PMSSC yields the second best clustering accuracy and subspace-preserving error. LRSC yields the best clustering accuracy due to good connectivity. For small-scale datasets such as BBCSport and ORL, the speed of PMSSC is slightly lower than for SSC-OMP because the advantage of reducing data size by sampling multiple subsets is diminished.

IV-C6 MNIST and EMNIST-Letters

MNIST contains 70,000 images of handwritten digits (0–9), while EMNIST-Letters contains 145,600 images of handwritten characters in 26 classes, as shown in Figs. 3 and 3. In our experiments, following [34], we generate MNIST4000 and MNIST10000, which are produced by randomly sampling 400 and 1,000 images per class of digit, respectively. Each image is represented as a 3472-dimensional feature vector by using the scattering convolution network [51], and its dimensionality reduced to 500 by PCA.

The results on MNIST and EMNIST-Letters are summarized in Tables VIVIII. We can observe that PMSSC yields the best clustering accuracy on MNIST4000, MNIST10000, and EMNIST-Letters. In particular, PMSSC is remarkably faster than the comparative methods on MNIST70000 and EMNIST-Letters. In the case of MNIST70000, EnSC yields the best clustering accuracy and subspace-preserving error but its computational cost is high. Similarly, S5C can achieve good connectivity, but is very slow.

IV-C7 CIFAR-10

CIFAR-10 includes 60,000 general objects in 10 classes, as illustrated in Fig. 3. Following [52], we employ the feature representations extracted by MCR2 [53], which learns a union of low-dimensional subspaces representation via self-supervised learning. Each feature is represented by a 128-dimensional feature vector, further normalized to have unit 2\ell_{2} norm.

The comparative results on CIFAR-10 are summarized in Table IX. It can be observed that our method outperforms others in terms of the runtime, while the clustering accuracy is competitive. However, as with SSC-OMP, we see that the connectivity is lower than for S5C, which uses 1\ell_{1} norm regularization.

IV-C8 Summary

Overall, our proposed method becomes significantly faster as the amount of input data increases. In addition, it achieves good clustering accuracy and connectivity, and provides subspace-preserving errors comparable to those of the comparative algorithms.

TABLE IX: Comparative results on CIFAR-10 where ’M’ means that the memory limitation of 32G is exceeded.
Method CIFAR-10
acc (a%a\%) sre (e%e\%) conn (cc) tt (sec.)
SCC 37.10 - - 196.40
LRR M - - -
TSC M - - -
LRSC M - - -
SSSC 63.80 - - 74.36
EnSC 61.79 22.60 0.0000 178.22
SSC-OMP 40.86 24.92 0.0000 63.58
S5C 64.52 46.35 0.2314 2338.55
PMSSC 63.52 26.41 0.0000 29.60

IV-D Analysis

IV-D1 Multi-subset Based Self-Expressive Model

Since our approach aims to minimize the self-expressive residual by the weight coefficient vector 𝒃\boldsymbol{b}^{\ast} solved in Algorithm 3, we show the mean self-expressive residual of data points represented by the coefficient vectors in Fig. 4. This experiment was performed on synthetic data, and we fixed T=10T=10 and δ=0.3\delta=0.3. Each blue bar indicates the mean self-expressive residual of the data points represented by Eq. (5), computed as

𝒛(t)2=1δNi=1N𝒙i(t)X(t)𝒄i(t)2.\begin{split}\|\boldsymbol{z}^{(t)}\|_{2}=\frac{1}{\lceil\delta N\rceil}\sum_{i=1}^{N}\|\boldsymbol{x}_{i}^{(t)}-X^{(t)}\boldsymbol{c}_{i}^{\ast(t)}\|_{2}.\end{split} (20)

The red bar indicates the mean self-expressive residual of the data points represented by Eq. (9), computed as:

𝒛2=1Ni=1N(𝒙iX𝒄i)2.\|\boldsymbol{z}\|_{2}=\frac{1}{N}\sum_{i=1}^{N}\|(\boldsymbol{x}_{i}-X\boldsymbol{c}_{i}^{\ast})\|_{2}. (21)

We can clearly observe that the mean self-expressive residual of PMS has lower error than every single subset. To highlight the benefit of 𝒃\boldsymbol{b}^{\ast}, we made a comparison to a variant of our approach, named PMSSC(avg), which replaced 𝒃\boldsymbol{b}^{\ast} by a simple average operation: in PMSSC(avg), Eq. (10) is replaced by

𝒄i=1Tt=1T𝒄i(t).\boldsymbol{c}_{i}^{\ast}=\frac{1}{T}\sum_{t=1}^{T}\boldsymbol{c}_{i}^{\ast(t)}. (22)

We performed experiments on synthetic data using the same setup as for Fig. 2 and present the results in Fig. 5. As can be seen, incorporating 𝒃\boldsymbol{b}^{\ast} improves clustering performance; in particular, the subspace-preserving representation error is significantly reduced. These experiments indicate that the weight coefficient vector 𝒃\boldsymbol{b}^{\ast} contributes to improving self-expressiveness.

Refer to caption
Figure 4: Comparative results in terms of the mean residuals over data points represented by the self-expressive models with different subsets. Blue bars represent each single tt-th subset, while the red bar is computed using multiple subsets.
Refer to caption
Refer to caption
Figure 5: Benefit of using 𝒃\boldsymbol{b}^{\ast} in PMSSC in terms of (a) clustering accuracy and (b) subspace-preserving representation error, for synthetic data. Red: using 𝒃\boldsymbol{b}^{\ast}. Green: using simple averaging.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Effects of varying parameters δ\delta and TT (GTSRB dataset): (a) clustering accuracy, (b) subspace-preserving representation error, (c) connectivity, and (d) runtime.
Refer to caption
Refer to caption
Figure 7: Effect of sampling method in our approach, for synthetic data: (a) clustering accuracy and (b) connectivity. Blue: weighted random sampling. Black: uniform random smapling.

IV-D2 Selection of Parameters

We performed multiple experiments on the GTSRB dataset with various choices of hyperparameters (T,δ)(T,\delta) to evaluate the sensitivity of our approach to parameter choice. Changes in clustering accuracy, subspace-preserving representation error, connectivity, and runtime when varying each parameter are illustrated in Fig. 6. We can confirm that high clustering accuracy and low subspace-preserving representation error are maintained in most cases, except when both TT and δ\delta are extremely small. This implies that the affinity matrix constructed by PMSSC provides subspace-preserving representations at most data points. We can also see that the connectivity improves as the number of subsets TT increases, because the affinity matrix contains at most sTNsTN nonzero entries in OMP optimization. Considering runtime, a practical choice of parameters is to increase TT for small values of δ\delta, and decrease TT for large values of δ\delta. In addition, time taken can be kept low by picking a small value of δ\delta for large-scale datasets.

IV-D3 Sampling Technique

Our approach adopts weighted random sampling to generate the subset data matrix X(t)X^{(t)}. To analyze the effect of sampling methods on our approach, we compared weighted random sampling to random sampling with uniform weights. The experimental settings used for synthetic data follow those in Fig. 2. Fig. 7 shows the clustering accuracy and connectivity as functions of nn. Obviously, weighted random sampling outperforms random sampling in terms of both clustering accuracy and connectivity. In particular, as the density of data points increases, the connectivity of the method with random sampling becomes zero, because imbalanced sampling leads to a disconnected subgraph in an affinity graph.

IV-D4 Computational Complexity

Algorithms 2 and 3 for affinity matrix construction consume most of the processing time. In Algorithm 2, the computational complexity for finding the self-expressive coefficient vector 𝒄i(t)\boldsymbol{c}_{i}^{\ast(t)} requires time 𝒪(DsδN)\mathcal{O}(Ds\lceil\delta N\rceil). In Algorithm 3, the computational complexity for finding the weight coefficient vector 𝒃(i)\boldsymbol{b}^{\ast(i)} requires 𝒪(DT2)\mathcal{O}(DT^{2}). Because these two algorithms are performed on NN data points, the computational complexity of PMS requires at least time 𝒪(N(TDsδN+DT2))\mathcal{O}(N(TDs\lceil\delta N\rceil+DT^{2})). However, processing TT subsets (the part taking 𝒪(TDsδN)\mathcal{O}(TDs\lceil\delta N\rceil)) can be performed in parallel, which reduces the computation time compared to methods that directly deal with the whole dataset. Fig. 2 supports this analysis.

V Conclusions

We have proposed a parallelizable multi-subset based self-expressive model for subspace clustering. A representation of the input data is formulated by combining the solutions of small optimization problems with respect to multiple subsets generated by data sampling. We have shown that this strategy can significantly improve speed with a multi-core approach that can be easily implemented, especially for large-scale data. Moreover, it has been verified that combining multiple subsets can reduce the self-expressive residuals of data compared to a single subset. Extensive experiments on synthetic data and real-world datasets have demonstrated the efficiency and effectiveness of our approach. As a limitation, our method is still unable to handle nonlinear subspaces due to the problem setting. In future, we would like to design a self-expressive model that can handle nonlinear subspaces, with the help of modeling capabilities from neural network architectures.

References

  • [1] R. Vidal, “Subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52–68, 2011.
  • [2] K. Hotta, H. Xie, and C. Zhang, “Affine subspace clustering with nearest subspace neighbor,” in International Workshop on Advanced Imaging Technology (IWAIT) 2021, vol. 11766, 2021, p. 117661C.
  • [3] C. Zhang, “Energy minimization over m-branched enumeration for generalized linear subspace clustering,” IEICE Transactions on Information and Systems, vol. 102, no. 12, pp. 2485–2492, 2019.
  • [4] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry, “Unsupervised segmentation of natural images via lossy data compression,” Computer Vision and Image Understanding, vol. 110, no. 2, pp. 212–225, 2008.
  • [5] R. Vidal, R. Tron, and R. Hartley, “Multiframe motion segmentation with missing data using powerfactorization and gpca,” International Journal of Computer Vision, vol. 79, no. 1, pp. 85–105, 2008.
  • [6] S. Tierney, J. Gao, and Y. Guo, “Subspace clustering for sequential data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1019–1026.
  • [7] C. Zhang, X. Lu, K. Hotta, and X. Yang, “G2mf-wa: Geometric multi-model fitting with weakly annotated data,” Computational Visual Media, vol. 6, pp. 135–145, 2020.
  • [8] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 6, 2009, pp. 2790–2797.
  • [9] ——, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013.
  • [10] C. You, D. Robinson, and R. Vidal, “Scalable sparse subspace clustering by orthogonal matching pursuit,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3918–3927.
  • [11] Y. Guo, S. Tierney, and J. Gao, “Efficient sparse subspace clustering by nearest neighbour filtering,” Signal Processing, vol. 185, p. 108082, 2021.
  • [12] C. You, C. Li, D. Robinson, and R. Vidal, “Self-representation based unsupervised exemplar selection in a union of subspaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [13] X. Peng, L. Zhang, and Z. Yi, “Scalable sparse subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 430–437.
  • [14] S. Matsushima and M. Brbic, “Selective sampling-based scalable sparse subspace clustering,” Advances in Neural Information Processing Systems, vol. 32, pp. 12 416–12 425, 2019.
  • [15] P. Tseng, “Nearest q-flat to m points,” Journal of Optimization Theory and Applications, vol. 105, no. 1, pp. 249–252, 2000.
  • [16] T. Zhang, A. Szlam, and G. Lerman, “Median k-flats for hybrid linear modeling with many outliers,” in Conference on Computer Vision Workshops, 2009, pp. 234–241.
  • [17] J. Lipor, D. Hong, Y. S. Tan, and L. Balzano, “Subspace clustering using ensembles of k-subspaces,” Information and Inference: A Journal of the IMA, vol. 10, no. 1, pp. 73–107, 2021.
  • [18] C. Lane, B. Haeffele, and R. Vidal, “Adaptive online k-subspaces with cooperative re-initialization,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 678–688.
  • [19] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
  • [20] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
  • [21] C. Lu, J. Feng, Z. Lin, T. Mei, and S. Yan, “Subspace clustering by block diagonal representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 487–501, 2018.
  • [22] W. Dong, X.-J. Wu, J. Kittler, and H.-F. Yin, “Sparse subspace clustering via nonconvex approximation,” Pattern Analysis and Applications, vol. 22, no. 1, pp. 165–176, 2019.
  • [23] K. Hotta, H. Xie, and C. Zhang, “Candidate subspace screening for linear subspace clustering with energy minimization,” in Irish Machine Vision and Image Processing Conference, 2020, pp. 125–128.
  • [24] J. Yan and M. Pollefeys, “A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate,” in European Conference on Computer Vision, 2006, pp. 94–106.
  • [25] G. Chen and G. Lerman, “Spectral curvature clustering (scc),” International Journal of Computer Vision, vol. 81, no. 3, pp. 317–330, 2009.
  • [26] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal 1\ell_{1}-norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, vol. 59, no. 6, pp. 797–829, 2006.
  • [27] C.-Y. Lu, H. Min, Z.-Q. Zhao, L. Zhu, D.-S. Huang, and S. Yan, “Robust and efficient subspace segmentation via least squares regression,” in European Conference on Computer Vision, 2012, pp. 347–360.
  • [28] G. Liu, Z. Lin, Y. Yu et al., “Robust subspace segmentation by low-rank representation,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010, p. 663–670.
  • [29] C. You, C.-G. Li, D. P. Robinson, and R. Vidal, “Oracle based active set algorithm for scalable elastic net subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3928–3937.
  • [30] P. Ji, M. Salzmann, and H. Li, “Efficient dense subspace clustering,” in IEEE Winter Conference on Applications of Computer Vision, 2014, pp. 461–468.
  • [31] E. L. Dyer, A. C. Sankaranarayanan, and R. G. Baraniuk, “Greedy feature selection for subspace clustering,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 2487–2517, 2013.
  • [32] B. Nasihatkon and R. Hartley, “Graph connectivity in sparse subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2137–2144.
  • [33] C. You, C. Li, D. P. Robinson, and R. Vidal, “Scalable exemplar-based subspace clustering on class-imbalanced data,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 67–83.
  • [34] Y. Chen, C.-G. Li, and C. You, “Stochastic sparse subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 4155–4164.
  • [35] C. You, C. Donnat, D. P. Robinson, and R. Vidal, “A divide-and-conquer framework for large-scale subspace clustering,” in Proceedings of 50th Asilomar Conference on Signals, Systems and Computers, 2016, pp. 1014–1018.
  • [36] M. A. Davenport and M. B. Wakin, “Analysis of orthogonal matching pursuit using the restricted isometry property,” IEEE Transactions on Information Theory, vol. 56, no. 9, pp. 4395–4401, 2010.
  • [37] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, 2004.
  • [38] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Proceedings of Asilomar Conference on Signals, Systems and Computers, 1993, pp. 40–44.
  • [39] C.-K. Wong and M. C. Easton, “An efficient method for weighted sampling without replacement,” SIAM Journal on Computing, vol. 9, no. 1, pp. 111–113, 1980.
  • [40] R. Heckel and H. Bölcskei, “Subspace clustering via thresholding and spectral clustering,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 3263–3267.
  • [41] R. Vidal and P. Favaro, “Low rank subspace clustering (lrsc),” Pattern Recognition Letters, vol. 43, pp. 47–61, 2014.
  • [42] F. R. Chung, “Spectral graph theory,” 1997.
  • [43] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 684–698, 2005.
  • [44] F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in Proceedings of the IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142.
  • [45] D. Greene and P. Cunningham, “Practical solutions to the problem of diagonal dominance in kernel document clustering,” in Proc. 23rd International Conference on Machine learning, 2006, pp. 377–384.
  • [46] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition,” Neural Networks, vol. 32, pp. 323–332, 2012.
  • [47] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [48] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik, “Emnist: Extending mnist to handwritten letters,” in International Joint Conference on Neural Networks, 2017, pp. 2921–2926.
  • [49] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
  • [50] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, “Learning a spatially smooth subspace for face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–7.
  • [51] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
  • [52] S. Zhang, C. You, R. Vidal, and C.-G. Li, “Learning a self-expressive network for subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 393–12 403.
  • [53] Y. Yu, K. H. R. Chan, C. You, C. Song, and Y. Ma, “Learning diverse and discriminative representations via the principle of maximal coding rate reduction,” Advances in Neural Information Processing Systems, vol. 33, pp. 9422–9434, 2020.