This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

1\ell_{1}-Regularized ICA: A Novel Method for Analysis of Task-related fMRI Data

Yusuke Endo Department of Mechanical Systems Engineering, Graduate School of Science and Engineering, Ibaraki University    Koujin Takeda Department of Mechanical Systems Engineering, Graduate School of Science and Engineering, Ibaraki University, e-mail: koujin.takeda.kt@vc.ibaraki.ac.jp
(17th October, 2024)
Abstract

We propose a new method of independent component analysis (ICA) in order to extract appropriate features from high-dimensional data. In general, matrix factorization methods including ICA have a problem regarding the interpretability of extracted features. For the improvement of interpretability, it is considered that sparse constraint on a factorized matrix is helpful. With this background, we construct a new ICA method with sparsity. In our method, the 1\ell_{1}-regularization term is added to the cost function of ICA, and minimization of the cost function is performed by difference of convex functions algorithm. For the validity of our proposed method, we apply it to synthetic data and real functional magnetic resonance imaging data.

Keywords: independent component analysis, sparse matrix factorization, sparse coding, difference of convex functions algorithm, functional magnetic resonance imaging

1 Introduction

In machine learning, matrix factorization (MF) is known as significant unsupervised learning method for feature extraction from data, and various methods of MF have been proposed. In MF, two factorized matrices 𝑨p×K\bm{A}\in\mathbb{R}^{p\times K} and 𝑺K×N\bm{S}\in\mathbb{R}^{K\times N} are computed from an observed data matrix 𝑿p×N\bm{X}\in\mathbb{R}^{p\times N}, which are related by 𝑿𝑨𝑺\bm{X}\approx\bm{AS}. As the methods of MF, various independent component analysis (ICA) [1], principal component analysis (PCA), and non-negative MF [2] are widely known and frequently used. However, these MF methods have a common problem in the application to high-dimensional data: it is generally difficult to interpret what extracted features mean. To resolve this problem, sparse MF methods [3], [4] are considered to be useful, where a sparse constraint is imposed on either factorized matrix 𝑨\bm{A} or 𝑺\bm{S}. By introducing sparse constraint into PCA [5], [6] or non-negative MF [7], it is found that the extracted feature can be interpreted relatively easily due to sparsity. However, as far as the authors know, the widely-accepted sparse method for ICA has not been established yet.

Due to its significance, there are many applications of MF to real-world data. Analysis of biological data such as neuronal activity or electrocardiogram is one of such applications. Among various MF methods and applications, we especially focus on the application of ICA to functional magnetic resonance imaging (fMRI) data. Nowadays, fMRI is a significant experimental method for understanding the function of the whole human brain. In fMRI, time series data of cerebral blood flow in test subject is measured by magnetic resonance imaging. There are two types of fMRI: one is resting-state fMRI, which is measured under no human’s task or stimulus, and the other is task-related fMRI, which is measured under some tasks or stimuli. In the application of MF to fMRI [8], [9], [10], [11], the dimensions pp and NN are the numbers of voxels and scanning time steps, respectively. In this setting, KK vectors in the spatial feature factorized matrix 𝑨\bm{A} and temporal feature matrix 𝑺\bm{S} are interpreted as KK features in the human brain activity.

For feature extraction from task-related fMRI by MF, it is considered that statistical independence among extracted feature signals is generally significant, in particular for temporal feature matrix 𝑺\bm{S}. However, it will be better to impose an additional constraint such as sparsity on spatial feature matrix 𝑨\bm{A} for localization of the elements in 𝑨\bm{A}. This is because it is widely believed that information representation for external stimuli in the human brain is sparse. Therefore, it is not enough for feature extraction only to assume statistical independence for temporal feature 𝑺\bm{S}. Practically, in the application of ICA to fMRI data, statistical independence is often imposed on 𝑨\bm{A} instead of 𝑺\bm{S} for localization of elements in 𝑨\bm{A}. Such ICA is specifically called spatial ICA [12]. In contrast, the method with the assumption of statistical independence on temporal feature 𝑺\bm{S} is called temporal ICA. In this connection, it is shown recently that sparse MF or spatial ICA can extract more appropriate features from fMRI data than other MF methods without localization of the elements in 𝑨\bm{A} [13]. From these facts, we expect that temporal ICA can be improved by adding a sparse constraint on the matrix 𝑨\bm{A}.

With this background, we propose a novel ICA method, which may extract features appropriately from task-related fMRI. In our method, we minimize the cost function of ICA with an additional regularization term. Namely, we impose the constraints of statistical independence on 𝑺\bm{S} and sparsity on 𝑨\bm{A}. For minimizing the proposed cost function, we apply difference of convex functions (DC) algorithm [14, 15, 16], which is a powerful tool in nonconvex optimization problem.

The organization of this article is as follows. In section 2, the framework of ICA is overviewed and several related works are presented. In section 3, we provide how to construct our proposed method and discuss the convergence of our algorithm theoretically. In section 4, we apply our method both to synthetic data and real-world data, then discuss the validity. Finally, section 5 is devoted to conclusions and future works.

2 Overview of ICA

Throughout this article, matrices are denoted by bold uppercase letters such as 𝑨\bm{A} and vectors as bold lowercase letters such as 𝒂\bm{a}. Scalars are denoted as small italic, and elements of vector/matrix are written as italic lowercase letter such as [a1,,an][a_{1},\ldots,a_{n}].

2.1 Framework of ICA and FastICA

In the introduction, we mainly focus on the application of ICA to fMRI. In practice, ICA is applied not only to biological data analysis but also to other various problems such as image processing or finance. Therefore, we formulate our proposed ICA in general setting for the application to various problems.

Originally, ICA is proposed as one of the blind source separation methods. The objective of all ICA methods is to estimate latent independent source 𝑺=[𝒔1,,𝒔K]T\bm{S}=[\bm{s}_{1},\ldots,\bm{s}_{K}]^{T} and mixing matrix 𝑨\bm{A} simultaneously without any prior knowledge on them. In many algorithms of ICA, non-gaussianity is the quantity to be maximized.

Among various ICA methods, FastICA [17] is a widely used algorithm. In FastICA, factorized matrices 𝑨,𝑺\bm{A},\bm{S} are estimated as follows.

  1. 1.

    The observed matrix 𝑿\bm{X} is prewhitened by 𝑸K×p\bm{Q}\in\mathbb{R}^{K\times p} as 𝒁=𝑸𝑿\bm{Z}=\bm{QX}, where 𝑸\bm{Q} denotes the prewhitening matrix and 𝒁\bm{Z} denotes the prewhitened observed matrix.

  2. 2.

    Next, the projection matrix 𝑾K×K\bm{W}\in\mathbb{R}^{K\times K} is evaluated, which maximizes non-gaussianity among rows of the signal source matrix 𝑺=𝑾T𝒁\bm{S}=\bm{W}^{T}\bm{Z}. The projection matrix 𝑾\bm{W} is frequently called unmixing matrix. The cost function for non-gaussianity is defined by

    Jf(𝒘i)=1N2{j=1NG(𝒘iT𝒛j)j=1NG(νj)}2.J_{f}(\bm{w}_{i})=\frac{1}{N^{2}}\left\{\sum_{j=1}^{N}G(\bm{w}_{i}^{T}\bm{z}_{j})-\sum_{j=1}^{N}G(\nu_{j})\right\}^{2}. (1)

    In the above equation, G(.)G(.) is a nonlinear function and defined by G(y)=ey2/2G(y)=-e^{-y^{2}/2}. The variable νj\nu_{j} (j{1,,N}(j\in\{1,\ldots,N\}) is a normal random variable with zero mean and unit variance. In maximizing Jf(𝒘i)J_{f}(\bm{w}_{i}), a fixed point algorithm in equation (2) is used.

    𝒘i1Nj=1N𝒛jG(𝒘iT𝒛j)G′′(𝒘iT𝒛j)𝒘i.\bm{w}_{i}\leftarrow\frac{1}{N}\sum_{j=1}^{N}\bm{z}_{j}G^{\prime}(\bm{w}_{i}^{T}\bm{z}_{j})-G^{\prime\prime}(\bm{w}_{i}^{T}\bm{z}_{j})\bm{w}_{i}. (2)

    In addition, the orthogonal constraint 𝑾T𝑾=𝑰K\bm{W}^{T}\bm{W}=\bm{I}_{K} is imposed on the projection matrix 𝑾\bm{W}, where 𝑰K\bm{I}_{K} is KK-dimensional identity matrix. In practice, the column vectors of 𝑾\bm{W} are mutually orthogonalized by Gram-Schmidt method, where the column vectors are updated as follows,

    𝒘i𝒘ij=1i1(𝒘iT𝒘j)𝒘j.\bm{w}_{i}\leftarrow\bm{w}_{i}-\sum_{j=1}^{i-1}(\bm{w}_{i}^{T}\bm{w}_{j})\bm{w}_{j}. (3)
  3. 3.

    The mixing matrix 𝑨\bm{A} is calculated as 𝑨=𝑸𝑾\bm{A}=\bm{Q}^{\dagger}\bm{W}, where 𝑸\bm{Q}^{\dagger} is Moore-Penrose pseudo-inverse matrix of 𝑸\bm{Q}. Accordingly, the observed matrix 𝑿\bm{X}, the mixing matrix 𝑨\bm{A}, and signal source matrix 𝑺\bm{S} are related by 𝑿𝑨𝑺\bm{X}\approx\bm{A}\bm{S}, which is shown by the orthogonality of 𝑾\bm{W}.

2.2 Related works on ICA with sparse constraint

Before going into our proposed method, we should mention several related works regarding ICA with sparse constraint. First, there is a method called SparseICA [18], [19], which was introduced for blind source separation in image processing in real-world application [20]. In this method, independent source matrix 𝑺\bm{S} is computed as the product of the sparse coefficient matrix 𝑪\bm{C} and the dictionary matrix 𝚽\bm{\Phi} by 𝑺=𝑪𝚽\bm{S}=\bm{C}\bm{\Phi}.

Next, ICA with regularization terms was proposed for the linear non-gaussian acyclic model [21]. In this method, the regularization term for the separation matrix defined by 𝑾T𝑸\bm{W}^{T}\bm{Q} is introduced. It should be noted that the construction of sparse matrix in ICA is similar to our proposed method explained in the following section. However, the sparse constraint on the matrix is different. In this method, the sparse constraint is imposed on the separation matrix 𝑾T𝑸\bm{W}^{T}\bm{Q}, while the estimated mixture matrix 𝑨\bm{A} is sparse in our method. Similar studies with sparse separation matrix 𝑾T𝑸\bm{W}^{T}\bm{Q} can also be found in other references [22], [23].

There is also other related work [24], where the sparse constraint is imposed on the mixing matrix 𝑨\bm{A} like our method. Additionally, the objective of the work is to identify sparse neuronal networks like ours. However, there is a difference in the methodology: this method is based on Bayesian framework and the algorithm is a stochastic one. More precisely, the mixing matrix 𝑨\bm{A} and parameters in the model are assumed to follow some probabilistic distributions and optimized by Markov chain Monte Carlo method. In contrast, our method is a deterministic one, where the cost function is minimized by DC algorithm.

3 Our method

In this section we elucidate the detail of our proposed method.

3.1 Formulation

In our method, the ii-th vector in the projection matrix 𝒘i\bm{w}_{i} minimizing cost function in equation (4) must be evaluated.

Jo(𝒘i)=Jf(𝒘i)+α𝑸#𝒘i1=Jf~(𝒘i)+α𝑸#𝒘i1.\begin{split}J_{o}(\bm{w}_{i})&=-J_{f}(\bm{w}_{i})+\alpha\|\bm{Q}^{\#}\bm{w}_{i}\|_{1}\\ &=\widetilde{J_{f}}(\bm{w}_{i})+\alpha\|\bm{Q}^{\#}\bm{w}_{i}\|_{1}.\end{split} (4)

Here p\|\cdot\|_{p} is p\ell_{p}-norm, Jf~(𝒘i)=Jf(𝒘i)\widetilde{J_{f}}(\bm{w}_{i})=-{J_{f}}(\bm{w}_{i}), and 𝑸#p×K\bm{Q}^{\#}\in\mathbb{R}^{p\times K} is given by the minimizer in equation (5).

𝑸#=argmin𝑸𝑸𝑸Fs.t.𝑸0k0,\bm{Q}^{\#}=\mathop{\rm arg~{}min}\limits_{\bm{Q}^{\prime}}{\|\bm{Q}^{\dagger}-\bm{Q}^{\prime}\|_{\rm F}}\quad\mathrm{s.t.}\quad\|\bm{Q}^{\prime}\|_{0}\leq k_{0}, (5)

with F\|\cdot\|_{\rm F} being Frobenuis norm and 0\|\cdot\|_{0} being the matrix 0\ell_{0}-norm or the number of non-zero elements in the matrix, respectively. By definition, the minimizer matrix 𝑸#\bm{Q}^{\#} is close to the original 𝑸\bm{Q}^{\dagger} and sparse with only k0k_{0} nonzero elements. We need the sparse estimated mixture matrix 𝑨=𝑸#𝑾\bm{A}=\bm{Q}^{\#}\bm{W}, whose sparsity depends on the sparsity of 𝑸#\bm{Q}^{\#} [25]. Hence, approximate sparse inverse whitening matrix 𝑸#\bm{Q}^{\#} is favorable rather than the original inverse whitening matrix 𝑸\bm{Q}^{\dagger} in our proposed method. In practical computation, the algorithm of orthogonal matching pursuit [26], [27] is applied to obtain each row of 𝑸#\bm{Q}^{\#}.

We apply widely-used DC algorithm for minimization of the cost function. DC algorithm is applicable to the minimization problem, where the cost functions J(𝒘)J(\bm{w}) is represented by the difference of two convex functions g(𝒘),h(𝒘)g(\bm{w}),h(\bm{w}) as

J(𝒘)=g(𝒘)h(𝒘).J(\bm{w})=g(\bm{w})-h(\bm{w}). (6)

Note that Hessians of the two convex functions must be bounded above. In DC algorithm, the following two processes are repeated until convergence or maximum iteration number MM is reached.

  1. (i)

    Update 𝒉(t){}_{\nabla}\bm{h}^{(t)} by choosing one element from the subgradient set of h(𝒘)h(\bm{w}), namely 𝒉(t)h(𝒘(t)){}_{\nabla}\bm{h}^{(t)}\in\partial h(\bm{w}^{(t)}) with 𝒘(t)\bm{w}^{(t)} being the tt-th update of 𝒘\bm{w}.

  2. (ii)

    Update 𝒘(t+1)\bm{w}^{(t+1)} by 𝒘(t+1)=argmin𝒘{g(𝒘)(𝒉(t))T𝒘}\bm{w}^{(t+1)}=\mathop{\rm arg~{}min}\limits_{\bm{w}}\{g(\bm{w})-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}\}.

It is considered that the convergent solution by DC algorithm is often a global optimal solution. Therefore, DC algorithm is one of the effective approaches for nonconvex optimization.

For application of DC algorithm to minimization in our method, the cost function Jo(𝒘i)J_{o}(\bm{w}_{i}) must be expressed by the difference of two convex functions. First, Lipschitz constant LL of Jf~(𝒘i)\widetilde{J_{f}}(\bm{w}_{i}) is introduced, which can be defined because Jf~(𝒘i)\widetilde{J_{f}}(\bm{w}_{i}) is secondary differentiable. Then, the function (L/2)𝒘i22Jf~(𝒘i)(L/2)\|\bm{w}_{i}\|_{2}^{2}-\widetilde{J_{f}}(\bm{w}_{i}) is easily proved to be convex by the definition of Lipschitz constant LL. Moreover, regularization term 𝑸#𝒘i1\|\bm{Q}^{\#}\bm{w}_{i}\|_{1} is clearly convex, and the sum of two convex functions (L/2)𝒘i22+α𝑸#𝒘i1(L/2)\|\bm{w}_{i}\|_{2}^{2}+\alpha\|\bm{Q}^{\#}\bm{w}_{i}\|_{1} is also convex. Therefore, Jo(𝒘i)J_{o}(\bm{w}_{i}) can be represented by the difference of two convex functions g(𝒘i),h(𝒘i)g(\bm{w}_{i}),h(\bm{w}_{i}) like Jo(𝒘i)=g(𝒘i)h(𝒘i)J_{o}(\bm{w}_{i})=g(\bm{w}_{i})-h(\bm{w}_{i}), where the two convex functions are defined in equations (7) and (8).

g(𝒘i)\displaystyle g(\bm{w}_{i}) =\displaystyle= L2𝒘i22+α𝑸#𝒘i1,\displaystyle\frac{L}{2}\|\bm{w}_{i}\|_{2}^{2}+\alpha\|\bm{Q}^{\#}\bm{w}_{i}\|_{1}, (7)
h(𝒘i)\displaystyle h(\bm{w}_{i}) =\displaystyle= L2𝒘i22Jf~(𝒘i).\displaystyle\frac{L}{2}\|\bm{w}_{i}\|_{2}^{2}-\widetilde{J_{f}}(\bm{w}_{i}). (8)

Note that g(𝒘i)g(\bm{w}_{i}) and h(𝒘i)h(\bm{w}_{i}) are continuous functions in our 1\ell_{1}-regularized ICA.

In DC algorithm, computation of Lipschitz constant LL needs maximum eigenvalue of Hessian 2Jf~(𝒘i)\nabla^{2}\widetilde{J_{f}}(\bm{w}_{i}), whose computational complexity is O(K3)O(K^{3}). In practice, LL is evaluated by backtracking line search algorithm [28]. In this algorithm, we prepare the tentative Lipschitz constant l(t)l^{(t)} in the tt-th update, then l(t)l^{(t)} is accepted as the final Lipschitz constant if the criterion of inequality (9) is satisfied for ζ(0,1)\zeta\in(0,1). Otherwise, l(t)l^{(t)} is rescaled by l(t)ηl(t)l^{(t)}\leftarrow\eta l^{(t)} with the constant η>1\eta>1, then inequality (9) is checked again.

Jo(𝒘i(t+1))Jo(𝒘i(t))ζ2𝒘i(t+1)𝒘i(t)22.J_{o}(\bm{w}_{i}^{(t+1)})\leq J_{o}(\bm{w}_{i}^{(t)})-\frac{\zeta}{2}\|\bm{w}_{i}^{(t+1)}-\bm{w}_{i}^{(t)}\|_{2}^{2}. (9)

After evaluation of Lipschitz constant, we move on to the formulation of DC algorithm in our proposed method. In DC algorithm, subderivative h(𝒘i)\partial h(\bm{w}_{i}) is necessary for the calculation of 𝒉(t){}_{\nabla}\bm{h}^{(t)} in the process (i), which is given by equation (10),

𝒉(t)=L𝒘i(t)Jf~(𝒘i(t)).{}_{\nabla}\bm{h}^{(t)}=L\bm{w}_{i}^{(t)}-\nabla\widetilde{J_{f}}(\bm{w}_{i}^{(t)}). (10)

Then, the process (ii) for the update of 𝒘i(t)\bm{w}_{i}^{(t)} is represented as follows.

𝒘i(t+1)=argmin𝒘{g(𝒘)(𝒉(t))T𝒘}=argmin𝒘{L2𝒘22+α𝑸#𝒘1(𝒉(t))T𝒘}.\begin{split}\bm{w}_{i}^{(t+1)}&=\mathop{\rm arg~{}min}\limits_{\bm{w}}\left\{g(\bm{w})-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}\right\}\\ &=\mathop{\rm arg~{}min}\limits_{\bm{w}}\left\{\frac{L}{2}\|\bm{w}\|_{2}^{2}+\alpha\|\bm{Q}^{\#}\bm{w}\|_{1}-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}\right\}.\end{split} (11)

The minimization problem in equation (11) can be regarded as a generalized lasso problem [29], whose solution cannot be obtained analytically. Hence, we attempt to find the solution numerically by the method of alternating directions method of multipliers (ADMM) [30]. In the application of ADMM, 𝒘i(t+1)\bm{w}_{i}^{(t+1)} is computed by minimizing augmented Lagrangian ρ(𝒘,𝜸,𝝉)\mathcal{L}_{\rho}(\bm{w},\bm{\gamma},\bm{\tau}),

ρ(𝒘,𝜸,𝝉)=L2𝒘22(𝒉(t))T𝒘+α𝜸1+𝝉T(𝑸#𝒘𝜸)+ρ2𝑸#𝒘𝜸22.\mathcal{L}_{\rho}(\bm{w},\bm{\gamma},\bm{\tau})=\frac{L}{2}\|\bm{w}\|_{2}^{2}-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}+\alpha\|\bm{\gamma}\|_{1}+\bm{\tau}^{T}(\bm{Q}^{\#}\bm{w}-\bm{\gamma})+\frac{\rho}{2}\|\bm{Q}^{\#}\bm{w}-\bm{\gamma}\|_{2}^{2}. (12)

The variables 𝒘,𝜸,𝝉\bm{w},\bm{\gamma},\bm{\tau} minimizing ρ(𝒘,𝜸,𝝉)\mathcal{L}_{\rho}(\bm{w},\bm{\gamma},\bm{\tau}) are alternately updated by equations (13), (14) and (15) until convergence.

𝒘[θ+1]\displaystyle\bm{w}^{[\theta+1]} =\displaystyle= argmin𝒘ρ(𝒘,𝜸[θ],𝝉[θ]),\displaystyle\mathop{\rm arg~{}min}\limits_{\bm{w}}{\mathcal{L}_{\rho}(\bm{w},\bm{\gamma}^{[\theta]},\bm{\tau}^{[\theta]})}, (13)
𝜸[θ+1]\displaystyle\bm{\gamma}^{[\theta+1]} =\displaystyle= argmin𝜸ρ(𝒘[θ+1],𝜸,𝝉[θ]),\displaystyle\mathop{\rm arg~{}min}\limits_{\bm{\gamma}}{\mathcal{L}_{\rho}(\bm{w}^{[\theta+1]},\bm{\gamma},\bm{\tau}^{[\theta]})}, (14)
𝝉[θ+1]\displaystyle\bm{\tau}^{[\theta+1]} =\displaystyle= 𝝉[θ]+ρ(𝑸#𝒘[θ+1]𝜸[θ+1]),\displaystyle\bm{\tau}^{[\theta]}+\rho(\bm{Q}^{\#}\bm{w}^{[\theta+1]}-\bm{\gamma}^{[\theta+1]}), (15)

where superscript [θ][\theta] means the variable of the θ\theta-th update in ADMM iteration. Update of 𝒘[θ+1]\bm{w}^{[\theta+1]} can be rewritten by putting the solution of ρ(𝒘,𝜸[θ],𝝉[θ])/𝒘=𝟎\partial\mathcal{L}_{\rho}(\bm{w},\bm{\gamma}^{[\theta]},\bm{\tau}^{[\theta]})/\partial\bm{w}=\bm{0}, which leads to

𝒘[θ+1]=(L𝑰K+ρ𝑸#T𝑸#)1{𝒉+𝑸#T(ρ𝜸[θ]𝝉[θ])}.\bm{w}^{[\theta+1]}=(L\bm{I}_{K}+\rho{\bm{Q}^{\#}}^{T}\bm{Q}^{\#})^{-1}\left\{{}_{\nabla}\bm{h}+{\bm{Q}^{\#}}^{T}(\rho\bm{\gamma}^{[\theta]}-\bm{\tau}^{[\theta]})\right\}. (16)

Similarly, by solving equation ρ(𝒘[θ+1],𝜸,𝝉[θ])/𝜸=𝟎\partial\mathcal{L}_{\rho}(\bm{w}^{[\theta+1]},\bm{\gamma},\bm{\tau}^{[\theta]})/\partial\bm{\gamma}=\bm{0} analytically, update of the jj-th element in 𝜸[θ+1]\bm{\gamma}^{[\theta+1]} is represented as follows,

γj[θ+1]=Sαρ{(𝑸#)j,𝒘[θ+1]+τj[θ]ρ},\gamma_{j}^{[\theta+1]}=\mathrm{S}_{\frac{\alpha}{\rho}}\left\{(\bm{Q}^{\#})_{j,}\bm{w}^{[\theta+1]}+\frac{\tau_{j}^{[\theta]}}{\rho}\right\}, (17)

where (𝑸#)j,(\bm{Q}^{\#})_{j,} means the jj-th row vector of 𝑸#\bm{Q}^{\#} and Sa(x)\mathrm{S}_{a}(x) is soft thresholding function defined in equation (18),

Sa(x)={xaforx>a,0foraxa,x+aforx<a.\mathrm{S}_{a}(x)=\begin{cases}x-a&$\rm for$\ \ x>a,\\ 0&$\rm for$\ \ -a\leq x\leq a,\\ x+a&$\rm for$\ \ x<-a.\end{cases} (18)

Finally, by combining these results, our proposed algorithm is obtained, which is summarized in Algorithm 1.

Algorithm 1 The algorithm of 1\ell_{1}-regularized ICA
0:  factorized matrix: 𝑨p×K\bm{A}\in\mathbb{R}^{p\times K}, 𝑺K×N\bm{S}\in\mathbb{R}^{K\times N}
  compute 𝑸,𝑸#\bm{Q},\bm{Q}^{\#} and prepare pre-whitened 𝑿\bm{X}
  prepare initial matrix 𝑾\bm{W} and normal random variables 𝝂\bm{\nu}
  normalize each column of 𝑾\bm{W}
  for i=1i=1 to KK do
     while t<Mt<M do
        update 𝒉(t){}_{\nabla}\bm{h}^{(t)} by equation (10)
        while until convergence do
           update 𝒘[θ+1]\bm{w}^{[\theta+1]}, 𝜸[θ+1]\bm{\gamma}^{[\theta+1]}, 𝝉[θ+1]\bm{\tau}^{[\theta+1]} by equations (16) (17) (15)
           θθ+1\theta\leftarrow\theta+1
        end while
        𝒘i(t+1)𝒘i[θ+1]\bm{w}_{i}^{(t+1)}\leftarrow\bm{w}_{i}^{[\theta+1]}
        orthogonalize and normalize 𝒘i(t+1)\bm{w}_{i}^{(t+1)}
        if l(t)l^{(t)} does not satisfy criterion in inequality (9)  then
           l(t)ηl(t)l^{(t)}\leftarrow\eta l^{(t)}
           update 𝒉(t){}_{\nabla}\bm{h}^{(t)} and 𝒘i(t)\bm{w}_{i}^{(t)} by equations (10) (11)
        end if
     end while
  end for
  compute 𝑨=𝑸#𝑾\bm{A}=\bm{Q}^{\#}\bm{W}, 𝑺=𝑾T𝑸𝑿\bm{S}=\bm{W}^{T}\bm{Q}\bm{X}
  output 𝑨\bm{A} and 𝑺\bm{S}

3.2 Convergence analysis

We can derive a convergence condition for our proposed algorithm for 1\ell_{1}-regularized ICA. The theorem and the proof of the convergence condition are given in the following.

For general DC algorithm, the condition of non-increasing function Jo(𝒘i(t+1))Jo(𝒘i(t))J_{o}(\bm{w}_{i}^{(t+1)})\leq J_{o}(\bm{w}_{i}^{(t)}) must be satisfied for the updated variable 𝒘i(t+1)\bm{w}_{i}^{(t+1)} [31]. When this condition is satisfied, it is guaranteed that the sequence {𝒘(t)}t0\{\bm{w}^{(t)}\}_{t\in\mathbb{Z}_{\geq 0}} converges to a point 𝒘\bm{w}^{*} satisfying the condition for subgradient as

g(𝒘)h(𝒘).\varnothing\neq\partial g(\bm{w}^{*})\cap\partial h(\bm{w}^{*}). (19)

The significant difference of DC algorithm in our 1\ell_{1}-regularized ICA is orthogonalization and normalization after variable update, which requires a slight change in the proof for convergence condition of DC algorithm. For this purpose, we should distinguish the variable after orthogonalization and normalization, which is denoted by 𝒘^i(t+1)\widehat{\bm{w}}_{i}^{(t+1)}. With this variable, we can show that our proposed algorithm converges to a point in equation (19) if the inequality Jo(𝒘^i(t+1))Jo(𝒘i(t))J_{o}(\widehat{\bm{w}}_{i}^{(t+1)})\leq J_{o}(\bm{w}_{i}^{(t)}) is satisfied.

The convergence theorem for our proposed algorithm is stated as follows. It should be commented that the convergence condition in the theorem is a sufficient condition, which can be relaxed.

Theorem 1.

If the following condition is satisfied at each step of (t)(t), the sequence {𝐰(t)}t0\{\bm{w}^{(t)}\}_{t\in\mathbb{Z}_{\geq 0}} converges to a certain point by our proposed algorithm.

{C1,ifmin𝒘{g(𝒘)(𝒉(t))T𝒘}>0,C1,ifmin𝒘{g(𝒘)(𝒉(t))T𝒘}<0,\left\{\begin{split}C\leq 1,&\quad{if}\ \min_{\bm{w}}\{g(\bm{w})-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}\}>0,\\ C\geq 1,&\quad{if}\ \min_{\bm{w}}\{g(\bm{w})-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}\}<0,\end{split}\right. (20)

where CC is a constant defined by

C\displaystyle C =\displaystyle= max{λmax((𝑹i(t+1))T𝑹i(t+1)),\displaystyle\max\left\{\sqrt{\lambda_{\mathrm{max}}\left((\bm{R}_{i}^{(t+1)})^{T}\bm{R}_{i}^{(t+1)}\right)},\right. (21)
Nλmax(𝑸#T𝑸#)λmin(𝑸#T𝑸#)1𝒘i(t+1)2,((𝒉(t))T𝒘i(t+1)𝒉(t)2𝒘i(t+1)2)11𝒘i(t+1)2}.\displaystyle\!\!\!\!\!\left.\sqrt{\frac{N\lambda_{\mathrm{max}}\left({\bm{Q}^{\#}}^{T}\bm{Q}^{\#}\right)}{\lambda_{\rm min}\left({\bm{Q}^{\#}}^{T}\bm{Q}^{\#}\right)}}\frac{1}{\|\bm{w}_{i}^{(t+1)}\|_{2}},\ \left(\frac{({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t+1)}}{\|{}_{\nabla}\bm{h}^{(t)}\|_{2}\|\bm{w}_{i}^{(t+1)}\|_{2}}\right)^{-1}\!\!\!\!\!\!\frac{1}{\|\bm{w}_{i}^{(t+1)}\|_{2}}\right\}\!\!.

In this definition, the operation of orthogonalization is expressed by the multiplication of a square matrix 𝐑i(t+1)K×K\bm{R}_{i}^{(t+1)}\in\mathbb{R}^{K\times K}, namely 𝐑i(t+1)𝐰i(t+1)\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}, and λmax(),λmin()\lambda_{\mathrm{max}}(\cdot),\lambda_{\mathrm{min}}(\cdot) are maximum and minimum eigenvalues of square matrix in the parenthesis, respectively.

Proof.

From the convexity of the function hh,

h(𝒘^i(t+1))h(𝒘i(t))+(𝒉(t))T(𝒘^i(t+1)𝒘i(t)).h(\widehat{\bm{w}}_{i}^{(t+1)})\geq h(\bm{w}_{i}^{(t)})+({}_{\nabla}\bm{h}^{(t)})^{T}(\widehat{\bm{w}}_{i}^{(t+1)}-\bm{w}_{i}^{(t)}). (22)

The ii-th vector 𝒘i(t+1)\bm{w}_{i}^{(t+1)} is orthogonalized to the set of vectors {𝒘j(t+1)}j{1,,i1}\{\bm{w}_{j}^{(t+1)}\}_{j\in\{1,\ldots,i-1\}} first, then normalized. After orthogonalization, the normalized vector 𝒘^i(t+1)\widehat{\bm{w}}_{i}^{(t+1)} is calculated as 𝒘^i(t+1)=𝑹i(t+1)𝒘i(t+1)𝑹i(t+1)𝒘i(t+1)2\widehat{\bm{w}}_{i}^{(t+1)}=\frac{\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}}. For this vector 𝒘^i(t+1)\widehat{\bm{w}}_{i}^{(t+1)} , the upper bound of Jo(𝒘^i(t+1))J_{o}(\widehat{\bm{w}}_{i}^{(t+1)}) is estimated as

Jo(𝒘^i(t+1))\displaystyle J_{o}(\widehat{\bm{w}}_{i}^{(t+1)}) =\displaystyle= g(𝒘^i(t+1))h(𝒘^i(t+1))\displaystyle g(\widehat{\bm{w}}_{i}^{(t+1)})-h(\widehat{\bm{w}}_{i}^{(t+1)}) (23)
(22)\displaystyle\overset{(\ref{eq:convexity})}{\leq} g(𝒘^i(t+1)){h(𝒘i(t))+(𝒉(t))T(𝒘^i(t+1)𝒘i(t))}\displaystyle g(\widehat{\bm{w}}_{i}^{(t+1)})-\left\{h(\bm{w}_{i}^{(t)})+({}_{\nabla}\bm{h}^{(t)})^{T}(\widehat{\bm{w}}_{i}^{(t+1)}-\bm{w}_{i}^{(t)})\right\}
=\displaystyle= g(𝒘^i(t+1))(𝒉(t))T𝒘^i(t+1)h(𝒘i(t))+(𝒉(t))T𝒘i(t)\displaystyle g(\widehat{\bm{w}}_{i}^{(t+1)})-({}_{\nabla}\bm{h}^{(t)})^{T}\widehat{\bm{w}}_{i}^{(t+1)}-h(\bm{w}_{i}^{(t)})+({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t)}
=(7)\displaystyle\overset{(\ref{eq:convex function 1})}{=} L2𝑹i(t+1)𝒘i(t+1)2+α𝑸#𝑹i(t+1)𝒘i(t+1)1𝑹i(t+1)𝒘i(t+1)2(𝒉(t))T𝑹i(t+1)𝒘i(t+1)𝑹i(t+1)𝒘i(t+1)2\displaystyle\frac{L}{2}\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}+\alpha\frac{\|\bm{Q}^{\#}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{1}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}}-\frac{({}_{\nabla}\bm{h}^{(t)})^{T}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}}
h(𝒘i(t))+(𝒉(t))T𝒘i(t)\displaystyle-h(\bm{w}_{i}^{(t)})+({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t)}
\displaystyle\leq Cmin𝒘{g(𝒘)(𝒉(t))T𝒘}h(𝒘i(t))+(𝒉(t))T𝒘i(t),\displaystyle C\min_{\bm{w}}\left\{g(\bm{w})-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}\right\}-h(\bm{w}_{i}^{(t)})+({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t)},

where the constant CC is determined by the upper bounds of the following terms,
L2𝑹i(t+1)𝒘i(t+1)2\frac{L}{2}\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}, α𝑸#𝑹i(t+1)𝒘i(t+1)1𝑹i(t+1)𝒘i(t+1)2\alpha\frac{\|\bm{Q}^{\#}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{1}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}}, and (𝒉(t))T𝑹i(t+1)𝒘i(t+1)𝑹i(t+1)𝒘i(t+1)2-\frac{({}_{\nabla}\bm{h}^{(t)})^{T}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}}. These three terms can be bounded as follows.

L2𝑹i(t+1)𝒘i(t+1)2\displaystyle\frac{L}{2}\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2} \displaystyle\leq (λmax((𝑹i(t+1))T𝑹i(t+1)))L2𝒘i(t+1)2,\displaystyle\left(\sqrt{\lambda_{\mathrm{max}}\left((\bm{R}_{i}^{(t+1)})^{T}\bm{R}_{i}^{(t+1)}\right)}\right)\frac{L}{2}\|\bm{w}_{i}^{(t+1)}\|_{2}, (24)
α𝑸#𝑹i(t+1)𝒘i(t+1)1𝑹i(t+1)𝒘i(t+1)2\displaystyle\alpha\frac{\|\bm{Q}^{\#}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{1}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}} =\displaystyle= α𝑸#𝑹i(t+1)𝒘i(t+1)1𝑹i(t+1)𝒘i(t+1)2𝑸#𝒘i(t+1)1𝑸#𝒘i(t+1)1\displaystyle\alpha\frac{\|\bm{Q}^{\#}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{1}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}\|\bm{Q}^{\#}\bm{w}_{i}^{(t+1)}\|_{1}}\|\bm{Q}^{\#}\bm{w}_{i}^{(t+1)}\|_{1} (25)
\displaystyle\leq αN𝑸#𝑹i(t+1)𝒘i(t+1)2𝑹i(t+1)𝒘i(t+1)2𝑸#𝒘i(t+1)2𝑸#𝒘i(t+1)1\displaystyle\alpha\frac{\sqrt{N}\|\bm{Q}^{\#}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}\|\bm{Q}^{\#}\bm{w}_{i}^{(t+1)}\|_{2}}\|\bm{Q}^{\#}\bm{w}_{i}^{(t+1)}\|_{1}
\displaystyle\leq (Nλmax(𝑸#T𝑸#)λmin(𝑸#T𝑸#)1𝒘i(t+1)2)α𝑸#𝒘i(t+1)1,\displaystyle\!\!\!\!\left(\!\!\sqrt{\frac{{N\lambda_{\mathrm{max}}\left({\bm{Q}^{\#}}^{T}\bm{Q}^{\#}\right)}}{{\lambda_{\mathrm{min}}\left({\bm{Q}^{\#}}^{T}\bm{Q}^{\#}\right)}}}\frac{1}{\|\bm{w}_{i}^{(t+1)}\|_{2}}\!\!\right)\!\alpha\|\bm{Q}^{\#}\bm{w}_{i}^{(t+1)}\|_{1},
(𝒉(t))T𝑹i(t+1)𝒘i(t+1)𝑹i(t+1)𝒘i(t+1)2\displaystyle-\frac{({}_{\nabla}\bm{h}^{(t)})^{T}\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}}{\|\bm{R}_{i}^{(t+1)}\bm{w}_{i}^{(t+1)}\|_{2}} \displaystyle\leq 𝒉(t)2\displaystyle\|-{}_{\nabla}\bm{h}^{(t)}\|_{2}
=\displaystyle= (((𝒉(t))T𝒘i(t+1)𝒉(t)2𝒘i(t+1)2)11𝒘i(t+1)2)(𝒉(t))T𝒘i(t+1).\displaystyle\!\!\!\!\!\left(\left(\frac{(-{}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t+1)}}{\|-{}_{\nabla}\bm{h}^{(t)}\|_{2}\|\bm{w}_{i}^{(t+1)}\|_{2}}\right)^{-1}\frac{1}{\|\bm{w}_{i}^{(t+1)}\|_{2}}\right)(-{}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t+1)}.

Note that the factor (𝒉(t))T𝒘i(t+1)𝒉(t)2𝒘i(t+1)2\frac{(-{}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t+1)}}{\|-{}_{\nabla}\bm{h}^{(t)}\|_{2}\|\bm{w}_{i}^{(t+1)}\|_{2}} on the right hand side of inequality (LABEL:eq:upper_bound_of_3rd_term) is cosine similarity. The inequality in (24) is from the upper bound of the general quadratic form. The first inequality in (25) is derived by the relation between 1\ell_{1}-norm and 2\ell_{2}-norm, namely 𝒙2𝒙1N𝒙2\|\bm{x}\|_{2}\leq\|\bm{x}\|_{1}\leq\sqrt{N}\|\bm{x}\|_{2} for an NN-dimensional vector 𝒙\bm{x}, and the second inequality is from the upper and the lower bounds of the general quadratic form. The inequality in (LABEL:eq:upper_bound_of_3rd_term) is Cauchy-Schwarz inequality. By these results and from the definition of 𝒘i(t+1)\bm{w}_{i}^{(t+1)} in (11), the last inequality in (23) is shown using the constant CC in (21). Then, if the condition (20) is satisfied, one has

Cmin𝒘{g(𝒘)(𝒉(t))T𝒘}h(𝒘i(t))+(𝒉(t))T𝒘i(t)\displaystyle C\min_{\bm{w}}\left\{g(\bm{w})-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}\right\}-h(\bm{w}_{i}^{(t)})+({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t)} (27)
\displaystyle\leq g(𝒘i(t))(𝒉(t))T𝒘i(t)h(𝒘i(t))+(𝒉(t))T𝒘i(t)\displaystyle g(\bm{w}_{i}^{(t)})-({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t)}-h(\bm{w}_{i}^{(t)})+({}_{\nabla}\bm{h}^{(t)})^{T}\bm{w}_{i}^{(t)}
=\displaystyle= g(𝒘i(t))h(𝒘i(t))\displaystyle g(\bm{w}_{i}^{(t)})-h(\bm{w}_{i}^{(t)})
=\displaystyle= Jo(𝒘i(t)).\displaystyle J_{o}(\bm{w}_{i}^{(t)}).

This means the property Jo(𝒘^i(t+1))Jo(𝒘i(t))J_{o}(\widehat{\bm{w}}_{i}^{(t+1)})\leq J_{o}(\bm{w}_{i}^{(t)}) is proved, which guarantees the convergence. ∎

4 Numerical experiment

4.1 Application to synthetic data

For validity of our proposed method, we apply our method to synthetic data to evaluate its performance. In addition, we compare the performance of our method with widely-used FastICA algorithm. In our numerical experiments, we use scikit-learn library (version 1.2.2) for Python (version 3.10.11).

Experimental conditions and performance measure

In this experiment, we apply our proposed method and FastICA to 4 synthetic data 𝑿(1),𝑿(2)\bm{X}^{(1)},\bm{X}^{(2)}, 𝑿(3)\bm{X}^{(3)}, and 𝑿(4)\bm{X}^{(4)}, which are generated as follows. In particular, in constructing 𝑿(3)\bm{X}^{(3)} and 𝑿(4)\bm{X}^{(4)}, the values of parameters are chosen to model large-size real fMRI data. (See also Table 1.)

Table 1: Summary of synthetic data
data dist. of 𝑺\bm{S} pp χ\chi dist. of 𝑬\bm{E}
𝑿(1)\bm{X}^{(1)} 80% uniform, 20% Laplace 10 0.8 𝒩(0,1){\cal N}(0,1)
𝑿(2)\bm{X}^{(2)} 80% uniform, 20% log-normal 10 0.8 𝒩(0,1){\cal N}(0,1)
𝑿(3)\bm{X}^{(3)} 80% uniform, 20% log-normal 10410^{4} 0.999 𝒩(0,1){\cal N}(0,1)
𝑿(4)\bm{X}^{(4)} 80% uniform, 20% log-normal 10410^{4} 0.999 𝒩(0,22){\cal N}(0,2^{2})
  1. 1.

    First, the ground-truth signal source matrix 𝑺K×N\bm{S}^{*}\in\mathbb{R}^{K\times N} is generated, whose elements are drawn from Laplace distribution, log-normal distribution, or uniform distribution defined by

    (sij;μ,σ)\displaystyle\mathcal{L}(s_{ij}^{*};\mu,\sigma) =\displaystyle= 12σexp(|sijμ|σ),\displaystyle\frac{1}{2\sigma}\exp\left(-\frac{|s_{ij}^{*}-\mu|}{\sigma}\right), (28)
    Lognormal(sij;μ,σ)\displaystyle{\rm Lognormal}(s_{ij}^{*};\mu,\sigma) =\displaystyle= 1sij2πσ2exp((logsijμ)22σ2),\displaystyle\frac{1}{s_{ij}^{*}\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{(\log s_{ij}^{*}-\mu)^{2}}{2\sigma^{2}}\right), (29)
    𝒰(sij;c1,c2)\displaystyle\mathcal{U}(s_{ij}^{*};c_{1},c_{2}) =\displaystyle= {1c2c1forc1sijc2,0otherwise.\displaystyle\begin{cases}\displaystyle\frac{1}{c_{2}-c_{1}}&{\rm for}\ c_{1}\leq s^{*}_{ij}\leq c_{2},\\ 0&{\rm otherwise.}\end{cases} (30)

    Note that these distributions are non-gaussian distributions. For constructing 𝑿(1)\bm{X}^{(1)}, 20% of the vectors in 𝑺\bm{S}^{*} have elements following Laplace distribution and 80% have elements following uniform distribution. Similarly, 20% of the vectors in 𝑺\bm{S}^{*} have elements following log-normal distribution and 80% have elements following uniform distribution for 𝑿(2),𝑿(3)\bm{X}^{(2)},\bm{X}^{(3)}, and 𝑿(4)\bm{X}^{(4)}. For all synthetic data, the dimensions of 𝑺\bm{S} are set as K=10,N=500K=10,N=500.

  2. 2.

    Next, the ground-truth mixing matrix 𝑨p×K\bm{A}^{*}\in\mathbb{R}^{p\times K} is generated, whose element is drawn from Bernoulli-gaussian distribution.

    BG(aij)=χδ(aij)+(1χ)12πexp(aij22).{\rm BG}(a_{ij}^{*})=\chi\delta(a_{ij}^{*})+(1-\chi)\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{{a_{ij}^{*}}^{2}}{2}\right). (31)

    For 𝑿(1)\bm{X}^{(1)} and 𝑿(2)\bm{X}^{(2)}, we set χ=0.8,p=10\chi=0.8,p=10, namely the size of the data is small. In contrast, we set p=104p=10^{4} for 𝑿(3)\bm{X}^{(3)} and 𝑿(4)\bm{X}^{(4)} to model real fMRI data, because the voxel size of fMRI data often takes the order from 10410^{4} to 10510^{5}. Moreover, in real fMRI data analysis as mentioned in section 4.2, elements of brain activity above 3 standard deviations from the mean can be interpreted as clear characteristic activations. Namely, the number of elements representing strong activations is typically of the order of 10110^{1}. From this observation, we set χ=0.999\chi=0.999 for 𝑿(3)\bm{X}^{(3)} and 𝑿(4)\bm{X}^{(4)}.

  3. 3.

    The observed matrix 𝑿p×N\bm{X}\in\mathbb{R}^{p\times N} is constructed from 𝑺\bm{S}^{*} and 𝑨\bm{A}^{*} by the matrix product 𝑿=𝑨𝑺+𝑬\bm{X}=\bm{A}^{*}\bm{S}^{*}+\bm{E}, where 𝑬p×N\bm{E}\in\mathbb{R}^{p\times N} is noise matrix with the elements following standard gaussian distribution 𝒩(0,1){\cal N}(0,1) for 𝑿(1),𝑿(2)\bm{X}^{(1)},\bm{X}^{(2)}, and 𝑿(3)\bm{X}^{(3)}. For 𝑿(4)\bm{X}^{(4)}, the element of noise matrix 𝑬\bm{E} follows 𝒩(0,22){\cal N}(0,2^{2}), namely noisier case.

After application of our proposed method to the synthetic matrix 𝑿\bm{X}, the result is evaluated by the performance measures in the following.

  • sparsity
    Sparsity of the estimated mixture matrix 𝑨\bm{A} is defined in equation (32).

    Sparsity(𝑨)=i=1pj=1K𝕀(a^ij<ϵ)pK,\mathrm{Sparsity}(\bm{A})=\frac{\sum_{i=1}^{p}\sum_{j=1}^{K}\mathbb{I}(\widehat{a}_{ij}<\epsilon)}{pK}, (32)

    where a^ij\widehat{a}_{ij} is the (i,j)(i,j)-element of normalized 𝑨\bm{A} and 𝕀(.)\mathbb{I}(.) is the indicator function. In our experiment, we set the infinitesimal threshold ϵ=102\epsilon=10^{-2}. Under experimental condition mentioned above, the sparsity of the ground-truth mixture matrix 𝑨\bm{A}^{*} is 0.8 after truncation of the elements.

  • kurtosis
    For non-gaussianity, mean of absolute values of kurtosis (denoted by MAK) for the estimated signal matrix 𝑺\bm{S} is used,

    MAK(𝑺)=1Ki=1K|1Nj=1Nsij4(1Nj=1Nsij2)23|,\mathrm{MAK}(\bm{S})=\frac{1}{K}\displaystyle\sum_{i=1}^{K}\left|\frac{\frac{1}{N}\sum_{j=1}^{N}s_{ij}^{4}}{(\frac{1}{N}\sum_{j=1}^{N}s_{ij}^{2})^{2}}-3\right|, (33)

    which is zero for the matrix with the elements following gaussian distribution.

  • RMSE as the reconstruction error
    For validity as an MF solution, rooted mean squared error (RMSE) is defined between the original observed matrix 𝑿\bm{X} and the product of estimated matrices 𝑨𝑺\bm{A}\bm{S} for describing reconstruction error.

    RMSE𝑿=1pNi=1pj=1N(xijl=1Kailslj)2.\mathrm{RMSE}_{\bm{X}}=\sqrt{\displaystyle\frac{1}{pN}\displaystyle\sum_{i=1}^{p}\displaystyle\sum_{j=1}^{N}\left(x_{ij}-\displaystyle\sum_{l=1}^{K}a_{il}s_{lj}\right)^{2}}. (34)
  • reconstruction success rate
    In general MF problem including ICA, the matrices after permutation of vectors in factorized matrix solution are also another solution. Therefore, it is not appropriate to measure RMSE between the ground-truth mixing matrix 𝑨\bm{A}^{*} and estimated mixing matrix 𝑨\bm{A} directly. To evaluate the difference between 𝑨\bm{A} and 𝑨\bm{A}^{*} by considering permutation symmetry of MF solution, Amari distance [32] is defined to measure how the product 𝑷=𝑾T𝑸𝑨\bm{P}=\bm{W}^{T}\bm{Q}\bm{A}^{*} is close to the permutation matrix as

    AD(𝑷)=i=1K(j=1K|pij|max𝑙|pil|1)+j=1K(i=1K|pij|max𝑙|plj|1).\mathrm{AD}(\bm{P})=\displaystyle\sum_{i=1}^{K}\left(\displaystyle\sum_{j=1}^{K}\displaystyle\frac{|p_{ij}|}{\underset{l}{\max}|p_{il}|}-1\right)+\displaystyle\sum_{j=1}^{K}\left(\displaystyle\sum_{i=1}^{K}\displaystyle\frac{|p_{ij}|}{\underset{l}{\max}|p_{lj}|}-1\right). (35)

    Amari distance is regarded as a significant performance measure in ICA, because it is known that 𝑷\bm{P} is equivalent to the permutation matrix if signal sources can be separated perfectly. When an algorithm of ICA is executed in nn trials under given observed matrix 𝑿\bm{X} and different initial matrix 𝑾\bm{W}, the reconstruction success rate is defined using Amari distance,

    SR=j=1n𝕀[AD(𝑷(j))<ψ]n,{\rm SR}=\frac{\sum_{j=1}^{n}\mathbb{I}[{\rm AD}(\bm{P}^{(j)})<\psi]}{n}, (36)

    where the superscript (j)(j) means the jj-th trial.

In our experiment, these quantities are evaluated under various α\alpha and the sparsity of 𝑸#\bm{Q}^{\#}. For simplicity, the sparsity of 𝑸#\bm{Q}^{\#} is denoted by κ=1k0/K\kappa=1-k_{0}/K in the following. The parameters other than α\alpha and κ\kappa are fixed as follows: μ=0,σ=1,c1=0,c2=1,M=500,ρ=1,ζ=105,η=103\mu=0,\sigma=1,c_{1}=0,c_{2}=1,M=500,\rho=1,\zeta=10^{-5},\eta=10^{-3}. Sparsity, kurtosis, and RMSE are evaluated by our algorithm in 10 trials under the same observed matrix 𝑿\bm{X} and different initial matrix 𝑾\bm{W}. Statistical error of multiple trials is shown by the error bar in the figures.

Result 1: General property of factorized matrices by the proposed method

We depict the behaviours of the sparsity of the estimated mixture matrix 𝑨\bm{A}, the kurtosis of the estimated signal matrix 𝑺\bm{S}, and the reconstruction error obtained by our proposed method and FastICA in Figure 1. The results for 𝑿(1)\bm{X}^{(1)} (top, under the mixtures of Laplace/uniform distributions) and 𝑿(2)\bm{X}^{(2)} (bottom, log-normal/uniform distributions) are compared. From this figure, there is no significant difference between mixtures of Laplace/uniform distributions and log-normal/uniform distributions. This implies that the result does not depend on the choice of non-gaussian distribution.

Refer to caption
Refer to caption
Figure 1: The behaviors of Sparsity(𝑨\bm{A}), MAK(𝑺)\mathrm{MAK}(\bm{S}), and RMSE𝑿\mathrm{RMSE}_{\bm{X}} versus α\alpha (left column) or κ\kappa (right column) for 𝑿(1)\bm{X}^{(1)} (top, mixture of Laplace/uniform) and 𝑿(2)\bm{X}^{(2)} (bottom, mixture of log-normal/uniform): In both cases of 𝑿(1)\bm{X}^{(1)} and 𝑿(2)\bm{X}^{(2)}, the red (κ=0\kappa=0) and the black lines (FastICA) are overlapped on the horizontal axis in the bottom left figure, and all lines by our method (α=105,103,101\alpha=10^{-5},10^{-3},10^{-1}) are overlapped in the bottom right figure.

We move on to the detail of the results in Figure 1. First, we discuss the sparsity of the estimated mixture matrix 𝑨\bm{A}. The sparsity of 𝑨\bm{A} by our proposed method is larger than FastICA regardless of the value of α,κ\alpha,\kappa. We also observe that the sparsity by our method is very large and close to 0.9 under larger α\alpha and κ=0.9\kappa=0.9. When the value of α\alpha is fixed, large sparsity is obtained for large κ\kappa under α=102\alpha=10^{-2} or 10110^{1}. In general, κ\kappa must be larger for larger sparsity of 𝑨\bm{A}, while α\alpha should be tuned carefully.

Next, we observe the behavior of MAK{\rm MAK}. MAK{\rm MAK} by our method tends to decrease for larger α\alpha, while MAK\rm MAK by FastICA does not. This may be because our ICA solution under larger α\alpha is different from the solution of the conventional FastICA. On the other hand, there is no significant correlation between κ\kappa and MAK{\rm MAK}. Therefore, for larger non-gaussianity, κ\kappa can take arbitrary value, while α\alpha should be appropriately adjusted.

We also mention the result of the reconstruction error. In Figure 1, no change in RMSE𝑿{\rm RMSE}_{\bm{X}} is observed even if α\alpha is varied, and RMSE𝑿{\rm RMSE}_{\bm{X}} depends only on κ\kappa. It is easy to prove analytically that RMSE𝑿{\rm RMSE}_{\bm{X}} depends only on the approximation accuracy of 𝑸#\bm{Q}^{\#} using the orthogonality of the matrix 𝑾\bm{W}, which is also indicated by numerical result. In addition, by fixing α\alpha, it is found that RMSE𝑿{\rm RMSE}_{\bm{X}} tends to increase for larger κ\kappa. To summarize, α\alpha can be chosen arbitrarily for better reconstruction error, while κ\kappa must be carefully adjusted. Additionally, from the fact that larger sparsity of 𝑨\bm{A} can be obtained under larger κ\kappa, large sparsity and small reconstruction error have a trade-off relation.

Lastly, we give the result of the reconstruction success rate. Before evaluating SR of our method, we evaluate Amari distance by FastICA in 20 trials with different initial matrix, whose average is 43.04±(8.87×103)43.04\pm(8.87\times 10^{-3}). Based on this result, the threshold of success is set as ψ=35\psi=35 in our experiment, which is the appropriate value because it is below the average of Amari distance by FastICA. For reconstruction success rate, we evaluate it in 20 trials with different initial matrix for α=1.0×104,2.0×104,,1.0×103\alpha=1.0\times 10^{-4},2.0\times 10^{-4},\ldots,1.0\times 10^{-3}. Parameter κ\kappa is set to 0.9 in all cases. In Figure 2, we show the frequency of samples and SR under various sparsity of estimated matrix 𝑨\bm{A}. The largest SR is observed in the range of 0.75Sparsity(𝑨)<0.850.75\leq{\rm Sparsity}(\bm{A})<0.85. As stated in the experimental conditions, the sparsity of the ground-truth mixture matrix 𝑨\bm{A}^{*} is 0.8. Therefore, this result indicates that the ground-truth mixture matrix 𝑨\bm{A}^{*} can be obtained by our method with high probability, especially when the sparsity of 𝑨\bm{A} is close to the one of 𝑨\bm{A}^{*}.

Refer to caption
Figure 2: Frequency and SR versus Sparsity(𝑨){\rm Sparsity}(\bm{A}): The blue bar and red line indicate the frequency of Sparsity(𝑨){\rm Sparsity}(\bm{A}) and SR at each bin, respectively. 10 samples with Sparsity(𝑨)<0.05{\rm Sparsity}(\bm{A})<0.05 are not displayed in this figure.

Result 2: Application to large synthetic data modelling fMRI

We also apply our method and FastICA to synthetic data 𝑿(3)\bm{X}^{(3)} and 𝑿(4)\bm{X}^{(4)}, which model real fMRI data with larger dimension. Similar to the applications to 𝑿(1)\bm{X}^{(1)} and 𝑿(2)\bm{X}^{(2)}, the behaviors of the sparsity of 𝑨\bm{A}, the kurtosis of 𝑺\bm{S}, and the reconstruction error are depicted in Figure 3.

Refer to caption
Refer to caption
Figure 3: The behaviors of Sparsity(𝑨\bm{A}), MAK(𝑺)\mathrm{MAK}(\bm{S}), and RMSE𝑿\mathrm{RMSE}_{\bm{X}} versus α\alpha (left column) or κ\kappa (right column) for large size data 𝑿(3)\bm{X}^{(3)} (top) and 𝑿(4)\bm{X}^{(4)} (bottom, noisy): In both cases of 𝑿(3)\bm{X}^{(3)} and 𝑿(4)\bm{X}^{(4)}, the red (κ=0\kappa=0) and the black lines (FastICA) are overlapped on the horizontal axis in the bottom left figure, and all lines by our method (α=105,103,101\alpha=10^{-5},10^{-3},10^{-1}) are overlapped in the bottom right figure.

The result is summarized as follows. First, sparsity of 𝑨\bm{A} is almost unchanged even when α\alpha is varied, especially in applying to 𝑿(3)\bm{X}^{(3)}. As for the dependence on κ\kappa, sparsity takes a large value at κ=0\kappa=0, and it suddenly decrease then gradually tends to increase when κ\kappa is increased from 0. In the application of 1\ell_{1}-regularized MF, similar phenomenon of little change in sparsity when varying α\alpha has already been observed in the previous study [33]: if synthetic data 𝑿\bm{X} is generated by very sparse ground-truth factorized matrix 𝑨\bm{A}^{*}, application of SparsePCA does not significantly change the sparsity of the resulting factorized matrix even when the sparse parameter in SparsePCA is varied. Since the evaluation of sparsity here is performed similarly to SparsePCA in the previous study, we guess that similar behavior of sparsity is obtained in this experiment. In the application to 𝑿(4)\bm{X}^{(4)}, the change of sparsity depends on κ\kappa rather than α\alpha, which is similar to 𝑿(3)\bm{X}^{(3)}. Note that the sparsity by FastICA is not so large in this case, while our method can sparsify factorized matrix. This indicates the validity of our method to obtain sparse factorized matrix from noisy large-size data.

Next, for 𝑿(3)\bm{X}^{(3)}, there is no clear trend between kurtosis of 𝑺\bm{S} and both of sparse parameters α,κ\alpha,\kappa, while the statistical error of kurtosis is very large under α=101\alpha=10^{-1} excepting the point κ=0.9\kappa=0.9. This result suggests that parameters should be tuned carefully for the stable solution with large kurtosis, if the data is less noisy. In contrast, for 𝑿(4)\bm{X}^{(4)}, kurtosis tends to increase from κ=0.6\kappa=0.6 to 0.9. From this result, κ\kappa should be larger for larger kurtosis, when our method is applied to noisy large-size data.

The behavior of reconstruction error is similar to Figure 1, while its value is much larger than those in Figure 1. This is caused by the fact that Moore-Penrose pseudo inverse matrix cannot approximate the true inverse matrix appropriately under p>Np>N, which is the problem inherent in original ICA.

Lastly, we compare the result of application to 𝑿(4)\bm{X}^{(4)} with FastICA. Before comparison with FastICA, we evaluate the sparsity and AD by FastICA in 20 trials with different initial matrix. The average of sparsity is 0.765±(2.85×104)0.765\pm(2.85\times 10^{-4}) and the average of AD is 38.82±1.1738.82\pm 1.17. Then, in the application of our method here, the parameter κ\kappa is set to be 0.7,0.8,0.90.7,0.8,0.9, and the experiment is conducted in 20 trials with different initial matrix for each value of κ\kappa (totally 60 trials). The parameter α\alpha is fixed at 10110^{-1}. Table 2 shows the result by our method. The threshold in the definition of SR is set at ψ=38.82\psi=38.82 for comparison with FastICA. This result in the table suggests that the factorized matrix with lower AD than FastICA can be obtained with approximately 50% probability, when the sparsity is larger than FastICA. Namely, our proposed method can obtain a closer factorized matrix to the ground-truth mixing matrix 𝑺\bm{S}^{*} than FastICA at a certain probability. This suggests that our method is competitive with original FastICA for evaluating of sparse and independent factorized matrices from large-size noisy data. As written in section 4.2, it should be emphasized that the validity of our method for application to fMRI data will be discussed.

Table 2: Result of application to 𝑿(4)\bm{X}^{(4)} (noisy data)
sparsity frequency SR
smaller than FastICA 19 0.211
larger than FastICA 41 0.488

4.2 Application to real-world data

Next, for verifying the practical utility of our method, we conduct an experiment for feature extraction in real fMRI data.

Dataset and performance measure

We apply our method to Haxby dataset [34], which is a task-related fMRI dataset recording human’s response to 8 images. The 8 images are as follows: shoe, house, scissors, scrambledpix, face, bottle, cat, and chair. In the experiment of fMRI data acquisition, one image is shown to a test subject for a while, then the next image is shown after an interval. Finally, all 8 images are shown sequentially. The whole Haxby dataset includes fMRI data of 12 trials from 6 test subjects. The previous study [13] shows no significant difference among subjects in this dataset. Therefore, in our experiment, we apply our method and FastICA to one trial data of test subject No. 2, which was acquired from 39912 voxels with 121 scanning time steps. In our experiment, we apply our method and FastICA to one trial data from the test subject No. 2, which was acquired from 39912 voxels with 121 scanning time steps. Namely, the data matrix size is 39912×12139912\times 121 or 𝑿39912×121\bm{X}\in\mathbb{R}^{39912\times 121}.

As performance measures, we use sparsity in equation (32) and correlation defined by

Correlation(𝒔i,𝒅REST)=|(𝒔i𝒔i¯)T(𝒅REST𝒅REST¯)|𝒔i𝒔i¯2𝒅REST𝒅REST¯2,\mathrm{Correlation}(\bm{s}_{i},{\bm{d}^{\mathrm{REST}}})=\displaystyle\frac{\left|(\bm{s}_{i}-\bar{\bm{s}_{i}})^{T}(\bm{d}^{\mathrm{REST}}-\overline{\bm{d}^{\mathrm{REST}}})\right|}{\|\bm{s}_{i}-\bar{\bm{s}_{i}}\|_{2}\|\bm{d}^{\mathrm{REST}}-\overline{\bm{d}^{\mathrm{REST}}}\|_{2}}, (37)

where overline means the arithmetic average of all elements in the vector. The vector 𝒔i\bm{s}_{i} means the ii-th temporal feature vector in the matrix 𝑺\bm{S}. The column vector 𝒅REST\bm{d}^{\mathrm{REST}} represents the ground-truth timing of resting state: the jj-th element of 𝒅REST\bm{d}^{\mathrm{REST}} is 1 if no image is shown to the test subject at the jj-th time step, and 0 if one of the images is shown, respectively. Therefore, the quantity in equation (37) measures the correlation between the ground-truth timing vector of the resting state 𝒅REST\bm{d}^{\rm REST} and temporal feature vector in 𝒔\bm{s}.

Result

Refer to caption
Figure 4: Correlation by our method: The values of Correlation between the respective temporal feature vector in 𝑺\bm{S} and the ground-truth timing vector 𝒅REST\bm{d}^{\mathrm{REST}} under various α,κ\alpha,\kappa are shown.

The result of correlation is depicted in Figure 4 under various α,κ\alpha,\kappa: α=106,102,101\alpha=10^{-6},10^{-2},10^{1} and κ=0,0.5,0.9\kappa=0,0.5,0.9. Other parameters are set as follows: K=20,M=500,ρ=1,ζ=105,η=103K=20,M=500,\rho=1,\zeta=10^{-5},\eta=10^{-3}. From this figure, the value of correlation is at most 0.6 for κ=0,0.5\kappa=0,0.5, whereas it sometimes exceeds 0.6 under α=102,101\alpha=10^{-2},10^{1} for κ=0.9\kappa=0.9. In addition, to confirm the validity of our method, we apply FastICA and our proposed method under κ=0.9,α=101\kappa=0.9,\alpha=10^{1} to the same data 50 times with different initial matrix, then conduct Student’s two-sample tt-test for the first, the second, and the third largest values of Correlation. The result is shown in Table 3, where negative value means that the Correlation by our method is larger. From this result, there is a significant difference between the results by our method and FastICA, and it is clear that the first and the second largest values of Correlation by our method are larger than FastICA. Although the third largest value by FastICA is larger, we think it is sufficient to claim the advantage of our method over FastICA.

Table 3: Result of Student’s two-sample tt-test
value of Correlation 1st 2nd 3rd
t-statistic -25.7 -3.33 5.43
p-value 4.31×10304.31\times 10^{-30} 1.52×1031.52\times 10^{-3} 1.14×1061.14\times 10^{-6}

Next, we visualize the extracted feature vectors by our method. The 18th row vector in 𝑺\bm{S} by our method under α=101,κ=0.9\alpha=10^{1},\kappa=0.9 is depicted in Figure 5A. Note that the 18th row vector has the largest Correlation in our result under α=101,κ=0.9\alpha=10^{1},\kappa=0.9 as shown in Figure 4. The timing of showing image to a test subject, which reflects the information of the vector 𝒅REST\bm{d}^{\mathrm{REST}}, is also shown in the figure. From this result, the temporal changes between resting and non-resting states can be tracked easily in the feature vector by our method. Additionally, the spatial map of the 18th spatial feature vector in 𝑨\bm{A} by our method is shown on the cross sections of the brain in Figure 5B. Note that this spatial feature vector is the counterpart of the 18th temporal feature vector in Figure 5A. For spatial map, the column vectors in 𝑨\bm{A} are normalized to have zero mean and unit variance, and elements within 3 standard deviations of the mean are truncated to zero. For visualization of the spatial map, we use Nilearn (version 0.10.1) in Python library. In this figure, several active regions under resting state or the response to some visual stimuli can be observed. In particular, strong activations are observed in the cerebellum, which is known to be activated by visual stimuli. This result indicates that our method with high sparsity setting can identify brain regions for information processing of visual stimuli with high accuracy.

To compare our proposed method and FastICA, we show the behaviors of two feature vectors given by FastICA, namely the 3rd and the 17th vectors. For clarity, extracted feature vectors are denoted with a superscript, which represents the method for comparing FastICA and our proposed method. For example, the 3rd vector in 𝑺\bm{S} by FastICA is written as 𝒔3(FastICA)\bm{s}_{3}^{(\rm FastICA)}. First, the vector 𝒔3(FastICA)\bm{s}_{3}^{(\rm FastICA)} and the spatial map of 𝒂3(FastICA)\bm{a}_{3}^{(\rm FastICA)} are depicted in Figures 6A and 6B, respectively. Note that the vector 𝒔3(FastICA)\bm{s}_{3}^{(\rm FastICA)} is most strongly correlated with the timing vector of visual stimuli as in Figure 4. However, from Figure 6A, resting and non-resting regions in fMRI data cannot be discriminated from the shape of the feature vector by FastICA. In addition, we cannot clearly identify which parts of the brain are strongly activated from Figure 6B. We also depict the vector 𝒔17(FastICA)\bm{s}_{17}^{\rm(FastICA)} and the spatial map of 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)} in Figures 7A and 7B, respectively. Note that the vector 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)} is most strongly correlated with the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)}, whose counterpart 𝒔18(ours)\bm{s}_{18}^{\rm(ours)} has the largest value of Correlation with 𝒅REST\bm{d}^{\rm REST} under appropriate α,κ\alpha,\kappa as in Figure 4. For Correlation between these two column vectors, see also Figure 8A in the following. Similarly to the vector 𝒔3(FastICA)\bm{s}_{3}^{(\rm FastICA)}, we cannot find significant synchronization with the timing of visual stimuli from Figure 7A, and it is difficult to understand what the spatial map of 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)} means because the area of activation in the brain is not clear from Figure 7B.

For relation between spatial maps by the two methods, we evaluate the absolute value of Correlation between the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)} and each column vector of 𝑨\bm{A} by FastICA in Figure 8A. From this result, we find that the vector 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)} is most strongly correlated with the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)}. In Figure 8B, we depict the scatter plot of the element in the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)} vs. the corresponding element in the vector 𝒂3(FastICA)\bm{a}_{3}^{\rm(FastICA)} (top) or 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)} (bottom), where the values of the elements in spatial vectors are normalized to the range [1,1][-1,1] by the linear transformation. From this figure, the vector 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)} is more strongly correlated with the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)}. Hence, these two spatial feature vectors will represent similar spatial networks. As mentioned in the result of Figure 7, synchronization with the timing of visual stimuli is not observed in the vector 𝒔17(FastICA)\bm{s}_{17}^{\rm(FastICA)}, which is the counterpart of the spatial feature vector 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)}. In contrast, the temporal feature vector 𝒔18(ours)\bm{s}_{18}^{\rm(ours)} is clearly synchronized with the timing of visual stimuli, and the corresponding spatial feature 𝒂18(ours)\bm{a}_{18}^{\rm(ours)} also shows activation in the region related to visual stimuli. From these facts, we can conclude that our method outperforms FastICA in feature extraction from fMRI data.

Refer to caption
Refer to caption
Figure 5: (A) The 18th row vector in 𝑺\bm{S} by our proposed method, 𝒔18(ours)\bm{s}_{18}^{\rm(ours)}: The orange background indicates the timing of showing image to a test subject. (B) Spatial map depicted on the cross-section of the brain: Each point has one-to-one correspondence with the element in the 18th spatial feature vector in 𝑨\bm{A}, 𝒂18(ours)\bm{a}_{18}^{\rm(ours)}. The points are displayed according to the mapping between voxel position and each element in the spatial feature vector. The color represents the value of the element.
Refer to caption
Refer to caption
Figure 6: (A) The vector 𝒔3(FastICA)\bm{s}_{3}^{\rm(FastICA)}: The orange background indicates the timing of showing image to a test subject. (B) Spatial map depicted on the cross-section of the brain: Each point has one-to-one correspondence with the element in the vector 𝒂3(FastICA)\bm{a}_{3}^{\rm(FastICA)}.
Refer to caption
Refer to caption
Figure 7: (A) The vector 𝒔17(FastICA)\bm{s}_{17}^{\rm(FastICA)}: The orange background indicates the timing of showing image to a test subject. (B) Spatial map depicted on the cross-section of the brain: Each point has one-to-one correspondence with the element in the vector 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)}.
Refer to caption
Refer to caption
Figure 8: (A) Correlation between the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)} and each column vector in 𝑨\bm{A} by FastICA (B) Scatter plot of the element in the vector 𝒂3(FastICA)\bm{a}_{3}^{\rm(FastICA)} vs. the corresponding element in the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)} (top), similarly the vector 𝒂17(FastICA)\bm{a}_{17}^{\rm(FastICA)} vs. the vector 𝒂18(ours)\bm{a}_{18}^{\rm(ours)} (bottom)

We think the advantage of our method is due to the sparsity of the estimated mixture matrix 𝑨\bm{A}. To support this, the sparsity of 𝑨\bm{A} by this experiment is summarized in Table 4. From this table, it is found that the parameters giving the feature vector with the largest correlation (α=101,κ=0.9\alpha=10^{1},\kappa=0.9) lead to the sparsest matrix 𝑨\bm{A}. On the other hand, Sparsity(𝑨){\rm Sparsity}(\bm{A}) by FastICA is evaluated as 0.104, which is much smaller than our method. In the previous study [35, 36, 13], it is claimed that the method of MF giving sparse 𝑨\bm{A} can extract appropriate temporal feature vector characterizing the external stimuli. Therefore, the result indicating the advantage of our method is consistent with the previous studies.

Table 4: Sparsity\rm Sparsity(𝑨\bm{A}), MAK\rm MAK(𝑺)\bm{S}), and RMSE𝑿\mathrm{RMSE}_{\bm{X}} under various α,κ\alpha,\kappa
Sparsity(𝑨){\rm Sparsity}(\bm{A}) κ=0\kappa=0 κ=0.5\kappa=0.5 κ=0.9\kappa=0.9
α=106\alpha=10^{-6} 0.095 0.099 0.099
α=102\alpha=10^{-2} 0.098 0.101 0.101
α=101\alpha=10^{1} 0.156 0.505 0.770
MAK(𝑺){\rm MAK}(\bm{S}) κ=0\kappa=0 κ=0.5\kappa=0.5 κ=0.9\kappa=0.9
α=106\alpha=10^{-6} 0.91 0.91 2.08
α=102\alpha=10^{-2} 1.27 1.41 1.36
α=101\alpha=10^{1} 1.27 1.41 1.50
RMSE𝑿{\rm RMSE}_{\bm{X}} κ=0\kappa=0 κ=0.5\kappa=0.5 κ=0.9\kappa=0.9
α=106\alpha=10^{-6} 0.904 0.908 0.922
α=102\alpha=10^{-2} 0.904 0.908 0.922
α=101\alpha=10^{1} 0.904 0.908 0.922
Refer to caption
Figure 9: Correlation by another sparse ICA method in the previous work [21]: The values of Correlation between the respective temporal feature vector in 𝑺\bm{S} and the ground-truth timing vector 𝒅REST\bm{d}^{\mathrm{REST}} under various α,κ\alpha,\kappa are shown.

To verify the significance of sparsity, the kurtosis of the estimated temporal feature matrix 𝑺\bm{S} and the reconstruction error by our method are also summarized in Table 4. For comparison, MAK(𝑺){\rm MAK}(\bm{S}) and RMSE𝑿{\rm RMSE}_{\bm{X}} obtained by FastICA are 12.37 and 0.904, respectively. Recall that the objective of this experiment is to identify the temporal features response to external stimuli. Hence, this result suggests that the sparsity of spatial features is more important than the kurtosis of temporal features or the reconstruction error in feature extraction of neuronal activity data. This fact can be reinforced by the result of previous study [13], where they applied sparse MF methods, namely SparsePCA and method of optimal directions (MOD), to the same dataset and investigated the performance of feature extraction and reconstruction error. As a result, even under the case of the large reconstruction error, appropriate features are obtained if the sparsity of the spatial feature in the brain is large. Our result of experiment shows that the accuracy of synchronization between extracted temporal features and the timing of showing images can be improved by sparsifying spatial features in the brain, even if the kurtosis of temporal features is decreased. Therefore, it can be concluded that the sparsity of spatial features in the brain is the most important to obtain features corresponding to visual stimuli.

For comparison, we also apply another sparse ICA method in the previous work [21] to the same data. In their method, the sparsity of the matrix 𝑾T𝑸{\bm{W}}^{T}{\bm{Q}} is controlled by two parameters λ,α\lambda,\alpha, and we set λ=0.1,0.5,1.0,α=104,103,102\lambda=0.1,0.5,1.0,\alpha=10^{-4},10^{-3},10^{-2}, respectively. The result in Figure 9 shows that very large value of Correlation over 0.5 cannot be obtained, in other words appropriate feature cannot be extracted sufficiently by the ICA method of sparsifying the matrix 𝑾T𝑸{\bm{W}}^{T}{\bm{Q}}. In contrast, our method sparsifying the matrix 𝑸#𝑾\bm{Q}^{\#}\bm{W} yields very large value of Correlation and works more appropriately for feature extraction from fMRI data.

It should be mentioned that there is also an analysis for the same data by MOD [13]. In the previous study, it is claimed that MOD with high sparsity setting can extract the activations in the cerebellum. Comparing the results by the proposed method and MOD, it is observed that the spatial maps from the proposed method and MOD are similar. However, the maximum correlation between the temporal vector and visual stimuli (or resting states) by MOD is 0.716 under K=20K=20. Therefore, it can be concluded that our method has advantage over MOD under the same value of KK. In addition, the previous study also showed that spatial ICA and SparsePCA can extract significant features. In the spatial features extracted by these methods, strong activations are observed in the region of early visual cortex, which differ from the one obtained by MOD or our proposed method. Such differences should be interpreted as the effectiveness of all methods (MOD, SparsePCA, spatial ICA, and our method) in extracting features from neural activity data, and the comparison among these methods does not make sense so much. In conclusion, this fact also supports the validity of our method for feature extraction from fMRI data.

5 Conclusion

In this study, we propose a novel ICA method giving sparse factorized matrix by adding an 1\ell_{1} regularization term. We also evaluate its performance by the application both to synthetic and real-world data. From the result by numerical experiment, we expect that the proposed method gains the interpretability of the result in comparison with the conventional ICA, because our method can give sparse factorized matrix by appropriate tuning of parameters. Furthermore, in the application to task-related fMRI data, our method can discriminate resting and non-resting states, and it is competitive with MOD or other MF methods. This indicates the utility of our proposed method in practical analysis of biological data.

As future works, we will compare the performance of our proposed method with other ICA methods, in particular stochastic ICA [24] among them. Due to its stochastic nature, it may be difficult to obtain a sparse factorized matrix stably. However, the stochastic method may have advantages over our method, for example on scalability by appropriate parameter tuning or data preprocessing. It is also necessary to apply our method to other real-world data such as genetic or financial ones, and investigate its utility. In particular, in the analysis of gene expression, it is reported that ICA with the assumption of independence on temporal features is suitable for gene clustering [37, 38]. In addition, it is known that the interpretation of gene clusters becomes easy by imposing sparsity on gene features [6]. Hence, our method will be helpful for finding a novel feature of gene expression. Finally, we hope that our method is found to be useful in various fields and can contribute to feature extraction in many practical problems.

Acknowledgments

The authors are grateful to Kazuharu Harada for sharing his related work [21] and offering program code of sparse ICA in his work. This work is supported by KAKENHI Nos. 18K11175, 19K12178, 20H05774, 20H05776, and 23K10978.

References

  • [1] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
  • [2] Daniel Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13, 2000.
  • [3] K. Engan, S.O. Aase, and J. Hakon Husoy. Method of optimal directions for frame design. IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99, 5:2443–2446, 1999.
  • [4] Michal Aharon, Michael Elad, and Alfred Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322, 2006.
  • [5] Ian T. Jolliffe, Nickolay T. Trendafilov, and Mudassir Uddin. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12(3):531–547, 2003.
  • [6] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.
  • [7] Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5:1457–1469, 2004.
  • [8] Surya Ganguli and Haim Sompolinsky. Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis. Annual review of neuroscience, 35:485–508, 2012.
  • [9] Xiaoyu Ding, Jong-Hwan Lee, and Seong-Whan Lee. Performance evaluation of nonnegative matrix factorization algorithms to estimate task-related neuronal activities from fmri data. Magnetic resonance imaging, 31(3):466–476, 2013.
  • [10] Jinglei Lv, Binbin Lin, Qingyang Li, Wei Zhang, Yu Zhao, Xi Jiang, Lei Guo, Junwei Han, Xintao Hu, Christine Guo, et al. Task fmri data analysis based on supervised stochastic coordinate coding. Medical image analysis, 38:1–16, 2017.
  • [11] Michael Beyeler, Emily L Rounds, Kristofor D Carlson, Nikil Dutt, and Jeffrey L Krichmar. Neural correlates of sparse coding and dimensionality reduction. PLoS computational biology, 15(6):e1006908, 2019.
  • [12] Christian F Beckmann, Marilena DeLuca, Joseph T Devlin, and Stephen M Smith. Investigations into resting-state connectivity using independent component analysis. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1457):1001–1013, 2005.
  • [13] Yusuke Endo and Koujin Takeda. Performance evaluation of matrix factorization for fmri data. Neural Computation, 36(1):128–150, 2024.
  • [14] Pham Dinh Tao and El Bernoussi Souad. Algorithms for solving a class of nonconvex optimization problems. methods of subgradients. In J.-B. Hiriart-Urruty, editor, Fermat Days 85: Mathematics for Optimization, volume 129 of North-Holland Mathematics Studies, pages 249–271. North-Holland, 1986.
  • [15] Pham Dinh Tao and El Bernoussi Souad. Duality in dc (difference of convex functions) optimization. subgradient methods. In Trends in Mathematical Optimization: 4th French-German Conference on Optimization, pages 277–293. Springer, 1988.
  • [16] Le Thi Hoai An and Pham Dinh Tao. The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of operations research, 133:23–46, 2005.
  • [17] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • [18] Michael Zibulevsky, Pavel Kisilev, Yehoshua Zeevi, and Barak Pearlmutter. Blind source separation via multinode sparse representation. Advances in neural information processing systems, 14, 2001.
  • [19] Michael Zibulevsky and Barak A. Pearlmutter. Blind source separation by sparse decomposition in a signal dictionary. Neural Computation, 13(4):863–882, 2001.
  • [20] Alexander M Bronstein, Michael M Bronstein, Michael Zibulevsky, and Yehoshua Y Zeevi. Sparse ica for blind separation of transmitted and reflected images. International Journal of Imaging Systems and Technology, 15(1):84–91, 2005.
  • [21] Kazuharu Harada and Hironori Fujisawa. Sparse estimation of linear non-gaussian acyclic model for causal discovery. Neurocomputing, 459:223–233, 2021.
  • [22] Ying Chen, Linlin Niu, Ray-Bing Chen, and Qiang He. Sparse-group independent component analysis with application to yield curves prediction. Computational Statistics & Data Analysis, 133:76–89, 2019.
  • [23] Kun Zhang and Lai-Wan Chan. Ica with sparse connections. Intelligent Data Engineering and Automated Learning, pages 530–537, 2006.
  • [24] Claire Donnat, Leonardo Tozzi, and Susan Holmes. Constrained bayesian ica for brain connectome inference. arXiv preprint arXiv:1911.05770, 2019.
  • [25] Simon Foucart. The sparsity of lasso-type minimizers. Applied and Computational Harmonic Analysis, 62:441–452, 2023.
  • [26] Y.C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, 1:40–44, 1993.
  • [27] Geoffrey M. Davis, Stephane G. Mallat, and Zhifeng Zhang. Adaptive time-frequency decompositions. Optical Engineering, 33:2183–2191, 1994.
  • [28] Katsuya Tono, Akiko Takeda, and Jun-ya Gotoh. Efficient dc algorithm for constrained sparse optimization. arXiv preprint arXiv:1701.08498, 2017.
  • [29] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335 – 1371, 2011.
  • [30] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1–122, 2011.
  • [31] Pham Dinh Tao and LT Hoai An. Convex analysis approach to dc programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289–355, 1997.
  • [32] Shun-ichi Amari, Andrzej Cichocki, and Howard Yang. A new learning algorithm for blind signal separation. Advances in neural information processing systems, 8, 1995.
  • [33] Ryota Kawasumi and Koujin Takeda. Automatic hyperparameter tuning in sparse matrix factorization. Neural Computation, 35(6):1086–1099, 2023.
  • [34] J V Haxby, M I Gobbini, M L Furey, A Ishai, J L Schouten, and P Pietrini. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, 293:2425–2430, 2001.
  • [35] Wei Zhang, Jinglei Lv, Xiang Li, Dajiang Zhu, Xi Jiang, Shu Zhang, Yu Zhao, Lei Guo, Jieping Ye, Dewen Hu, et al. Experimental comparisons of sparse dictionary learning and independent component analysis for brain network inference from fmri data. IEEE transactions on biomedical engineering, 66(1):289–299, 2018.
  • [36] Jianwen Xie, Pamela K Douglas, Ying Nian Wu, Arthur L Brody, and Ariana E Anderson. Decoding the encoding of functional brain networks: An fmri classification comparison of non-negative matrix factorization (nmf), independent component analysis (ica), and sparse coding algorithms. Journal of neuroscience methods, 282:81–94, 2017.
  • [37] Moyses Nascimento, Fabyano Fonseca e Silva, Thelma Safadi, Ana Carolina Campana Nascimento, Talles Eduardo Maciel Ferreira, Laís Mayara Azevedo Barroso, Camila Ferreira Azevedo, Simone Eliza Faccione Guimarães, and Nick Vergara Lopes Serão. Independent component analysis (ica) based-clustering of temporal rna-seq data. PloS one, 12(7):e0181195, 2017.
  • [38] Sookjeong Kim, Jong Kyoung Kim, and Seungjin Choi. Independent arrays or independent time courses for gene expression time series data analysis. Neurocomputing, 71(10-12):2377–2387, 2008.