This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Manifold Nonnegative Tucker Factorization for Tensor Data Representation

Jianyu Wang Linruize Tang Jie Chen Jingdong Chen Center for Intelligent Acoustics and Immersive Communications, School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China.
Abstract

Nonnegative Tucker Factorization (NTF) minimizes the euclidean distance or Kullback-Leibler divergence between the original data and its low-rank approximation which often suffers from grossly corruptions or outliers and the neglect of manifold structures of data. In particular, NTF suffers from rotational ambiguity, whose solutions with and without rotation transformations are equally in the sense of yielding the maximum likelihood. In this paper, we propose three Robust Manifold NTF algorithms to handle outliers by incorporating structural knowledge about the outliers. They first applies a half-quadratic optimization algorithm to transform the problem into a general weighted NTF where the weights are influenced by the outliers. Then, we introduce the correntropy induced metric, Huber function and Cauchy function for weights respectively, to handle the outliers. Finally, we introduce a manifold regularization to overcome the rotational ambiguity of NTF. We have compared the proposed method with a number of representative references covering major branches of NTF on a variety of real-world image databases. Experimental results illustrate the effectiveness of the proposed method under two evaluation metrics (accuracy and nmi).

keywords:
Nonnegative Tucker Factorization , Manifold learning , Low-rank Representation.
volume: 00
\journalname

Procedia Computer Science \runauth \jidprocs \jnltitlelogoProcedia Computer Science \CopyrightLine2011Published by Elsevier Ltd.

\dochead

1 Introduction

Non-negative tucker factorization (NTF) also known as nonnegative multilinear singular value decomposition (SVD), is a multiway extension of nonnegative matrix factorization [1]. It explores the nonnegative property of data which enhances the ability of part-based representation and has received considerable attention in many fields, e.g. text mining [2], hyperspectral imaging [3], blind source separation [4] and data clustering [5].

Finding and exploiting the low-rank approximation and low-dimensional manifold from high-dimensional data is a fundamental problem in machine learning. In particular, the extracted components obtained by principle component analysis (PCA) [6], vector quantization (VQ) [7] may lose their physical meaning if the nonnegativity is not preserved for high-dimensional real-world data. Hence, nonnegative matrix factorization (NMF) [8, 9] has been used to explore low-rank representation of given data. It is intractable for NMF to deal with high-order tensors. The order of a tensor is the number of dimensions of the array, and a mode is one of its dimensions [10]. For example, an RGB image can be represented by a third-order tensor with the dimensions of height ×\times width ×\times channel. When applying NMF to tensorial data, the first step is to reshape tensors into matrices, which often leads to a loss of the meaningful tensor structures, and large scale parameters leads to higher memory demands [11]. Tensorial data can naturally characterize data from multiple aspects which preserve the structure information in each mode. However, they are typically high dimensional and difficult to be handled in their original space. To address this problem, nonnegative tensor decomposition (NTD) methods have been proposed to directly exploit multidimensional structures of tensors. NTD can be viewed as a special case of NMF which not only inherits the advantages of NMF but also provides physically meaningful representation of multiway structure representation.

For nonnegative tensor data analysis, many nonnegative tensor decomposition methods are based on CANDECOMP/PARAFAC (CP) decompositions [12], Tucker [13] models and low-tubal-rank models [14], respectively. In this paper, we focus on the tucker model, since it enables more flexible and interpretable decomposition by utilizing the interaction of latent factors. Existing NTF decomposition methods usually have the following major problems. First, many NTF methods decompose a high-dimensional tensor into a product of low-rank nonnegative projection matrices and a low-dimensional nonnegative core tensor by minimizing the Euclidean distance between their product and the original tensor data. They are optimal when the data are condemned by additive Gaussian noise. However, they may fail on grossly corrupted data, since the corruptions or outliers seriously violate the noise assumption of gaussian distribution. Second, the tucker-based model suffers from the rotational ambiguity [15], i.e, solutions with and without rotation transformations are equally good in the sense of yielding the maximum likelihood [11]. This implies that NTF can only find arbitrary bases of the latent subspace.

1.1 Contributions

In this paper, we aim to explore robust manifold NTF methods to address the above challenges. Our contributions are threefold.

  • 1.

    Three robust manifold NTF algorithms are proposed. They first apply a half-quadratic optimization algorithm to transform the intractable problem into a weighted NTF model where the weights can handle the outliers. They can be implemented by incorporating different prior distribution for the measurement between the original data and reconstructed data. Unlike the Gaussian distribution of loss functions, other flexible distribution of metric can handle outliers more efficiently. The weights are adjusted adaptively with respect to the error. To our knowledge, this is the first work of NTF for robust weighted learning to handle outliers.

  • 2.

    Three RMNTF algorithms reach the state-of-the-art performance are proposed. The three algorithms fall into the one major subclasses of NTF technologies, denoted as weighted NTF. Different distribution of loss function between original data and reconstructed data may influence the ability to handle outliers. Specifically, RMNTF with correntropy induced metric (RMNTF-CIM), RMNTF with Huber function (RMNTF-Huber), and RMNTF with Cauchy function (RMNTF-Cauchy) have been studied in this paper.

  • 3.

    We demonstrate the convergence, robustness and invariance of RMNTF.

In this work, we first introduce some preliminaries and related work in the following two subsections, then present three RMNTF algorithms in Section 2. Section 3 presents the experimental results. Finally, Section 4 concludes our findings.

1.1.1 Notations

We denote vectors, matrices and tensors by bold lowercase 𝐱\mathbf{x}, bold uppercase 𝐗\mathbf{X} and calligraphic letters 𝒳\mathcal{X} respectively. 0\mathbb{R}_{\geq 0} denotes the fields of nonnegative real numbers. \langle\cdot\rangle denotes the expectation of a certain random variable. vec()\mathrm{vec}(\cdot) is the vectorization operator that turns a tensor into a column vector. The transpose of a vector or matrix is denoted by ()T(\cdot)^{T}. Symbols \circ, \otimes, \circledast and \odot denote the outer, Kronecker, Hadamard and Khatri-Rao products respectively. DiagN(𝐱)\mathrm{Diag}^{N}(\mathbf{x}) denotes the NNth order diagonal tensor formed by 𝐱\mathbf{x}. 𝐱n\mathbf{x}_{n}, 𝐗n\mathbf{X}_{n} and 𝒳n\mathcal{X}_{n} denote the nnth vector, nnth matrix and nnth tensor respectively. 𝐗n(m)\mathbf{X}_{n(m)} denotes the mode-mm unfolding of tensor 𝒳n\mathcal{X}_{n}. 𝐗(m)\mathbf{X}^{(m)} denotes the mode-mm factor matrix. ×n\times_{n} denotes the mode-nn tensor product.

Definition 1

(Mode-nn Product [16]): A mode-nn product of a tensor 𝒮r1×r2××rN\mathcal{S}\in\mathbb{R}^{r_{1}\times r_{2}\times\dots\times r_{N}} with a matrix 𝐀(n)In×rn\mathbf{A}^{(n)}\in\mathbb{R}^{I_{n}\times r_{n}} is denoted by 𝒳=𝒮×n𝐀(n)r1××rn1×In×rn+1××rN\mathcal{X}=\mathcal{S}\times_{n}\mathbf{A}^{(n)}\in\mathbb{R}^{r_{1}\times\dots\times r_{n-1}\times I_{n}\times r_{n+1}\times\dots\times r_{N}}. Each elements can be represented as 𝒳r1rn1inrn+1rN=jn𝒮r1rn1jnrn+1rN𝐀jnin(n)\mathcal{X}_{r_{1}\dots r_{n-1}i_{n}r_{n+1}\dots r_{N}}=\sum_{j_{n}}\mathcal{S}_{r_{1}\dots r_{n-1}j_{n}r_{n+1}\dots r_{N}}\mathbf{A}^{(n)}_{j_{n}i_{n}}.

Definition 2

(Mode-nn Unfolding [17]): It is also known as matricization or flattening, which is the process of reordering the elements of an NN-way array into a matrix along each mode. A mode-nn unfolding matrix of a tensor 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\dots\times I_{N}} is denoted as 𝐗(n)\mathbf{X}_{(n)} and arranges the mode-nn fibers to be the columns of the resulting matrix.

Definition 3

(Folding Operator [10]): Given a tensor 𝒳I1××IN\mathcal{X}\in\mathbb{R}^{I_{1}\times\dots\times I_{N}}, the mode-n folding operator of a matrix 𝐌=𝒳(n)\mathbf{M}=\mathcal{X}_{(n)} is demoted as foldn(𝐌)\mathrm{fold}_{n}(\mathbf{M}), which is the inverse operator of the unfolding operator.

1.2 Related Work

NMF: Given a non-negative data matrix 𝐗=[x1,,xN]0N×M\mathbf{X}=[x_{1},\dots,x_{N}]\in\mathbb{R}_{\geq 0}^{N\times M}, where NN refers to the number of data points and MM indicates the dimension of the feature. The objective of NMF is to find two nonnegative and low rank factor matrices 𝐔0N×r\mathbf{U}\in\mathbb{R}_{\geq 0}^{N\times r} and 𝐕0M×r\mathbf{V}\in\mathbb{R}_{\geq 0}^{M\times r} and the product of the two matrices approximates the original data matrix 𝐗\mathbf{X} by 𝐗𝐔𝐕T\mathbf{X}\approx\mathbf{U}\mathbf{V}^{T}, generally rmin{M,N}r\ll\min\{M,N\}. NMF incorporates the nonnegativity constraint and obtains the part-based representation as well as enhancing the interpretability [8] and these methods have a close relation with K-means [18]. However, the factorization of matrices is generally nonunique and many regularizors [19] or constraints [20] have been developed to alleviate nonuniqueness of decomposition. In particularly, NMF methods are proposed to utilize the priors to obtain a better representation [21] and achieve a robust clustering [22, 23].

However, the multiway structure information of high-dimensional data such as RGB images or videos cannot be represented by NMF. Vectorization of the high-dimensional data may lead to excess numbers of parameters and structure interruptions. Nonnegative tensor decomposition based methods have been proposed for the high-dimensional data.

Refer to caption

Figure 1: Canonical Polyadic Decomposition.

Nonnegative CANDECOMP/PARAFAC (NCP) methods: The CP-based tensor model (Fig. 1) [24] decomposes 𝒳\mathcal{X} into a linear combination of RR rank-one tensors as follows:

𝒳=r=1Rλr𝐚r(1)𝐚r(2)𝐚r(N)=DiagN(𝝀)×n=1N𝐀(n)T,\begin{split}\mathcal{X}&=\sum_{r=1}^{R}\lambda_{r}\mathbf{a}^{(1)}_{r}\circ\mathbf{a}^{(2)}_{r}\circ\dots\circ\mathbf{a}^{(N)}_{r}\\ &=\mathrm{Diag}^{N}(\boldsymbol{\lambda})\times_{{n}={1}}^{N}\mathbf{A}^{(n)T},\end{split} (1)

where DiagN(𝝀)\mathrm{Diag}^{N}(\boldsymbol{\lambda}) is the NNth-order diagonal tensor. The CP rank of 𝒳\mathcal{X} is given by Rankcp(𝒳)=R\mathrm{Rank}_{cp}(\mathcal{X})=R denotes the smallest number of the rank-one tensor decomposition [11]. The CP-based tensor model assumes each tensor element can be calculated by a summation of RR products and it is restrictive since it only considers RR possible interactions between latent factors.

Existing CP-based tensor models have a flexible subspace representation which may not consider the intrinsic manifold information of high-dimensional data and the performance of downstream tasks will be limited. A few regularization and prior knowledge strategies have been studied in these models [11]. For example, Zhao et al. [25] formulated CP factorization using a hierarchical probabilistic model and employed a fully Bayesian treatment by incorporating a sparsity-inducing prior over multiple latent factors and the appropriate hyperpriors over all hyperparameters to determine the rank of models. Zhao et al. [26] proposed a probabilistic model to recover the underlying low-rank tensor which modeled by multiplicative interactions among multiple groups of latent factors, and the additive sparse tensor modeling outliers. Zhou et al. [11] introduced concurrent regularizations which regularized the entire subspace in a concurrent and coherent way to avoid the strong scale restrictions of L2L_{2} regularization. Chen et al. [27] proposed a generalized weighted low rank tensor factorization which represented the sparse component as a mixture of Gaussian, and unified the Tucker and CP factorization in a joint framework to handle complex noise and outliers. However, these methods neglect nonnegative constrains and may not learn the part-based and physical meaning representations.

Refer to caption

Figure 2: Tucker Decomposition.

NTF methods: The Tucker model (Fig. 2) assumes that an original tensor 𝒳\mathcal{X} can be well approximated as

𝒳=𝒮×1𝐀(1)×2×N𝐀(N),\begin{split}\mathcal{X}=\mathcal{S}\times_{{1}}\mathbf{A}^{({1})}\times_{{2}}\dots\times_{N}\mathbf{A}^{(N)},\end{split} (2)

where the tucker rank of 𝒳\mathcal{X} is denoted as Ranktc(𝒳)=(R1,,Rn,,RN)\mathrm{Rank}_{\mbox{tc}}(\mathcal{X})=(R_{1},\dots,R_{n},\dots,R_{N}) with Rn=Rank(𝐗(n))R_{n}=\mathrm{Rank}(\mathbf{X}_{(n)}).

We note that the CP decomposition is a special case of the Tucker decomposition. Although the tucker decomposition is invariant to rotations in the factor matrices, it shares parameters across latent factor matrices by a core tensor. In contrast, CP decomposition methods force each factor vector to capture potentially redundant information [28]. CP decomposition methods are more prone to overfitting than Tucker decompositions. The main interest in Tucker model is to find subspaces for tensor approximation.

Li et al. [29] introduced a manifold regularization into the core tensors of NTF which preserved the tensor geometry information. But the representation space of the core tensor will increase exponentially as tensor order increases, which results in high computational complexity. Jiang et al. [30] added a graph Laplacian regularization on a low-dimensional factor matrix to improve the robustness of tensor decomposition. Sun et al. [5] proposed a heterogeneous tensor decomposition for clustering by performing dimensionality reduction on the first N1N-1 order of the tensor, and incorporating some useful constraints on the last-mode factor matrix for clustering. However, these methods neglect nonnegative constraints and may loss physical meaning of low-rank representations. Yin et al. [31] incorporated Laplacian Eigenmaps and Locally Linear Embedding as the manifold regularization terms into the least square form of NTF model. Pan et al. [32] introduced the orthogonal constraint into the group of factor matrices of NTF, which not only helps to keep the inherent tensor structure but also well performs in data compression. Yin et al. [33] proposed Hypergraph Regularized NTF which preserved nonnegativity in tensor factorization and uncover the higher-order relationship among the nearest neighborhoods. However, these methods decompose tensor data by minimizing the Euclidean distance which fails on the not cleaned data.

Manifold learning: Manifold learning is a problem which encodes the geometric information of the data space. Its goal is to find a representation in which two objects are close to each other after dimension reduction if they are close in the intrinsic geometry of data manifold. Based on this idea, many types of manifold learning algorithms have been proposed, such as ISOMAP [34], LLE [35], Laplacian eigenmaps [36] and locality preserving projections [37]. The advantage of introducing manifold structures is that it can preserve the intrinsic geometric information of data points, and it has been shown to be useful in a wide-range of applications, such as face recognition [38], text mining [19], and multimedia interaction [39].

Recently, the idea of manifold learning has been employed to matrix and tensor analysis. For instance, Cai et al. [19] proposed graph regularized nonnegative matrix factorization (GNMF) which utilized the intrinsic geometric information. However, GNMF may not have optimal solutions due to the noises or outliers of data, consequently other extension methods based on the GNMF have been proposed. Moreover, NMF based on manifold learning methods may break the structure of mutliway data points. Hence, tensor based manifold learning methods have been proposed such as Graph regularized Nonnegative Tucker Decomposition [40], LLE based nonnegative tensor decomposition [31]. These manifold nonnegative tensor decomposition methods assume that the distribution of noise is Gaussian and may fail on grossly corrupted datasets.

Refer to caption

Figure 3: The comparison of loss function.

2 Robust Manifold NTF methods

Because our methods rely on the half-quadratic theory, we first introduce the Half-Quadratic [41] minimization technique for the generic robust NTF framework.

2.1 Half-Quadratic Programming for Nonnegative Tucker Factorization

Half-Quadratic minimization was pioneered by Geman and Reynolds [42], and it was used to alleviate the computational task in the context of image reconstruction with nonconvex regularization. Let 𝒳\mathcal{X} denote an original tensor and 𝒳^\hat{\mathcal{X}} denote a reconstruction tensor of 𝒳\mathcal{X}. Replacing the squared residual of data-fidelity terms in [43] on each entry with a generic function:

𝒥(𝒳,𝒳^)=g()+λΦ({𝐀(n)}n=1N)s.t.𝒳^=𝒮×n=1N𝐀(n),=𝒳𝒳^𝒮0,𝐀(n)0,\begin{split}\mathcal{J}(\mathcal{X},\hat{\mathcal{X}})=g(\mathcal{E})+\lambda\Phi(\{\mathbf{A}^{(n)}\}_{n=1}^{N})\\ \mathrm{s.t.}\quad\hat{\mathcal{X}}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)},\quad\mathcal{E}=\mathcal{X}-\hat{\mathcal{X}}\\ \mathcal{S}\geq 0,\quad\mathbf{A}^{(n)}\geq 0,\end{split} (3)

where \mathcal{E} represents the residual error between original tensor 𝒳\mathcal{X} and reconstructed tensor 𝒳^\hat{\mathcal{X}}, g()g(\cdot) is chosen to be robust to outliers or gross errors, and Φ()\Phi(\cdot) denotes the regularization terms with respect to {𝐀(n)}n=1N\{\mathbf{A}^{(n)}\}_{n=1}^{N}. The minimizer 𝒳^\hat{\mathcal{X}} of cost function 𝒥(𝒳,𝒳^)\mathcal{J}(\mathcal{X},\hat{\mathcal{X}}) involving the reconstruction error which is nonlinear with respect to 𝒳\mathcal{X} and the regularization term Φ({𝐀(n)}n=1N)\Phi(\{\mathbf{A}^{(n)}\}_{n=1}^{N}). When factor matrices 𝐀(n),n=1,,N1\mathbf{A}^{(n)},n=1,\dots,N-1 and core tensor 𝒮\mathcal{S} have many nonzero entries or ill-conditioned, the computation of factorization is costly. specifically, the loss function is possibly non-quadratic and non-convex, and it is difficult to optimize directly. Fortunately, the half-quadratic minimization [44] has been developed to solve the intractable optimization. According to the conjugate function and half quadratic theory [44], the reconstruction error term g()g(\mathcal{E}) can be performed as

g()=min𝒲I1××INQ(,𝒲)+ϕ(𝒲),\begin{split}g(\mathcal{E})=\min_{\mathcal{W}\in\mathbb{R}^{I_{1}\times\dots\times I_{N}}}Q(\mathcal{E},\mathcal{W})+\phi(\mathcal{W}),\end{split} (4)

where ϕ(𝒲)\phi(\mathcal{W}) is the conjugate function of g()g(\mathcal{E}), 𝒲\mathcal{W} is the corresponded additional auxiliary variable, and Q(,)Q(\cdot,\cdot) is a quadratic term for \mathcal{E} and 𝒲\mathcal{W}. In this paper, we only consider the quadratic term of the multiplicative form [42]:

Q(,𝒲)=12𝒲2.\begin{split}Q(\mathcal{E},\mathcal{W})=\frac{1}{2}\mathcal{W}\circledast\mathcal{E}^{2}.\end{split} (5)

Substituting (4) and (5) into (3), we have the augmented cost function 𝒥^(𝒳,𝒳^;𝒲)\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}};\mathcal{W}):

𝒥^(𝒳,𝒳^;𝒲)=12𝒲2+ϕ(𝒲)+λΦ(𝐀(N))s.t.𝒳^=𝒮×n=1N𝐀(n),=𝒳𝒳^𝒮0,𝐀(n)0,n{1,,N}.\begin{split}&\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}};\mathcal{W})=\frac{1}{2}\mathcal{W}\circledast\mathcal{E}^{2}+\phi(\mathcal{W})+\lambda\Phi(\mathbf{A}^{(N)})\\ &\mathrm{s.t.}\quad\hat{\mathcal{X}}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)},\quad\mathcal{E}=\mathcal{X}-\hat{\mathcal{X}}\\ &\quad\quad\mathcal{S}\geq 0,\quad\mathbf{A}^{(n)}\geq 0,\quad\forall n\in\{1,\dots,N\}.\end{split} (6)

The reconstruction error term involved in 𝒥^(𝒳,𝒳^;𝒲)\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}};\mathcal{W}) is half-quadratic. Hence, the minimizer (𝒳^,𝒲)(\hat{\mathcal{X}},\mathcal{W}) of min𝒳^,𝒲{𝒥^(𝒳,𝒳^;𝒲)}\min_{\hat{\mathcal{X}},\mathcal{W}}\left\{\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}};\mathcal{W})\right\} is calculated by alternate minimization. At iteration kk we calculate

𝒲k:𝒥^(𝒳,𝒳^k1;𝒲k)𝒥^(𝒳,𝒳^k1;𝒲k1)𝒳^k:𝒥^(𝒳,𝒳^k;𝒲k)𝒥^(𝒳,𝒳^k1;𝒲k).\begin{split}&\mathcal{W}^{k}:\quad\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}}^{k-1};\mathcal{W}^{k})\leq\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}}^{k-1};\mathcal{W}^{k-1})\\ &\hat{\mathcal{X}}^{k}:\quad\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}}^{k};\mathcal{W}^{k})\leq\hat{\mathcal{J}}(\mathcal{X},\hat{\mathcal{X}}^{k-1};\mathcal{W}^{k}).\end{split} (7)

When 𝒳^\hat{\mathcal{X}} is fixed, the minimization of the reconstruction error term g()g(\cdot) is convex with respect to 𝒲\mathcal{W}. The explicit optimum solution of 𝒲\mathcal{W} [45] can be determined as

𝒲={2g(0)2if=0[g()][]if0,\displaystyle\mathcal{W}=\left\{\begin{array}[]{ll}\frac{\partial^{2}g(0)}{\partial\mathcal{E}^{2}}\quad\mbox{if}\quad\mathcal{E}=0\\ \frac{\left[\frac{\partial g(\mathcal{E})}{\partial\mathcal{E}}\right]}{\left[\mathcal{E}\right]}\quad\mbox{if}\quad\mathcal{E}\neq 0,\end{array}\right. (10)

where [][]\frac{[\cdot]}{[\cdot]} denotes element-wise division.

It is shown that the auxiliary variable 𝒲\mathcal{W} only depends on the loss function g()g(\cdot). Since the outliers often cause large fitting errors, 𝒲\mathcal{W} is important for the objective functions to constrain overfitting. For the large outliers, the weights 𝒲\mathcal{W} should be small. On contrary, for the small errors, the weights 𝒲\mathcal{W} should be large. Therefore, the weights variable 𝒲\mathcal{W} can be seen as an outlier mask. The frequency used loss functions are shown in Fig. 3.

2.2 Robust NTF with Manifold Regularization

By using the nonnegative constraints and robust loss function for outliers, robust NTF can learn a part-based representation. Many robust NTF methods perform well in euclidean space. They fail to discover the intrinsic geometrical and discriminating structure of the data space. Here, we introduce a geometrically based regularization for our robust NTF framework.

First, supposing that the real data points 𝒳i\mathcal{X}_{i} lies on a low-dimensional manifold \mathcal{M} and 𝐀i(N)\mathbf{A}^{(N)}_{i} is the representation of 𝒳i\mathcal{X}_{i} in the subspace. We make an assumption that if 𝒳i\mathcal{X}_{i} and 𝒳j\mathcal{X}_{j} are neighbors in data space, then their low-rank representations 𝐀i(N)\mathbf{A}^{(N)}_{i} and 𝐀j(N)\mathbf{A}^{(N)}_{j} are close enough to each other in \mathcal{M}. We build a regularization term as follows

()=pfp2,\begin{split}\mathcal{R}(\mathcal{M})=\sum_{p}\|f_{p}\|_{\mathcal{M}}^{2},\end{split} (11)

where fpf_{p} is the mapping function which project data point 𝒳i\mathcal{X}_{i} to the low-rank representation 𝐀i(N)=fp(𝒳i)\mathbf{A}^{(N)}_{i}=f_{p}(\mathcal{X}_{i}). ()\mathcal{R}(\mathcal{M}) is a measurement of the smoothness of fpf_{p} along the geodesics in the intrinsic geometry of the data.

Based on [19], we use the similarity graph of 𝒳\mathcal{X}. Suppose that 𝐕IN×IN\mathbf{V}\in\mathbb{R}^{I_{N}\times I_{N}} defines the affinity of data points, then we use the Heat Kernel to describe the similarity between each pair of data points if nodes ii and jj are connected:

𝐕ij=exp(𝒳i𝒳j22τ),\begin{split}\mathbf{V}_{ij}=\exp\left(-\frac{\|\mathcal{X}_{i}-\mathcal{X}_{j}\|^{2}_{2}}{\tau}\right),\end{split} (12)

where τ\tau is the width of the kernel used to control the similarity. Then, we calculate the diagonal matrix 𝐃IN×IN\mathbf{D}\in\mathbb{R}^{I_{N}\times I_{N}}, where [𝐃]ii=j[𝐕]ij[\mathbf{D}]_{ii}=\sum_{j}[\mathbf{V}]_{ij} and the Laplacian matrix is 𝐋=𝐃𝐕\mathbf{L}=\mathbf{D}-\mathbf{V}. The graph regularization can be estimated as follows:

()=12p=1Pi=1Nj=1N[𝐕]ij[fp(𝒳i)fp(𝒳j)]2=Tr(𝐀(N)T𝐋𝐀(N)).\begin{split}\mathcal{R}(\mathcal{M})=&\frac{1}{2}\sum_{p=1}^{P}\sum_{i=1}^{N}\sum_{j=1}^{N}[\mathbf{V}]_{ij}\left[f_{p}(\mathcal{X}_{i})-f_{p}(\mathcal{X}_{j})\right]^{2}\\ =&\mathrm{Tr}(\mathbf{A}^{(N)T}\mathbf{L}\mathbf{A}^{(N)}).\end{split} (13)

2.3 RMNTF-CIM

Liu et al. [46] proposed the concept of Cross correntropy which is a generalized similarity measure between two arbitrary scalar random variables 𝐗\mathbf{X} and 𝐘\mathbf{Y} defined by

Vσ(𝐗,𝐘)=κσ(𝐗𝐘),\begin{split}V_{\sigma}(\mathbf{X},\mathbf{Y})=\langle\kappa_{\sigma}(\mathbf{X}-\mathbf{Y})\rangle,\end{split} (14)

where κσ()\kappa_{\sigma}(\cdot) is the kernel function. In practice, the joint probabilistic density function is unknown and only a finite number of data points {(xi,yi)}i=1N\{(x_{i},y_{i})\}_{i=1}^{N} are available. The sample estimator of correntropy can be represented as

V^N,σ(𝐗,𝐘)=1Ni=1Nκσ(xiyi).\begin{split}\hat{V}_{N,\sigma}(\mathbf{X},\mathbf{Y})=\frac{1}{N}\sum_{i=1}^{N}\kappa_{\sigma}(x_{i}-y_{i}).\end{split} (15)

Based on the above definition of correntropy, Liu et al. [46] proposed correntropy induced metric (CIM) in the sample space which denoted as

CIM(𝐗,𝐘)=[κ(0)1ni=1nκσ(ei)]12,\begin{split}\mathrm{CIM}(\mathbf{X},\mathbf{Y})=\left[\kappa(0)-\frac{1}{n}\sum_{i=1}^{n}\kappa_{\sigma}(e_{i})\right]^{\frac{1}{2}},\end{split} (16)

where we use the Gaussian kernel in this paper, i.e., κσ(e)=12πσexp(ei2/2σ2)\kappa_{\sigma}(e)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-e_{i}^{2}/2\sigma^{2}\right), and eie_{i} is denoted as ei=xiyie_{i}=x_{i}-y_{i}.

Substituting the error on each entry in Tucker model with the CIM, we obtain the objective function of RMNTF-CIM:

𝒥(𝒳,𝒳^)=11I1INi1=1I1iN=1INκσ(𝒳i1iN𝒳^i1iN),\begin{split}\mathcal{J}(\mathcal{X},\hat{\mathcal{X}})=1-\frac{1}{I_{1}\dots I_{N}}\sum_{i_{1}=1}^{I_{1}}\dots\sum_{i_{N}=1}^{I_{N}}\kappa_{\sigma}\left(\mathcal{X}_{i_{1}\dots i_{N}}-\hat{\mathcal{X}}_{i_{1}\dots i_{N}}\right),\end{split} (17)

which is equivalent to solving the following optimization problem

min𝐀(1),,𝐀(n),𝒮𝒥(𝒳,𝒳^)+λ2i=1INj=1IN𝐚i(N)𝐚j(N)22vijs.t.𝒳^=𝒮×n=1N𝐀(n),𝒮0,𝐀(n)0.\begin{split}\min_{\mathbf{A}^{(1)},\dots,\mathbf{A}^{(n)},\mathcal{S}}\mathcal{J}(\mathcal{X},\hat{\mathcal{X}})+\frac{\lambda}{2}\sum_{i=1}^{I_{N}}\sum_{j=1}^{I_{N}}\|\mathbf{a}_{i}^{(N)}-\mathbf{a}_{j}^{(N)}\|_{2}^{2}v_{ij}\\ \mathrm{s.t.}\quad\hat{\mathcal{X}}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)},\quad\mathcal{S}\geq 0,\quad\mathbf{A}^{(n)}\geq 0.\end{split} (18)

Then, we introduce the half-quadratic minimization, 𝒥(𝒳,𝒳^)=12Q(𝒲2)+ϕ(𝒲)\mathcal{J}(\mathcal{X},\hat{\mathcal{X}})=\frac{1}{2}Q(\mathcal{W}\circledast\mathcal{E}^{2})+\phi(\mathcal{W}), where ϕ(𝒲)\phi(\mathcal{W}) is the conjugate function of 𝒥(𝒳,𝒳^)\mathcal{J}(\mathcal{X},\hat{\mathcal{X}}).

2.3.1 Optimization of weighted tensor 𝒲\mathcal{W}

When the factor matrices 𝐀(n)\mathbf{A}^{(n)} and the core tensor 𝒮\mathcal{S} are fixed, the optimization problem with respect to 𝒲\mathcal{W} can be solved separately:

𝒲(n)=argminfw(𝒲(n))withfw(𝒲(n))=i=1Nj=1M[𝒲(n)(𝒳(n)𝐀(n)𝐁(n)T)2]ij+ϕ(𝒲(n)),\begin{split}&\mathcal{W}_{(n)}^{\star}=\arg\min f_{w}(\mathcal{W}_{(n)})\\ &\mbox{with}\quad f_{w}(\mathcal{W}_{(n)})=\sum_{i=1}^{N}\sum_{j=1}^{M}\left[\mathcal{W}_{(n)}\circledast\left(\mathcal{X}_{(n)}-\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\right)^{2}\right]_{ij}\\ &+\phi(\mathcal{W}_{(n)}),\end{split} (19)

where 𝐁(n)T=𝒮(n)(in𝐀(i)T)\mathbf{B}^{(n)T}=\mathcal{S}_{(n)}\left(\otimes_{i\neq n}\mathbf{A}^{(i)T}\right), [𝐗]ij[\mathbf{X}]_{ij} denotes the iith row and jjth column element of matrix 𝐗\mathbf{X}.

Let (n)=𝒳(n)𝐀(n)𝐁(n)T\mathcal{E}_{(n)}=\mathcal{X}_{(n)}-\mathbf{A}^{(n)}\mathbf{B}^{(n)T}, then

𝒲(n)=[(n)fw(𝒲(n))][(n)]=[(n)(1κσ((n)))][(n)]exp((𝒳(n)𝐀(n)𝐁(n)T)22σ2),\begin{split}\mathcal{W}_{(n)}^{\star}&=\frac{[\nabla_{\mathcal{E}_{(n)}}f_{w}(\mathcal{W}_{(n)})]}{[\mathcal{E}_{(n)}]}=\frac{[\nabla_{\mathcal{E}_{(n)}}\left(1-\kappa_{\sigma}(\mathcal{E}_{(n)})\right)]}{[\mathcal{E}_{(n)}]}\\ &\varpropto\exp\left(-\frac{\left(\mathcal{X}_{(n)}-\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\right)^{2}}{2\sigma^{2}}\right),\end{split} (20)

where [][]\frac{[\cdot]}{[\cdot]} denotes the element-wise division operation.

2.3.2 optimization of factor matrices 𝐀(n)\mathbf{A}^{(n)}

We use the Lagrange multiplier method and consider the mode-nn unfolding form, then:

𝐀(n)=argminfA(𝐀(n)),withfA(𝐀(n))=i=1Ij=1J[𝒲(n)(𝒳(n)𝐀(n)𝐁(n)T)2]ij+λ2Tr[𝐀(N)T𝐋𝐀(N)]+Tr[𝛀n𝐀(n)]s.t.𝛀0.\begin{split}&\mathbf{A}^{(n)\star}=\arg\min f_{A}(\mathbf{A}^{(n)}),\\ &\mbox{with}\quad f_{A}(\mathbf{A}^{(n)})=\sum_{i=1}^{I}\sum_{j=1}^{J}\left[\mathcal{W}_{(n)}\circledast\left(\mathcal{X}_{(n)}-\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\right)^{2}\right]_{ij}\\ &+\frac{\lambda}{2}\mathrm{Tr}\left[\mathbf{A}^{(N)T}\mathbf{L}\mathbf{A}^{(N)}\right]+\mathrm{Tr}\left[\boldsymbol{\Omega}_{n}\mathbf{A}^{(n)}\right]\\ &\rm{s.t.}\quad\boldsymbol{\Omega}\geq 0.\end{split} (21)

If nNn\neq N, the objective function with respect to 𝐀(n)\mathbf{A}^{(n)} can be transformed as:

fA(𝐀(n))=i=1I([𝒳(n)]i[𝐀(n)]i𝐁(n)T)𝐓i([𝒳(n)]i[𝐀(n)]i𝐁(n)T)T+Tr(𝛀n𝐀(n)),\begin{split}&f_{A}(\mathbf{A}^{(n)})=\sum_{i=1}^{I}\left([\mathcal{X}_{(n)}]_{i\cdot}-[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\right)\mathbf{T}_{i}\\ &\left([\mathcal{X}_{(n)}]_{i\cdot}-[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\right)^{T}+\mathrm{Tr}(\boldsymbol{\Omega}_{n}\circledast\mathbf{A}^{(n)}),\end{split} (22)

where 𝐓i=Diag([𝒲(n)]i)0M×M\mathbf{T}_{i}=\mathrm{Diag}\left([\mathcal{W}_{(n)}]_{i\cdot}\right)\in\mathbb{R}^{M\times M}_{\geq 0}, M=I1××In1×In+1××INM=I_{1}\times\dots\times I_{n-1}\times I_{n+1}\times\dots\times I_{N}, 𝛀n\boldsymbol{\Omega}_{n} is the nonneative Lagrange multiplier for the nonnegative constraint. The partial derivative of fA(𝐀(n))f_{A}(\mathbf{A}^{(n)}) with respect to 𝐀(n)\mathbf{A}^{(n)} is:

fA([𝐀(n)]i)[𝐀(n)]i=2[[𝒳(n)]i𝐓i𝐁(n)]+2[[𝐀(n)]i𝐁(n)T𝐓i𝐁(n)]+[𝛀n]i.\begin{split}\frac{\partial f_{A}([\mathbf{A}^{(n)}]_{i\cdot})}{\partial{[\mathbf{A}^{(n)}}]_{i\cdot}}=&-2\left[[\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}\mathbf{B}^{(n)}\right]\\ &+2\left[[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}\right]+[\boldsymbol{\Omega}_{n}]_{i\cdot}.\end{split} (23)

Using the KKT conditions [𝛀n]ik[𝐀(n)]ik=0[\boldsymbol{\Omega}_{n}]_{ik}[\mathbf{A}^{(n)}]_{ik}=0, we obtain the following equations

([𝒳(n)]i𝐓i𝐁(n))+([𝐀(n)]i𝐁(n)T𝐓i𝐁(n))=𝟎.\begin{split}-\left([\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}\mathbf{B}^{(n)}\right)+\left([\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}\right)=\mathbf{0}.\end{split} (24)

Then, we obtain the update rules of 𝐀(n)\mathbf{A}^{(n)}:

[𝐀(n)]i=[𝐀(n)]i[([𝒳(n)]i𝐓i𝐁(n))][([𝐀(n)]i𝐁(n)T𝐓i𝐁(n))]𝐀(n)=𝐀(n)[(𝒲(n)𝒳(n)𝐁(n))][(𝒲(n)(𝐀(n)𝐁(n)T)𝐁(n))].\begin{split}[\mathbf{A}^{(n)}]_{i\cdot}&=[\mathbf{A}^{(n)}]_{i\cdot}\circledast\frac{\left[\left([\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}\mathbf{B}^{(n)}\right)\right]}{\left[\left([\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}\right)\right]}\\ \mathbf{A}^{(n)}&=\mathbf{A}^{(n)}\circledast\frac{\left[\left(\mathcal{W}_{(n)}\circledast\mathcal{X}_{(n)}\mathbf{B}^{(n)}\right)\right]}{\left[\left(\mathcal{W}_{(n)}\circledast\left(\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\right)\mathbf{B}^{(n)}\right)\right]}.\end{split} (25)

If n=Nn=N, the objective function with respect to 𝐀(N)\mathbf{A}^{(N)} can be represented as:

fA(𝐀(N))=i=1I([𝒳(N)]i[𝐀(N)]i𝐁(N)T)𝐓i([𝒳(N)]i[𝐀(N)]i𝐁(N)T)T+Tr(𝛀N𝐀(N))+λ2Tr(𝐀(N)T𝐋𝐀(N)).\begin{split}&f_{A}(\mathbf{A}^{(N)})=\sum_{i=1}^{I}\left([\mathcal{X}_{(N)}]_{i\cdot}-[\mathbf{A}^{(N)}]_{i\cdot}\mathbf{B}^{(N)T}\right)\mathbf{T}_{i}\\ &\left([\mathcal{X}_{(N)}]_{i\cdot}-[\mathbf{A}^{(N)}]_{i\cdot}\mathbf{B}^{(N)T}\right)^{T}+\mathrm{Tr}(\boldsymbol{\Omega}_{N}\circledast\mathbf{A}^{(N)})\\ &+\frac{\lambda}{2}\mathrm{Tr}\left(\mathbf{A}^{(N)T}\mathbf{L}\mathbf{A}^{(N)}\right).\end{split} (26)

The partial derivative of fA(𝐀(N))f_{A}(\mathbf{A}^{(N)}) is:

fA([𝐀(N)]i)[𝐀(N)]i=2([𝒳(n)]i𝐓i𝐁(n))+2([𝐀(n)]i𝐁(n)T𝐓i𝐁(n))+λ𝐋𝐀(N)+𝛀N.\begin{split}&\frac{\partial f_{A}([\mathbf{A}^{(N)}]_{i\cdot})}{\partial[{\mathbf{A}^{(N)}}]_{i\cdot}}=-2\left([\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}\mathbf{B}^{(n)}\right)\\ &+2\left([\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}\right)+\lambda\mathbf{L}\mathbf{A}^{(N)}+\boldsymbol{\Omega}_{N}.\end{split} (27)

Using the KKT conditions [𝛀N]ik[𝐀(N)]ik=0[\boldsymbol{\Omega}_{N}]_{ik}[\mathbf{A}^{(N)}]_{ik}=0, we obtain the following equations:

2(𝒲(N)(𝐀(N)𝐁(N)T)𝐁(N))2(𝒲(N)𝒳(N)𝐁(N))+λ𝐋𝐀(N)=𝟎.\begin{split}&2\left(\mathcal{W}_{(N)}\circledast(\mathbf{A}^{(N)}\mathbf{B}^{(N)T})\mathbf{B}^{(N)}\right)\\ &-2\left(\mathcal{W}_{(N)}\circledast\mathcal{X}_{(N)}\mathbf{B}^{(N)}\right)+\lambda\mathbf{L}\mathbf{A}^{(N)}=\mathbf{0}.\end{split} (28)

Then, we obtain the update rules of 𝐀(N)\mathbf{A}^{(N)}:

𝐀(N)=𝐀(N)[(𝒲(N)𝒳(N)𝐁(N))]+λ𝐕𝐀(N)][(𝒲(N)(𝐀(N)𝐁(N)T)𝐁(N))+λ𝐃𝐀(N)].\begin{split}\mathbf{A}^{(N)}&=\mathbf{A}^{(N)}\circledast\frac{[\left(\mathcal{W}_{(N)}\circledast\mathcal{X}_{(N)}\mathbf{B}^{(N)}\right)]+\lambda\mathbf{V}\mathbf{A}^{(N)}]}{[\left(\mathcal{W}_{(N)}\circledast\left(\mathbf{A}^{(N)}\mathbf{B}^{(N)T}\right)\mathbf{B}^{(N)}\right)+\lambda\mathbf{D}\mathbf{A}^{(N)}]}.\end{split} (29)

2.3.3 optimization of the core tensor 𝒮\mathcal{S}

For the subproblem of core tensor 𝒮\mathcal{S}, we consider the vectorization form of (17):

𝒮=argminf𝒮(𝒮),withf𝒮(𝒮)=vec(𝟏)T[vec(𝒲)(vec(𝒳)𝐅vec(𝒮))2]+vec(𝒮)Tvec(𝛀𝒮),\begin{split}&\mathcal{S}^{\star}=\arg\min f_{\mathcal{S}}(\mathcal{S}),\\ &\mbox{with}\quad f_{\mathcal{S}}(\mathcal{S})=\mathrm{vec}(\mathbf{1})^{T}\left[\mathrm{vec}(\mathcal{W})\circledast\left(\mathrm{vec}(\mathcal{X})-\mathbf{F}\mathrm{vec}(\mathcal{S})\right)^{2}\right]\\ &+\mathrm{vec}(\mathcal{S})^{T}\mathrm{vec}(\boldsymbol{\Omega}_{\mathcal{S}}),\end{split} (30)

where 𝟏I1×I2××IN\mathbf{1}\in\mathbb{R}^{I_{1}\times I_{2}\times\dots\times I_{N}} denotes the all-one vector of length i=1NIi\prod_{i=1}^{N}I_{i}, 𝐅=𝐀(1)𝐀(2)𝐀(N)I1I2IN×r1r2×rN\mathbf{F}=\mathbf{A}^{(1)}\otimes\mathbf{A}^{(2)}\otimes\dots\otimes\mathbf{A}^{(N)}\mathbb{R}^{I_{1}I_{2}\dots I_{N}\times r_{1}r_{2}\times r_{N}}, and vec(𝛀𝒮)\mathrm{vec}(\boldsymbol{\Omega}_{\mathcal{S}}) denotes the Lagrange multipliers of vec(𝒮)\mathrm{vec}(\mathcal{S}).

f𝒮(𝒮)={[vec(𝒳)𝐅vec(𝒮)]T𝐓𝒮[vec(𝒳)𝐅vec(𝒮)]}+vec(𝒮)Tvec(𝛀𝒮),\begin{split}f_{\mathcal{S}}(\mathcal{S})=&\Bigg{\{}\left[\mathrm{vec}(\mathcal{X})-\mathbf{F}\mathrm{vec}(\mathcal{S})\right]^{T}\mathbf{T}_{\mathcal{S}}\left[\mathrm{vec}(\mathcal{X})-\mathbf{F}\mathrm{vec}(\mathcal{S})\right]\Bigg{\}}\\ &+\mathrm{vec}(\mathcal{S})^{T}\mathrm{vec}(\boldsymbol{\Omega}_{\mathcal{S}}),\end{split} (31)

where 𝐓𝒮=Diag(vec(𝒲))I1I2IN×I1I2IN\mathbf{T}_{\mathcal{S}}=\mathrm{Diag}(\mathrm{vec}(\mathcal{W}))\in\mathbb{R}^{I_{1}I_{2}\dots I_{N}\times I_{1}I_{2}\dots I_{N}}.

f𝒮(𝒮)vec(𝒮)=2𝐅T𝐓𝒮𝐅vec(𝒮)2𝐅T𝐓𝒮vec(𝒳)+vec(𝛀𝒮).\begin{split}\frac{\partial f_{\mathcal{S}}(\mathcal{S})}{\partial\mathrm{vec}(\mathcal{S})}=2\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}\mathbf{F}\mathrm{vec}(\mathcal{S})-2\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}\mathrm{vec}(\mathcal{X})+\mathrm{vec}(\boldsymbol{\Omega}_{\mathcal{S}}).\end{split} (32)

Using the KKT conditions vec(𝒮)vec(𝛀𝒮)=𝟎\mathrm{vec}(\mathcal{S})\circledast\mathrm{vec}(\boldsymbol{\Omega}_{\mathcal{S}})=\mathbf{0}, where 𝟎\mathbf{0} denote the all-zero vector, we obtain the following equations

[𝐅T𝐓𝒮𝐅vec(𝒮)𝐅T𝐓𝒮vec(𝒳)]vec(𝒮)=𝟎.\begin{split}\left[\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}\mathbf{F}\mathrm{vec}(\mathcal{S})-\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}\mathrm{vec}(\mathcal{X})\right]\circledast\mathrm{vec}(\mathcal{S})=\mathbf{0}.\end{split} (33)

Then, we obtain the update rules of 𝒮\mathcal{S}:

vec(𝒮)=vec(𝒮)[𝐅T𝐓𝒮vec(𝒳)][𝐅T𝐓𝒮𝐅vec(𝒮)]𝒮=𝒮[𝐅T𝐓𝒮(𝒳)][𝐅T𝐓𝒮𝐅(𝒮)].\begin{split}\mathrm{vec}(\mathcal{S})&=\mathrm{vec}(\mathcal{S})\circledast\frac{[\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}\mathrm{vec}(\mathcal{X})]}{[\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}\mathbf{F}\mathrm{vec}(\mathcal{S})]}\\ \mathcal{S}&=\mathcal{S}\circledast\frac{[\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}(\mathcal{X})]}{[\mathbf{F}^{T}\mathbf{T}_{\mathcal{S}}\mathbf{F}(\mathcal{S})]}.\end{split} (34)

2.4 RMNTF-Huber

Robust statistics work well on model reconstruction under some observation with noise or outliers. Some popular M-estimators [47] such as Huber loss function and Cauchy function have been proposed to solve noisy data mining.

In this section, we take the Huber function in reconstruction error term g()g(\cdot) to measure the quality of approximation by considering the connection between L1L_{1} norm and L2L_{2} norm:

ghuber()={2 if||c2c||c2if||c,g_{\rm{huber}}(\mathcal{E})=\left\{\begin{array}[]{ll}\mathcal{E}^{2}\quad\quad\quad\quad\mbox{ if}\quad|\mathcal{E}|\leq c\\ 2c|\mathcal{E}|-c^{2}\quad\mbox{if}\quad|\mathcal{E}|\geq c,\end{array}\right. (35)

where cc is the cutoff parameter to tradeoff between the L1L_{1}-norm and L2L_{2}-norm.

Substituting the Huber function on each entry in (3), we have the RMNTF-Huber by minimizing the following objective function:

min𝐀(1),,𝐀(n),𝒮ghuber()+λ2i=1INj=1IN𝐚i(N)𝐚j(N)22vijs.t.𝒳^=𝒮×n=1N𝐀(n),𝒮0,𝐀(n)0,n{1,,N},\begin{split}\min_{\mathbf{A}^{(1)},\dots,\mathbf{A}^{(n)},\mathcal{S}}g_{\rm{huber}}(\mathcal{E})+\frac{\lambda}{2}\sum_{i=1}^{I_{N}}\sum_{j=1}^{I_{N}}\|\mathbf{a}_{i}^{(N)}-\mathbf{a}_{j}^{(N)}\|_{2}^{2}v_{ij}\\ \mathrm{s.t.}\hat{\mathcal{X}}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)},\mathcal{S}\geq 0,\mathbf{A}^{(n)}\geq 0,\forall n\in\{1,\dots,N\},\end{split} (36)

where =𝒳𝒳^=𝒳𝒮×n=1N𝐀(n)\mathcal{E}=\mathcal{X}-\hat{\mathcal{X}}=\mathcal{X}-\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}.

Following the equation (10), we obtain the optimization of weighted tensor 𝒲\mathcal{W}:

𝒲(n)={1if||cc|| if||c.\mathcal{W}_{(n)}^{\star}=\left\{\begin{array}[]{ll}1\quad\quad\mbox{if}\quad|\mathcal{E}|\leq c\\ \frac{c}{|\mathcal{E}|}\quad\mbox{ if}\quad|\mathcal{E}|\geq c.\end{array}\right. (37)

The optimization of factor matrices 𝐀(n)\mathbf{A}^{(n)} and the core tensor 𝒮\mathcal{S} are the same as RMNTF-CIM. Here, we note that the cutoff parameter cc is set to the median of reconstruction errors, i.e., c=median(||)c=\mathrm{median}(|\mathcal{E}|).

2.5 RMNTF-Cauchy

For any tensor 𝒳\mathcal{X}, we define the reconstruction error of RMNTF-Cauchy as follows:

min𝐀(1),,𝐀(n),𝒮gCauchy()+λ2i=1INj=1IN𝐚i(N)𝐚j(N)22vijs.t.𝒳^=𝒮×n=1N𝐀(n),𝒮0,𝐀(n)0,\begin{split}\min_{\mathbf{A}^{(1)},\dots,\mathbf{A}^{(n)},\mathcal{S}}g_{\rm{Cauchy}}(\mathcal{E})+\frac{\lambda}{2}\sum_{i=1}^{I_{N}}\sum_{j=1}^{I_{N}}\|\mathbf{a}_{i}^{(N)}-\mathbf{a}_{j}^{(N)}\|_{2}^{2}v_{ij}\\ \mathrm{s.t.}\quad\hat{\mathcal{X}}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)},\quad\mathcal{S}\geq 0,\quad\mathbf{A}^{(n)}\geq 0,\end{split} (38)

where gCauchy(x)=ln(1+x),x0g_{\rm{Cauchy}}(x)=\ln(1+x),x\geq 0.

As equation (10), the Optimization of weighted tensor 𝒲\mathcal{W} can be represented as:

𝒲(n)=11+(γ)2.\begin{split}\mathcal{W}_{(n)}^{\star}=\frac{1}{1+\left(\frac{\mathcal{E}}{\gamma}\right)^{2}}.\end{split} (39)

The optimization of factor matrices 𝐀(n)\mathbf{A}^{(n)} and the core tensor 𝒮\mathcal{S} are the same as RMNTF-CIM.

2.6 Discussion

The convergence of the algorithms are guaranteed by the following theorem:

Theorem 1

Updating projection matrices {𝐀(n)}n=1N\{\mathbf{A}^{(n)}\}_{n=1}^{N}, core tensor 𝒮\mathcal{S} and weight tensor 𝒲\mathcal{W} iteratively as (25), (29), (34) and (20). It leads to a nonincreasing of the objective function in (17), and converges to a stationary point.

Proof 1

See Appendix A.1.1 for the proof of Theorem 1.

The robustness of RMNTF is guaranteed by the following Theorem:

Theorem 2

Suppose there are the training images (𝒳1,𝒳2,,𝒳t)(\mathcal{X}^{1},\mathcal{X}^{2},\dots,\mathcal{X}^{t}) and test image 𝒳t\mathcal{X}^{t\prime}. If the optimal parameters, core tensor 𝒮\mathcal{S}, factor matrices {𝐀(n)}n=1N\{\mathbf{A}^{(n)}\}_{n=1}^{N} and weight tensor 𝒲\mathcal{W} are learned from training images. The low rank representation of test image 𝐚t\mathbf{a}^{\prime}_{t} learned by RMNTF will have close form solution:

𝐚t=[𝐁𝐁T+(j=1tvj)𝐈]1(𝒳t(N)𝐁T+λj=1t[𝐀0(N)]jvj),\begin{split}\!\mathbf{a}^{\prime}_{t}\!=\!\left[\!\mathbf{B}^{\prime}\mathbf{B}^{\prime T}\!+\!\left(\!\sum_{j=1}^{t}v_{j}\!\right)\!\mathbf{I}\right]^{-1}\!\left(\!{\mathcal{X}^{t\prime}}_{(N)}\mathbf{B}^{\prime T}\!+\!\lambda\!\sum_{j=1}^{t}[\mathbf{A}_{0}^{(N)}]_{j\cdot}v_{j}\!\right)\!,\end{split} (40)

where 𝐁=𝒲(N)𝐁(N)T\mathbf{B}^{\prime}=\sqrt{\mathcal{W}_{(N)}}\circledast\mathbf{B}^{(N)T}, 𝐁(N)T=𝒮(N)(i=1N1𝐀(i)T)0rN×I1××IN1\mathbf{B}^{(N)T}=\mathcal{S}_{(N)}\left(\otimes_{i=1}^{N-1}\mathbf{A}^{(i)T}\right)\in\mathbb{R}_{\geq 0}^{r_{N}\times I_{1}\times\dots\times I_{N-1}}, 𝐈rN×rN\mathbf{I}\in\mathbb{R}^{r_{N}\times r_{N}} denotes an identity matrix, and 𝐕\mathbf{V} the affinity of data points. If the test image is outlier, then the weight tensor 𝒲\mathcal{W} is constrained to a small value. The low rank representation of this outlier 𝐚t\mathbf{a}_{t}^{\prime} will be repaired through the manifold structure of the training data.

Proof 2

See Appendix A.1.2 for the proof of Theorem 40.

The uniqueness of RMNTF is guaranteed by the following Theorem:

Theorem 3

Since the solution of 𝐀(N)\mathbf{A}^{(N)} in equation (A.39) is unique and 𝐀(n),n=1,,N1\mathbf{A}^{(n)},n=1,\dots,N-1, can be uniquely estimated from 𝒵^\hat{\mathcal{Z}} due to the lemma 2, the mode-N unfolding of RMNTF model 𝒳N=𝐀(N)𝒮(N)𝐙(N)T\mathcal{X}_{N}=\mathbf{A}^{(N)}\mathcal{S}_{(N)}\mathbf{Z}^{(N)T} has an essentially unique solution. Because of the lemma 1, the RMNTF model 𝒳=𝒮×n=1N𝐀(n)\mathcal{X}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)} is uniqueness.

Proof 3

See Appendix A.1.3 for the proof of Theorem 3.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 4: Basic images learned by NMF, GNMF, NTD, GNTD, RMNTF from randomly chosen 25 subjects on the AT&T ORL dataset. (a) Original images. (b) NMF. (c) GNMF. (d) NTD. (e) GNTD. (f) RMNTFCIM{}_{\text{CIM}}. Each image denotes a basis vector learned by the above-mentioned methods.

3 Experiments

In this section, we compare the proposed RMNTF-CIM, RMNTF-Huber, RMNTF-Cauchy with nine nonnegative matrix and tensor factorization methods on five image data sets.

3.1 Datasets

We conducted experiments on the COIL100, USPS, FEI, ORL and FERET image datasets.

COIL100111https://www.kaggle.com/jessicali9530/coil100 is an object categorization image database including 100 classes of objective, each of which contains 7272 images with 128×128128\times 128 of difference observation angle. For reprocessing, we resize each image into 64×6464\times 64 with the RGB representation by nearest neighbor interpolation algorithm. As a result, each object is represented as a 64×64×364\times 64\times 3 tensor. In total, we have 7200 tensor objects.

USPS Dataset222https://www.kaggle.com/bistaumanga/usps-dataset consists of 1100011000 images of handwritten digits 090\sim 9 with 16×1616\times 16. Each of handwritten digits contains 11001100 images.

FEI Part 1 Dataset333https://fei.edu.br/~cet/facedatabase.html is the subset of FEI database, which consists of 700 color images of size 480×640480\times 640 collected from 50 individuals. Each individual has 14 different images under different observed and facial expressions. In our experiment, the images are resized to 48×6448\times 64. These image finally construct a fourth-order 48×64×3×70048\times 64\times 3\times 700 tensor.

AT & T ORL Dataset444https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html: is the subset of FEI database, which consists of 700 color images of size 480×640480\times 640 collected from 50 individuals. Each individual has 14 different images under different observed and facial expressions. In our experiment, the images are resized to 48×6448\times 64 pixels. These image finally construct a fourth-order 48×64×3×70048\times 64\times 3\times 700 tensor.

FERET Dataset555https://www.nist.gov/itl/products-and-services/color-feret-database: This dataset [48] collects 1412514125 grayscale face images acquired from 11991199 subjects. It is widely used for evaluating face recognition and clustering problem. In our experiment, we use a subset of 14001400 images with a size of 80×8080\times 80 from 200200 subjects where each subject contains 77 images with varying postures, genders, lighting condition, shooting directions and race. In total, we stack the images into a third-order tensor with size of 80×80×140080\times 80\times 1400.

For COIL100, we randomly selected 5,10,20,30,505,10,20,30,50 categories from the whole COIL100 for evaluation. For USPS, we randomly selected 3,5,73,5,7 categories and the whole 1010 categories for the evaluation. For FEI, ORL and FERET data sets, we randomly selected 5,10,15,205,10,15,20 categories for the evaluation. For each comparison, we reported the average results over 5050 Monte-Carlo runs.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Refer to caption
(m)
Refer to caption
(n)
Refer to caption
(o)
Figure 5: The convergence of the proposed RMNTF. (a)\sim(e) are the convergence curves of RMNTF-CIM on five datasets: COIL100, FEI, FERET, ORL and USPS, respectively. (f)\sim(j) are the convergence curves of RMNTF-Huber on five datasets, respectively. (k)\sim(o) are the convergence curves of RMNTF-Cauchy on five datasets, respectively.
Refer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to caption
(d)
Refer to captionRefer to captionRefer to captionRefer to caption
(e)
Figure 6: Evaluation of proposed methods on COIL100 database contaminated by Laplace noise and salt & pepper noise, respectively. (a) Average accuracy and NMI on the subset of COIL100 which contains 55 categories, the first two figures show the results contaminated by Laplace noise and the remains show the results contaminated by salt & pepper noise. (b) Average evaluation on the subset of COIL100 which contains 1010 categories. (c) Average evaluation on the subset of COIL100 which contains 2020 categories. (d) Average evaluation on the subset of COIL100 which contains 3030 categories. (e) Average evaluation on the subset of COIL100 which contains 5050 categories.

3.2 Compared algorithms

The hyperparameters of RMNTF-CIM, RMNTF-Huber, and RMNTF-Cauchy in all experiments were set as follows: λ=105,p=3\lambda=10^{5},p=3, unless otherwise stated. We compared RMNTF with three NMF methods, and six NTD methods. They are listed as follows:

  • 1.

    Nonnegative Matrix Factorization (NMF) [8]: We unfold tensorial original data into a matrix to apply it to NMF algorithm and obtain a low rank activation matrix 𝐕\mathbf{V}. Finally, we apply 𝐕\mathbf{V} to downstream tasks.

  • 2.

    Graph Regularization Nonnegative Matrix Decomposition (GNMF) [19]: We unfold tensorial data into a matrix 𝐗\mathbf{X} and reduce the dimensionality of 𝐗\mathbf{X} into KK by GNMF, where the dictionary matrix 𝐔\mathbf{U} is used in graph regularization term. The coefficient of regularization λ=10\lambda=10, and the hyperparameter p=3p=3.

  • 3.

    Correntropy Induced Metric Nonnegative Matrix Factorization (CIMNMF) [49]:Similar to the above, we unfold the tensorial data into a matrix and apply it to CIMNMF. The loss function between the unfolding tensor and reconstruction data is based on the correntropy induced metric. It handles outlier rows by incorporating structural knowledge about outliers.

  • 4.

    Nonnegative Tucker Decomposition (NTD) [1]: NTD works directly on the tensorial data. It is constructed by the tucker model with nonnegativity constraints and updated by multiplicative rules. This method can be considered as a special case of RMNTF when all the entries of weight tensor 𝒲\mathcal{W} are 11, and λ=0\lambda=0.

  • 5.

    Nonnegative Tucker Decomposition with alpha-Divergence (α\alphaNTD) [50]: We directly use the tensorial data to α\alphaNTD. It is considered α\alpha-divergence as a discrepancy measure and derive multiplicative updating rules for NTD. The hyperparameter α=1\alpha=1.

  • 6.

    Sparse Nonnegative Tucker Decomposition (SparseNTD) [51]: The tensorial data is directly utilized in SparseNTD. It adopts the sparsity and nonnegativity constraints to a core tensor and several factor matrices. The hyperparameter λ=0.5\lambda=0.5.

  • 7.

    Graph Regularized Nonnegative Tucker Decomposition (GNTD) [40]: GNTD use tensorial data and it constructs the nearest neighbor graph to maintain the intrinsic manifold structure of tensor and applies this constraint on the NNth factor matrix. The hyperparameters λ=105\lambda=10^{5}, and p=3p=3.

  • 8.

    Locally Linear Embedding Regularized Nonnegative Tucker Decomposition (LLENTD) [31]: We directly use tensor in LLENTD. It incorporates Laplacian Eigenmaps as manifold regularization terms into the least square form of nonnegative tucker model. The hyperparameters λ=105\lambda=10^{5}, and p=3p=3.

  • 9.

    Manifold Regularization Nonnegative Tucker Decomposition (MRNTD) [29]: MRNTD utilizes tensor data directly. It employs the manifold regularization terms for the core tensors constructed in the NTD. The hyperparameters λ=105\lambda=10^{5}, and p=3p=3.

  • 10.

    Hypergraph regularized nonnegative tensor factorization (hyperNTD) [33]:We take tensorial data into hyperNTD directly. It incorporates a higher-order relationship among the nearest neighborhoods and the nonnegative tucker decomposition. The hyperparameters λ=105\lambda=10^{5}, and p=3p=3.

Refer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 7: Evaluation of proposed methods on FEI database contaminated by Laplace noise and salt & pepper noise, respectively, the first two rows show the results contaminated by Laplace noise and the last two rows show the results contaminated by salt & pepper noise. (a) Average accuracy and NMI on the subset of FEI which contains 55 categories. (b) Average evaluation on the subset of FEI which contains 1010 categories. (c) Average evaluation on the subset of FEI which contains 1515 categories. (d) Average evaluation on the subset of FEI which contains 2020 categories.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(d)
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
(e)
Figure 8: Sensitive of hyperparatmeters, λ\lambda and pp in terms of ACC and NMI on five datasets, the first three figures show the ACC of RMNTF-Cauchy, RMNTF-CIM, RMNTF-Huber, respectively and they are contaminated by Laplace noise which the deviation is set to 120, and the last show the NMI of RMNTF-Cauchy, RMNTF-CIM, RMNTF-Huber, respectively. (a) Average accuracy and NMI on the subset of COIL100 which contains 1010 categories. (b) Average accuracy and NMI on the subset of FEI which contains 1010 categories. (c) Average accuracy and NMI on the subset of FERET which contains 1010 categories. (b) Average accuracy and NMI on the subset of ORL which contains 1010 categories. (b) Average accuracy and NMI on the subset of USPS which contains 1010 categories.

3.3 Evaluation Metrics

To evaluate the clustering performance of these algorithms, we adopt two commonly used metrics: 1) accuracy(ACC) and 2) normalized mutual information (NMI). The ACC is defined by

ACC(𝐥,𝐥^)=1ni=1nδ[li,map(l^i)],\begin{split}\operatorname{ACC}({{\mathbf{l}}},{{\mathbf{\hat{l}}}})=\frac{1}{n}\sum\limits_{i=1}^{n}{\delta[{l_{i}},\operatorname{map}({{\hat{l}}_{i}})]},\end{split} (41)

where nn is the total number of samples in a dataset. We have the cluster labels 𝐥^={l^1,l^2,,l^n}{\mathbf{\hat{l}}}=\{{\hat{l}_{1}},{\hat{l}_{2}},\cdots,{\hat{l}_{n}}\} and the ground-truth labels 𝐥={l1,l2,,ln}{\mathbf{l}}=\{{l_{1}},{l_{2}},\cdots,{l_{n}}\}. δ(x,y)\delta(x,y) is set to 11 if and only if x=yx=y, and 0 otherwise. map()\operatorname{map}(\cdot) is a displacement mapping function that maps each cluster label li^\hat{l_{i}} to the equivalent label from the dataset.

Another metric NMI is defined by

NMI(𝐥,𝐥^)=MI(𝐥,𝐥^)max[H(𝐥),H(𝐥^)],\begin{split}\operatorname{NMI}({\mathbf{l}},{\mathbf{\hat{l}}})=\frac{{\operatorname{MI}({\mathbf{l}},{\mathbf{\hat{l}}})}}{{\max[\operatorname{H}({\mathbf{l}}),\operatorname{H}({\mathbf{\hat{l}}})]}},\end{split} (42)

where H(𝐥)\operatorname{H}(\mathbf{l}) and H(𝐥^)\operatorname{H}(\mathbf{\hat{l}}) denote the entropy of 𝐥\mathbf{l} and 𝐥^\mathbf{\hat{l}}, respectively, and

MI(𝐥,𝐥^)=li𝐥li𝐥^p(li,l^i)log2[p(li,l^i)p(li)p(l^i)],\begin{split}\operatorname{MI}({\mathbf{l}},{\mathbf{\hat{l}}})=\sum\limits_{{l_{i}}\in{\mathbf{l}}}{\sum\limits_{{l_{i}}\in{\mathbf{\hat{l}}}}{p({l_{i}},{{\hat{l}}_{i}})}}\log_{2}\left[{\frac{{p({l_{i}},{{\hat{l}}_{i}})}}{{p({l_{i}})p({{\hat{l}}_{i}})}}}\right],\end{split} (43)

p(li){p({l_{i}})} and p(l^i){p({{\hat{l}}_{i}})} represent the marginal probability distribution functions of 𝐥{\mathbf{l}} and 𝐥^{\mathbf{\hat{l}}}, respectively, and p(li,l^i){p({l_{i}},{{\hat{l}}_{i}})} is the joint probability distribution function of 𝐥{\mathbf{l}} and 𝐥^{\mathbf{\hat{l}}}. MI(𝐥,𝐥^)\operatorname{MI}({\mathbf{l}},{\mathbf{\hat{l}}}) ranges from 0 to 1, with MI(𝐥,𝐥^)=1\operatorname{MI}({\mathbf{l}},{\mathbf{\hat{l}}})=1 if the two sets of clusters are identical, and MI(𝐥,𝐥^)=0\operatorname{MI}({\mathbf{l}},{\mathbf{\hat{l}}})=0 otherwise.

3.4 Main Results

3.4.1 Basis Visualization and Convergence

To compare the ability of extracting parts-based feature of tensor objects by NMF, GNMF, NTD, and RMNTF, respectively, we visualize the basic images extracted by each algorithms on 25 subjects randomly chosen from the AT&T ORL dataset in Fig. 4. From the experimental results, we note that the proposed RMNTF extracts more localized parts of face images than other algorithms, since these images reconstructed from the base are more homogeneous. It means that RMNTD can provide a more sparse representation.

In addition, we investigate the convergence speed of RMNTF. We show the convergence curves of the three implementations of RMNTF on five image datasets in Fig. 5. It is shown that the proposed three algorithms exhibit fast convergence rates, usually taking less than 100 iterations.

3.4.2 Simulated Corruption and Clustering Results

In order to evaluate the robustness of RMNTF, we compare our algorithms with the state-of-the-art clustering algorithms on five image datasets contaminated by Laplace noise and salt & pepper noise. The experimental results on COIL100 and FEI are presented in Fig. 6 and Fig. 7. Due to the space limitation, the results on FERET, ORL and USPS are represented in the supplement file.

Laplace noise and salt & pepper noise sometimes exist in image corruption. However, the cost function of some traditional methods, such as NMF, usually adopt Euclidean distance. They cannot deal with this kind of data well since the noise distribution is not consistent with the noise assumption.

The simulated Laplace noise obeys a Laplace distribution La(0,δ)La(0,\delta). We set the deviation δ\delta from 40 to 280 and add the noise to each pixels of images randomly. The first two rows in Fig. 6 and Fig. 7 show the mean and standard deviations of average accuracy and NMI of RMNTF’s three implementations and other nine representative algorithms. The experimental results confirms that RMNTF based method performs better than other methods when the deviation of Laplace noise is within 200. However, when the deviation is excessive and many outliers come into being, performance of all the methods reduce dramatically.

In terms of salt & pepper noise, we set the percentage of contaminated pixels from 5 to 60 percent for each image. Results are shown in the last two rows in Fig. 6 and Fig. 7. With increase of corrupted pixels, only GNMF is competitive with RMNTF based methods. And they outperform than other methods. When more than 30 percent of pixels are corrupted, performance of all the algorithms is dramatically degraded and gradually reaches unanimity. Because it is difficult to separate outliers from inliers.

We have conducted experiments on FERET, ORL, and USPS. The results are listed in Appendix A .

3.5 Effects of the hyperparameters

This subsection investigates the effects of the hyperparameters of RMNTF based algorithms on the performance of clustering task. Experiments are conducted on five image datasets contaminated by Laplace noise which the deviation is set to 120. There are two hyperparameters λ\lambda and pp need to be predefined. We report the average accuracy and NMI on 10 categories. For the three implementations of RMNTF, the parameters were setting as follows: fix p={2,3,4,5,6,7,8,9,10}p=\{2,3,4,5,6,7,8,9,10\} and choose λ{100,101,102,103,104,105,106}\lambda\in\{{10^{0}},{10^{1}},{10^{2}},{10^{3}},{10^{4}},{10^{5}},{10^{6}}\}.

Fig. 8 shows the effect of hyperparameters. The clustering performances are oscillate while λ<104\lambda<{10^{4}} and then tend to be stable while λ\lambda increases in range [104,106][{10^{4}},{10^{6}}]. Therefore, λ\lambda can be selected around 10410^{4}. It can be seen that the clustering performances are quite stable while the integer pp is in range [5,8][5,8], which means that pp hardly influences RMNTF. In total, our methods are probably robust to λ\lambda and they tend to achieve better performance when λ\lambda is slightly smaller, but they may be influenced by hyperparameter pp.

4 Conclusion

In this paper, we explored three robust cost function for manifold structure of nonnegative tucker factorization. To deal with the minimization of non-convex cost functions, we derive an iterative half-quadratic minimization optimization. Then, the optimization problem can be reduced to a weighted Euclidean distance of NTF. The proposed methods are further utilizing manifold structure information to enhance the performance of accuracy and avoid the rotational ambiguity. Due to the connection between robust loss functions and robust M-estimators, we adopt CIM, Huber and Cauchy functions to replace traditional Euclidean loss function. The proposed methods combine manifold structures with robust loss function to improve the clustering accuracy under noisy data and outliers. We investigated the effective of Laplacian noise and Salt & Pepper noise on the performance of the models respectively. Under a small degree of noise interference, the proposed algorithms improve greatly compared with other algorithms. For example, in FEI database, when the deviation value of the Laplacian noise is 50, the accuracy of our methods are absolutely improved by 10% compared with GNTD. In general, experimental results show that the proposed methods outperform the comparison methods in terms of clustering accuracy and normalized mutual information under noisy data or outliers.

References

  • [1] Y.-D. Kim, S. Choi, Nonnegative tucker decomposition, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Minneapolis, MN, USA, 2007, pp. 1–8.
  • [2] A. Schein, M. Zhou, D. Blei, H. Wallach, Bayesian poisson tucker decomposition for learning the structure of international relations, in: Proceedings of the 33rd International Conference on Machine Learning (ICML-16), PMLR, 2016, pp. 2810–2819.
  • [3] A. Karami, M. Yazdi, G. Mercier, Compression of hyperspectral images using discerete wavelet transform and tucker decomposition, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 5 (2) (2012) 444–450.
  • [4] A. Cichocki, R. Zdunek, A. H. Phan, S. Amari, Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation, John Wiley & Sons, 2009.
  • [5] Y. Sun, J. Gao, X. Hong, B. Mishra, B. Yin, Heterogeneous tensor decomposition for clustering via manifold optimization, IEEE Trans. Pattern Anal. Mach. Intell 38 (3) (2016) 476–489.
  • [6] H. Abdi, L. J. Williams, Principal component analysis, Wiley Interdiscipl. Rev. Comput. Stat. 2 (4) (2010) 433–459.
  • [7] R. M. Gray, Vector quantization, IEEE Assp Mag. 1 (2) (1984) 4–29.
  • [8] D. Lee, H. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (Oct. 1999) 788–791.
  • [9] J. Wang, X. Zhang, Deep nmf topic modeling, Neurocomputing 515 (2023) 157–173.
  • [10] T. G. Kolda, B. W. Bader, Tensor decompositions and applications, SIAM Rev. 51 (3) (2009) 455–500.
  • [11] Y. Zhou, H. Lu, Y.-M. Cheung, Probabilistic rank-one tensor analysis with concurrent regularizations, IEEE Trans. Cybern. 51 (7) (2021) 3496–3509.
  • [12] Q. Shi, H. Lu, Y.-M. Cheung, Tensor rank estimation and completion via cp-based nuclear norm, in: Proc. ACM Conf. Inf. Knowl. Manage., Nov. 2017, pp. 949–958.
  • [13] Q. Shi, Y.-M. Cheung, Q. Zhao, H. Lu, Feature extraction for incomplete data via low-rank tensor decomposition with feature regularization, IEEE Trans. Neural Netw. Learn. Syst. 30 (6) (Jun. 2018) 1803–1817.
  • [14] Y. Zhou, Y.-M. Cheung, Bayesian low-tubal-rank robust tensor factorization with multi-rank determination, IEEE Trans. Pattern Anal. Mach. Intell. 43 (1) (May. 2019) 62–76.
  • [15] M. E. Tipping, C. M. Bishop, Probabilistic principal component analysis, J. Roy. Stat. Soc. B (Stat. Methodol.) 61 (3) (1999) 611–622.
  • [16] L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika 31 (3) (Sep. 1966) 279–311.
  • [17] L. De Lathauwer, B. De Moor, J. Vandewalle, A multilinear singular value decomposition, SIAM J. Matrix Anal. Appl. 21 (4) (2000) 1253–1278.
  • [18] C. H. Ding, T. Li, M. I. Jordan, Convex and semi-nonnegative matrix factorizations, IEEE Trans. Pattern Anal. Mach. Intell. 32 (1) (Jan. 2010) 45–55.
  • [19] D. Cai, X. He, J. Han, T. Huang, Graph regularized nonnegative matrix factorization for data representation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (Aug. 2010) 1548–1560.
  • [20] H. Liu, Z. Wu, X. Li, D. Cai, T. Huang, Constrained nonnegative matrix factorization for image representation, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (Jul. 2012) 1299–1311.
  • [21] J. Pan, N. Gillis, Generalized separable nonnegative matrix factorization, IEEE Trans. Pattern Anal. Mach. intell. 43 (5) (May. 2021) 1546–1561.
  • [22] N. Guan, T. Liu, Y. Zhang, D. Tao, L. S. Davis, Truncated cauchy non-negative matrix factorization, IEEE Trans. Pattern Anal. Mach. intell. 41 (1) (Jan. 2019) 246–259.
  • [23] B. D. Haeffele, R. Vidal, Structured low-rank matrix factorization: Global optimality, algorithms, and applications, IEEE Trans. Pattern Anal. Mach. Intell. 42 (6) (Jun. 2020) 1468–1482.
  • [24] R. A. Harshman, Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis, UCLA Work. Pap. Phonetics 16 (1) (1970) 1–84.
  • [25] Q. Zhao, L. Zhang, A. Cichocki, Bayesian cp factorization of incomplete tensors with automatic rank determination, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (Sep. 2015) 1751–1763.
  • [26] Q. Zhao, G. Zhou, L. Zhang, A. Cichocki, S.-I. Amari, Bayesian robust tensor factorization for incomplete multiway data, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2015) 736–748.
  • [27] X. Chen, Z. Han, Y. Wang, Q. Zhao, D. Meng, L. Lin, Y. Tang, A generalized model for robust tensor factorization with noise modeling by mixture of gaussians, IEEE Trans. Neural Netw. Learn. Syst. 29 (11) (Nov. 2018) 5380–5393.
  • [28] S. Fang, R. M. Kirby, S. Zhe, Bayesian streaming sparse tucker decomposition, in: Proc. 35th Conference on Uncertainty in Artificial Intelligence (UAI), 2021.
  • [29] X. Li, M. K. Ng, G. Cong, Y. Ye, Q. Wu, Mr-ntd: Manifold regularization nonnegative tucker decomposition for tensor data dimension reduction and representation, IEEE Trans. Neural Netw. Learn. Syst. 28 (8) (Aug. 2017) 1787–1800.
  • [30] B. Jiang, C. Ding, J. Tang, B. Luo, Image representation and learning with graph-laplacian tucker tensor decomposition, IEEE Trans. Cybern. 49 (4) (Apr. 2019) 1417–1426.
  • [31] W. Yin, Z. Ma, Le & lle regularized nonnegative tucker decomposition for clustering of high dimensional datasets, Neurocomputing 364 (2019) 77–94.
  • [32] J. Pan, M. K. Ng, Y. Liu, X. Zhang, H. Yan, Orthogonal nonnegative tucker decomposition, SIAM J. Sci. Comput. 43 (1) (2021) B55–B81.
  • [33] W. Yin, Z. Ma, Q. Liu, Hyperntf: A hypergraph regularized nonnegative tensor factorization for dimensionality reduction, arXiv preprint arXiv:2101.06827 (2021).
  • [34] J. Tenenbaum, V. De Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, science 290 (5500) (2000) 2319–2323.
  • [35] S. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science 290 (5500) (2000) 2323–2326.
  • [36] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering., in: Proc. Conf. Neural Inf. Process. Syst. (NIPS), 2002, pp. 585–591.
  • [37] X. He, P. Niyogi, Locality preserving projections, Proc. Conf. Neural Inf. Process. Syst. (NIPS) (2004) 153–160.
  • [38] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (Mar. 2005) 328–340.
  • [39] Y. Yang, Y. Zhuang, F. Wu, Y. Pan, Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval, IEEE Trans. Multimedia 10 (3) (Apr. 2008) 437–446.
  • [40] Y. Qiu, G. Zhou, Y. Wang, Y. Zhang, S. Xie, A generalized graph regularized non-negative tucker decomposition framework for tensor data representation, IEEE Trans. Cybern. 52 (1) (2022) 594–607.
  • [41] S. Boyd, L. Vandenberghe, Convex optimization, Cambridge, U.K.: Cambridge Univ. Press, 2004.
  • [42] D. Geman, G. Reynolds, Constrained restoration and the recovery of discontinuities, IEEE Trans. Pattern Anal. Mach. Intell. 14 (3) (Mar. 1992) 367–383.
  • [43] M. Nikolova, R. H. Chan, The equivalence of half-quadratic minimization and the gradient linearization iteration, IEEE Trans. Image Process. 16 (6) (Jun. 2007) 1623–1627.
  • [44] M. Nikolova, M. K. Ng, Analysis of half-quadratic minimization methods for signal and image recovery, SIAM J. Sci. Comput. 27 (3) (2005) 937–966.
  • [45] P. Charbonnier, L. Blanc-Féraud, G. Aubert, M. Barlaud, Deterministic edge-preserving regularization in computed imaging, IEEE Trans. Image Process. 6 (2) (Feb. 1997) 298–311.
  • [46] W. Liu, P. P. Pokharel, J. C. Principe, Correntropy: Properties and applications in non-gaussian signal processing, IEEE Trans. Signal Processing 55 (11) (Nov. 2007) 5286–5298.
  • [47] Z. Zhang, Parameter estimation techniques: A tutorial with application to conic fitting, Image Vis. Comput. 15 (1) (1997) 59–76.
  • [48] P. J. Phillips, H. Moon, S. A. Rizvi, P. J. Rauss, The feret evaluation methodology for face-recognition algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 22 (10) (Oct. 2000) 1090–1104.
  • [49] L. Du, X. Li, Y.-D. Shen, Robust nonnegative matrix factorization via half-quadratic minimization, in: Proc. IEEE 12th Int. Conf. Data Mining (ICDM), Dec. 2012, pp. 201–210.
  • [50] Y.-D. Kim, A. Cichocki, S. Choi, Nonnegative tucker decomposition with alpha-divergence, in: Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2008, pp. 1829–1832.
  • [51] Y. Xu, Alternating proximal gradient method for sparse nonnegative tucker decomposition, Math. Programm. Comput. 7 (1) (2015) 39–70.
  • [52] B. Jiang, C. Ding, J. Tang, Graph-laplacian pca: Closed-form solution and robustness, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3492–3498.
  • [53] C. Van Loan, N. Pitsianis, Approximation with kronecker products, in: Linear Algebra for Large Scale and Real Time Applications, Springer, 1993, pp. 293–314.
  • [54] K. Huang, N. D. Sidiropoulos, A. Swami, Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition, IEEE Trans. Signal Process. 62 (1) (Jan. 2014) 211–224.

Appendix A

Refer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 9: Evaluation of proposed methods on FERET database contaminated by Laplace noise and salt & pepper noise, respectively. (a) Average accuracy and NMI on the subset of FERET which contains 55 categories, the first two figures show the results contaminated by Laplace noise and the remains show the results contaminated by salt & pepper noise. (b) Average evaluation on the subset of FERET which contains 1010 categories. (c) Average evaluation on the subset of FERET which contains 1515 categories. (d) Average evaluation on the subset of FERET which contains 2020 categories.
Refer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 10: Evaluation of proposed methods on ORL database contaminated by Laplace noise and salt & pepper noise, respectively. (a) Average accuracy and NMI on the subset of ORL which contains 55 categories, the first two figures show the results contaminated by Laplace noise and the remains show the results contaminated by salt & pepper noise. (b) Average evaluation on the subset of ORL which contains 1010 categories. (c) Average evaluation on the subset of ORL which contains 1515 categories. (d) Average evaluation on the subset of ORL which contains 2020 categories.
Refer to captionRefer to captionRefer to captionRefer to caption
(a)
Refer to captionRefer to captionRefer to captionRefer to caption
(b)
Refer to captionRefer to captionRefer to captionRefer to caption
(c)
Refer to captionRefer to captionRefer to captionRefer to caption
(d)
Figure 11: Evaluation of proposed methods on USPS database contaminated by Laplace noise and salt & pepper noise, respectively. (a) Average accuracy and NMI on the subset of USPS which contains 33 categories, the first two figures show the results contaminated by Laplace noise and the remains show the results contaminated by salt & pepper noise. (b) Average evaluation on the subset of USPS which contains 55 categories. (c) Average evaluation on the subset of USPS which contains 77 categories. (d) Average evaluation on the subset of USPS which contains 1010 categories.

A.1 Discussion

A.1.1 Generalization error bound and convergence analysis

Before we prove Theorem 1, we first give the definition of an upper bound auxiliary function.

Definition 4

(x,x)\mathcal{F}(x,x^{\prime}) is an upper bound auxiliary function for f(x)f(x) if the following conditions are satisfied:

(x,x)f(x),(x,x)=f(x).\begin{split}\mathcal{F}(x,x^{\prime})\geq f(x),\mathcal{F}(x,x)=f(x).\end{split} (44)
Corollary 1

If (,)\mathcal{F}(\cdot,\cdot) is an upper bound auxiliary function for f(x)f(x), then f(x)f(x) is non-increasing under the update rule

xt+1=argminx(x,xt),\begin{split}x^{t+1}=\arg\min_{x}\mathcal{F}(x,x^{t}),\end{split} (45)
Proof 4
f(xt+1)(xt+1,xt)(xt,xt)=f(xt).\begin{split}f(x^{t+1})\leq\mathcal{F}(x^{t+1},x^{t})\leq\mathcal{F}(x^{t},x^{t})=f(x^{t}).\end{split} (46)
Definition 5

A function can be represented as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point, which can be formulated as follows:

f(x)=i=0f(i)(a)i!(xa)i,\begin{split}f(x)=\sum_{i=0}^{\infty}\frac{f^{(i)}(a)}{i!}(x-a)^{i},\end{split} (47)

where aa is the point and ii is the order of partial derivatives.

Given the above definitions, the objective function of RMNTF-CIM with respect to the three univariate functions are obtained as (19), (24) and (28). Then, we have the following four lemmas.

Lemma 1

The auxiliary function for (19) is as follows:

([𝐀(n)]ij,[𝐀(n)]ij)=fA([𝐀(n)]ij)+[𝐀(n)]i𝐁(n)T𝐓i[𝐁(n)]j([𝐀(n)]ij[𝐀(n)]ij)+[𝐀(n)𝐁(n)T𝐓i𝐁(n)]ij[𝐀(n)]ij([𝐀(n)]ij[𝐀(n)]ij)2,\begin{split}&\mathcal{F}([\mathbf{A}^{(n)}]_{ij},[\mathbf{A}^{(n)\star}]_{ij})=f_{A}([\mathbf{A}^{(n)}]_{ij})\\ &+[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)\\ &+\frac{[\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{ij}}{[\mathbf{A}^{(n)}]_{ij}}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)^{2},\end{split} (48)

where 𝐁(n)T=𝒮(n)(in𝐀(i)T)0rn×I1In1In+1IN\mathbf{B}^{(n)T}=\mathcal{S}_{(n)}\left(\otimes_{i\neq n}\mathbf{A}^{(i)T}\right)\in\mathbb{R}_{\geq 0}^{r_{n}\times I_{1}\dots I_{n-1}I_{n+1}\dots I_{N}}.

Proof 5

For RMNTF-CIM, the objective function can be re written as

𝒥(𝒳,𝒮,{𝐀(n)}n=1N)=11I1INi1=1INin=1INexp((𝒳i1iN𝒮×n=1N𝐀(n))22σ2).\begin{split}&\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N})=1-\\ &\frac{1}{I_{1}\dots I_{N}}\sum_{i_{1}=1}^{I_{N}}\dots\sum_{i_{n}=1}^{I_{N}}\exp\left(-\frac{\left(\mathcal{X}_{i_{1}\dots i_{N}}-\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right)^{2}}{2\sigma^{2}}\right).\end{split} (49)

It is obvious that ([𝒲(n)]ij,[𝒲(n)]ij)=[fw(𝒲(n))]ij\mathcal{F}([\mathcal{W}_{(n)}]_{ij},[\mathcal{W}_{(n)}]_{ij})=[f_{w}(\mathcal{W}_{(n)})]_{ij}, we only need to prove that ([𝒲(n)]ij)[fw(𝒲(n))]ij\mathcal{F}([\mathcal{W}_{(n)}]_{ij})\geq[f_{w}(\mathcal{W}_{(n)})]_{ij}. The first-order partial derivative of (17) in element-wise is

[𝒥(𝒳,𝒮,{𝐀(n)}n=1N)𝐀(n)]i1iN=1σ212πσ1I1IN[exp((𝒳𝒮×n=1N𝐀(n))22σ2)]i1iN[𝒳𝒮×n=1N𝐀(n)]i1iN[(𝒮×n=1N𝐀(n))𝐀(n)]i1iN.\begin{split}&\left[\frac{\partial\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N})}{\partial\mathbf{A}^{(n)}}\right]_{i_{1}\dots i_{N}}=\frac{1}{\sigma^{2}}\frac{1}{\sqrt{2\pi}\sigma}\frac{1}{I_{1}\dots I_{N}}\cdot\\ &\left[\exp\left(-\frac{\left(\mathcal{X}-\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right)^{2}}{2\sigma^{2}}\right)\right]_{i_{1}\dots i_{N}}\cdot\\ &\left[\mathcal{X}-\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right]_{i_{1}\dots i_{N}}\cdot\left[\frac{\partial\left(\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right)}{\partial\mathbf{A}^{(n)}}\right]_{i_{1}\dots i_{N}}.\end{split} (50)

By the equivalent relationship 𝒮×n=1N𝐀(n)=foldn(𝐀(n)𝐁(n)T)\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}=\rm{fold}_{n}(\mathbf{A}^{(n)}\mathbf{B}^{(n)T}), we have

[(𝒮×n=1N𝐀(n))𝐀(n)]i1iN=[𝐀(n)𝐁(n)T𝐀(n)]rnm=[𝐁(n)]rnm,\begin{split}\left[\frac{\partial\left(\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right)}{\partial\mathbf{A}^{(n)}}\right]_{i_{1}\dots i_{N}}=\left[\frac{\partial\mathbf{A}^{(n)}\mathbf{B}^{(n)T}}{\partial\mathbf{A}^{(n)}}\right]_{r_{n}m}=\left[\mathbf{B}^{(n)}\right]_{r_{n}m},\end{split} (51)

where m=1,,I1In1In+1INm=1,\dots,I_{1}\dots I_{n-1}I_{n+1}\dots I_{N}.

Hence, the first-order partial derivative of RMNTF-CIM is

𝒥(𝒳,𝒮,{𝐀(n)}n=1N)𝐀(n)[exp((𝒳𝒮×n=1N𝐀(n))22σ2)(𝒮×n=1N𝐀(n)𝒳)](n)𝐁(n)=[𝒲(𝒮×n=1N𝐀(n)𝒳)](n)𝐁(n).\begin{split}&\frac{\partial\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N})}{\partial\mathbf{A}^{(n)}}\propto\Bigg{[}\exp\left(-\frac{\left(\mathcal{X}-\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right)^{2}}{2\sigma^{2}}\right)\\ &\circledast\left(\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}-\mathcal{X}\right)\Bigg{]}_{(n)}\mathbf{B}^{(n)}\\ &=\left[\mathcal{W}\circledast\left(\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}-\mathcal{X}\right)\right]_{(n)}\mathbf{B}^{(n)}.\end{split} (52)

We assume that 𝒲\mathcal{W} is a new variable independent of 𝐀(n)\mathbf{A}^{(n)}. Let 𝐓i=Diag([𝒲(n)]i)\mathbf{T}_{i}=\mathrm{Diag}([\mathcal{W}_{(n)}]_{i\cdot}), then (52) can be represented by

𝒥(𝒳,𝒮,{𝐀(n)}n=1N)𝐀(n){[𝐀(n)]i𝐁(n)T𝐓i}i=1In𝐁(n){𝒳(n)𝐓i}i=1In𝐁(n),\begin{split}\frac{\partial\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N})}{\partial\mathbf{A}^{(n)}}\propto&\{[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}\}_{i=1}^{I_{n}}\mathbf{B}^{(n)}\\ &-\{\mathcal{X}_{(n)}\mathbf{T}_{i}\}_{i=1}^{I_{n}}\mathbf{B}^{(n)},\end{split} (53)

where {[𝐀(n)]i𝐁(n)T𝐓i}i=1In=[[𝐀(n)]1𝐁(n)T𝐓1;;\{[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}\}_{i=1}^{I_{n}}=\bigg{[}[\mathbf{A}^{(n)}]_{1\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{1};\dots; [𝐀(n)]In𝐁(n)T𝐓In][\mathbf{A}^{(n)}]_{I_{n}\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{I_{n}}\bigg{]} 0In×I1IpIN(pn)\in\mathbb{R}_{\geq 0}^{I_{n}\times I_{1}\dots I_{p}\dots I_{N}(p\neq n)}. The second-order derivative of RMNTF-CIM can be represented as:

2𝒥(𝒳,𝒮,{𝐀(n)}n=1N)𝐀(n)𝐀(n)={𝐁(n)T𝐓i𝐁(n)}i=1In.\begin{split}\frac{\partial^{2}\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N})}{\partial\mathbf{A}^{(n)}\partial\mathbf{A}^{(n)}}=\{\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}\}_{i=1}^{I_{n}}.\end{split} (54)

According to the Taylor expansion in Definition 5, we can rewrite 𝒥(𝒳,𝒮,{𝐀(n)}n=1N)\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N}) to its Taylor expansion form with respect to 𝐀(n)\mathbf{A}^{(n)}:

f([𝐀(n)]ij)=fA([𝐀(n)]ij)+[𝐀(n)]i𝐁(n)T𝐓i[𝐁(n)]j([𝐀(n)]ij[𝐀(n)]ij)[𝒳(n)]i𝐓i[𝐁(n)]j([𝐀(n)]ij[𝐀(n)]ij)+12[𝐁(n)]jT𝐓i[𝐁(n)]j([𝐀(n)]ij[𝐀(n)]ij)2.\begin{split}&f([\mathbf{A}^{(n)}]_{ij})=f_{A}([\mathbf{A}^{(n)}]_{ij})\\ &+[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)\\ &-[\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)\\ &+\frac{1}{2}[\mathbf{B}^{(n)}]^{T}_{\cdot j}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)^{2}.\end{split} (55)

The upper bound auxiliary function for (19) is defined as:

([𝐀(n)]ij,[𝐀(n)]ij)=f([𝐀(n)]ij)+[𝐀(n)]i𝐁(n)T𝐓i[𝐁(n)]j([𝐀(n)]ij[𝐀(n)]ij)[𝒳(n)]i𝐓i[𝐁(n)]j([𝐀(n)]ij[𝐀(n)]ij)+[𝐀(n)𝐁(n)T𝐓i𝐁(n)]ij[𝐀(n)]ij([𝐀(n)]ij[𝐀(n)]ij)2.\begin{split}&\mathcal{F}([\mathbf{A}^{(n)}]_{ij},[\mathbf{A}^{(n)\star}]_{ij})=f([\mathbf{A}^{(n)\star}]_{ij})\\ &+[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)\\ &-[\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)\\ &+\frac{[\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{ij}}{[\mathbf{A}^{(n)}]_{ij}}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)^{2}.\end{split} (56)

Substituting (55) into (56),we find that ([𝐀(n)]ij,[𝐀(n)]ij)fA([𝐀(n)]ij)\mathcal{F}([\mathbf{A}^{(n)}]_{ij},[\mathbf{A}^{(n)\star}]_{ij})\geq f_{A}([\mathbf{A}^{(n)}]_{ij}) is equivalent to

12[𝐀(n)𝐁(n)T𝐓i𝐁(n)]ij[𝐀(n)]ij([𝐀(n)]ij[𝐀(n)]ij)212[𝐁(n)]jT𝐓i[𝐁(n)]j([𝐀(n)]ij[𝐀(n)]ij)2.\begin{split}&\frac{1}{2}\frac{[\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{ij}}{[\mathbf{A}^{(n)}]_{ij}}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)^{2}\\ &\geq\frac{1}{2}[\mathbf{B}^{(n)}]^{T}_{\cdot j}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}\left([\mathbf{A}^{(n)}]_{ij}-[\mathbf{A}^{(n)\star}]_{ij}\right)^{2}.\end{split} (57)

Because we have

[𝐀(n)𝐁(n)T𝐓i𝐁(n)]ij[𝐀(n)]ij=j([𝐀(n)]ij×[𝐁(n)T𝐓i𝐁(n)]jj)[𝐀(n)]ij[𝐀(n)]ij×[𝐁(n)T𝐓i𝐁(n)]jj[𝐀(n)]ij=[𝐁(n)T𝐓i𝐁(n)]jj.\begin{split}&\frac{[\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{ij}}{[\mathbf{A}^{(n)}]_{ij}}=\frac{\sum_{j}\left([\mathbf{A}^{(n)}]_{ij}\times[\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{jj}\right)}{[\mathbf{A}^{(n)}]_{ij}}\\ &\geq\frac{[\mathbf{A}^{(n)}]_{ij}\times[\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{jj}}{[\mathbf{A}^{(n)}]_{ij}}=[\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{jj}.\end{split} (58)

Now, we can demonstrate that (57) holds, and (56) is the upper bound auxiliary function for (19), the updates of 𝐀(n)\mathbf{A}^{(n)} lead to a non-increasing of the objective function fA(𝐀(n))f_{A}(\mathbf{A}^{(n)}). Because the elements of factor matrices 𝐀(n)\mathbf{A}^{(n)} are nonnegative, and (56) is a convex function, its minimum value can be achieved at

[𝐀(n)]ij=[𝐀(n)]ij[𝐀(n)]i𝐁(n)T𝐓i[𝐁(n)]j[𝒳(n)]i𝐓i[𝐁(n)]j[𝐀(n)𝐁(n)T𝐓i𝐁(n)]ij[𝐀(n)]ij=[𝐀(n)]ij×[𝒳(n)]i𝐓i[𝐁(n)]j[𝐀(n)𝐁(n)T𝐓i𝐁(n)]ij.\begin{split}&[\mathbf{A}^{(n)\star}]_{ij}\\ &=[\mathbf{A}^{(n)}]_{ij}-\frac{[\mathbf{A}^{(n)}]_{i\cdot}\mathbf{B}^{(n)T}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}-[\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}}{\frac{[\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{ij}}{[\mathbf{A}^{(n)}]_{ij}}}\\ &=[\mathbf{A}^{(n)}]_{ij}\times\frac{[\mathcal{X}_{(n)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(n)}]_{\cdot j}}{[\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{ij}}.\end{split} (59)

Lemma 1 is proved.

Lemma 2

The auxiliary function for (24) is as follows:

([𝐀(N)]ij,[𝐀(N)]ij)=f([𝐀(N)]ij)+([𝐀(N)]ij[𝐀(N)]ij)([𝐀(N)]i𝐁(N)T𝐓i[𝐁(N)]j[𝒳(N)]i𝐓i[𝐁(N)]j+λ[𝐋𝐀(N)]ij)+[𝐀(N)𝐁(N)T𝐓i𝐁(N)]ij+[λ𝐃𝐀(N)]ij[𝐀(N)]ij([𝐀(N)]ij[𝐀(N)]ij)2.\begin{split}\!&\mathcal{F}([\mathbf{A}^{(N)}]_{ij},[\mathbf{A}^{(N)\star}]_{ij})\!=\!f([\mathbf{A}^{(N)}]_{ij})\!+\!\left([\mathbf{A}^{(N)}]_{ij}\!-\![\mathbf{A}^{(N)\star}]_{ij}\right)\cdot\\ \!&\!\left(\![\mathbf{A}^{(N)}]_{i\cdot}\mathbf{B}^{(N)T}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\!-\![\mathcal{X}_{(N)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\!+\!\lambda\left[\mathbf{L}\mathbf{A}^{(N)}\right]_{ij}\!\right)\\ &\!+\!\frac{[\mathbf{A}^{(N)}\mathbf{B}^{(N)T}\mathbf{T}_{i}\mathbf{B}^{(N)}]_{ij}\!+\!\left[\lambda\mathbf{D}\mathbf{A}^{(N)}\right]_{ij}}{[\mathbf{A}^{(N)}]_{ij}}\left([\mathbf{A}^{(N)}]_{ij}\!-\![\mathbf{A}^{(N)\star}]_{ij}\right)^{2}.\end{split} (60)
Proof 6

For RMNTF-CIM, the objective function with respect to 𝐀(N)\mathbf{A}^{(N)} can be represented as

𝒥^(𝒳,𝒮,{𝒜(n)}n=1N)=𝒥(𝒳,𝒮,{𝐀(n)}n=1N)+λ2Tr(𝐀(N)T𝐋𝐀(N))+Tr(𝛀N𝐀(N)).\begin{split}&\hat{\mathcal{J}}(\mathcal{X},\mathcal{S},\{\mathcal{A}^{(n)}\}_{n=1}^{N})=\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N})\\ &+\frac{\lambda}{2}\mathrm{Tr}\left(\mathbf{A}^{(N)T}\mathbf{L}\mathbf{A}^{(N)}\right)+\mathrm{Tr}\left(\boldsymbol{\Omega}_{N}\mathbf{A}^{(N)}\right).\end{split} (61)

The first-order partial derivative of (61) in element-wise is

𝒥^(𝒳,𝒮,{𝒜(n)}n=1N)[𝐀(N)]ij=1σ212πσ1I1IN[exp((𝒳𝒮×n=1N𝐀(n))22σ2)]i1iN[𝒳𝒮×n=1N𝐀(n)]i1iN[(𝒮×n=1N𝐀(n))𝐀(N)]i1iN+λ2[(𝐋+𝐋T)𝐀(N)]ij+[𝛀N]ij.\begin{split}&\frac{\partial\hat{\mathcal{J}}(\mathcal{X},\mathcal{S},\{\mathcal{A}^{(n)}\}_{n=1}^{N})}{\partial\left[\mathbf{A}^{(N)}\right]_{ij}}=\frac{1}{\sigma^{2}}\frac{1}{\sqrt{2\pi}\sigma}\frac{1}{I_{1}\dots I_{N}}\cdot\\ &\left[\exp\left(-\frac{\left(\mathcal{X}-\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right)^{2}}{2\sigma^{2}}\right)\right]_{i_{1}\dots i_{N}}\cdot\\ &\left[\mathcal{X}-\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right]_{i_{1}\dots i_{N}}\cdot\left[\frac{\partial\left(\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}\right)}{\partial\mathbf{A}^{(N)}}\right]_{i_{1}\dots i_{N}}\\ &+\frac{\lambda}{2}\left[\left(\mathbf{L}+\mathbf{L}^{T}\right)\mathbf{A}^{(N)}\right]_{ij}+\left[\boldsymbol{\Omega}_{N}\right]_{ij}.\end{split} (62)

Because of (51), we rewrite the first-order derivative of RMNTF-CIM:

𝒥^(𝒳,𝒮,{𝒜(n)}n=1N)𝐀(N)λ2(𝐋+𝐋T)𝐀(N)+[𝒲(𝒮×n=1N𝐀(n)𝒳)](N)𝐁(N)+𝛀N={[𝐀(N)]i𝐁(N)T𝐓i}i=1IN𝐁(N){𝒳(N)𝐓i}i=1IN𝐁(N)+λ2(𝐋+𝐋T)𝐀(N)+𝛀N.\begin{split}&\frac{\partial\hat{\mathcal{J}}(\mathcal{X},\mathcal{S},\{\mathcal{A}^{(n)}\}_{n=1}^{N})}{\partial\mathbf{A}^{(N)}}\propto\frac{\lambda}{2}\left(\mathbf{L}+\mathbf{L}^{T}\right)\mathbf{A}^{(N)}\\ &+\left[\mathcal{W}\circledast\left(\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)}-\mathcal{X}\right)\right]_{(N)}\mathbf{B}^{(N)}+\boldsymbol{\Omega}_{N}\\ =&\{[\mathbf{A}^{(N)}]_{i\cdot}\mathbf{B}^{(N)T}\mathbf{T}_{i}\}_{i=1}^{I_{N}}\mathbf{B}^{(N)}-\{\mathcal{X}_{(N)}\mathbf{T}_{i}\}_{i=1}^{I_{N}}\mathbf{B}^{(N)}\\ &+\frac{\lambda}{2}\left(\mathbf{L}+\mathbf{L}^{T}\right)\mathbf{A}^{(N)}+\boldsymbol{\Omega}_{N}.\end{split} (63)

The tensor variable 𝒲\mathcal{W} is independent of 𝐀(N)\mathbf{A}^{(N)}. The second-order derivative of RMNTF-CIM with respect to 𝐀(N)\mathbf{A}^{(N)} is represented as

2𝒥^(𝒳,𝒮,{𝒜(n)}n=1N)𝐀(N)𝐀(N)={𝐁(N)T𝐓i𝐁(N)}i=1IN+λ𝐋.\begin{split}&\frac{\partial^{2}\hat{\mathcal{J}}(\mathcal{X},\mathcal{S},\{\mathcal{A}^{(n)}\}_{n=1}^{N})}{\partial\mathbf{A}^{(N)}\partial\mathbf{A}^{(N)}}=\{\mathbf{B}^{(N)T}\mathbf{T}_{i}\mathbf{B}^{(N)}\}_{i=1}^{I_{N}}+\lambda\mathbf{L}.\end{split} (64)

According to the Taylor expansion, we rewrite 𝒥^\hat{\mathcal{J}} with respect to 𝐀(N)\mathbf{A}^{(N)}:

f([𝐀(N)]ij)=fA([𝐀(N)]ij)+([𝐀(n)]ij[𝐀(n)]ij)([𝐀(N)]i𝐁(N)T𝐓i[𝐁(N)]j[𝒳(N)]i𝐓i[𝐁(N)]j+λ[𝐋𝐀(n)]ij)+12([𝐁(N)]jT𝐓i[𝐁(N)]j+λ[𝐋]ii)([𝐀(N)]ij[𝐀(N)]ij)2.\begin{split}&f\!(\left[\mathbf{A}^{(N)}\right]_{ij}\!)\!=\!f_{A}\!(\left[\mathbf{A}^{(N)}\right]_{ij})\!+\!\left(\![\mathbf{A}^{(n)}]_{ij}\!-\![\!\mathbf{A}^{(n)\star}\!]_{ij}\right)\!\cdot\\ &\Bigg{(}[\mathbf{A}^{(N)}]_{i\cdot}\mathbf{B}^{(N)T}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\!-\![\mathcal{X}_{(N)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\\ &\!+\!\lambda\!\left[\!\mathbf{L}\mathbf{A}^{(n)}\!\right]_{ij}\!\Bigg{)}\!+\!\frac{1}{2}\left([\mathbf{B}^{(N)}]^{T}_{\cdot j}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\!+\!\lambda\!\left[\mathbf{L}\right]_{ii}\!\right)\\ &\cdot\left([\mathbf{A}^{(N)}]_{ij}-[\mathbf{A}^{(N)\star}]_{ij}\right)^{2}.\end{split} (65)

The upper bound auxiliary function for (24) is denoted as:

([𝐀(N)]ij,[𝐀(N)]ij)=f([𝐀(N)]ij)+([𝐀(N)]ij[𝐀(N)]ij)([𝐀(N)]i𝐁(N)T𝐓i[𝐁(N)]j[𝒳(N)]i𝐓i[𝐁(N)]j+λ[𝐋𝐀(N)]ij)+[𝐀(N)𝐁(N)T𝐓i𝐁(N)]ij+[λ𝐃𝐀(N)]ij[𝐀(N)]ij([𝐀(N)]ij[𝐀(N)]ij)2.\begin{split}&\mathcal{F}(\![\mathbf{A}\!^{(N)}]_{ij}\!,\![\mathbf{A}\!^{(N)\star}]_{ij}\!)\!=\!f(\![\mathbf{A}\!^{(N)\star}\!]_{ij}\!)\!+\!\left(\![\mathbf{A}\!^{(N)}\!]_{ij}\!-\![\mathbf{A}\!^{(N)\star}\!]_{ij}\!\right)\!\cdot\!\\ &\!\Bigg{(}\![\mathbf{A}^{(N)}]_{i\cdot}\mathbf{B}^{(N)T}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\!-\![\mathcal{X}_{(N)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\\ &\!+\!\lambda\!\left[\mathbf{L}\mathbf{A}^{(N)}\right]_{ij}\!\Bigg{)}\!+\!\frac{[\mathbf{A}^{(N)}\mathbf{B}^{(N)T}\mathbf{T}_{i}\mathbf{B}^{(N)}]_{ij}\!+\!\left[\lambda\mathbf{D}\mathbf{A}^{(N)}\right]_{ij}}{[\mathbf{A}^{(N)}]_{ij}}\\ &\cdot\left([\mathbf{A}^{(N)}]_{ij}-[\mathbf{A}^{(N)\star}]_{ij}\right)^{2}.\end{split} (66)

Substituting (65) into (66),we find that ([𝐀(N)]ij,[𝐀(N)]ij)fA([𝐀(N)]ij)\mathcal{F}([\mathbf{A}^{(N)}]_{ij},[\mathbf{A}^{(N)\star}]_{ij})\geq f_{A}([\mathbf{A}^{(N)}]_{ij}) is equivalent to

[𝐀(N)𝐁(N)T𝐓i𝐁(N)]ij+[λ𝐃𝐀(N)]ij[𝐀(N)]ij12([𝐁(N)]jT𝐓i[𝐁(N)]j+λ[𝐋]ii).\begin{split}&\frac{[\mathbf{A}^{(N)}\mathbf{B}^{(N)T}\mathbf{T}_{i}\mathbf{B}^{(N)}]_{ij}+\left[\lambda\mathbf{D}\mathbf{A}^{(N)}\right]_{ij}}{[\mathbf{A}^{(N)}]_{ij}}\\ &\geq\frac{1}{2}\left([\mathbf{B}^{(N)}]^{T}_{\cdot j}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}+\lambda\left[\mathbf{L}\right]_{ii}\right).\end{split} (67)

Because we have

[𝐀(n)𝐁(n)T𝐓i𝐁(n)]ij[𝐀(n)]ij=j([𝐀(n)]ij×[𝐁(n)T𝐓i𝐁(n)]jj)[𝐀(n)]ij[𝐀(n)]ij×[𝐁(n)T𝐓i𝐁(n)]jj[𝐀(n)]ij=[𝐁(n)T𝐓i𝐁(n)]jj,\begin{split}&\frac{[\mathbf{A}^{(n)}\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{ij}}{[\mathbf{A}^{(n)}]_{ij}}=\frac{\sum_{j}\left([\mathbf{A}^{(n)}]_{ij}\times[\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{jj}\right)}{[\mathbf{A}^{(n)}]_{ij}}\\ &\geq\frac{[\mathbf{A}^{(n)}]_{ij}\times[\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{jj}}{[\mathbf{A}^{(n)}]_{ij}}=[\mathbf{B}^{(n)T}\mathbf{T}_{i}\mathbf{B}^{(n)}]_{jj},\end{split} (68)

and

[λ𝐃𝐀(N)]ij=λk[𝐃]ik[𝐀(N)]kjλ[𝐃]ii[𝐀(N)]ijλ[𝐃𝐖]ii[𝐀(N)]ij=λ[𝐋]ii[𝐀(N)]ij.\begin{split}&\left[\lambda\mathbf{D}\mathbf{A}^{(N)}\right]_{ij}=\lambda\sum_{k}[\mathbf{D}]_{ik}[\mathbf{A}^{(N)}]_{kj}\geq\lambda[\mathbf{D}]_{ii}[\mathbf{A}^{(N)}]_{ij}\\ &\geq\lambda\left[\mathbf{D}-\mathbf{W}\right]_{ii}[\mathbf{A}^{(N)}]_{ij}=\lambda[\mathbf{L}]_{ii}[\mathbf{A}^{(N)}]_{ij}.\end{split} (69)

Now, we demonstrate that (67) holds, and (66) is the upper bound auxiliary function for fA(𝐀(N))f_{A}(\mathbf{A}^{(N)}). Because the elements of 𝐀(N)\mathbf{A}^{(N)} are nonnegative, and (66) is convex, its minimum value can be achieved at

[𝐀(N)]ij=[𝐀(N)]ij[𝐀(N)]i𝐁(N)T𝐓i[𝐁(N)]j[𝒳(N)]i𝐓i[𝐁(N)]j+λ[𝐋𝐀(N)]ij[𝐀(N)𝐁(N)T𝐓i𝐁(N)]ij[𝐀(N)]ij+[λ𝐃𝐀(N)]ij=[𝐀(N)]ij×[𝒳(N)]i𝐓i[𝐁(N)]j+λ[𝐖𝐀(N)]ij[𝐀(N)𝐁(N)T𝐓i𝐁(N)]ij+[λ𝐃𝐀(N)]ij.\begin{split}\!&[\mathbf{A}^{(N)\star}]_{ij}=[\mathbf{A}^{(N)}]_{ij}-\\ &\frac{[\mathbf{A}^{(N)}]_{i\cdot}\mathbf{B}^{(N)T}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\!-\![\mathcal{X}_{(N)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}\!+\!\lambda\left[\mathbf{L}\mathbf{A}^{(N)}\right]_{ij}}{\frac{[\mathbf{A}^{(N)}\mathbf{B}^{(N)T}\mathbf{T}_{i}\mathbf{B}^{(N)}]_{ij}}{[\mathbf{A}^{(N)}]_{ij}}+\left[\lambda\mathbf{D}\mathbf{A}^{(N)}\right]_{ij}}\\ &=[\mathbf{A}^{(N)}]_{ij}\times\frac{[\mathcal{X}_{(N)}]_{i\cdot}\mathbf{T}_{i}[\mathbf{B}^{(N)}]_{\cdot j}+\lambda\left[\mathbf{W}\mathbf{A}^{(N)}\right]_{ij}}{[\mathbf{A}^{(N)}\mathbf{B}^{(N)T}\mathbf{T}_{i}\mathbf{B}^{(N)}]_{ij}+\left[\lambda\mathbf{D}\mathbf{A}^{(N)}\right]_{ij}}.\end{split} (70)

Lemma 2 is proved.

Lemma 3

The auxiliary function for (28) is as follows:

([vec(𝒮)]i,[vec(𝒮)]i)=f([vec(𝒮)]i)+([𝐅T𝐅vec(𝒮)]i[𝐅Tvec(𝒳)]i)[vec(𝒲)]i([vec(𝒮)]i[vec(𝒮)]i)+12[𝐅T𝐅vec(𝒮)]i[vec(𝒲)]i[vec(𝒮)]i([vec(𝒮)]i[vec(𝒮)]i)2.\begin{split}&\mathcal{F}([\mathrm{vec}(\mathcal{S})]_{i},[\mathrm{vec}(\mathcal{S}^{\star})]_{i})=f(\left[\mathrm{vec}(\mathcal{S}^{\star})\right]_{i})\\ &+\left([\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}-[\mathbf{F}^{T}\mathrm{vec}(\mathcal{X})]_{i}\right)\cdot[\mathrm{vec}(\mathcal{W})]_{i}\cdot\\ &\left([\mathrm{vec}(\mathcal{S})]_{i}-[\mathrm{vec}(\mathcal{S}^{\star})]_{i}\right)\\ &+\frac{1}{2}\frac{[\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}\cdot[\mathrm{vec}(\mathcal{W})]_{i}}{[\mathrm{vec}(\mathcal{S})]_{i}}\cdot\left([\mathrm{vec}(\mathcal{S})]_{i}-[\mathrm{vec}(\mathcal{S}^{\star})]_{i}\right)^{2}.\end{split} (71)
Proof 7

First, we vectorize the objective function 𝒥(𝒳,𝒮,{𝐀(n)}n=1N)\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}^{(n)}\}_{n=1}^{N}) with respect to the core tensor 𝒮\mathcal{S}:

vec(𝒥(𝒳,𝒮,{𝐀}n=1N))=11I1INi=1I1INexp(([vec(𝒳)]i[𝐅vec(𝒮)]i)22σ2).\begin{split}&\mathrm{vec}(\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}\}_{n=1}^{N}))=1-\\ &\frac{1}{I_{1}\dots I_{N}}\sum_{i=1}^{I_{1}\dots I_{N}}\exp\left(-\frac{\left([\mathrm{vec}(\mathcal{X})]_{i}-[\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}\right)^{2}}{2\sigma^{2}}\right).\end{split} (72)

And then, we obtain the first-order partial derivative of vec(𝒥(𝒳,𝒮,{𝐀}n=1N))\mathrm{vec}(\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}\}_{n=1}^{N})):

vec(𝒥(𝒳,𝒮,{𝐀}n=1N))[vec(𝒮)]i=([𝐅T𝐅vec(𝒮)]i[𝐅Tvec(𝒳)]i)[vec(𝒲)]i.\begin{split}&\frac{\partial\mathrm{vec}(\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}\}_{n=1}^{N}))}{\partial[\mathrm{vec}(\mathcal{S})]_{i}}=\\ &\left([\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}-[\mathbf{F}^{T}\mathrm{vec}(\mathcal{X})]_{i}\right)\cdot[\mathrm{vec}(\mathcal{W})]_{i}.\end{split} (73)

The second-order partial derivative of vec(𝒥(𝒳,𝒮,{𝐀}n=1N))\mathrm{vec}(\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}\}_{n=1}^{N})):

2vec(𝒥(𝒳,𝒮,{𝐀}n=1N))[vec(𝒮)]i[vec(𝒮)]i=[𝐅T𝐅]ii[vec(𝒲)]i.\begin{split}&\frac{\partial^{2}\mathrm{vec}(\mathcal{J}(\mathcal{X},\mathcal{S},\{\mathbf{A}\}_{n=1}^{N}))}{\partial[\mathrm{vec}(\mathcal{S})]_{i}\partial[\mathrm{vec}(\mathcal{S})]_{i}}=[\mathbf{F}^{T}\mathbf{F}]_{ii}\cdot[\mathrm{vec}(\mathcal{W})]_{i}.\end{split} (74)

According to the Taylor expansion, we rewrite 𝒥\mathcal{J} with respect to 𝒮\mathcal{S}:

f([vec(𝒮)]i)=f𝒮([vec(𝒮)]i)+([vec(𝒮)]i[vec(𝒮)]i)([𝐅T𝐅vec(𝒮)]i[𝐅Tvec(𝒳)]i)[vec(𝒲)]i+12([vec(𝒮)]i[vec(𝒮)]i)2[𝐅T𝐅]ii[vec(𝒲)]i.\begin{split}&f([\mathrm{vec}(\mathcal{S})]_{i})=f_{\mathcal{S}}(\left[\mathrm{vec}(\mathcal{S})\right]_{i})+\left([\mathrm{vec}(\mathcal{S})]_{i}-[\mathrm{vec}(\mathcal{S}^{\star})]_{i}\right)\cdot\\ &\left([\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}-[\mathbf{F}^{T}\mathrm{vec}(\mathcal{X})]_{i}\right)\cdot[\mathrm{vec}(\mathcal{W})]_{i}\\ &+\frac{1}{2}\left([\mathrm{vec}(\mathcal{S})]_{i}-[\mathrm{vec}(\mathcal{S}^{\star})]_{i}\right)^{2}\cdot[\mathbf{F}^{T}\mathbf{F}]_{ii}\cdot[\mathrm{vec}(\mathcal{W})]_{i}.\end{split} (75)

The upper bound auxiliary function for (28) can be represented as:

([vec(𝒮)]i,[vec(𝒮)]i)=f([vec(𝒮)]i)+([𝐅T𝐅vec(𝒮)]i[𝐅Tvec(𝒳)]i)[vec(𝒲)]i([vec(𝒮)]i[vec(𝒮)]i)+12[𝐅T𝐅vec(𝒮)]i[vec(𝒲)]i[vec(𝒮)]i([vec(𝒮)]i[vec(𝒮)]i)2.\begin{split}&\mathcal{F}([\mathrm{vec}(\mathcal{S})]_{i},[\mathrm{vec}(\mathcal{S}^{\star})]_{i})=f(\left[\mathrm{vec}(\mathcal{S}^{\star})\right]_{i})\\ &+\left([\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}-[\mathbf{F}^{T}\mathrm{vec}(\mathcal{X})]_{i}\right)\cdot[\mathrm{vec}(\mathcal{W})]_{i}\cdot\\ &\left([\mathrm{vec}(\mathcal{S})]_{i}-[\mathrm{vec}(\mathcal{S}^{\star})]_{i}\right)\\ &+\frac{1}{2}\frac{[\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}\cdot[\mathrm{vec}(\mathcal{W})]_{i}}{[\mathrm{vec}(\mathcal{S})]_{i}}\cdot\left([\mathrm{vec}(\mathcal{S})]_{i}-[\mathrm{vec}(\mathcal{S}^{\star})]_{i}\right)^{2}.\end{split} (76)

Because we have:

[𝐅T𝐅vec(𝒮)]i[vec(𝒲)]i[vec(𝒮)]i=j[𝐅T𝐅]ij[vec𝒮]j[vec(𝒲)]i[vec(𝒮)]i[𝐅T𝐅]ii[vec(𝒲)]i.\begin{split}&\frac{[\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}\cdot[\mathrm{vec}(\mathcal{W})]_{i}}{[\mathrm{vec}(\mathcal{S})]_{i}}=\frac{\sum_{j}[\mathbf{F}^{T}\mathbf{F}]_{ij}[\mathrm{vec}\mathcal{S}]_{j}\cdot[\mathrm{vec}(\mathcal{W})]_{i}}{[\mathrm{vec}(\mathcal{S})]_{i}}\\ &\geq[\mathbf{F}^{T}\mathbf{F}]_{ii}\cdot[\mathrm{vec}(\mathcal{W})]_{i}.\end{split} (77)

Hence, ([vec(𝒮)]i,[vec(𝒮)]i)f([vec(𝒮)]i)\mathcal{F}([\mathrm{vec}(\mathcal{S})]_{i},[\mathrm{vec}(\mathcal{S}^{\star})]_{i})\geq f([\mathrm{vec}(\mathcal{S})]_{i}).

Now, we demonstrate that (76) holds, and (75) is the upper bound auxiliary function for fS(𝒮)f_{S}(\mathcal{S}). Because the elements of 𝒮\mathcal{S} are nonnegative, and (75) is convex, its minimum value can be achieved at

[vec(𝒮)]i=[vec(𝒮)]i([𝐅T𝐅vec(𝒮)]i[𝐅Tvec(𝒳)]i)[vec(𝒲)]i[𝐅T𝐅vec(𝒮)]i[vec(𝒲)]i[vec(𝒮)]i=[vec(𝒮)]i×[𝐅Tvec(𝒳)]i[vec(𝒲)]i[𝐅T𝐅vec(𝒮)]i[vec(𝒲)]i.\begin{split}\!&[\mathrm{vec}(\mathcal{S}^{\star})]_{i}=[\mathrm{vec}(\mathcal{S})]_{i}-\\ &\frac{\left([\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}-[\mathbf{F}^{T}\mathrm{vec}(\mathcal{X})]_{i}\right)\cdot[\mathrm{vec}(\mathcal{W})]_{i}}{\frac{[\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}\cdot[\mathrm{vec}(\mathcal{W})]_{i}}{[\mathrm{vec}(\mathcal{S})]_{i}}}\\ &=[\mathrm{vec}(\mathcal{S})]_{i}\times\frac{[\mathbf{F}^{T}\mathrm{vec}(\mathcal{X})]_{i}\cdot[\mathrm{vec}(\mathcal{W})]_{i}}{[\mathbf{F}^{T}\mathbf{F}\mathrm{vec}(\mathcal{S})]_{i}\cdot[\mathrm{vec}(\mathcal{W})]_{i}}.\end{split} (78)

Lemma 3 is proved.

A.1.2 Robustness analysis of RMNTF

One important property of RMNTF is that its reconstruction is robust against outliers, as shown in experiments. Here, we show that the robustness of RMNTF is mainly due to the weighted tensor and the regularization. The previous work gLPCA [52] and GLTD [30] have demonstrate the robustness of Laplacian regularization. In this section, we further demonstrate that this property also occurs on RMNTF.

Suppose we have learned the optimal parameters 𝒮\mathcal{S}, {𝐀(n)}n=1N\{\mathbf{A}^{(n)}\}_{n=1}^{N} and 𝒲\mathcal{W} from input training images (𝒳1,𝒳2,,𝒳t)({\mathcal{X}}^{1},{\mathcal{X}}^{2},\dots,{\mathcal{X}}^{t}). Now we have a text image 𝒳t\mathcal{X}^{t\prime} and we aim to learn its low-rank representation [𝐀(N)]t[\mathbf{A}^{(N)}]_{t\cdot} and reconstruction 𝒳^t\hat{\mathcal{X}}^{t\prime}, while the parameters 𝒮0\mathcal{S}_{0}, {𝐀0(n)}n=1N\{\mathbf{A}_{0}^{(n)}\}_{n=1}^{N} and 𝒲0\mathcal{W}_{0} learned by training images are fixed. This can be transformed as solving the problem:

min[𝐀(N)]t𝒥0=i1I1in1=1IN1𝒲0(𝒳t𝒳^t)2+λΦ([𝐀N]t)𝒳^t=𝒮0×n=1N1𝐀0(n)×[𝐀N]tΦ([𝐀N]t)=j=1t[𝐀(N)]t[𝐀0(N)]j2vj,\begin{split}\min_{[\mathbf{A}^{(N)}]_{t\cdot}}\mathcal{J}_{0}&=\sum_{i_{1}}^{I_{1}}\dots\sum_{i_{n-1}=1}^{I_{N-1}}\mathcal{W}_{0}\circledast\left(\mathcal{X}^{t\prime}-\hat{\mathcal{X}}^{t\prime}\right)^{2}+\lambda\Phi([\mathbf{A}^{N}]_{t\cdot})\\ \hat{\mathcal{X}}^{t\prime}&=\mathcal{S}_{0}\times_{n=1}^{N-1}\mathbf{A}_{0}^{(n)}\times[\mathbf{A}^{N}]_{t\cdot}\\ \Phi([\mathbf{A}^{N}]_{t\cdot})&=\sum_{j=1}^{t}\|[\mathbf{A}^{(N)}]_{t\cdot}-[\mathbf{A}_{0}^{(N)}]_{j\cdot}\|^{2}v_{j},\end{split} (79)

where [𝐀0(N)]j[\mathbf{A}_{0}^{(N)}]_{j\cdot} is the jjth row of 𝐀0(N)\mathbf{A}_{0}^{(N)}. Let [𝐀(N)]t=(a1,,an)=𝐚t01×rN[\mathbf{A}^{(N)}]_{t\cdot}=({a}_{1}^{\prime},\dots,{a}_{n}^{\prime})=\mathbf{a}_{t}^{\prime}\in\mathbb{R}_{\geq 0}^{1\times r_{N}}, using the mode-n unfolding of tensor, problem (79) can be reformulated as:

min[𝐀(N)]t𝒥0=𝒲0(N)(𝒳t(N)𝐚t𝐁(N)T)2+λj=1t𝐚t[𝐀0(N)]j2vj,\begin{split}\min_{[\mathbf{A}^{(N)}]_{t\cdot}}\mathcal{J}_{0}=&\left\|\sqrt{{\mathcal{W}_{0}}_{(N)}}\circledast\left({\mathcal{X}^{t\prime}}_{(N)}-\mathbf{a}_{t}^{\prime}\mathbf{B}^{(N)T}\right)\right\|^{2}\\ &+\lambda\sum_{j=1}^{t}\left\|\mathbf{a}_{t}^{\prime}-[\mathbf{A}_{0}^{(N)}]_{j\cdot}\right\|^{2}v_{j},\end{split} (80)

where 𝐁(N)T=𝒮(N)(i=1N1𝐀(i)T)0rN×I1IN1\mathbf{B}^{(N)T}=\mathcal{S}_{(N)}\left(\otimes_{i=1}^{N-1}\mathbf{A}^{(i)T}\right)\in\mathbb{R}_{\geq 0}^{r_{N}\times I_{1}\dots I_{N-1}}. Let 𝐁=𝒲0(N)𝐁(N)T\mathbf{B}^{\prime}=\sqrt{{\mathcal{W}_{0}}_{(N)}}\circledast\mathbf{B}^{(N)T} and 𝒳~(N)t=𝒲0(N)𝒳t(N)\tilde{\mathcal{X}}^{t}_{(N)}=\sqrt{{\mathcal{W}_{0}}_{(N)}}\circledast{\mathcal{X}^{t\prime}}_{(N)}. Then, we obtain first-order partial derivative as follows:

𝒥0𝐚t=2(𝐁𝐁T𝐚tT𝐁𝒳tT(N))+2λ(j=1tvj𝐚tj=1t[𝐀0(N)]jvj)=0.\begin{split}\frac{\partial\mathcal{J}_{0}}{\partial\mathbf{a}_{t}^{\prime}}=&2\left(\mathbf{B}^{\prime}\mathbf{B}^{\prime T}\mathbf{a}_{t}^{\prime T}-\mathbf{B}^{\prime}{\mathcal{X}^{t\prime T}}_{(N)}\right)\\ &+2\lambda\left(\sum_{j=1}^{t}v_{j}\mathbf{a}^{\prime}_{t}-\sum_{j=1}^{t}[\mathbf{A}_{0}^{(N)}]_{j\cdot}v_{j}\right)=0.\end{split} (81)

We obtain the optimal 𝐚t\mathbf{a}^{\prime}_{t} as

𝐚t=[𝐁𝐁T+(j=1tvj)𝐈]1(𝒳t(N)𝐁T+λj=1t[𝐀0(N)]jvj),\begin{split}\!\mathbf{a}^{\prime}_{t}\!=\!\left[\!\mathbf{B}^{\prime}\mathbf{B}^{\prime T}\!+\!\left(\!\sum_{j=1}^{t}v_{j}\!\right)\!\mathbf{I}\right]^{-1}\!\left(\!{\mathcal{X}^{t\prime}}_{(N)}\mathbf{B}^{\prime T}\!+\!\lambda\!\sum_{j=1}^{t}[\mathbf{A}_{0}^{(N)}]_{j\cdot}v_{j}\!\right)\!,\end{split} (82)

where 𝐈rN×rN\mathbf{I}\in\mathbb{R}^{r_{N}\times r_{N}} is an identity matrix.

We note that the problem (79) has a closed form solution. In standard NTF model, λ=0\lambda=0, and the second term of equation (79) should be removed. Hence, the solution of 𝐚t\mathbf{a}^{\prime}_{t} may be singular value. In another word, if some corruptions happened to some elements of test image 𝒳t\mathcal{X}^{t\prime}, the regularization Φ()\Phi(\cdot) and the weighted tensor 𝒲\mathcal{W} will restore these corruptions properly.

A.1.3 Invariance of RMNTF

RMNTF has the following invariance property where the reconstruction 𝒳^=𝒮×n=1N𝐀(n)\hat{\mathcal{X}}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)} is invariant under the transformation by permutation matrices and nonnegative diagonal matrices. Let 𝐏(n)\mathbf{P}^{(n)} and 𝐐(n)\mathbf{Q}^{(n)} be a permutation matrix and a nonnegative diagonal matrix, respectively. Factor matrices {𝐀(n)}n=1N\{\mathbf{A}^{(n)}\}_{n=1}^{N} and core tensor 𝒮\mathcal{S} are transformed as:

{𝐀^(1)=𝐀(1)𝐏(1)𝐐(1)𝐀^(2)=𝐀(2)𝐏(2)𝐐(2)𝐀^(N)=𝐀(N)𝐏(N)𝐐(N)𝒮^=(𝐏(1)𝐐(1))T×1×N1(𝐏(N)𝐐(N))T×N𝒮.\displaystyle\left\{\begin{array}[]{ll}\hat{\mathbf{A}}^{(1)}=\mathbf{A}^{(1)}\mathbf{P}^{(1)}\mathbf{Q}^{(1)}\\ \hat{\mathbf{A}}^{(2)}=\mathbf{A}^{(2)}\mathbf{P}^{(2)}\mathbf{Q}^{(2)}\\ \dots\\ \hat{\mathbf{A}}^{(N)}=\mathbf{A}^{(N)}\mathbf{P}^{(N)}\mathbf{Q}^{(N)}\\ \hat{\mathcal{S}}=(\mathbf{P}^{(1)}\mathbf{Q}^{(1)})^{T}\times_{1}\dots\times_{N-1}(\mathbf{P}^{(N)}\mathbf{Q}^{(N)})^{T}\times_{N}\mathcal{S}.\end{array}\right. (88)
Definition 6

(Uniqueness of NTF) The NTF model 𝒳=𝒮×n=1N𝐀(n)\mathcal{X}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)} is essentially unique, if 𝐀(n)=𝐀^(n)𝐏(n)𝐃(n),n,\mathbf{A}^{(n)}=\hat{\mathbf{A}}^{(n)}\mathbf{P}^{(n)}\mathbf{D}^{(n)},\forall n, holds for any other NTF model 𝒳^=𝒮^×n=1N𝐀^(n)\hat{\mathcal{X}}=\hat{\mathcal{S}}\times_{n=1}^{N}\hat{\mathbf{A}}^{(n)}, where 𝐏(n)\mathbf{P}^{(n)} and 𝐃(n)\mathbf{D}^{(n)} are permutation matrix and nonnegative matrix, respectively.

To demonstrate the uniqueness of RMNTF model, we need to study the difference of uniqueness between the mode-n unfolding of RMNTF model and RMNTF model.

Lemma 1

If the RMNTF model 𝒳=𝒮×n=1N𝐀(n)\mathcal{X}=\mathcal{S}\times_{n=1}^{N}\mathbf{A}^{(n)} is uniqueness, 𝒳(p)=𝐀(p)𝐁(p)T\mathcal{X}_{(p)}=\mathbf{A}^{(p)}\mathbf{B}^{(p)T} is the unique nonnegative matrix decomposition of matrix 𝒳(n),n\mathcal{X}_{(n)},\forall n.

Proof 8

Suppose there exists a non-trivial matrix 𝐂\mathbf{C} such that 𝒳^n=(𝐀(n)𝐂)(𝐂1𝐁(n)T)\hat{\mathcal{X}}_{n}=(\mathbf{A}^{(n)}\mathbf{C})(\mathbf{C}^{-1}\mathbf{B}^{(n)T}) is another solution of mode-n unfolding of RMNTF. Let 𝐀^(n)=𝐀(n)𝐂\hat{\mathbf{A}}^{(n)}=\mathbf{A}^{(n)}\mathbf{C}, and 𝐁^(n)T=𝐂(1)𝐁(n)T\hat{\mathbf{B}}^{(n)T}=\mathbf{C}^{(-1)}\mathbf{B}^{(n)T}, then, foldn(𝐀^(n)𝐁^(n)T)=𝒳(n)\mathrm{fold}_{n}(\hat{\mathbf{A}}^{(n)}\hat{\mathbf{B}}^{(n)T})=\mathcal{X}_{(n)}. However, when pnp\neq n, it satisfied that 𝒳^(p)𝒳(p)\hat{\mathcal{X}}_{(p)}\neq\mathcal{X}_{(p)}. This contradicts the assumption that the RMNTF is essentially unique.

Lemma 2

The mode-N unfolding of RMNTF model 𝒳(N)=𝐀(N)𝐁(N)T\mathcal{X}_{(N)}=\mathbf{A}^{(N)}\mathbf{B}^{(N)T} has an essentially unique solution of nonnegative matrix decomposition, then the RMNTF model of 𝒳\mathcal{X} is essentially unique.

Proof 9

Suppose that 𝒳(N)\mathcal{X}_{(N)} has an essentially unique solution of NMF and 𝒳(N)=𝐀(N)𝒮(N)𝐙(N)T\mathcal{X}_{(N)}=\mathbf{A}^{(N)}\mathcal{S}_{(N)}{\mathbf{Z}}^{(N)T}, where 𝐙(N)=nN𝐀(n){\mathbf{Z}}^{(N)}=\otimes_{n\neq N}\mathbf{A}^{(n)}, both 𝒮(N)\mathcal{S}_{(N)} and 𝐙(N){\mathbf{Z}}^{(N)} can be uniquely estimated. we introduce the permutation matrix 𝐏=nN𝐏(N)\mathbf{P}=\otimes_{n\neq N}\mathbf{P}^{(N)} and nonnegative diagonal matrix 𝐐=nN𝐐(n)\mathbf{Q}=\otimes_{n\neq N}\mathbf{Q}^{(n)}, suppose that

𝐙^(N)=𝐙(N)𝐏𝐐=(nN𝐀(n))𝐏𝐐=nN𝐀(n)𝐏(n)𝐐(n)=nN𝐀^(n),\begin{split}\hat{\mathbf{Z}}^{(N)}&=\mathbf{Z}^{(N)}\mathbf{P}\mathbf{Q}=\left(\otimes_{n\neq N}\mathbf{A}^{(n)}\right)\mathbf{P}\mathbf{Q}\\ &=\otimes_{n\neq N}\mathbf{A}^{(n)}\mathbf{P}^{(n)}\mathbf{Q}^{(n)}=\otimes_{n\neq N}\hat{\mathbf{A}}^{(n)},\end{split} (89)

where 𝐀^(n)=𝐀(n)𝐏(n)𝐐(n)\hat{\mathbf{A}}^{(n)}=\mathbf{A}^{(n)}\mathbf{P}^{(n)}\mathbf{Q}^{(n)}. Here, we only need to demonstrate that 𝐀^(n)\hat{\mathbf{A}}^{(n)} can be uniquely estimated from 𝐙^(N)\hat{\mathbf{Z}}^{(N)}.

Motivated by [53], we appropriately rearrange the elements of 𝐙^(N)\hat{\mathbf{Z}}^{(N)} and reshape it to a tensor 𝒵^\hat{\mathcal{Z}} such that

𝒵^=vec(𝐀^(1))vec(𝐀^(N1)),\begin{split}\hat{\mathcal{Z}}=\mathrm{vec}(\hat{\mathbf{A}}^{(1)})\circ\dots\circ\mathrm{vec}(\hat{\mathbf{A}}^{(N-1)}),\end{split} (90)

where the tensor 𝒵^\hat{\mathcal{Z}} is a rank-one tensor [54] and 𝐀^(n)\hat{\mathbf{A}}^{(n)} can be uniquely estimated from 𝒵^\hat{\mathcal{Z}}. The lemma 2 has been proved.