This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

{NoHyper}

Sampling-Based Decomposition Algorithms for Arbitrary Tensor Networks

Osman Asif Malik Lawrence Berkeley National Laboratory, oamalik@lbl.gov    Vivek Bharadwaj UC Berkeley, vivek_bharadwaj@berkeley.edu    Riley Murray UC Berkeley, rjmurray@berkeley.edu
Abstract

We show how to develop sampling-based alternating least squares (ALS) algorithms for decomposition of tensors into any tensor network (TN) format. Provided the TN format satisfies certain mild assumptions, resulting algorithms will have input sublinear per-iteration cost. Unlike most previous works on sampling-based ALS methods for tensor decomposition, the sampling in our framework is done according to the exact leverage score distribution of the design matrices in the ALS subproblems. We implement and test two tensor decomposition algorithms that use our sampling framework in a feature extraction experiment where we compare them against a number of other decomposition algorithms.

1 Introduction

Tensor decomposition has emerged as an important tool in data mining and machine learning [24, 4, 5, 11]. Applications in data mining include network analysis, web mining, topic modeling and recommendation systems. Tensor decomposition is used widely in machine learning for things like parameter reduction in neural networks, understanding of deep neural network expressiveness, supervised learning and feature extraction.

Due to the multidimensional nature of tensors, they are inherently plagued by the curse of dimensionality. For example, representing a tensor 𝓧I××I\bm{\mathscr{X}}\in\mathbb{R}^{I\times\cdots\times I} with NN modes requires INI^{N} numbers. This exponential dependence on NN makes its way into algorithms for computing tensor decompositions. Alternating least squares (ALS) is arguably the most popular and successful approach for computing a wide range of tensor decompositions. When decomposing a tensor 𝓧\bm{\mathscr{X}}, each iteration of ALS involves solving a sequence of least squares problems for which the entries of 𝓧\bm{\mathscr{X}} feature as the dependent variables. The per-iteration cost of ALS therefore naturally inherits the exponential dependence on NN.

A large number of papers have sought to reduce the cost of tensor decomposition by leveraging techniques from randomized numerical linear algebra. One particularly interesting line of work [3, 14, 9, 19, 18] seeks to construct ALS algorithms with a per-iteration cost which is sublinear in the number of tensor entries, i.e., o(IN)o(I^{N}). Since any algorithm considering all entries of 𝓧\bm{\mathscr{X}} immediately incurs a cost of Ω(IN)\Omega(I^{N}), these works have all resorted to sampling-based techniques. More precisely, they all sample the ALS subproblems according to the leverage score distribution (or an approximation thereof). This is done efficiently by taking advantage of the special structure of the design matrices in the ALS subproblems for tensor decomposition.

The previous works discussed above develop methods for specific tensor decompositions: The CP decomposition in [3, 14, 18], Tucker decomposition in [9], and the tensor ring decomposition in [19, 18]. In this paper, we consider all decompositions that can be expressed in tensor network (TN) format. The TN format allows for a very wide range of decompositions, including the CP, Tucker, tensor train and tensor ring decompositions. The following summarizes our contributions:

  • We first show how to efficiently sample rows of any tall-and-skinny matrix which is in TN format according to the exact leverage score distribution.

  • We then show how this sampling technique can be used to yield ALS algorithms with a per-iteration cost which is input sublinear for all TN decompositions that satisfy certain mild assumptions.

The decomposition framework we present builds on the work in [18]. That paper is notable since it provided the first sampling-based ALS methods for CP and tensor ring decomposition with a per-iteration cost depending polynomially on NN; the dependence in earlier works was exponential [3, 14, 19]. We are able to substantially simplify the scheme in [18] by entirely avoiding the complicated recursive sketching procedure that it relies on. This makes implementation easier and is also what ensures that the sampling is done according to the exact leverage score distribution rather than an approximation of it. The simplification is also what paves the way for generalization to arbitrary TN formats.

2 Related Work

Cheng et al. [3] develop the first ALS method for CP decomposition with an input sublinear per-iteration cost. Their method uses a mixture of leverage score and row-norm sampling applied to the matricized-tensor-times-Khatri–Rao product (MTTKRP) which arises as a key computational kernel in CP decomposition. Larsen and Kolda [14] use leverage score sampling to reduce the size of each ALS subproblem. Their method has improved theoretical guarantees compared to those in [3] as well as improved scalability due to various practical improvements. Malik and Becker [19] propose a sampling-based ALS method for tensor ring decomposition which also uses leverage score sampling to reduce the size of the ALS subproblems.

All three papers [3, 14, 19] require a number of samples which scales exponentially with the number of input tensor modes, NN, for performance guarantees to hold. This translates into a per-iteration cost which is Ω(RN+1)\Omega(R^{N+1}) for the CP decomposition [3, 14] and Ω(R2N+2)\Omega(R^{2N+2}) for the tensor ring decomposition [19], where RR is the relevant notion of rank. For low-rank decompositions (RIR\ll I), the cost will be input sublinear despite this exponential cost, but for higher rank it might not be (recall that, unlike in matrix decomposition, we can have R>IR>I in tensor decomposition). Malik [18] develop methods for both CP and tensor ring decomposition which avoid this exponential dependence on NN, instead improving it to a polynomial dependence. This is achieved by sampling from a distribution much closer to the exact leverage score distribution than the previous works [3, 14, 19] do. Since our work builds on and improves the scheme in [18], it also has a polynomial dependence on NN when used for CP and tensor ring decomposition (see Tables 1 and 2).

Fahrbach et al. [9] develop a method for efficient sampling of ridge regression problems involving Kronecker product design matrices. They use these techniques to achieve an efficient sampling-based ALS method for regularized Tucker decomposition.

Efficient sketching of structured matrices is an active research area with applications beyond tensor decomposition. The recent work by Ma and Solomonik [16] is particularly interesting since it proposes structured sketching operators for general TN matrices. These operators take the form of TNs made up of Gaussian tensors and are therefore dense. When used in ALS for tensor decomposition, their sketching operators therefore need to access all entries of the data tensor in each least squares solve which leads to a per-iteration cost which is at least linear in the input tensor size (and therefore exponential in NN).

Due to space constraints, we have focused on previous works that develop sampling-based ALS methods here. There is a large number of works that develop randomized tensor decomposition methods using other techniques. For a comprehensive list of such works, see [18, Sec. 2]. For an overview of work on structured sketching, see [20, Sec. 1.2].

3 Preliminaries

3.1 General Notation

By tensor, we mean a multidimensional array containing real numbers. We refer to a tensor with NN indices as an NN-way tensor and we say that it is of order NN or that it has NN modes. We use bold uppercase Euler script letters (e.g., 𝓧\bm{\mathscr{X}}) to denote tensors of order 3 or greater, bold uppercase letters (e.g., 𝑿\bm{X}) to denote matrices, bold lowercase letters (e.g., 𝒙\bm{x}) to denote vectors, and regular lowercase letters to denote scalars (e.g., xx). We denote entries of an object either in parentheses or with subscripts. For example, 𝓧(i,j,k)=𝓧ijk\bm{\mathscr{X}}(i,j,k)=\bm{\mathscr{X}}_{ijk} is the entry on position (i,j,k)(i,j,k) in the 3-way tensor 𝓧\bm{\mathscr{X}}, and 𝒙(i)=𝒙i\bm{x}(i)=\bm{x}_{i} is the iith entry in the vector 𝒙\bm{x}. We use a colon to denote all entries along a particular mode. For example, 𝑿(i,:)\bm{X}(i,:) denotes the iith row of the matrix 𝑿\bm{X}. Superscripts in parentheses will be used to denote a sequence of objects. For example, 𝓐(1),,𝓐(M)\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(M)} is a sequence of tensors and 𝒆(i)\bm{e}^{(i)} is the iith canonical basis vector. The values that a tensor’s indices can take are referred to as dimensions. For example, if 𝓧=(𝓧ijk)i=1,j=1,k=1I,J,K\bm{\mathscr{X}}=(\bm{\mathscr{X}}_{ijk})_{i=1,j=1,k=1}^{I,J,K} then the dimensions of modes 1, 2 and 3 are II, JJ and KK respectively. Note that dimension does not refer to the number of modes, which is 3 for 𝓧\bm{\mathscr{X}}. For a positive integer II, we use the notation [I]=def{1,,I}[I]\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\{1,\ldots,I\}. For indices i1[I1],,iN[IN]i_{1}\in[I_{1}],\ldots,i_{N}\in[I_{N}], the notation i1iN¯=def1+n=1N(in1)j=1n1Ij\overline{i_{1}\cdots i_{N}}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}1+\sum_{n=1}^{N}(i_{n}-1)\prod_{j=1}^{n-1}I_{j} will be used to denote the linear index corresponding to the multi-index (i1,,iN)(i_{1},\ldots,i_{N}). The pseudoinverse of a matrix 𝑨\bm{A} is denoted by 𝑨+\bm{A}^{+}. We use \otimes, \odot and \circledast to denote the Kronecker, Khatri–Rao and Hadamard matrix products which are defined in Section A.

3.2 Tensor Networks

A tensor network (TN) consists of tensors and connections between them that indicate how they should be contracted with each other. Since the mathematical notation easily gets unwieldy when working with TNs, it is common practice to use graphical representations when discussing such networks. Figure 1 shows how scalars, vectors, matrices and tensors can be represented graphically. A circle or node is used to represent a tensor, and dangling edges are used to indicate modes of the tensor.

Refer to caption
Figure 1: Graphical TN representation of a (a) scalar, (b) vector, (c) matrix and (d) 4-way tensor.

Contraction between two tensors or a tensor with itself can be represented by connecting the appropriate dangling edges. This is illustrated in Figure 2. In mathematical notation, these contractions can be written elementwise as

  1. (a)

    i𝑨ii=trace(𝑨)=c\sum_{i}\bm{A}_{ii}=\operatorname{trace}(\bm{A})=c,

  2. (b)

    j𝑨ij𝒙j=𝒚i\sum_{j}\bm{A}_{ij}\,\bm{x}_{j}=\bm{y}_{i},

  3. (c)

    j𝑨ij𝑩jk=𝑪ik\sum_{j}\bm{A}_{ij}\,\bm{B}_{jk}=\bm{C}_{ik},

  4. (d)

    mn𝓧mn𝑨i𝑩jm𝑪kn=𝓨ijk\sum_{\ell mn}\bm{\mathscr{X}}_{\ell mn}\bm{A}_{i\ell}\bm{B}_{jm}\bm{C}_{kn}=\bm{\mathscr{Y}}_{ijk}.

Refer to caption
Figure 2: TN representation of (a) matrix trace, (b) matrix-vector multiplication, (c) matrix-matrix multiplication and (d) a 3-way Tucker decomposition.

To reduce computational cost when contracting a TN, it is optimal to contract two tensors at a time [26]. Determining the optimal contraction order is NP-hard, and developing heuristics for this problem is an active research area [26, 21]. There are several software packages that compute exact or approximate optimal contraction orders, e.g., NCON [25] for Matlab and the opt_einsum package for Python111Available at https://github.com/dgasmith/opt_einsum..

A TN matrix is a matrix which is represented by a TN. The dangling edges in such a network represent either rows or columns of the matrix. Figure 3 shows an example of a TN matrix with dangling edges pointing to the left representing rows and dangling edges pointing right representing columns. Suppose there are NrN_{r} and NcN_{c} dangling edges representing rows and columns, respectively, and let N=Nr+NcN=N_{r}+N_{c}. Let 𝓐I1××IN\bm{\mathscr{A}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}} denote the NN-way TN representing the matrix, and let 𝒋\bm{j} be a length-NN vector with the first NrN_{r} entries enumerating the modes of 𝓐\bm{\mathscr{A}} corresponding to rows and the remaining entries enumerating those modes that correspond to columns. The matrix 𝑨\bm{A} represented by 𝓐\bm{\mathscr{A}} is then given elementwise by

𝑨(i𝒋(1)i𝒋(Nr)¯,i𝒋(Nr+1)i𝒋(N)¯)=𝓐i1iN\bm{A}(\overline{i_{\bm{j}(1)}\cdots i_{\bm{j}(N_{r})}},\overline{i_{\bm{j}(N_{r}+1)}\cdots i_{\bm{j}(N)}})=\bm{\mathscr{A}}_{i_{1}\cdots i_{N}} (1)

and is of size n=1NrI𝒋(n)×m=Nr+1NI𝒋(m)\prod_{n=1}^{N_{r}}I_{\bm{j}(n)}\times\prod_{m=N_{r}+1}^{N}I_{\bm{j}(m)}. For example, if the mode dimensions corresponding to the dangling edges in Figure 3 is 100, then the matrix represented by the TN is of size 106×10010^{6}\times 100.

Refer to caption
Figure 3: Example of a TN matrix.

Graphical TN notation does not specify the order in which the modes corresponding to dangling edges should appear when the tensor is represented as a multidimensional array. A similar ambiguity exists for the TN matrices where it is not clear how the row and column modes should be permuted. In both cases, any ordering can be used as long the choice is consistent in all computations.

3.3 Leverage Score Sampling

In the context of least squares, leverage scores indicate how important the rows of the design matrix are. By sampling according to this importance metric, it is possible to compute a good approximation (with high probability) to the least squares solution using a random subset of the equations of the full least squares problem. We provide the necessary preliminaries on leverage score sampling in this section and refer the reader to [17, 29] for further details.

Definition 3.1.

The iith leverage score of a matrix 𝑨I×R\bm{A}\in\mathbb{R}^{I\times R} is defined as i(𝑨)=def𝒆(i)𝑨𝑨+𝒆(i)\ell_{i}(\bm{A})\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\bm{e}^{(i)\top}\bm{A}\bm{A}^{+}\bm{e}^{(i)} for i[I]i\in[I], where 𝒆(i)\bm{e}^{(i)} is the iith canonical basis vector of length II.

Definition 3.2.

Let 𝑨I×R\bm{A}\in\mathbb{R}^{I\times R} and let 𝒑I\bm{p}\in\mathbb{R}^{I} be a vector with entries 𝒑(i)=i(𝑨)/rank(𝑨)\bm{p}(i)=\ell_{i}(\bm{A})/\operatorname{rank}(\bm{A}). Then 𝒑\bm{p} is the leverage score distribution on row indices of 𝑨\bm{A}. Let f:[J][I]f:[J]\rightarrow[I] be a random map such that each f(j)f(j) is independent and distributed according to 𝒑\bm{p}. The random matrix 𝑺J×I\bm{S}\in\mathbb{R}^{J\times I} defined elementwise via

𝑺(j,i)=defInd{f(j)=i}/J𝒑(f(j))\bm{S}(j,i)\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\operatorname*{\operatorname{Ind}}\{f(j)=i\}/\sqrt{J\bm{p}(f(j))} (2)

is the leverage score sampling matrix for 𝑨\bm{A}. In (2), Ind{A}\operatorname*{\operatorname{Ind}}\{A\} is the indicator function which is 1 if the random event AA occurs and 0 otherwise.

The following result is well-known and variants have appeared throughout the literature [6, 7, 8, 14].

Theorem 3.3.

Let 𝐀I×R\bm{A}\in\mathbb{R}^{I\times R} be a matrix and suppose 𝐒J×I\bm{S}\in\mathbb{R}^{J\times I} is the leverage score sampling matrix for 𝐀\bm{A}. Moreover, let ε,δ(0,1)\varepsilon,\delta\in(0,1), and define OPT=defmin𝐗𝐀𝐗𝐘F\textup{OPT}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\min_{\bm{X}}\|\bm{A}\bm{X}-\bm{Y}\|_{\textup{F}} and

𝑿~=defargmin𝑿𝑺𝑨𝑿𝑺𝒀F.\tilde{\bm{X}}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\operatorname*{arg\,min}_{\bm{X}}\|\bm{S}\bm{A}\bm{X}-\bm{S}\bm{Y}\|_{\textup{F}}. (3)

If JRmax(log(R/δ),1/(εδ))J\gtrsim R\max(\log(R/\delta),1/(\varepsilon\delta)), then with probability at least 1δ1-\delta it holds that 𝐀𝐗~𝐘F(1+ε)OPT\|\bm{A}\tilde{\bm{X}}-\bm{Y}\|_{\textup{F}}\leq(1+\varepsilon)\textup{OPT}.

4 Proposed Sampling Method

In this section, we propose an efficient method for sampling rows of a tall-and-skinny TN matrix 𝑨I×R\bm{A}\in\mathbb{R}^{I\times R} according to its leverage scores. The method is useful when II is so large that costs and storage on the order Ω(I)\Omega(I) are too expensive, and RR is small enough that costs and storage on the order O(R3)O(R^{3}) are affordable. In this setting it is infeasible to sample the rows of 𝑨\bm{A} directly via the probability distribution in Definition 3.2 since doing so would require computing the pseudoinverse of 𝑨\bm{A} which costs O(IR2)O(IR^{2}). Moreover, once that pseudoinverse is computed, we would be required to compute, store and sample according to a length-II vector of probabilities.

In view of Definition 3.1, note that

i(𝑨)=𝒆(i)𝑨𝑨+𝒆(i)=𝒆(i)𝑨𝚽𝑨𝒆(i),\ell_{i}(\bm{A})=\bm{e}^{(i)\top}\bm{A}\bm{A}^{+}\bm{e}^{(i)}=\bm{e}^{(i)\top}\bm{A}\bm{\Phi}\bm{A}^{\top}\bm{e}^{(i)}, (4)

where 𝚽=def(𝑨𝑨)+\bm{\Phi}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}(\bm{A}^{\top}\bm{A})^{+}. This formulation is the basis for our sampling method. As we show in Section 4.1, the matrix 𝚽\bm{\Phi} can be computed efficiently when 𝑨\bm{A} is a TN matrix. Once 𝚽\bm{\Phi} is computed, we can again leverage the TN structure of 𝑨\bm{A} to draw samples from its leverage score distribution via the formula in (4) without forming a length-II probability vector. We describe how this is done in Section 4.2.

4.1 Computing 𝚽\bm{\Phi}

When 𝑨\bm{A} is a TN matrix, it is straightforward to represent the Gram matrix 𝑨𝑨\bm{A}^{\top}\bm{A} in such a format as well. This can be done by linking up the dangling column edges of 𝑨\bm{A}^{\top} with the corresponding dangling row edges of 𝑨\bm{A}. For example, suppose 𝑨\bm{A} is the TN matrix in Figure 3. The graphical representation of 𝑨\bm{A}^{\top} is then a horizontal mirror image of Figure 3. The resulting Gram matrix is illustrated in Figure 4.

Refer to caption
Figure 4: Example of how the Gram matrix 𝑨𝑨\bm{A}^{\top}\bm{A} can be computed as a contraction of its TN representation.

A dense representation of the Gram matrix can be computed by contracting its TN representation. For a wide range of TN matrices, this will be much more efficient that computing 𝑨𝑨\bm{A}^{\top}\bm{A} naïvely by first forming 𝑨\bm{A} as a dense matrix and then carrying out a dense matrix-matrix multiplication between 𝑨\bm{A}^{\top} and 𝑨\bm{A}. Once 𝑨𝑨\bm{A}^{\top}\bm{A} is computed, its pseudoinverse 𝚽=(𝑨𝑨)+\bm{\Phi}=(\bm{A}^{\top}\bm{A})^{+} is affordable to compute via standard methods.

4.2 Drawing Samples Efficiently

We now describe how to sample rows of 𝑨\bm{A} according to its leverage score distribution without needing to form 𝑨\bm{A}. In order to keep the notation simple, we will assume without loss of generality that the modes of the TN 𝓐\bm{\mathscr{A}} describing the matrix 𝑨\bm{A} are arranged so that

𝑨(i1iNr¯,iNr+1iN¯)=𝓐i1iN.\bm{A}(\overline{i_{1}\cdots i_{N_{r}}},\overline{i_{N_{r}+1}\cdots i_{N}})=\bm{\mathscr{A}}_{i_{1}\cdots i_{N}}.

Sampling a row of 𝑨\bm{A} is therefore equivalent to sampling a multi-index i1iNr¯[n=1NrIn]\overline{i_{1}\cdots i_{N_{r}}}\in[\prod_{n=1}^{N_{r}}I_{n}] with each in[In]i_{n}\in[I_{n}].

In a nutshell, our sampling scheme sequentially samples each index i1,,iNri_{1},\ldots,i_{N_{r}} one after another by conditioning on the previously drawn indices. This allows us to sample from the leverage score distribution while at the same time avoid forming the full probability vector of length n=1NrIn\prod_{n=1}^{N_{r}}I_{n}.

In the following we use (i1)\mathbb{P}(i_{1}) to denote the probability that the first index is i1i_{1}. We use (ini1,,in1)\mathbb{P}(i_{n}\mid i_{1},\ldots,i_{n-1}) to denote the conditional probability that the nnth index is ini_{n} given that the previous n1n-1 indices are i1,,in1i_{1},\ldots,i_{n-1}. Let ρ=defrank(𝑨)\rho\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\operatorname{rank}(\bm{A}). From Definition 3.2 and the formula in (4) we would like to sample row i1iNr¯\overline{i_{1}\cdots i_{N_{r}}} with probability

1ρ𝑨(i1iNr¯,:)𝚽𝑨(i1iNr¯,:).\frac{1}{\rho}\,\bm{A}(\overline{i_{1}\cdots i_{N_{r}}},:)\,\bm{\Phi}\,\bm{A}(\overline{i_{1}\cdots i_{N_{r}}},:)^{\top}.

Therefore, we want the first index to be i1i_{1} with probability

(i1)=1ρi2iNr𝑨(i1iNr¯,:)𝚽𝑨(i1iNr¯,:).\mathbb{P}(i_{1})=\frac{1}{\rho}\sum_{i_{2}\cdots i_{N_{r}}}\bm{A}(\overline{i_{1}\cdots i_{N_{r}}},:)\,\bm{\Phi}\,\bm{A}(\overline{i_{1}\cdots i_{N_{r}}},:)^{\top}. (5)

The matrix 𝑨𝚽𝑨\bm{A}\bm{\Phi}\bm{A}^{\top} can easily be formulated as a TN matrix by linking up the TNs for 𝑨\bm{A} and 𝑨\bm{A}^{\top} with 𝚽\bm{\Phi}. Moreover, the summation in (5) in amounts to adding an additional Nr1N_{r}-1 contractions to the TN representation of 𝑨𝚽𝑨\bm{A}\bm{\Phi}\bm{A}^{\top}. All in all, this results in a TN matrix 𝑴(1)I1×I1\bm{M}^{(1)}\in\mathbb{R}^{I_{1}\times I_{1}} with only two dangling edges. The desired probabilities then lie along the diagonal of this matrix: (i1)=𝑴i1i1(1)\mathbb{P}(i_{1})=\bm{M}^{(1)}_{i_{1}i_{1}}. The dense representation of 𝑴(1)\bm{M}^{(1)} can be computed efficiently by contracting the underlying network (it is sufficient to only compute diagonal entries which further reduces the cost). An example of what this looks like when 𝑨\bm{A} is the TN matrix in Figure 3 is illustrated in Figure 5. The first index i1i_{1} is then drawn according to the probability distribution ((i1))i1=1I1(\mathbb{P}(i_{1}))_{i_{1}=1}^{I_{1}}.

Refer to caption
Figure 5: Example of TN matrix for computing the distribution of the first index i1i_{1}.

The next step is to draw subsequent indices conditionally on the previously drawn indices. Suppose we have drawn i1,,in1i_{1},\ldots,i_{n-1}. When sampling the nnth index we compute the probabilities

(ini1,,in1)=(i1,,in)(i1,,in1)\mathbb{P}(i_{n}\mid i_{1},\ldots,i_{n-1})=\frac{\mathbb{P}(i_{1},\ldots,i_{n})}{\mathbb{P}(i_{1},\ldots,i_{n-1})} (6)

for each in[In]i_{n}\in[I_{n}]. Since (i1,,in1)=in(i1,,in)\mathbb{P}(i_{1},\ldots,i_{n-1})=\sum_{i_{n}}\mathbb{P}(i_{1},\ldots,i_{n}) it is sufficient to compute the numerator in (6) for each in[In]i_{n}\in[I_{n}] and then normalize them so that they add up to 1. Note that

(i1,,in)=1ρin+1iNr𝑨(i1iNr¯,:)𝚽𝑨(i1iNr¯,:).\mathbb{P}(i_{1},\ldots,i_{n})=\frac{1}{\rho}\sum_{i_{n+1}\cdots i_{N_{r}}}\bm{A}(\overline{i_{1}\cdots i_{N_{r}}},:)\,\bm{\Phi}\,\bm{A}(\overline{i_{1}\cdots i_{N_{r}}},:)^{\top}. (7)

As earlier, this can be represented as a TN by adding NrnN_{r}-n additional contractions over the indices in+1,,iNri_{n+1},\ldots,i_{N_{r}} to the TN representation of 𝑨𝚽𝑨\bm{A}\bm{\Phi}\bm{A}^{\top}. Moreover, since the indices i1,,in1i_{1},\ldots,i_{n-1} are fixed, this effectively eliminates the corresponding modes of the underlying tensor representation by reducing their dimensionality to 1. This results in a TN matrix 𝑴(n)In×In\bm{M}^{(n)}\in\mathbb{R}^{I_{n}\times I_{n}} with only two dangling edges, with the desired joint probabilities laying along the diagonal: (i1,,in)=𝑴inin(n)\mathbb{P}(i_{1},\ldots,i_{n})=\bm{M}^{(n)}_{i_{n}i_{n}} (again, the cost can be reduced by only computing the diagonal elements of 𝑴(n)\bm{M}^{(n)}). Figure 6 illustrates what this looks like when we are computing the probability (i1,i2)\mathbb{P}(i_{1},i_{2}) for the example from Figure 5 with i1i_{1} fixed.

Refer to caption
Figure 6: Example of TN matrix for computing the distribution of the second index i2i_{2} conditionally on the first index i1i_{1}.

This sampling procedure is carried out until all indices i1,,iNri_{1},\ldots,i_{N_{r}} have been drawn, which corresponds to drawing a single row index i1iNr¯\overline{i_{1}\cdots i_{N_{r}}}. The procedure is then repeated until the desired number of samples has been drawn. Note that the distribution ((i1))i1=1I1(\mathbb{P}(i_{1}))_{i_{1}=1}^{I_{1}} remains the same for all samples, so in order to speed up the overall sampling procedure it is a good idea to draw i1i_{1} for all samples immediately.

The efficiency of the proposed sampling scheme relies on the possibility to efficiently contract the TNs corresponding to the Gram matrix computation 𝑨𝑨\bm{A}^{\top}\bm{A} and the probability distributions in (5) and (6). For a large class of tensor networks, this will be possible. We discuss the computational complexity of these contractions for the TN matrices that arise for CP and tensor ring decompositions in Section B.

5 Application to Tensor Decomposition

TNs can be used to express a very wide range of tensor decompositions, including the CP [10], Tucker [27], tensor train [23] and tensor ring decompositions [31].

Suppose 𝓧I1×IN\bm{\mathscr{X}}\in\mathbb{R}^{I_{1}\times\cdots I_{N}} is a data tensor that we want to decompose into some TN format consisting of tensors 𝓐(1),,𝓐(M)\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(M)}. We use TN(𝓐(1),,𝓐(M))I1××IN\operatorname{TN}(\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(M)})\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}} to denote this TN. This decomposition problem can be formulated as the optimization problem

argmin𝓐(1),,𝓐(M)𝓧TN(𝓐(1),,𝓐(M))F,\operatorname*{arg\,min}_{\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(M)}}\|\bm{\mathscr{X}}-\operatorname{TN}(\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(M)})\|_{\textup{F}}, (8)

where F\|\cdot\|_{\textup{F}} is a straightforward generalization of the matrix Frobenius norm to tensors (see Section A). In general, this is a difficult non-convex optimization problem which is hard to solve. A popular approach to finding an approximate solution to (8) is to iteratively update one tensor 𝓐(m)\bm{\mathscr{A}}^{(m)} at a time via alternating least squares (ALS). Minimizing the objective in (8) with respect to a single tensor at a time turns it into a linear least squares problem:

argmin𝑨(m)𝑿(m)𝑨m𝑨(m)F,\operatorname*{arg\,min}_{\bm{A}^{(m)}}\|\bm{X}^{(m)}-\bm{A}^{\neq m}\bm{A}^{(m)}\|_{\textup{F}}, (9)

where 𝑿(m)\bm{X}^{(m)} and 𝑨(m)\bm{A}^{(m)} are appropriate matricizations of 𝓧\bm{\mathscr{X}} and 𝓐(m)\bm{\mathscr{A}}^{(m)}, respectively, and 𝑨m\bm{A}^{\neq m} is a TN matrix which depends on the tensors 𝓐(1),,𝓐(m1),𝓐(m+1),,𝓐(M)\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(m-1)},\bm{\mathscr{A}}^{(m+1)},\ldots,\bm{\mathscr{A}}^{(M)}. For example, for the CP and Tucker decompositions, 𝑨m\bm{A}^{\neq m} takes the form of a Khatri–Rao and Kronecker product matrix, respectively [12]. Figure 7 shows examples of three tensor decompositions (top plots) and examples of what a TN matrix 𝑨m\bm{A}^{\neq m} corresponding to each decomposition looks like (bottom plots). Algorithm 1 provides a high-level ALS algorithm for computing any TN decomposition.

Refer to caption
Figure 7: The top plots show examples of tensor decompositions: (a) CP decomposition, (b) tensor ring decomposition, and (c) a TN decomposition discovered in [15] via an evolutionary search procedure. The bottom plots (d)–(f) show examples of TN matrices that arise when updating the tensor marked in gray in the corresponding top plot. A tensor 𝓐\bm{\mathscr{A}} is said to be diagonal if only the elements 𝓐iii\bm{\mathscr{A}}_{ii\cdots i} are nonzero.
Input: 𝓧I1××IN\bm{\mathscr{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}
Output: Factorization tensors 𝓐(1),,𝓐(M)\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(M)}
1 Initialize tensors 𝓐(2),,𝑨(M)\bm{\mathscr{A}}^{(2)},\ldots,\bm{A}^{(M)}
2
3while termination criteria not met do
4 for m=1,,Mm=1,\ldots,M do
5    𝑨(m)=argmin𝑨𝑿(m)𝑨m𝑨F\bm{A}^{(m)}=\operatorname*{arg\,min}_{\bm{A}}\|\bm{X}^{(m)}-\bm{A}^{\neq m}\bm{A}\|_{\textup{F}}
6    
7return 𝓐(1),,𝓐(M)\bm{\mathscr{A}}^{(1)},\ldots,\bm{\mathscr{A}}^{(M)}
Algorithm 1 ALS for TN decomposition

ALS is the “workhorse approach” to tensor decomposition [12]. ALS-based methods for various decompositions are implemented in many leading tensor software packages, including Tensor Toolbox [2] and Tensorlab [28] for Matlab and TensorLy [13] for Python. ALS typically works well, especially when the data tensor 𝓧\bm{\mathscr{X}} is not too large. However, as previously discussed, each iteration of ALS depends on all entries of 𝓧\bm{\mathscr{X}}. If 𝓧\bm{\mathscr{X}} has NN modes each of dimension II, then 𝓧\bm{\mathscr{X}} contains INI^{N} entries. Algorithms for tensor decomposition therefore inherit this exponential dependence on NN. In the case of ALS, each iteration computed via (9) requires access to all entries of 𝓧\bm{\mathscr{X}}. Moreover, for every iteration, it requires forming the matrix 𝑨m\bm{A}^{\neq m} and then solving a linear system involving this matrix.

As discussed in Section 2, several previous works have sought to address the curse of dimensionality in ALS algorithms by sampling the least squares problem in (9). These works develop methods for specific tensor decompositions. By contrast, our sampling framework described in Section 4 can be used to develop a sampling-based decomposition scheme for any TN decomposition. This is done by independently sampling each least squares problem on line 1 in Algorithm 1 according to the leverage scores of the TN matrix 𝑨m\bm{A}^{\neq m}. The sampling is done by first determining the indices to sample by applying our techniques in Section 4. Letting 𝑺\bm{S} represent the sampling operation, the next step is to compute 𝑺𝑿(m)\bm{S}\bm{X}^{(m)} and 𝑺𝑨m\bm{S}\bm{A}^{\neq m}. The final step is to solve the sampled least squares problem

𝑨(m)=argmin𝑨𝑺𝑿(m)𝑺𝑨m𝑨F.\bm{A}^{(m)}=\operatorname*{arg\,min}_{\bm{A}}\|\bm{S}\bm{X}^{(m)}-\bm{S}\bm{A}^{\neq m}\bm{A}\|_{\textup{F}}. (10)

In order to make the discussion in Sections 4 and 5 a bit more concrete, we now apply these ideas to the CP decomposition.

Example 5.1.

Suppose 𝓧I1××IN\bm{\mathscr{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}} is an NN-way tensor and that we want to compute a rank-RR CP decomposition of it. Figure 7 (a) shows what the CP decomposition looks like when N=4N=4. Mathematically, for an arbitrary NN it takes the form

𝓧i1iNr=1Rλr𝑨i1r(1)𝑨iNr(N),\bm{\mathscr{X}}_{i_{1}\cdots i_{N}}\approx\sum_{r=1}^{R}\lambda_{r}\bm{A}^{(1)}_{i_{1}r}\cdots\bm{A}^{(N)}_{i_{N}r}, (11)

where λ1,,λR\lambda_{1},\ldots,\lambda_{R} are the elements in the diagonal tensor in the middle of Figure 7 (a), and the other tensors now have two modes and are therefore denoted by 𝑨(m)Im×R\bm{A}^{(m)}\in\mathbb{R}^{I_{m}\times R} and referred to as factor matrices. The values of the diagonal elements can be incorporated into the factor matrices, and we therefore assume that they are all equal to 1 without loss of generality. The design matrices now take the form

𝑨m=𝑨(N)𝑨(m+1)𝑨(m1)𝑨(1).\bm{A}^{\neq m}=\bm{A}^{(N)}\odot\cdots\bm{A}^{(m+1)}\odot\bm{A}^{(m-1)}\odot\cdots\odot\bm{A}^{(1)}.

The corresponding unfoldings 𝑿(m)\bm{X}^{(m)} are defined elementwise via

𝑿(m)(i1im1im+1iN¯,im)=𝓧i1iN.\bm{X}^{(m)}(\overline{i_{1}\cdots i_{m-1}i_{m+1}\cdots i_{N}},i_{m})=\bm{\mathscr{X}}_{i_{1}\cdots i_{N}}.

The TN corresponding to the Gram matrix 𝑨m𝑨m\bm{A}^{\neq m\top}\bm{A}^{\neq m} can be efficiently contracted via the well-known formula

𝑨m𝑨m=\scalereljm𝑨(j)𝑨(j),\bm{A}^{\neq m\top}\bm{A}^{\neq m}=\operatorname*{\scalerel*{\circledast}{\sum}}_{j\neq m}\bm{A}^{(j)\top}\bm{A}^{(j)}, (12)

where \circledast denotes elementwise product. Drawing a row index i[jmIj]i\in[\prod_{j\neq m}I_{j}] for 𝑨m\bm{A}^{\neq m} corresponds to drawing a multi-index i1im1im+1iN¯\overline{i_{1}\cdots i_{m-1}i_{m+1}\cdots i_{N}} with each ij[Ij]i_{j}\in[I_{j}]. It is straightforward to show that the appropriate sampling probabilities in (5) and (7) are now given by

((ij)jn,jm)=1Rrk𝚽rk(jnjm𝑨ijr(j)𝑨ijk(j))(j>njm(𝑨(j)𝑨(j))rk).\mathbb{P}((i_{j})_{j\leq n,j\neq m})=\frac{1}{R}\sum_{rk}\bm{\Phi}_{rk}\cdot\Big{(}\prod_{\begin{subarray}{c}j\leq n\\ j\neq m\end{subarray}}\bm{A}^{(j)}_{i_{j}r}\bm{A}^{(j)}_{i_{j}k}\Big{)}\Big{(}\prod_{\begin{subarray}{c}j>n\\ j\neq m\end{subarray}}(\bm{A}^{(j)\top}\bm{A}^{(j)})_{rk}\Big{)}.

This formula appears in Lemma 10 of [18], but in that formula 𝚽\bm{\Phi} is an approximation of (𝑨m𝑨m)+(\bm{A}^{\neq m\top}\bm{A}^{\neq m})^{+}. This also explains why a different normalization constant is used in [18]. The proof, however, remains the same.

5.1 Computational Complexity Examples

In this section we give the computational complexity of our proposed approach when used for CP and tensor ring decomposition. This allows us to compare our work with [18] which also considers these two decompositions. Tables 1 and 2 present the computational complexity of various methods for CP and tensor ring decomposition, respectively, when doing a rank-RR decomposition of an NN-way cubic tensor of size 𝓧I××I\bm{\mathscr{X}}\in\mathbb{R}^{I\times\cdots\times I}. For a precise definition of the rank RR for each decomposition, see (11) and (14). Our approaches for CP and tensor ring decomposition are called TNS-CP and TNS-TR, respectively, where TNS stands for Tensor Network Sampling. The complexities for our methods are computed in Section B. The complexities for the competing methods are taken from Tables 1 and 2 in [18]. For the randomized methods, the complexities reflect the parameter choices required for theoretical performance guarantees; see the respective cited works for further details.

As mentioned in Section 2, the primary contribution of [18] was to reduce the dependence on NN in the per-iteration cost from exponential to polynomial in ALS for CP and tensor ring decomposition. Our methods further improve on that cost by reducing the dependence on RR. The improvement is due to the avoidance of the complicated recursive sketching step used in [18]. Note that all methods other than ours and those by [18] have an exponential dependence on NN.

Table 1: Leading order computational cost (ignoring log\log factors) for various CP decomposition methods. #it is the number of ALS iterations. RR denotes the CP rank. Note that ε~\tilde{\varepsilon} is used as the accuracy parameter instead of ε\varepsilon for SPALS. The reason is that SPALS has additive-error guarantees which are weaker than the relative-error guarantees of the other randomized methods; see Section B.3 for a discussion.
Method Complexity
CP-ALS [12] #itN(N+I)IN1R\textup{\#it}\cdot N(N+I)I^{N-1}R
SPALS [3] IN+#itN(N+I)RN+1/ε~2I^{N}+\textup{\#it}\cdot N(N+I)R^{N+1}/\tilde{\varepsilon}^{2}
CP-ARLS-LEV [14] #itN(R+I)RN/(εδ)\textup{\#it}\cdot N(R+I)R^{N}/(\varepsilon\delta)
CP-ALS-ES [18] #itN2R3(R+NI/ε)/δ\textup{\#it}\cdot N^{2}R^{3}(R+NI/\varepsilon)/\delta
TNS-CP (our) #itN3IR3/(εδ)\textup{\#it}\cdot N^{3}IR^{3}/(\varepsilon\delta)
Table 2: Leading order computational cost (ignoring log\log factors) for various tensor ring decomposition methods. #it is the number of ALS iterations. The tensor ring rank is (R,,R)(R,\ldots,R).
Method Complexity
TR-ALS [31] #itNINR2\textup{\#it}\cdot NI^{N}R^{2}
rTR-ALS [30] NINK+#itNKNR2NI^{N}K+\textup{\#it}\cdot NK^{N}R^{2}
TR-SVD [31] IN+1+INR3I^{N+1}+I^{N}R^{3}
TR-SVD-Rand [1] INR2I^{N}R^{2}
TR-ALS-Sampled [19] #itNIR2N+2/(εδ)\textup{\#it}\cdot NIR^{2N+2}/(\varepsilon\delta)
TR-ALS-ES [18] #itN3R8(R+I/ε)/δ\textup{\#it}\cdot N^{3}R^{8}(R+I/\varepsilon)/\delta
TNS-TR (our) #itN3IR8/(εδ)\textup{\#it}\cdot N^{3}IR^{8}/(\varepsilon\delta)

6 Experiments

We run the same feature extraction and classification experiment as in [18]. The experiment considers 7200 images from the COIL-100 dataset [22]. These images depict 100 different objects (e.g., a toy car, an onion, a tomato), each photographed from 72 different angles. Each image is 128 by 128 pixels and has three color channels. We arrange these images into a tensor 𝓧\bm{\mathscr{X}} of size 7200×128×128×37200\times 128\times 128\times 3 and then decompose it using either a rank-25 CP decomposition or a rank-(5,5,5,5)(5,5,5,5) tensor ring decomposition. For the CP decomposition, we treat the rows of the factor matrix 𝑨(1)7200×25\bm{A}^{(1)}\in\mathbb{R}^{7200\times 25} as feature vectors for each image. We treat 90% of the images as labeled and use them to classify the images in the remaining 10% using a kk-nearest neighbor algorithm with k=1k=1 and 10-fold cross validation. The tensor ring decomposition is used in the same way, but instead of the CP factor matrix we now use the core tensor 𝓐(1)5×7200×5\bm{\mathscr{A}}^{(1)}\in\mathbb{R}^{5\times 7200\times 5} corresponding to the first mode and reshape the slices (𝓐:i:(1))i(\bm{\mathscr{A}}^{(1)}_{:i:})_{i} into length-25 feature vectors. Further details on the experiment setup are given in Section C.

Table 3 shows the run time, relative decomposition error (Err.) and classification accuracy (Acc.) for a number of different CP and tensor ring decomposition methods. The algorithms for our TNS-CP and TNS-TR were implemented222Our code is available at https://github.com/OsmanMalik/TNS. by modifying the codes for CP-ALS-ES and TR-ALS-ES by [18] appropriately.

Table 3: Run time, decomposition error and classification accuracy in the feature extraction experiment.
Method Time (s) Err. Acc. (%)
CP-ALS (Ten. Toolbox) 48.0 0.31 99.2
CPD-ALS (Tensorlab) 71.9 0.31 99.0
CPD-MINF (Tensorlab) 108.6 0.33 99.7
CPD-NLS (Tensorlab) 112.1 0.31 92.6
CP-ARLS-LEV 28.6 0.32 97.7
CP-ALS-ES 29.0 0.32 98.3
TNS-CP (our) 23.5 0.32 98.3
TR-ALS 10278.4 0.31 99.5
TR-ALS-Sampled 5.3 0.33 98.0
TR-ALS-ES 29.7 0.33 97.2
TNS-TR (our) 14.5 0.33 97.3

All methods achieve roughly the same decomposition error, with the randomized methods typically having a slightly higher error. All methods except CPD-NLS yield similar classification accuracy. It is clear that our methods are faster than CP-ALS-ES and TR-ALS-ES from [18], corroborating the improved complexity reported in Tables 1 and 2.

7 Conclusion

We have presented a sampling-based ALS approach for decomposing a tensor into any TN format. The generality of the framework is notable—all previous works on randomized tensor decomposition we are aware of develop methods for specific decompositions. Additionally, unlike most previous sampling-based ALS algorithms in the literature, our framework makes it possible to do the sampling according to the exact leverage scores. Our approach simplifies and generalizes the scheme developed in [18]. Both complexity analyses and numerical experiments confirm the improved run time we are able to achieve compared to [18] when we use our framework for CP and tensor ring decomposition.

There are many exciting avenues for future research. One direction is to adapt and implement our method for use in a high-performance distributed memory setting. Another interesting direction is designing an exact leverage score sampling scheme for general TN matrices 𝑨I×R\bm{A}\in\mathbb{R}^{I\times R} that avoid the Ω(R3)\Omega(R^{3}) cost that we incur when computing the pseudoinverse 𝚽=(𝑨𝑨)+\bm{\Phi}=(\bm{A}^{\top}\bm{A})^{+}. Such methods are known for certain matrices with Kronecker product structure [20], but for general TN matrices such results are not known.

Acknowledgments

OAM was supported by the LDRD Program of Lawrence Berkeley National Laboratory under U.S. DOE Contract No. DE-AC02-05CH11231. VB was supported in part by the U.S. DOE, Office of Science, Office of Advanced Scientific Computing Research, DOE Computational Science Graduate Fellowship under Award No. DE-SC0022158. RM was supported by NSF Grant No. 2004235.

References

  • Ahmadi-Asl et al. [2020] S. Ahmadi-Asl, A. Cichocki, A. H. Phan, M. G. Asante-Mensah, F. Mousavi, I. Oseledets, and T. Tanaka. Randomized algorithms for fast computation of low-rank tensor ring model. Machine Learning: Science and Technology, 2020.
  • Bader et al. [2021] B. W. Bader, T. G. Kolda, and others. MATLAB Tensor Toolbox, Version 3.2.1. Available online at https://www.tensortoolbox.org, 2021.
  • Cheng et al. [2016] D. Cheng, R. Peng, Y. Liu, and I. Perros. SPALS: Fast alternating least squares via implicit leverage scores sampling. In Advances In Neural Information Processing Systems, pages 721–729, 2016.
  • Cichocki et al. [2016] A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, and D. P. Mandic. Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends in Machine Learning, 9(4-5):249–429, 2016.
  • Cichocki et al. [2017] A. Cichocki, A.-H. Phan, Q. Zhao, N. Lee, I. Oseledets, M. Sugiyama, and D. P. Mandic. Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends in Machine Learning, 9(6):431–673, 2017.
  • Drineas et al. [2006] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for 2\ell_{2} regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pages 1127–1136, 2006.
  • Drineas et al. [2008] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008.
  • Drineas et al. [2011] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2011.
  • Fahrbach et al. [2022] M. Fahrbach, T. Fu, and M. Ghadiri. Subquadratic Kronecker regression with applications to tensor decomposition. Preprint arXiv:2209.04876, 2022.
  • Hitchcock [1927] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189, 1927.
  • Ji et al. [2019] Y. Ji, Q. Wang, X. Li, and J. Liu. A survey on tensor techniques and applications in machine learning. IEEE Access, 7:162950–162990, 2019.
  • Kolda and Bader [2009] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, Aug. 2009.
  • Kossaifi et al. [2019] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic. TensorLy: Tensor learning in Python. Journal of Machine Learning Research, 20(26):1–6, 2019.
  • Larsen and Kolda [2022] B. W. Larsen and T. G. Kolda. Practical leverage-based sampling for low-rank tensor decomposition. SIAM Journal on Matrix Analysis and Applications, 43(3):1488–1517, 2022.
  • Li and Sun [2020] C. Li and Z. Sun. Evolutionary topology search for tensor network decomposition. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 5947–5957. PMLR, 2020.
  • Ma and Solomonik [2022] L. Ma and E. Solomonik. Cost-efficient Gaussian tensor network embeddings for tensor-structured inputs. Preprint arXiv:2205.13163, 2022.
  • Mahoney [2011] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
  • Malik [2022] O. A. Malik. More efficient sampling for tensor decomposition with worst-case guarantees. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 14887–14917. PMLR, 2022.
  • Malik and Becker [2021] O. A. Malik and S. Becker. A sampling-based method for tensor ring decomposition. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 7400–7411. PMLR, 2021.
  • Malik et al. [2022] O. A. Malik, Y. Xu, N. Cheng, S. Becker, A. Doostan, and A. Narayan. Fast algorithms for monotone lower subsets of Kronecker least squares problems. Preprint arXiv:2209.05662, 2022.
  • Meirom et al. [2022] E. Meirom, H. Maron, S. Mannor, and G. Chechik. Optimizing tensor network contraction using reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 15278–15292. PMLR, 2022.
  • Nene et al. [1996] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL-100). Technical Report CUCS-006-96, Columbia University, 1996.
  • Oseledets [2011] I. V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
  • Papalexakis et al. [2016] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos. Tensors for data mining and data fusion: Models, applications, and scalable algorithms. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2):1–44, 2016.
  • Pfeifer et al. [2014a] R. N. Pfeifer, G. Evenbly, S. Singh, and G. Vidal. NCON: A tensor network contractor for MATLAB. Preprint arXiv:1402.0939, 2014a.
  • Pfeifer et al. [2014b] R. N. C. Pfeifer, J. Haegeman, and F. Verstraete. Faster identification of optimal contraction sequences for tensor networks. Phys. Rev. E, 90:033315, 2014b.
  • Tucker [1963] L. R. Tucker. Implications of factor analysis of three-way matrices for measurement of change. Problems in measuring change, 15(122-137):3, 1963.
  • Vervliet et al. [2016] N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer. Tensorlab 3.0. Available online at https://www.tensorlab.net, Mar. 2016.
  • Woodruff [2014] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.
  • Yuan et al. [2019] L. Yuan, C. Li, J. Cao, and Q. Zhao. Randomized tensor ring decomposition and its application to large-scale data reconstruction. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2127–2131. IEEE, 2019.
  • Zhao et al. [2016] Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki. Tensor ring decomposition. Preprint arXiv:1606.05535, 2016.

Appendix A Additional Notation

Let 𝑨I×J\bm{A}\in\mathbb{R}^{I\times J} and 𝑩K×L\bm{B}\in\mathbb{R}^{K\times L} be two matrices. Their Kronecker product is denoted by 𝑨𝑩IK×JL\bm{A}\otimes\bm{B}\in\mathbb{R}^{IK\times JL} and defined as

𝑨𝑩=def[𝑨11𝑩𝑨12𝑩𝑨1J𝑩𝑨21𝑩𝑨22𝑩𝑨2J𝑩𝑨I1𝑩𝑨I2𝑩𝑨IJ𝑩].\bm{A}\otimes\bm{B}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\begin{bmatrix}\bm{A}_{11}\bm{B}&\bm{A}_{12}\bm{B}&\cdots&\bm{A}_{1J}\bm{B}\\ \bm{A}_{21}\bm{B}&\bm{A}_{22}\bm{B}&\cdots&\bm{A}_{2J}\bm{B}\\ \vdots&\vdots&&\vdots\\ \bm{A}_{I1}\bm{B}&\bm{A}_{I2}\bm{B}&\cdots&\bm{A}_{IJ}\bm{B}\\ \end{bmatrix}.

Suppose now that 𝑨I×J\bm{A}\in\mathbb{R}^{I\times J} and 𝑩K×J\bm{B}\in\mathbb{R}^{K\times J}, i.e., 𝑨\bm{A} and 𝑩\bm{B} have the same number of columns. The Khatri–Rao product of 𝑨\bm{A} and 𝑩\bm{B} is denoted by 𝑨𝑩IK×J\bm{A}\odot\bm{B}\in\mathbb{R}^{IK\times J} and defined as

𝑨𝑩=def[𝑨:1𝑩:1𝑨:2𝑩:2𝑨:J𝑩:J].\bm{A}\odot\bm{B}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\begin{bmatrix}\bm{A}_{:1}\otimes\bm{B}_{:1}&\bm{A}_{:2}\otimes\bm{B}_{:2}&\cdots&\bm{A}_{:J}\bm{B}_{:J}\end{bmatrix}.

The Khatri–Rao product is sometimes called the columnwise Kronecker product for obvious reasons.

Now, suppose that 𝑨I×J\bm{A}\in\mathbb{R}^{I\times J} and 𝑩I×J\bm{B}\in\mathbb{R}^{I\times J} are of the same size. The Hadamard product, or elementwise product, of 𝑨\bm{A} and 𝑩\bm{B} is denoted by 𝑨𝑩I×J\bm{A}\circledast\bm{B}\in\mathbb{R}^{I\times J} and defined elementwise in the obvious way:

(𝑨𝑩)ij=𝑨ij𝑩ij.(\bm{A}\circledast\bm{B})_{ij}=\bm{A}_{ij}\bm{B}_{ij}.

The Frobenius norm is a standard matrix norm that can be easily extended to tensors. For a tensor 𝓧I1××IN\bm{\mathscr{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}} it is defined as

𝓧F=defi1=1I1iN=1IN𝓧i1iN2.\|\bm{\mathscr{X}}\|_{\textup{F}}\stackrel{{\scriptstyle\text{\tiny{def}}}}{{=}}\sqrt{\sum_{i_{1}=1}^{I_{1}}\cdots\sum_{i_{N}=1}^{I_{N}}\bm{\mathscr{X}}_{i_{1}\cdots i_{N}}^{2}}.

Appendix B Detailed Complexity Analysis

Here we provide further details on the complexity analysis in Section 5.1.

B.1 CP Decomposition

Suppose we are decomposing an NN-way cubic tensor of size 𝓧I××I\bm{\mathscr{X}}\in\mathbb{R}^{I\times\cdots\times I} using a rank-RR CP decomposition. As mentioned in Example 5.1, the entries of the diagonal tensor (see Figure 7 (a)) can be treated as ones without loss of generality. The update in (10) therefore involves updating one of the CP factor matrices 𝑨(m)\bm{A}^{(m)}, m[N]m\in[N].

Computing 𝚽\bm{\Phi}

Computing the Gram matrix 𝑨m𝑨m\bm{A}^{\neq m\top}\bm{A}^{\neq m} via (12) costs O(NIR2)O(NIR^{2}). When doing this computation, we store all the Gram matrices 𝑨(j)𝑨(j)\bm{A}^{(j)\top}\bm{A}^{(j)} for no additional computational cost for later use. Computing the pseudoinverse 𝚽=(𝑨m𝑨m)+\bm{\Phi}=(\bm{A}^{\neq m\top}\bm{A}^{\neq m})^{+} costs an additional O(R3)O(R^{3}). The total cost of computing 𝚽\bm{\Phi} is therefore O(NIR2+R3)O(NIR^{2}+R^{3}).

Drawing JJ Samples

The cost is the same as the sampling cost for CP-ALS-ES in [18], but without the one-time costs since 𝚽\bm{\Phi} and the Gram matrices 𝑨(j)𝑨(j)\bm{A}^{(j)\top}\bm{A}^{(j)}, j[N]{m}j\in[N]\setminus\{m\}, have already been computed. The cost is therefore O(JR2N2I)O(JR^{2}N^{2}I); see [18, Sec. C.1] for details.

Solving Sampled Least Squares Problem

The cost of forming the sampled matrices 𝑺𝑨m\bm{S}\bm{A}^{\neq m} and 𝑺𝑿m\bm{S}\bm{X}^{\neq m}, and solving the associated linear system (10) via direct methods, is the same as for CP-ALS-ES in [18], namely O(JR(N+R+I))O(JR(N+R+I)); see [18, Sec. C.1] for details.

Adding It All Up

The costs above are for solving one least squares problem of the form (10). Each ALS iteration requires solving NN such systems. Consequently, the per-iteration cost for our proposed method is

O(JR2N3I),O(JR^{2}N^{3}I), (13)

where we have used the fact that the sample size JJ must satisfy J>RJ>R (see Theorem 3.3) to simplify the complexity expression. Since 𝑨m\bm{A}^{\neq m} has RR columns, the bound on JJ from Theorem 3.3 is JRmax(log(R/δ),1/(εδ))J\gtrsim R\max(\log(R/\delta),1/(\varepsilon\delta)). Inserting this into (13) gives the per-iteration cost

O(R3N3Ilog(R/δ)/(εδ))=O~(R3N3I/(εδ))O(R^{3}N^{3}I\log(R/\delta)/(\varepsilon\delta))=\tilde{O}(R^{3}N^{3}I/(\varepsilon\delta))

which is what we report in Table 1.

B.2 Tensor Ring Decomposition

Suppose that we are decomposing an NN-way cubic tensor of size 𝓧I××I\bm{\mathscr{X}}\in\mathbb{R}^{I\times\cdots\times I} using a tensor ring decomposition with all ranks set to RR. The decomposition takes the form

𝓧i1iNr1rN𝓐rNi1r1(1)𝓐r1i2r2(2)𝓐rN1iNrN(N),\bm{\mathscr{X}}_{i_{1}\cdots i_{N}}\approx\sum_{r_{1}\cdots r_{N}}\bm{\mathscr{A}}^{(1)}_{r_{N}i_{1}r_{1}}\bm{\mathscr{A}}^{(2)}_{r_{1}i_{2}r_{2}}\cdots\bm{\mathscr{A}}^{(N)}_{r_{N-1}i_{N}r_{N}}, (14)

where each 𝓐(m)\bm{\mathscr{A}}^{(m)} is now a 3-way tensor of size R×I×RR\times I\times R, and where each summation index rnr_{n} goes from 1 to RR. See [31] for further details on the tensor ring decomposition.

Computing 𝚽\bm{\Phi}

For the tensor ring decomposition, the TN representation of 𝑨m\bm{A}^{\neq m} is shown in Figure 7 for the case N=4N=4. Figure 8 illustrates how we contract the TN corresponding to the Gram matrix 𝑨m𝑨m\bm{A}^{\neq m\top}\bm{A}^{\neq m} for an arbitrary NN. Each edge we eliminate in subplot (a) costs O(IR4)O(IR^{4}). Since there are N1N-1 pairs to contract, this brings the total cost of the step in (a) to O(NIR4)O(NIR^{4}). Each step in the sequence of contractions in (b) costs O(R6)O(R^{6}). Since N2N-2 such contractions are required, the total cost of step (b) is O(NR6)O(NR^{6}). Once the matrix 𝑨m𝑨m\bm{A}^{\neq m\top}\bm{A}^{\neq m} in (c) is computed, it costs O(R6)O(R^{6}) to compute its pseudoinverse, i.e., less than the cost of the step in (b). Adding up the costs for the steps in (a) and (b) and the cost of the pseudoinverse brings the total cost for computing 𝚽\bm{\Phi} to O(N(R6+IR4))O(N(R^{6}+IR^{4})).

Refer to caption
Figure 8: Illustration of contraction order of TN for 𝑨m𝑨m\bm{A}^{\neq m\top}\bm{A}^{\neq m} for tensor ring decomposition. The variables II and RR indicate the dimension of each edge. First we eliminate the large dimensions II by contracting over the red edges in (a). Then we contract the resulting chain of tensors in sequence as shown in (b). The final output is the R2×R2R^{2}\times R^{2} matrix shown in (c).

Drawing JJ Samples

The cost is the same as the sampling cost for TR-ALS-ES in [18], but without the one-time costs in that paper333The Gram matrices 𝑮[2](j)𝑮[2](j)\bm{G}_{[2]}^{(j)\top}\bm{G}_{[2]}^{(j)} in [18] are precisely what we compute when we do the contractions in Figure 8 (a). These can be saved during that step, which is why the one-time costs of computing those Gram matrices is already accounted for from the computation of 𝚽\bm{\Phi}.. The cost is therefore O(JN2IR6)O(JN^{2}IR^{6}); see [18, Sec. C.2] for details.

Solving Sampled Least Squares Problem

The cost of forming the sampled matrices 𝑺𝑨m\bm{S}\bm{A}^{\neq m} and 𝑺𝑿(m)\bm{S}\bm{X}^{(m)}, and solving the associated linear system (10) via direct methods, is the same as for TR-ALS-ES in [18], namely O(J(NR3+R4+IR2))O(J(NR^{3}+R^{4}+IR^{2})).

Adding It All Up

The costs above are for solving one least squares problem of the form (10). Each ALS iteration requires solving NN such systems. Consequently, the per-iteration cost for our proposed method is

O(JN3IR6).O(JN^{3}IR^{6}). (15)

Since 𝑨m\bm{A}^{\neq m} has R2R^{2} columns, the bound on JJ in Theorem 3.3 is JR2max(log(R2/δ),1/(εδ))J\gtrsim R^{2}\max(\log(R^{2}/\delta),1/(\varepsilon\delta)). Inserting this into (15) gives the per-iteration cost

O(N3IR8log(R2/δ)/(εδ))=O~(N3IR8/(εδ)),O(N^{3}IR^{8}\log(R^{2}/\delta)/(\varepsilon\delta))=\tilde{O}(N^{3}IR^{8}/(\varepsilon\delta)),

which is what we report in Table 2.

B.3 Some Remarks on Tables 1 and 2

All methods in Table 1 except CP-ALS are randomized. The theoretical guarantees for CP-ARLS-LEV, CP-ALS-ES and TNS-CP are all of the same type: They provide relative-error guarantees of the form in Theorem 3.3 for each least squares solve on line 1 in Algorithm 1 when it is adapted to CP decomposition. For these three methods, the meaning of the variables ε\varepsilon and δ\delta are therefore identical. SPALS, by contrast, has weaker additive-error guarantees. Expressed using the same notation as in Theorem 3.3, their guarantees take the form

𝑨𝑿~𝒀F2OPT2+ε~𝒀F2,\|\bm{A}\tilde{\bm{X}}-\bm{Y}\|_{\textup{F}}^{2}\leq\textup{OPT}^{2}+\tilde{\varepsilon}\|\bm{Y}\|_{\textup{F}}^{2}, (16)

where the statement in (16) holds with probability at least 1δ1-\delta when the number of samples is large enough. The constant δ\delta has the same meaning as for the other three randomized methods, but ε~\tilde{\varepsilon} has a different meaning than ε\varepsilon: The latter is a relative-error accuracy parameter while the former is an additive-error accuracy parameter.

For the three methods TR-ALS-Sampled, TR-ALS-ES and TNS-TR in Table 2, the parameters δ\delta and ε\varepsilon are all used in the sense of Theorem 3.3.

The number of iterations, denoted by #it, required to reach a certain accuracy or fulfill certain termination criteria may differ between the various methods. For example, the deterministic ALS methods—by virtue of being exact and non-random—may require fewer iterations to reach a certain accuracy. Similarly differences may exist between the various randomized ALS methods, and between the different decomposition types.

Appendix C Further Details on Experiment

C.1 Competing Methods

We provide some further details on the methods we compare against below.

C.2 Parameter Choices

We sample J=2000J=2000 rows in all ALS subproblems for the sampling-based CP decomposition methods (CP-ARLS-LEV, CP-ALS-ES and TNS-CP). For the sampling-based tensor ring methods (TR-ALS-Sampled, TR-ALS-ES and TNS-TR) we use J=1000J=1000 samples in the ALS subproblems. Both CP-ALS-ES and TR-ALS-ES require an intermediate embedding via a recursive sketching procedure. For both methods, we use an embedding dimension of 10000 for these intermediate steps, as suggested in [18].

C.3 Hardware/Software and Dataset

The experiments are run in Matlab R2021b on a laptop computer with an Intel Core i7-1185G7 CPU and 32 GB of RAM.

The COIL-100 dataset [22] is available for download at https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php.