Accelerating jackknife resampling for the Canonical Polyadic Decomposition

Christos Psarras 0000-0001-6057-7491 RWTH Aachen UniversitySchinkelstr. 2AachenNorth Rhine-Westphalia52062Germany christos.psarras@rwth-aachen.de , Lars Karlsson Umeå UniversitetPlan 4, MIT-husetUmeåSweden larsk@cs.umu.se , Rasmus Bro 0000-0002-7641-4854 University of CopenhagenRolighedsvej 26Copenhagen1958 Frederiksberg CDenmark rb@food.ku.dk and Paolo Bientinesi Umeå UniversitetPlan 4, MIT-husetUmeåSweden larsk@cs.umu.se

(2018)

Abstract.

The Canonical Polyadic (CP) tensor decomposition is frequently used as a model in applications in a variety of different fields. Using jackknife resampling to estimate parameter uncertainties is often desirable but results in an increase of the already high computational cost. Upon observation that the resampled tensors, though different, are nearly identical, we show that it is possible to extend the recently proposed Concurrent ALS (CALS) technique to a jackknife resampling scenario. This extension gives access to the computational efficiency advantage of CALS for the price of a modest increase (typically a few percent) in the number of floating point operations. Numerical experiments on both synthetic and real-world datasets demonstrate that the new workflow based on a CALS extension can be several times faster than a straightforward workflow where the jackknife submodels are processed individually.

jackknife, Tensors, Decomposition, CP, ALS

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: 10.1145/1122445.1122456^†^†conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NY^†^†booktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Mathematics of computing Mathematical software performance^†^†ccs: Software and its engineering Software performance

1. Introduction

The CP model is used increasingly across a large diversity of fields. One of the fields in which CP is commonly applied is chemistry (Murphy et al., 2013; Wiberg and Jacobsson, 2004), where there is often a need for estimating not only the parameters of the model, but also the associated uncertainty of those parameters (Farrance and Frenkel, 2012). In fact, in some areas it is a dogma that an estimate without an uncertainty is not a result. A common approach for estimating uncertainties of the parameters of CP models is through resampling, such as bootstrapping or jackknifing (Riu and Bro, 2003; Kiers, 2004). The latter has added benefits, e.g., for variable selection (Martens and Martens, 2000) and outlier detection(Riu and Bro, 2003). Here we consider a new technique, JK-CALS, that increases the performance of jackknife resampling applied to CP by more efficiently utilizing the computer’s memory hierarchy.

The basic concept of jackknife is somewhat related to cross-validation. Let $\boldsymbol{\mathcal{T}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}$ be a tensor, and $\mathbf{U}_{1}$ , $\ldots$ , $\mathbf{U}_{N}$ the factor matrices of a CP model. Let us also make the assumption (typical in many applications) that the first mode corresponds to independent samples, and all the other modes correspond to variables. For the most basic type of jackknifing, namely Leave-One-Out (LOO)¹¹1Henceforth, when we mention jackknifing we imply LOO jackknifing, unless otherwise stated., one sample (out of $I_{1}$ ) is left out at a time (resulting in a tensor with only $I_{1}-1$ samples) and a model is fitted to the remaining data; we refer to that model as a submodel. All samples are left out exactly once, resulting in $I_{1}$ distinct submodels. Each submodel provides an estimate of the parameters of the overall model. For example, each submodel provides an estimate of the factor (or loading) matrix $\mathbf{U}_{2}$ . From these $I_{1}$ estimates it is possible to calculate the variance (or bias) of the overall loading matrix (the one obtained from all samples). One complication comes from some indeterminacies with CP that need to be taken into account. For example, when one (or more) samples are removed from the initial tensor, the order of components in the submodel may change; this phenomenon is explained and a solution is proposed in (Riu and Bro, 2003).

Recently, the authors proposed a technique, Concurrent ALS (CALS) (Psarras et al., 2020), that can fit multiple CP models to the same underlying tensor more rapidly than regular ALS. CALS achieves better performance not by altering the numerics but by utilizing the computer’s memory hierarchy more efficiently than regular ALS. However, the CALS technique cannot be directly applied to jackknife resampling, since the $I_{1}$ submodels are fitted to different tensors. In this paper, we extend the idea that underpins CALS to jackknife resampling. The new technique takes advantage of the fact that the $I_{1}$ resampled tensors are nearly identical. At the price of a modest increase in arithmetic operations, the technique allows for more efficient fitting of the CP submodels and thus improved overall performance of a jackknife workflow. In applications in which the number of components in the CP model is relatively low, the technique can significantly reduce the overall time to solution.

Contributions

•

An efficient technique, JK-CALS, for performing jackknife resampling of CP models. The technique is based on an extension of CALS to nearly identical tensors. To the best of our knowledge, this is the first attempt at accelerating jackknife resampling of CP models.
•

Numerical experiments demonstrate that JK-CALS can lead to performance gains in a jackknife resampling workflow.
•

Theoretical analysis shows that the technique generalizes from leave-one-out to delete- $d$ jackknife with a modest (less than a factor of two) increase in arithmetic.
•

A C++ library with support for GPU acceleration and a Matlab interface.

Organization

The rest of the paper is organized as follows. In Section 2, we provide an overview of related research. In Section 3, we review the standard CP-ALS and CALS algorithms, as well as jackknife applied to CP. We describe the technique which enables us to use CALS to compute jackknife more efficiently in Section 4. In Section 5 we demonstrate the efficiency of our proposed technique, by applying it to perform jackknife resampling to CP models that have been fitted to artificial and real tensors. Finally, in Section 6, we conclude the paper and provide insights for further research.

2. Related Work

Two popular techniques for uncertainty estimation for CP models are bootstrap and jackknife (Westad and Marini, 2015; Kiers, 2004; Riu and Bro, 2003). The main difference is that jackknife resamples without replacement whereas bootstrap resamples with replacement. Bootstrap frequently involves more submodels than jackknife and is therefore more expensive. The term jackknife typically refers to leave-one-out jackknife, where only one observation is removed when resampling. More than one observation can be removed at a time (Peddada, 1993); a variation commonly called delete- $d$ jackknife. When applied to CP, jackknife has certain benefits over bootstrap, e.g., for variable selection (Martens and Martens, 2000) and outlier detection(Riu and Bro, 2003).

Jackknife requires fitting multiple submodels. A straightforward way of accelerating jackknife is to separately accelerate the fitting of each submodel, e.g., using a faster implementation. The simplest and most extensively used numerical method for fitting CP models is the Alternating Least Squares (CP-ALS) method. Alternative methods for fitting CP models include eigendecomposition-based methods (Sanchez and Kowalski, 1986) and gradient-based (all-at-once) optimization methods (Acar et al., 2011).

Several techniques have been proposed to accelerate CP-ALS. Line search (Rajih et al., 2008) and extrapolation (Ang et al., 2019) aim to reduce the number of iterations until convergence. Randomization-based techniques have also been proposed. These target very large tensors, and either randomly sample the tensor (Vervliet and De Lathauwer, 2016) or the Khatri-Rao product (Battaglino et al., 2018), to reduce their size and, by extension, the overall amount of computation. Similarly, compression-based techniques replace the target tensor with a compressed version, thus also reducing the amount of computation during fitting (Bro and Andersson, 1998). The CP model of the reduced tensor is inflated to correspond to a model of the original tensor.

Several projects offer high-performance implementations of CP-ALS, for example, Cyclops (Solomonik et al., 2013), PLANC (Kannan et al., 2016), Partensor (Lourakis and Liavas, 2018), SPLATT (Smith et al., 2015), and Genten (Phipps and Kolda, 2019). For a more comprehensive list of software implementing some variant of CP-ALS, refer to (Psarras et al., 2021).

Similar to the present work, there have been attempts at accelerating jackknife although (to the best of our knowledge) not in the context of CP. In (Buzas, 1997), the high computational cost of jackknife is tackled by using a numerical approximation that requires fewer operations at the price of lower accuracy. In (Belotti and Peracchi, 2020), a general-purpose routine for fast jackknife estimation is presented. Some estimators (often linear ones) have leave-one-out formulas that allow for fast computation of the estimator after leaving one sample out. Jackknife is thus accelerated by computing the estimator on the full set and then systematically applying the leave-one-out formula. In (Hinkle and Stromberg, 1996), a similar technique is studied. Jackknife computes an estimator on $s$ distinct subsets of the $s$ samples. Any two of these subsets differ by only one sample, i.e., any one subset can be obtained from any other by replacing one and only one element. Some estimators have a fast updating formula, which can rapidly transform an estimator for one subset to an estimator for another subset. Jackknife is thus accelerated by computing the estimator from scratch on the first subset and then repeatedly updating the estimator using this fast updating formula.

3. CP-ALS, CALS and jackknife

In this section, we first specify the notation to be used throughout the paper, we then review the CP-ALS and CALS techniques, and finally we describe jackknife resampling applied to CP.

3.1. Notation

For vectors and matrices, we use bold lowercase and uppercase roman letters, respectively, e.g., $\mathbf{v}$ and $\mathbf{U}$ . For tensors, we follow the notation in (Kolda and Bader, 2009); specifically, we use bold calligraphic fonts, e.g., $\boldsymbol{\mathcal{T}}$ . The order (number of indices or modes) of a tensor is denoted by uppercase roman letters, e.g., $N$ . For each mode $n\in\{1,2,\ldots,N\}$ , a tensor $\boldsymbol{\mathcal{T}}$ can be unfolded (matricized) into a matrix, denoted by $\mathbf{T}_{(n)}$ , where the columns are the mode- $n$ fibers of $\boldsymbol{\mathcal{T}}$ , i.e., the vectors obtained by fixing all indices except for mode $n$ . Sets are denoted by calligraphic fonts, e.g., $\mathcal{S}$ . Given two matrices $\mathbf{A}$ and $\mathbf{B}$ with the same number of columns, the Khatri-Rao product, denoted by $\mathbf{A}\odot\mathbf{B}$ , is the column-wise Kronecker product of $\mathbf{A}$ and $\mathbf{B}$ . Finally, the unary operator $\oplus$ , when applied to a matrix, denotes the scalar which is the sum of all matrix elements.

3.2. CP-ALS

The standard alternating least-squares method for CP is shown in Algorithm 1 (CP-ALS). The input consists of a target tensor $\boldsymbol{\mathcal{T}}$ . The output consists of a CP model represented by a sequence of factor matrices $\mathbf{U}_{1}$ , $\ldots$ , $\mathbf{U}_{N}$ . The algorithm repeatedly updates the factor matrices one by one in sequence until either of the following criteria are met: a) the fit of the model to the target tensor falls below a certain tolerance threshold, or b) a maximum number of iterations has been reached. To update a specific factor matrix $\mathbf{U}_{n}$ , the gradient of the least-squares objective function with respect to that factor matrix is set to zero and the resulting linear least-squares problem is solved directly from the normal equations. This entails computing the Matricized Tensor Times Khatri-Rao Product (MTTKRP) (line 1), which is the product between the mode- $n$ unfolding $\mathbf{T}_{(n)}$ and the Khatri-Rao Product (KRP) of all factor matrices except $\mathbf{U}_{n}$ . The MTTKRP is followed by the Hadamard product of the Gramians of each factor matrix ( ${\mathbf{U}_{i}}^{T}\mathbf{U}_{i}$ ) in line 1. Factor matrix $\mathbf{U}_{n}$ is updated by solving the linear system in line 1. At the completion of an iteration, i.e., a full pass over all $N$ modes, the error between the model and the target tensor is computed (line 1) using the efficient formula derived in (Phan et al., 2013).

Input:

\boldsymbol{\mathcal{T}}

\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}

The target tensor

Output:

\mathbf{U}_{1},\ldots,\mathbf{U}_{N}

The fitted factor matrices

1 Initialize the factor matrices

\mathbf{U}_{1},\ldots,\mathbf{U}_{N}

2 repeat

3 for $n=1,2$ , $\ldots$ , $N$ do

\mathbf{M}_{n}\leftarrow\mathbf{T}_{(n)}(\odot_{i\neq n}\mathbf{U}_{i})

MTTKRP

\mathbf{H}_{n}\leftarrow\ast_{i\neq n}({\mathbf{U}_{i}}^{T}\mathbf{U}_{i})

Hadamard product of Gramians

\mathbf{U}_{n}\leftarrow\mathbf{M}_{n}{\mathbf{H}_{n}}^{\dagger}

{\mathbf{H}_{n}}^{\dagger}

: pseudoinverse of

\mathbf{H}_{n}

5 end for

e\leftarrow||\boldsymbol{\mathcal{T}}||^{2}-(\oplus(\mathbf{H}_{N}\ast({\mathbf{U}_{N}}^{T}\mathbf{U}_{N})))-2(\oplus(\mathbf{U}_{N}\ast\mathbf{M}_{N}))

Error calculation

7until convergence detected or maximum number of iterations reached

Algorithm 1 CP-ALS: Alternating least squares method for CP decomposition.

Assuming a small number of components ( $R$ ), the most expensive step is the MTTKRP (line 1). This step involves $2R\prod_{i}I_{i}$ FLOPs (ignoring, for the sake of simplicity, the lower order of FLOPs required for the computation of the KRP). The operation touches slightly more than $\prod_{i}I_{i}$ memory locations, resulting in an arithmetic intensity less than $2R$ FLOPs per memory reference. Thus, unless $R$ is sufficiently large, the speed of the computation will be limited by the memory bandwidth rather than the speed of the processor. The CP-ALS algorithm is inherently memory-bound for small $R$ , regardless of how it is implemented.

Figure 1. Efficiency of MTTKRP on a

50\times 200\times 200

tensor for an increasing number of components. Note that in the multi-threaded execution, the theoretical peak performance increases while the total number of operations to be performed stays the same (as in the single-threaded case); this explains the drop in efficiency per thread.

The impact on performance of the memory-bound nature of MTTKRP is demonstrated in Fig. 1, which shows the computational efficiency of a particular implementation of MTTKRP as a function of the number of components (for a tensor of size $50\times 200\times 200$ ). Efficiency is defined as the ratio of the performance achieved by MTTKRP (in FLOPs/sec), relative to the Theoretical Peak Performance (TPP, see below) of the machine, i.e.,

\textsc{efficiency}=\frac{\textsc{performance}}{\textsc{tpp}}=\frac{\textsc{\#flops}/\textsc{time}}{\textsc{tpp}}.

The TPP of a machine is defined as the maximum number of (double precision) floating point operations the machine can perform in one second. Table 1 shows the TPP for our particular machine (see Sec. 5 for details).

System	TPP (GFlops/sec)	threads	frequency per core (Ghz)
CPU	112	1	3.5
CPU	1536	24	2

Table 1. Theoretical Peak Performance (TPP) for a particular machine. Due to the decrease in the peak frequency per core when all 24 cores are used, the TPP for 24 cores is less than

24\times

the TPP for 1 core.

In Fig. 1, we see that the efficiency of MTTKRP tends to increase with the number of components, $R$ , until eventually reaching a plateau. On this machine, the plateau is $R\geq 60$ at $\approx 70\%$ efficiency for one thread and $R\geq 300$ at $\approx 35\%$ efficiency for 24 threads. For $R\leq 20$ , which is common in applications, the efficiency is well below the TPP.

3.3. Concurrent ALS (CALS)

When fitting multiple CP models to the same underlying tensor, the Concurrent ALS (CALS) technique can improve the efficiency if the number of components is not large enough for CP-ALS to reach its performance plateau (Psarras et al., 2020). A need to fit multiple models to the same tensor arises, for example, when trying different initial guesses or when trying different numbers of components.

The gist of CALS can be summarized as follows (see (Psarras et al., 2020) for details). Suppose $K$ independent instances of CP-ALS have to be executed on the same underlying tensor. Rather than running them sequentially or in parallel, run them in lock-step fashion as follows. Advance every CP-ALS process one iteration before proceeding to the next iteration. One CALS iteration entails $K$ CP-ALS iterations (one iteration per model). Each CP-ALS iteration in turn contains one MTTKRP operation, so one CALS iteration also entails $K$ MTTKRP operations. But these MTTKRPs all involve the same tensor and can therefore be fused into one bigger MTTKRP operation (see Eq. 3 of (Psarras et al., 2020)). The performance of the fused MTTKRP depends on the sum total of components, i.e., $\sum_{i=1}^{K}R_{i}$ , where $R_{i}$ is the number of components in model $i$ . Due to the performance profile of MTTKRP (see Fig. 1), the fused MTTKRP is expected to be more efficient than each of the $K$ smaller operations it replaces.

The following example illustrates the impact on efficiency of MTTKRP fusion. Given a target tensor of size $50\times 200\times 200$ , $K=50$ models to fit, and $R_{i}=5$ components in each model, the efficiency for each of the $K$ MTTKRPs in CP-ALS is about $15\%$ ( $3\%$ ) for 1 (24) threads (see Fig.1). The efficiency of the fused MTTKRP in CALS will be as observed for $R=\sum_{i=1}^{K}R_{i}=250$ , i.e., $60\%$ ( $30\%$ ) for 1 (24) threads. Since the MTTKRP operation dominates the cost, CALS is expected to be $\approx 4\times$ ( $\approx 10\times$ ) faster than CP-ALS for 1 (24) threads.

3.4. Jackknife

Algorithm 2 shows a baseline (inefficient) application of leave-one-out jackknife resampling to a CP model. For details, see (Riu and Bro, 2003). The inputs are a target tensor $\boldsymbol{\mathcal{T}}$ , an overall CP model $P$ fitted to all of $\boldsymbol{\mathcal{T}}$ , and a sampled mode $\hat{n}\in\{1,2,\ldots,N\}$ . For each sample $p\in\{1,2,\ldots,I_{\hat{n}}\}$ , the algorihm removes the slice corresponding to the sample from tensor $\boldsymbol{\mathcal{T}}$ (line 2) and model $P$ (line 2) and fits a reduced model $P_{-p}$ (lines 2–2) to the reduced tensor $\boldsymbol{\mathcal{\hat{T}}}$ using regular CP-ALS. After fitting all submodels, the standard deviation of every factor matrix except $\mathbf{U}_{\hat{n}}$ is computed from the $I_{\hat{n}}$ submodels in $\mathcal{Q}$ (line 2). The only costly part of Algorithm 2 is the repeated calls to CP-ALS in line 2.

Input:

\boldsymbol{\mathcal{T}}

\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}

The target tensor

P=\mathbf{U}_{1},\ldots,\mathbf{U}_{N}

A CP model fitted to

\boldsymbol{\mathcal{T}}

\hat{n}

The sampled mode

Output:

\mathbf{S}_{1},\ldots,\mathbf{S}_{N}

Uncertainty of each element of each factor matrix of P

\mathcal{Q}\leftarrow\emptyset

Set containing fitted jackknife models

2 for $p\in\{1,2,\ldots,I_{\hat{n}}\}$ do For every index

p

in mode

\hat{n}

\boldsymbol{\mathcal{T}}_{-p}\leftarrow

remove the slice with index

p

in mode

\hat{n}

from tensor

\boldsymbol{\mathcal{T}}

P_{-p}\leftarrow

remove row

p

from factor matrix

\mathbf{U}_{\hat{n}}

P

\hat{P}_{-p}\leftarrow

cp_als(

\boldsymbol{\mathcal{T}}_{-p}

P_{-p}

)

\hat{P}_{-p}\leftarrow

permutation and scale adjustment of

\hat{P}_{-p}

\mathcal{Q}\leftarrow\mathcal{Q}\cup\{\hat{P}_{-p}\}

9 end for

10for $n\in\{1,2,\ldots,N\}\setminus\{\hat{n}\}$ do For every mode

n

except

\hat{n}

\mathbf{S}_{n}\leftarrow

standard deviation of factor matrix

\mathbf{U}_{n}

\mathcal{Q}

13 end for

Algorithm 2 JK-ALS: An algorithm that performs (LOO) jackknife resampling on a CP model.

4. Accelerating jackknife by using CALS

The straightforward application of jackknife to CP in Algorithm 2 involves $I_{\hat{n}}$ independent calls to CP-ALS on nearly the same tensor. Since the tensors are not exactly the same, CALS (Psarras et al., 2020) cannot be directly applied. In this section, we show how one can rewrite Algorithm 2 in such a way that CALS can be applied. There is an associated overhead due to extra computation, but we will show that the overhead is modest (less than a 100% increase and typically only a few percent increase).

4.1. JK-CALS: Jackknife extension of CALS

Let $\boldsymbol{\mathcal{T}}$ be an $N$ -mode tensor with a corresponding CP model $\mathbf{A}_{1},\ldots,\mathbf{A}_{N}$ . Let $\boldsymbol{\mathcal{\hat{T}}}$ be identical to $\boldsymbol{\mathcal{T}}$ except for one sample (with index $p$ ) removed from the sampled mode $\hat{n}\in\{1,2,\ldots,N\}$ . Let $\mathbf{\hat{A}}_{1},\ldots,\mathbf{\hat{A}}_{N}$ be the CP submodel corresponding to the resampled tensor $\boldsymbol{\mathcal{T}}$ .

When fitting a CP model to $\boldsymbol{\mathcal{T}}$ using CP-ALS, the MTTKRP for mode $n$ is given by

(1)

\mathbf{M}_{n}\leftarrow\mathbf{T}_{(n)}(\mathbf{A}_{N}\odot\dots\odot\mathbf{A}_{n+1}\odot\mathbf{A}_{n-1}\odot\dots\odot\mathbf{A}_{1}).

Similarly, when fitting a model to $\boldsymbol{\mathcal{\hat{T}}}$ , the MTTKRP for mode $n$ is given by

(2)

\mathbf{\hat{M}}_{n}\leftarrow\mathbf{\hat{T}}_{(n)}(\mathbf{\hat{A}}_{N}\odot\dots\odot\mathbf{\hat{A}}_{n+1}\odot\mathbf{\hat{A}}_{n-1}\odot\dots\odot\mathbf{\hat{A}}_{1}).

Can $\mathbf{\hat{M}}_{n}$ be computed from $\mathbf{T}_{(n)}$ instead of $\mathbf{\hat{T}}_{(n)}$ ? As we will see, the answer is yes. We separate two cases: $n=\hat{n}$ and $n\neq\hat{n}$ .

Case I: $n=\hat{n}$ . The slice of $\boldsymbol{\mathcal{T}}$ removed when resampling corresponds to a row of the unfolding $\mathbf{T}_{(n)}=\mathbf{T}_{(\hat{n})}$ . To see this, note that element $\boldsymbol{\mathcal{T}}(i_{1},i_{2},\dots,i_{N})$ corresponds to element $\mathbf{T}_{(n)}(i_{n},j)$ of its mode- $n$ unfolding (Kolda and Bader, 2009), where

(3)

j=1+\sum_{\begin{subarray}{c}k=1\\ k\neq n\end{subarray}}^{N}(i_{k}-1)\prod_{\begin{subarray}{c}m=1\\ m\neq n\end{subarray}}^{k-1}I_{m}.

When we remove sample $p$ , then $\mathbf{\hat{T}}_{(n)}$ will be identical to $\mathbf{T}_{(n)}$ except that row $p$ from the latter is missing in the former. In other words, $\mathbf{\hat{T}}_{(n)}=\mathbf{E}_{p}\mathbf{T}_{(n)}$ , where $\mathbf{E}_{p}$ is the matrix that removes row $p$ . We can therefore compute $\mathbf{\hat{M}}_{n}$ by replacing $\mathbf{\hat{T}}_{(n)}$ with $\mathbf{T}_{(n)}$ in (2) and then discarding row $p$ from the result:

\mathbf{\hat{M}}_{n}\leftarrow\mathbf{E}_{p}(\mathbf{T}_{(n)}(\mathbf{\hat{A}}_{N}\odot\dots\odot\mathbf{\hat{A}}_{n+1}\odot\mathbf{\hat{A}}_{n-1}\odot\dots\odot\mathbf{\hat{A}}_{1})).

Case II: $n\neq\hat{n}$ . The slice of $\boldsymbol{\mathcal{T}}$ removed when resampling corresponds to a set of columns in the unfolding $\mathbf{T}_{(n)}$ . One could in principle remove these columns to obtain $\mathbf{\hat{T}}_{(n)}$ . But instead of explicitly removing sample $p$ from $\boldsymbol{\mathcal{T}}$ , we can simply zero out the corresponding slice of $\boldsymbol{\mathcal{T}}$ . To give the CP model matching dimensions, we need only insert a row of zeros at index $p$ in factor matrix $\hat{n}$ . Crucially, the zeroing out of slice $p$ is superfluous. In the MTTKRP, the elements that should have been zeroed out will be multiplied with zeros in the Khatri-Rao product generated by the row of zeros insert in factor matrix $\hat{n}$ . Thus, to compute $\mathbf{\hat{M}}_{n}$ in (2) we (a) replace $\mathbf{\hat{T}}_{(n)}$ with $\mathbf{T}_{(n)}$ and (b) insert a row of zeros at index $p$ in factor matrix $\mathbf{\hat{A}}_{\hat{n}}$ .

In summary, we have shown that it is possible to compute $\mathbf{\hat{M}}_{n}$ in (2) without referencing the reduced tensor. There is an overhead associate with extra arithmetic. For the case $n=\hat{n}$ , we compute numbers that are later discarded. For the case $n\neq\hat{n}$ , we do some arithmetic with zeros.

Based on the observations above, the CALS algorithm (Psarras et al., 2020) can be modified to facilitate the concurrent fitting of all jackknife submodels. Algorithm 3 incorporates the necessary changes. In the end, extending CALS to support jackknife comes down to these localized changes (colored red in Algorithm 3):

•

Insert a row of zeros in one of the factor matrices,
•

periodically zero out the padded row to keep it zero, and
•

adjust the error formula to compute the submodel error.

We remark that JK-CALS can be straightforwardly extended to delete- $d$ jackknife. Instead of padding and periodically zeroing out one row, we pad and periodically zero out $d$ rows.

Input:

\boldsymbol{\mathcal{T}}

\in\mathbb{R}^{I_{1}\times\cdots\times I_{N}}

The target tensor

\hat{n}

The sampled mode

Output:

\mathbf{U}_{1}^{(p)},\ldots,\mathbf{U}_{N}^{(p)}

for

p=1,2,\ldots,I_{\hat{n}}

The fitted submodels

1 Initialize the submodels

2 for $n=1,2$ , $\ldots$ , $N$ do Initialize one factor multi-matrix for each mode

3 for $p=1,2$ , $\ldots$ , $I_{\hat{n}}$ do

4 if $n=\hat{n}$ then

\mathbf{\overline{U}}_{|p}^{(n)}\leftarrow\mathbf{U}_{n}^{(p)}

with a row of zeros inserted at index

p

6 else

\mathbf{\overline{U}}_{|p}^{(n)}\leftarrow\mathbf{U}_{n}^{(p)}

8 end if

10 end for

12 end for

13 repeat Concurrently run

I_{\hat{n}}

instances of CP-ALS

14 for $n=1,2$ , $\ldots$ , $N$ do

\mathbf{\overline{M}}^{(n)}\leftarrow\mathbf{T}_{(n)}(\odot_{i\neq n}\mathbf{\overline{U}}^{(i)})

16 for $p=1,2$ , $\ldots$ , $I_{\hat{n}}$ do

\mathbf{H}_{p}^{(n)}\leftarrow\ast_{i\neq n}({\mathbf{\overline{U}}_{|p}^{(i)}}^{T}\mathbf{\overline{U}}_{|p}^{(i)})

\mathbf{\overline{U}}_{|p}^{(n)}\leftarrow\mathbf{\overline{M}}_{|p}^{(n)}{\mathbf{H}_{p}^{(n)}}^{\dagger}

19 if $n=\hat{n}$ then

\mathbf{\overline{U}}_{|p}^{(n)}\leftarrow

zero out row

p

\mathbf{\overline{U}}_{|p}^{(n)}

21 end if

23 end for

25 end for

26 for $p=1,2$ , $\ldots$ , $I_{\hat{n}}$ do

e\leftarrow{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}||\boldsymbol{\mathcal{T}}_{-p}||^{2}}-(\oplus(\mathbf{H}_{p}^{(N)}\ast({\mathbf{U}_{|p}^{(N)}}^{T}\mathbf{U}_{|p}^{(N)})))-2(\oplus(\mathbf{U}_{|p}^{(N)}\ast\mathbf{M}_{|p}^{(N)}))

Error calculation

28 end for

30until convergence detected for all instances or maximum number of iterations reached

Algorithm 3 JK-CALS: Concurrent alternating least squares method for jackknife estimation.

4.2. Performance considerations

While Algorithm 3 benefits from improved MTTKRP efficiency, the padding results in extra arithmetic operations. Let $d$ denote the number of removed samples ( $d=1$ corresponds to leave-one-out). For the sake of simplicity, assume that the integer $d$ divides $I_{\hat{n}}$ . There are $I_{\hat{n}}/d$ submodels, each with $R$ components. The only costly part is the MTTKRP.

The MTTKRPs in JK-ALS (for all submodels combined) requires

\left(\frac{I_{\hat{n}}}{d}\right)\left(2R(I_{\hat{n}}-d)\prod_{\begin{subarray}{c}i=1\\ i\neq\hat{n}\end{subarray}}^{N}I_{i}\right)

FLOPs. Meanwhile, the fused MTTKRP in JK-CALS requires

2\left(\frac{I_{\hat{n}}}{d}R\right)\prod_{i=1}^{N}I_{i}

FLOPs. The ratio of the latter to the former comes down to

\frac{I_{\hat{n}}}{I_{\hat{n}}-d}\leq 2,

since $d\leq I_{\hat{n}}/2$ in delete- $d$ jackknife. Thus, in the worst case, JK-CALS requires less than twice the FLOPs of JK-ALS. More typically, the overhead is negligible.

5. Experiments

We investigate the performance benefits of the JK-CALS algorithm to perform jackknife resampling on a CP model through two sets of experiments. In the first set of experiments, we focus on the scalability of the algorithm, with respect to both problem size and number of processor cores. For this purpose, we use synthetic datasets of increasing volume, mimicking the shape of real datasets. In the second set of experiments, we illustrate JK-CALS’s practical impact by using it to perform jackknife resampling on two tensors arising in actual applications.

All experiments were conducted using a Linux-based system with an Intel® Xeon® Platinum 8160 Processor (Turbo Boost enabled, Hyper-Threading disabled), which contains 24 physical cores split in 2 NUMA regions of 12 cores each. The system also contains an Nvidia Tesla V100 GPU²²2Driver version: 470.57.02, CUDA Version: 11.2. The experiments were conducted with double precision arithmetic and we report results for 1 thread, 24 threads (two NUMA regions), and the GPU (with 24 CPU threads). The source code (available online³³3https://github.com/HPAC/CP-CALS/tree/jackknife) was compiled using GCC⁴⁴4GCC version 9 and linked to the Intel® Math Kernel Library⁵⁵5MKL version 19.0.

5.1. Scalability analysis

In this first experiment, we use three synthetic tensors of size $50\times m\times m$ with $m\in\{100,200,400\}$ , referred to as “small”, “medium” and “large” tensors, respectively. The samples are in the first mode. The other modes contain variables. The number of samples is kept low, since leave-one-out jackknife is usually performed on a small number of samples (usually $<100$ ), while there can be arbitrarily many variables.

For each tensor, we perform jackknife on four models with varying number of components ( $R\in\{3,5,7,9\}$ ). This range of component counts is typical in applications. In practice, it is often the case that multiple models are fitted to the target tensor, and many of those models are then further analyzed using jackknife. For this reason, we perform jackknife on each model individually, as well as to all models simultaneously (denoted by “All” in the figures), to better simulate multiple real-world application scenarios. In this experiment, the termination criteria based on maximum number of iterations and tolerance are ignored; instead, all models are forced to go through exactly 100 iterations, typically a small number of iterations for small values of tolerance (i.e., most models require more than 100 iterations). The reason for this choice is that we aim to isolate the performance difference of the methods tested; therefore, we maintain a consistent amount of workload throughout the experiment. (Tolerance and maximum number of iterations are instead used later on in the application experiments.)

For comparison, we perform jackknife using three methods: JK-ALS, JK-OALS and JK-CALS. JK-OALS uses OpenMP to take advantage of the inherent parallelism when fitting multiple submodels. This method is only used for multi-threaded and GPU experiments, and we are only going to focus on its performance, ignoring the memory overhead associated with it.

Figure 2. Execution time for single-threaded jackknife resampling applied to three different tensors (small, medium, and large in the top, middle, and bottom panels, respectively), and different number of components (from left to right:

R\in\{3,5,7,9\}

, and “All”, which represents doing jackknife to all four models simultaneously).

Fig. 2 shows results for single threaded execution; in this case, JK-OALS is absent. JK-CALS consistently outperforms JK-ALS for all tensor sizes and workloads. Specifically, for any fixed amount of workload—i.e., a model of a specific number of components—JK-CALS exhibits increasing speedups compared to JK-ALS, as the tensor size increases. For example, for a model with 5 components, JK-CALS is $2.9$ , $3$ , $5.2$ times faster than JK-ALS, for the small, medium and large tensor sizes, respectively.

Figure 3. Execution time for multi-threaded (24 threads) jackknife resampling applied to three different tensors (small, medium, and large in the top, middle, and bottom panels, respectively), and different number of components (from left to right:

R\in\{3,5,7,9\}

, and “All”, which represents doing jackknife to all four models simultaneously).

Fig. 3 shows results for multi-threaded execution, using 24 threads. In this case, JK-CALS outperforms the other two implementations (JK-ALS and JK-OALS) for the medium and large tensors, for all workloads (number of components), exhibiting speedups up to $35\times$ and $8\times$ compared to JK-ALS and JK-OALS, respectively. For the small tensor ( $50\times 100\times 100$ ) and small workloads ( $R\leq 7$ ), JK-CALS is outperformed by JK-OALS; for $R=3$ , it is also outperformed by JK-ALS. Investigating this further, for the small tensor and $R=3$ and $5$ , the parallel speedup (the ratio between single threaded and multi-threaded execution time) of JK-CALS is $0.3\times$ and $0.7\times$ for 24 threads. However, for $12$ threads, the corresponding timings are $0.28$ and $0.27$ seconds, resulting in speedups of $2.7\times$ and $3.7\times$ respectively. This points to two main reasons for the observed performance of JK-CALS in these cases: a) the amount of available computational resources (24 threads) is disproportionately high compared to the volume of computation to be performed and b) because of the small amount of overall computation, the small overhead associated with the CALS methodology becomes more significant.

That being said, even for the small tensor, as the amount of workload increases—already for a single model with 9 components—JK-CALS again becomes the fastest method. Finally, similarly to the single threaded case, as the size of the tensor increases, so do the speedups achieved by JK-CALS over the other two methods.

Figure 4. Execution time for GPU + multi-threaded (GPU + 24 threads) jackknife resampling applied to three different tensors (small, medium, and large in the top, middle, and bottom panels, respectively), and different number of components (from left to right:

R\in\{3,5,7,9\}

, and “All”, which represents doing jackknife to all four models simultaneously).

Fig. 4 shows results when the GPU is used to perform MTTKRP for all three methods; in this case, all 24 threads are used on the CPU. For the small tensor and small workloads ( $R\leq 5$ ), there is not enough computation to warrant the shipping of data to and from the GPU, resulting in higher execution times compared to multi-threaded execution; for all other cases, all methods have reduced execution time when using the GPU compared to the execution on 24 threads. Furthermore, in those cases, JK-CALS is consistently faster than its counterparts, exhibiting the largest speedups when the workload is at its highest (“All”), with values of $10\times$ , $7\times$ , $7\times$ compared to JK-OALS, and $12\times$ , $10\times$ , $9\times$ compared to JK-ALS, for the small, medium and large tensors, respectively.

5.2. Real-world applications

Figure 5. Execution time for jackknife resampling applied to two applications tensors. For tensor

89\times 97\times 549

, whose expected rank is 5, three models with

R\in\{4,5,6\}

were fitted and then jackknife was applied to them (i.e., the “All” group from the previous section). Similarly, for tensor

44\times 2700\times 200

, whose expected rank is 20, three models with

R\in\{19,20,21\}

were fitted and then jackknifed. In both cases, tolerance and maximum number of iterations were set to

10^{-6}

and

1000

respectively.

In this second experiment, we selected two tensors of size $89\times 97\times 549$ and $44\times 2700\times 200$ from the field of Chemometrics (Acar et al., 2008; Skov et al., 2008). In this field it is common to fit multiple, randomly initialized models in a range of low components (e.g. $R\in\{1,2,\ldots,20\}$ , $10$ – $20$ models for each $R$ , and then analyze (e.g., using jackknife) those models that might be of particular interest (often those with components close to the expected rank of the target tensor); in the tensors we consider, the expected rank is $5$ and $20$ , respectively. To mimic the typical workflow of practitioners, we fitted three models to each tensor, of components $R\in\{4,5,6\}$ and $R\in\{19,20,21\}$ , respectively, and used the three methods (JK-ALS, JK-OALS and JK-CALS) to apply jackknife resampling to the fitted models. The values for tolerance and maximum number of iterations were set according to typical values for the particular field, namely $10^{-6}$ and $1000$ , respectively.

In Fig. 5 we report the execution time for 1 thread, 24 threads, and GPU + 24 threads. For both datasets and for all configurations, JK-CALS is faster than the other two methods. Specifically, when compared to JK-ALS over the two tensors, JK-CALS achieves speedups of $5.4\times$ and $2\times$ for single threaded execution, $10\times$ and $2.8\times$ for 24-threaded execution. Similarly, when compared to JK-OALS, JK-CALS achieves speedups of $2.7\times$ and $4.8\times$ for 24-threaded execution. Finally, JK-CALS takes advantage of the GPU the most, exhibiting speedups of $17.5\times$ and $3.7\times$ over JK-ALS, and $9\times$ and $2\times$ over JK-OALS, for GPU execution.

6. Conclusion

Jackknife resampling of CP models is useful for estimating uncertainties, but the computation requires fitting multiple submodels and is therefore computationally expensive. We presented a new technique for implementing jackknife that better utilizes the computer’s memory hierarchy. The technique is based on a novel extension of the Concurrent ALS (CALS) algorithm for fitting multiple CP models to the same underlying tensor, first introduced in (Psarras et al., 2020). The new technique has a modest arithmetic overhead that is bounded above by factor of two in the worst case. Numerical experiments on both synthetic and real-world datasets using a multicore processor paired with a GPU demonstrated that the proposed algorithm can be several times faster than a straightforward implementation of jackknife resampling based on multiple calls to a regular CP-ALS implementation.

Future work includes extending the software to support delete- $d$ jackknife.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions

CP drafted the main manuscript text, developed the source code, performed the experiments and prepared all figures. LK and PB revised the main manuscript text. CP, LK, and PB discussed and formulated the jackknife extension of CALS. CP, LK, RB, and PB discussed and formulated the experiments. LK, RB, and CP discussed the related work section. PB oversaw the entire process. All authors reviewed and approved the final version of the manuscript.

Data Availability Statement

The source code and datasets used in this study are available online: https://github.com/HPAC/CP-CALS/tree/jackknife.

Acknowledgements.

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 333849990/GRK2379 (IRTG Modern Inverse Problems).

References

(1)
Acar et al. (2008) Evrim Acar, Rasmus Bro, and Bonnie Schmidt. 2008. New exploratory clustering tool. Journal of Chemometrics 22, 1 (2008), 91–100. https://doi.org/10.1002/cem.1106 arXiv:https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdf/10.1002/cem.1106
Acar et al. (2011) E. Acar, D.M. Dunlavy, and T.G. Kolda. 2011. A scalable optimization approach for fitting canonical tensor decompositions. Journal of Chemometrics 25, 2 (2011), 67–86. https://doi.org/10.1002/cem.1335
Ang et al. (2019) A.M.S. Ang, J.E. Cohen, L.T.K. Hien, and N. Gillis. 2019. Extrapolated alternating algorithms for approximate canonical polyadic decomposition. In ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3147–3151. https://doi.org/10.1109/ICASSP40776.2020.9053849
Battaglino et al. (2018) C. Battaglino, G. Ballard, and T.G. Kolda. 2018. A practical randomized CP tensor decomposition. SIAM J. Matrix Anal. Appl. 39, 2 (2018), 876–901. https://doi.org/10.1137/17M1112303
Belotti and Peracchi (2020) Federico Belotti and Franco Peracchi. 2020. Fast leave-one-out methods for inference, model selection, and diagnostic checking. The Stata Journal 20, 4 (2020), 785–804. https://doi.org/10.1177/1536867X20976312 arXiv:https://doi.org/10.1177/1536867X20976312
Bro and Andersson (1998) R. Bro and C.A. Andersson. 1998. Improving the speed of multiway algorithms: Part II: Compression. Chemometrics and Intelligent Laboratory Systems 42, 1–2 (1998), 105–113. https://doi.org/10.1016/S0169-7439(98)00011-2
Buzas (1997) J. S. Buzas. 1997. Fast Estimators of the Jackknife. The American Statistician 51, 3 (1997), 235–240. https://doi.org/10.1080/00031305.1997.10473969 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00031305.1997.10473969
Farrance and Frenkel (2012) Ian Farrance and Robert Frenkel. 2012. Uncertainty of measurement: a review of the rules for calculating uncertainty components through functional relationships. The Clinical Biochemist Reviews 33, 2 (2012), 49.
Hinkle and Stromberg (1996) John E. Hinkle and Arnold J. Stromberg. 1996. Efficient computation of statistical procedures based on all subsets of a specified size. Communications in Statistics - Theory and Methods 25, 3 (1996), 489–500. https://doi.org/10.1080/03610929608831709 arXiv:https://doi.org/10.1080/03610929608831709
Kannan et al. (2016) Ramakrishnan Kannan, Grey Ballard, and Haesun Park. 2016. A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization. SIGPLAN Not. 51, 8, Article 9 (Feb. 2016), 11 pages. https://doi.org/10.1145/3016078.2851152
Kiers (2004) Henk AL Kiers. 2004. Bootstrap confidence intervals for three-way methods. Journal of Chemometrics: A Journal of the Chemometrics Society 18, 1 (2004), 22–36.
Kolda and Bader (2009) T.G. Kolda and B.W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (September 2009), 455–500. https://doi.org/10.1137/07070111X
Lourakis and Liavas (2018) Georgios Lourakis and Athanasios P. Liavas. 2018. Nesterov-Based Alternating Optimization for Nonnegative Tensor Completion: Algorithm and Parallel Implementation. In 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). 1–5. https://doi.org/10.1109/SPAWC.2018.8445941
Martens and Martens (2000) Harald Martens and Magni Martens. 2000. Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Quality and Preference 11, 1 (2000), 5–16. https://doi.org/10.1016/S0950-3293(99)00039-7
Murphy et al. (2013) K.R. Murphy, C.A. Stedmon, D. Graeber, and R. Bro. 2013. Fluorescence spectroscopy and multi-way techniques. PARAFAC. Anal. Methods 5 (2013), 6557–6566. Issue 23. https://doi.org/10.1039/C3AY41160E
Peddada (1993) SD Peddada. 1993. Jackknife variance estimation and bias reduction,” CR Rao, ed., Handbook of Statistics, Vol. 9. , 723–744 pages.
Phan et al. (2013) Anh-Huy Phan, Petr Tichavský, and Andrzej Cichocki. 2013. Fast Alternating LS Algorithms for High Order CANDECOMP/PARAFAC Tensor Factorizations. IEEE Transactions on Signal Processing 61, 19 (2013), 4834–4846. https://doi.org/10.1109/TSP.2013.2269903
Phipps and Kolda (2019) Eric T. Phipps and Tamara G. Kolda. 2019. Software for Sparse Tensor Decomposition on Emerging Computing Architectures. SIAM Journal on Scientific Computing 41, 3 (2019), C269–C290. https://doi.org/10.1137/18M1210691 arXiv:https://doi.org/10.1137/18M1210691
Psarras et al. (2020) Christos Psarras, Lars Karlsson, and Paolo Bientinesi. 2020. Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions. arXiv:cs.MS/2010.04678
Psarras et al. (2021) Christos Psarras, Lars Karlsson, Jiajia Li, and Paolo Bientinesi. 2021. The landscape of software for tensor computations. arXiv:cs.MS/2103.13756
Rajih et al. (2008) M. Rajih, P. Comon, and R.A. Harshman. 2008. Enhanced line search: a novel method to accelerate PARAFAC. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1128–1147. https://doi.org/10.1137/06065577
Riu and Bro (2003) Jordi Riu and Rasmus Bro. 2003. Jack-knife technique for outlier detection and estimation of standard errors in PARAFAC models. Chemometrics and Intelligent Laboratory Systems 65, 1 (2003), 35–49. https://doi.org/10.1016/S0169-7439(02)00090-4
Sanchez and Kowalski (1986) Eugenio. Sanchez and Bruce R. Kowalski. 1986. Generalized rank annihilation factor analysis. Analytical Chemistry 58, 2 (1986), 496–499. https://doi.org/10.1021/ac00293a054 arXiv:https://doi.org/10.1021/ac00293a054
Skov et al. (2008) Thomas Skov, Davide Ballabio, and Rasmus Bro. 2008. Multiblock variance partitioning: A new approach for comparing variation in multiple data blocks. Analytica Chimica Acta 615, 1 (2008), 18–29. https://doi.org/10.1016/j.aca.2008.03.045
Smith et al. (2015) S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In 2015 IEEE International Parallel and Distributed Processing Symposium. 61–70. https://doi.org/10.1109/IPDPS.2015.27
Solomonik et al. (2013) E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. 2013. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 813–824. https://doi.org/10.1109/IPDPS.2013.112
Vervliet and De Lathauwer (2016) N. Vervliet and L. De Lathauwer. 2016. A randomized block sampling approach to canonical polyadic decomposition of large-scale tensors. IEEE Journal of Selected Topics in Signal Processing 10, 2 (2016), 284–295. https://doi.org/10.1109/JSTSP.2015.2503260
Westad and Marini (2015) Frank Westad and Federico Marini. 2015. Validation of chemometric models – A tutorial. Analytica Chimica Acta 893 (2015), 14–24. https://doi.org/10.1016/j.aca.2015.06.056
Wiberg and Jacobsson (2004) Kent Wiberg and Sven P Jacobsson. 2004. Parallel factor analysis of HPLC-DAD data for binary mixtures of lidocaine and prilocaine with different levels of chromatographic separation. Analytica Chimica Acta 514, 2 (2004), 203–209. https://doi.org/10.1016/j.aca.2004.03.062