Accelerating jackknife resampling for the Canonical Polyadic Decomposition
Abstract.
The Canonical Polyadic (CP) tensor decomposition is frequently used as a model in applications in a variety of different fields. Using jackknife resampling to estimate parameter uncertainties is often desirable but results in an increase of the already high computational cost. Upon observation that the resampled tensors, though different, are nearly identical, we show that it is possible to extend the recently proposed Concurrent ALS (CALS) technique to a jackknife resampling scenario. This extension gives access to the computational efficiency advantage of CALS for the price of a modest increase (typically a few percent) in the number of floating point operations. Numerical experiments on both synthetic and real-world datasets demonstrate that the new workflow based on a CALS extension can be several times faster than a straightforward workflow where the jackknife submodels are processed individually.
1. Introduction
The CP model is used increasingly across a large diversity of fields. One of the fields in which CP is commonly applied is chemistry (Murphy et al., 2013; Wiberg and Jacobsson, 2004), where there is often a need for estimating not only the parameters of the model, but also the associated uncertainty of those parameters (Farrance and Frenkel, 2012). In fact, in some areas it is a dogma that an estimate without an uncertainty is not a result. A common approach for estimating uncertainties of the parameters of CP models is through resampling, such as bootstrapping or jackknifing (Riu and Bro, 2003; Kiers, 2004). The latter has added benefits, e.g., for variable selection (Martens and Martens, 2000) and outlier detection(Riu and Bro, 2003). Here we consider a new technique, JK-CALS, that increases the performance of jackknife resampling applied to CP by more efficiently utilizing the computer’s memory hierarchy.
The basic concept of jackknife is somewhat related to cross-validation. Let be a tensor, and , , the factor matrices of a CP model. Let us also make the assumption (typical in many applications) that the first mode corresponds to independent samples, and all the other modes correspond to variables. For the most basic type of jackknifing, namely Leave-One-Out (LOO)111Henceforth, when we mention jackknifing we imply LOO jackknifing, unless otherwise stated., one sample (out of ) is left out at a time (resulting in a tensor with only samples) and a model is fitted to the remaining data; we refer to that model as a submodel. All samples are left out exactly once, resulting in distinct submodels. Each submodel provides an estimate of the parameters of the overall model. For example, each submodel provides an estimate of the factor (or loading) matrix . From these estimates it is possible to calculate the variance (or bias) of the overall loading matrix (the one obtained from all samples). One complication comes from some indeterminacies with CP that need to be taken into account. For example, when one (or more) samples are removed from the initial tensor, the order of components in the submodel may change; this phenomenon is explained and a solution is proposed in (Riu and Bro, 2003).
Recently, the authors proposed a technique, Concurrent ALS (CALS) (Psarras
et al., 2020), that can fit multiple CP
models to the same underlying tensor more rapidly than regular ALS.
CALS achieves better performance not by altering the numerics but by utilizing the computer’s memory hierarchy more efficiently than regular ALS.
However, the CALS technique cannot be directly applied to jackknife resampling, since the
submodels are fitted to different tensors.
In this paper, we extend the idea that underpins CALS to jackknife resampling.
The new technique takes advantage of the fact that the resampled tensors are nearly identical.
At the price of a modest increase in arithmetic operations, the technique allows for more efficient fitting of the CP submodels and thus improved overall performance of a jackknife workflow.
In applications in which the number of components in the CP model is relatively low, the technique can significantly
reduce the overall time to solution.
Contributions
-
•
An efficient technique, JK-CALS, for performing jackknife resampling of CP models. The technique is based on an extension of CALS to nearly identical tensors. To the best of our knowledge, this is the first attempt at accelerating jackknife resampling of CP models.
-
•
Numerical experiments demonstrate that JK-CALS can lead to performance gains in a jackknife resampling workflow.
-
•
Theoretical analysis shows that the technique generalizes from leave-one-out to delete- jackknife with a modest (less than a factor of two) increase in arithmetic.
-
•
A C++ library with support for GPU acceleration and a Matlab interface.
Organization
The rest of the paper is organized as follows. In Section 2, we provide an overview of related research. In Section 3, we review the standard CP-ALS and CALS algorithms, as well as jackknife applied to CP. We describe the technique which enables us to use CALS to compute jackknife more efficiently in Section 4. In Section 5 we demonstrate the efficiency of our proposed technique, by applying it to perform jackknife resampling to CP models that have been fitted to artificial and real tensors. Finally, in Section 6, we conclude the paper and provide insights for further research.
2. Related Work
Two popular techniques for uncertainty estimation for CP models are bootstrap and jackknife (Westad and Marini, 2015; Kiers, 2004; Riu and Bro, 2003). The main difference is that jackknife resamples without replacement whereas bootstrap resamples with replacement. Bootstrap frequently involves more submodels than jackknife and is therefore more expensive. The term jackknife typically refers to leave-one-out jackknife, where only one observation is removed when resampling. More than one observation can be removed at a time (Peddada, 1993); a variation commonly called delete- jackknife. When applied to CP, jackknife has certain benefits over bootstrap, e.g., for variable selection (Martens and Martens, 2000) and outlier detection(Riu and Bro, 2003).
Jackknife requires fitting multiple submodels. A straightforward way of accelerating jackknife is to separately accelerate the fitting of each submodel, e.g., using a faster implementation. The simplest and most extensively used numerical method for fitting CP models is the Alternating Least Squares (CP-ALS) method. Alternative methods for fitting CP models include eigendecomposition-based methods (Sanchez and Kowalski, 1986) and gradient-based (all-at-once) optimization methods (Acar et al., 2011).
Several techniques have been proposed to accelerate CP-ALS. Line search (Rajih et al., 2008) and extrapolation (Ang et al., 2019) aim to reduce the number of iterations until convergence. Randomization-based techniques have also been proposed. These target very large tensors, and either randomly sample the tensor (Vervliet and De Lathauwer, 2016) or the Khatri-Rao product (Battaglino et al., 2018), to reduce their size and, by extension, the overall amount of computation. Similarly, compression-based techniques replace the target tensor with a compressed version, thus also reducing the amount of computation during fitting (Bro and Andersson, 1998). The CP model of the reduced tensor is inflated to correspond to a model of the original tensor.
Several projects offer high-performance implementations of CP-ALS, for example, Cyclops (Solomonik et al., 2013), PLANC (Kannan et al., 2016), Partensor (Lourakis and Liavas, 2018), SPLATT (Smith et al., 2015), and Genten (Phipps and Kolda, 2019). For a more comprehensive list of software implementing some variant of CP-ALS, refer to (Psarras et al., 2021).
Similar to the present work, there have been attempts at accelerating jackknife although (to the best of our knowledge) not in the context of CP. In (Buzas, 1997), the high computational cost of jackknife is tackled by using a numerical approximation that requires fewer operations at the price of lower accuracy. In (Belotti and Peracchi, 2020), a general-purpose routine for fast jackknife estimation is presented. Some estimators (often linear ones) have leave-one-out formulas that allow for fast computation of the estimator after leaving one sample out. Jackknife is thus accelerated by computing the estimator on the full set and then systematically applying the leave-one-out formula. In (Hinkle and Stromberg, 1996), a similar technique is studied. Jackknife computes an estimator on distinct subsets of the samples. Any two of these subsets differ by only one sample, i.e., any one subset can be obtained from any other by replacing one and only one element. Some estimators have a fast updating formula, which can rapidly transform an estimator for one subset to an estimator for another subset. Jackknife is thus accelerated by computing the estimator from scratch on the first subset and then repeatedly updating the estimator using this fast updating formula.
3. CP-ALS, CALS and jackknife
In this section, we first specify the notation to be used throughout the paper,
we then review the CP-ALS and CALS techniques,
and finally we describe jackknife resampling applied to CP.
3.1. Notation
For vectors and matrices, we use bold lowercase and uppercase roman letters, respectively, e.g., and .
For tensors, we follow the notation in (Kolda and Bader, 2009);
specifically, we use bold calligraphic fonts, e.g., .
The order (number of indices or modes) of a tensor is denoted by uppercase roman letters, e.g., .
For each mode , a tensor can be unfolded (matricized) into a matrix, denoted by , where the columns are the mode- fibers of , i.e., the vectors obtained by fixing all indices except for mode .
Sets are denoted by calligraphic fonts, e.g., .
Given two matrices and with the same number of columns, the Khatri-Rao product, denoted by , is the column-wise Kronecker product of and . Finally, the unary operator , when
applied to a matrix, denotes the scalar which is the sum of all matrix elements.
3.2. CP-ALS
The standard alternating least-squares method for CP is shown in Algorithm 1 (CP-ALS). The input consists of a target tensor . The output consists of a CP model represented by a sequence of factor matrices , , . The algorithm repeatedly updates the factor matrices one by one in sequence until either of the following criteria are met: a) the fit of the model to the target tensor falls below a certain tolerance threshold, or b) a maximum number of iterations has been reached. To update a specific factor matrix , the gradient of the least-squares objective function with respect to that factor matrix is set to zero and the resulting linear least-squares problem is solved directly from the normal equations. This entails computing the Matricized Tensor Times Khatri-Rao Product (MTTKRP) (line 1), which is the product between the mode- unfolding and the Khatri-Rao Product (KRP) of all factor matrices except . The MTTKRP is followed by the Hadamard product of the Gramians of each factor matrix () in line 1. Factor matrix is updated by solving the linear system in line 1. At the completion of an iteration, i.e., a full pass over all modes, the error between the model and the target tensor is computed (line 1) using the efficient formula derived in (Phan et al., 2013).
Assuming a small number of components (), the most expensive step is the MTTKRP (line 1). This step involves FLOPs (ignoring, for the sake of simplicity, the lower order of FLOPs required for the computation of the KRP). The operation touches slightly more than memory locations, resulting in an arithmetic intensity less than FLOPs per memory reference. Thus, unless is sufficiently large, the speed of the computation will be limited by the memory bandwidth rather than the speed of the processor. The CP-ALS algorithm is inherently memory-bound for small , regardless of how it is implemented.
The impact on performance of the memory-bound nature of MTTKRP is demonstrated in Fig. 1, which shows the computational efficiency of a particular implementation of MTTKRP as a function of the number of components (for a tensor of size ). Efficiency is defined as the ratio of the performance achieved by MTTKRP (in FLOPs/sec), relative to the Theoretical Peak Performance (TPP, see below) of the machine, i.e.,
The TPP of a machine is defined as the maximum number of (double precision) floating point operations the machine can perform in one second. Table 1 shows the TPP for our particular machine (see Sec. 5 for details).
System | TPP (GFlops/sec) | threads | frequency per core (Ghz) |
---|---|---|---|
CPU | 112 | 1 | 3.5 |
1536 | 24 | 2 |
In Fig. 1,
we see that the efficiency of MTTKRP tends to increase with the number of components, , until eventually reaching a plateau.
On this machine, the plateau is at efficiency for one thread and at efficiency for 24 threads.
For , which is common in applications, the efficiency is well below the TPP.
3.3. Concurrent ALS (CALS)
When fitting multiple CP models to the same underlying tensor, the Concurrent ALS (CALS) technique can improve the efficiency if the number of components is not large enough for CP-ALS to reach its performance plateau (Psarras et al., 2020). A need to fit multiple models to the same tensor arises, for example, when trying different initial guesses or when trying different numbers of components.
The gist of CALS can be summarized as follows (see (Psarras et al., 2020) for details). Suppose independent instances of CP-ALS have to be executed on the same underlying tensor. Rather than running them sequentially or in parallel, run them in lock-step fashion as follows. Advance every CP-ALS process one iteration before proceeding to the next iteration. One CALS iteration entails CP-ALS iterations (one iteration per model). Each CP-ALS iteration in turn contains one MTTKRP operation, so one CALS iteration also entails MTTKRP operations. But these MTTKRPs all involve the same tensor and can therefore be fused into one bigger MTTKRP operation (see Eq. 3 of (Psarras et al., 2020)). The performance of the fused MTTKRP depends on the sum total of components, i.e., , where is the number of components in model . Due to the performance profile of MTTKRP (see Fig. 1), the fused MTTKRP is expected to be more efficient than each of the smaller operations it replaces.
The following example illustrates the impact on efficiency of MTTKRP fusion.
Given a target tensor of size , models to fit, and components in each model, the efficiency for
each of the MTTKRPs in CP-ALS is about () for 1 (24) threads (see Fig.1).
The efficiency of the fused MTTKRP in CALS will be as observed for , i.e., () for 1 (24) threads.
Since the MTTKRP operation dominates the cost, CALS is expected to be () faster than CP-ALS for 1 (24) threads.
3.4. Jackknife
Algorithm 2 shows a baseline (inefficient) application of leave-one-out jackknife resampling to a CP model. For details, see (Riu and Bro, 2003). The inputs are a target tensor , an overall CP model fitted to all of , and a sampled mode . For each sample , the algorihm removes the slice corresponding to the sample from tensor (line 2) and model (line 2) and fits a reduced model (lines 2–2) to the reduced tensor using regular CP-ALS. After fitting all submodels, the standard deviation of every factor matrix except is computed from the submodels in (line 2). The only costly part of Algorithm 2 is the repeated calls to CP-ALS in line 2.
A CP model fitted to
The sampled mode
4. Accelerating jackknife by using CALS
The straightforward application of jackknife to CP in Algorithm 2 involves independent calls to CP-ALS on nearly the same tensor.
Since the tensors are not exactly the same, CALS (Psarras
et al., 2020) cannot be directly applied.
In this section, we show how one can rewrite Algorithm 2 in such a way that CALS can be applied.
There is an associated overhead due to extra computation, but we will show that the overhead is modest (less than a 100% increase and typically only a few percent increase).
4.1. JK-CALS: Jackknife extension of CALS
Let be an -mode tensor with a corresponding CP model . Let be identical to except for one sample (with index ) removed from the sampled mode . Let be the CP submodel corresponding to the resampled tensor .
When fitting a CP model to using CP-ALS, the MTTKRP for mode is given by
(1) |
Similarly, when fitting a model to , the MTTKRP for mode is given by
(2) |
Can be computed from instead of ? As we will see, the answer is yes. We separate two cases: and .
Case I: . The slice of removed when resampling corresponds to a row of the unfolding . To see this, note that element corresponds to element of its mode- unfolding (Kolda and Bader, 2009), where
(3) |
When we remove sample , then will be identical to except that row from the latter is missing in the former. In other words, , where is the matrix that removes row . We can therefore compute by replacing with in (2) and then discarding row from the result:
Case II: . The slice of removed when resampling corresponds to a set of columns in the unfolding . One could in principle remove these columns to obtain . But instead of explicitly removing sample from , we can simply zero out the corresponding slice of . To give the CP model matching dimensions, we need only insert a row of zeros at index in factor matrix . Crucially, the zeroing out of slice is superfluous. In the MTTKRP, the elements that should have been zeroed out will be multiplied with zeros in the Khatri-Rao product generated by the row of zeros insert in factor matrix . Thus, to compute in (2) we (a) replace with and (b) insert a row of zeros at index in factor matrix .
In summary, we have shown that it is possible to compute in (2) without referencing the reduced tensor. There is an overhead associate with extra arithmetic. For the case , we compute numbers that are later discarded. For the case , we do some arithmetic with zeros.
Based on the observations above, the CALS algorithm (Psarras et al., 2020) can be modified to facilitate the concurrent fitting of all jackknife submodels. Algorithm 3 incorporates the necessary changes. In the end, extending CALS to support jackknife comes down to these localized changes (colored red in Algorithm 3):
-
•
Insert a row of zeros in one of the factor matrices,
-
•
periodically zero out the padded row to keep it zero, and
-
•
adjust the error formula to compute the submodel error.
We remark that JK-CALS can be straightforwardly extended to delete- jackknife.
Instead of padding and periodically zeroing out one row, we pad and periodically zero out rows.
The sampled mode
4.2. Performance considerations
While Algorithm 3 benefits from improved MTTKRP efficiency, the padding results in extra arithmetic operations. Let denote the number of removed samples ( corresponds to leave-one-out). For the sake of simplicity, assume that the integer divides . There are submodels, each with components. The only costly part is the MTTKRP.
The MTTKRPs in JK-ALS (for all submodels combined) requires
FLOPs. Meanwhile, the fused MTTKRP in JK-CALS requires
FLOPs. The ratio of the latter to the former comes down to
since in delete- jackknife. Thus, in the worst case, JK-CALS requires less than twice the FLOPs of JK-ALS. More typically, the overhead is negligible.
5. Experiments
We investigate the performance benefits of the JK-CALS algorithm to perform jackknife resampling on a CP model through two sets of experiments. In the first set of experiments, we focus on the scalability of the algorithm, with respect to both problem size and number of processor cores. For this purpose, we use synthetic datasets of increasing volume, mimicking the shape of real datasets. In the second set of experiments, we illustrate JK-CALS’s practical impact by using it to perform jackknife resampling on two tensors arising in actual applications.
All experiments were conducted using a Linux-based system with an Intel® Xeon® Platinum 8160 Processor (Turbo Boost enabled, Hyper-Threading disabled), which contains 24 physical cores split in 2 NUMA regions of 12 cores each.
The system also contains an Nvidia Tesla V100 GPU222Driver version: 470.57.02, CUDA Version: 11.2.
The experiments were conducted with double precision arithmetic and we report results for 1 thread, 24 threads (two NUMA
regions), and the GPU (with 24 CPU threads).
The source code (available online333https://github.com/HPAC/CP-CALS/tree/jackknife) was compiled using GCC444GCC version 9 and linked to the Intel® Math Kernel Library555MKL version 19.0.
5.1. Scalability analysis
In this first experiment, we use three synthetic tensors of size with , referred to as “small”, “medium” and “large” tensors, respectively. The samples are in the first mode. The other modes contain variables. The number of samples is kept low, since leave-one-out jackknife is usually performed on a small number of samples (usually ), while there can be arbitrarily many variables.
For each tensor, we perform jackknife on four models with varying number of components (). This range of component counts is typical in applications. In practice, it is often the case that multiple models are fitted to the target tensor, and many of those models are then further analyzed using jackknife. For this reason, we perform jackknife on each model individually, as well as to all models simultaneously (denoted by “All” in the figures), to better simulate multiple real-world application scenarios. In this experiment, the termination criteria based on maximum number of iterations and tolerance are ignored; instead, all models are forced to go through exactly 100 iterations, typically a small number of iterations for small values of tolerance (i.e., most models require more than 100 iterations). The reason for this choice is that we aim to isolate the performance difference of the methods tested; therefore, we maintain a consistent amount of workload throughout the experiment. (Tolerance and maximum number of iterations are instead used later on in the application experiments.)
For comparison, we perform jackknife using three methods: JK-ALS, JK-OALS and JK-CALS. JK-OALS uses OpenMP to take advantage of the inherent parallelism when fitting multiple submodels. This method is only used for multi-threaded and GPU experiments, and we are only going to focus on its performance, ignoring the memory overhead associated with it.
Fig. 2 shows results for single threaded execution; in this case, JK-OALS is absent. JK-CALS consistently outperforms JK-ALS for all tensor sizes and workloads. Specifically, for any fixed amount of workload—i.e., a model of a specific number of components—JK-CALS exhibits increasing speedups compared to JK-ALS, as the tensor size increases. For example, for a model with 5 components, JK-CALS is , , times faster than JK-ALS, for the small, medium and large tensor sizes, respectively.
Fig. 3 shows results for multi-threaded execution, using 24 threads. In this case, JK-CALS outperforms the other two implementations (JK-ALS and JK-OALS) for the medium and large tensors, for all workloads (number of components), exhibiting speedups up to and compared to JK-ALS and JK-OALS, respectively. For the small tensor () and small workloads (), JK-CALS is outperformed by JK-OALS; for , it is also outperformed by JK-ALS. Investigating this further, for the small tensor and and , the parallel speedup (the ratio between single threaded and multi-threaded execution time) of JK-CALS is and for 24 threads. However, for threads, the corresponding timings are and seconds, resulting in speedups of and respectively. This points to two main reasons for the observed performance of JK-CALS in these cases: a) the amount of available computational resources (24 threads) is disproportionately high compared to the volume of computation to be performed and b) because of the small amount of overall computation, the small overhead associated with the CALS methodology becomes more significant.
That being said, even for the small tensor, as the amount of workload increases—already for a single model with 9 components—JK-CALS again becomes the fastest method. Finally, similarly to the single threaded case, as the size of the tensor increases, so do the speedups achieved by JK-CALS over the other two methods.
Fig. 4 shows results when the GPU is used to perform MTTKRP for all three methods; in this case, all 24 threads are used on the CPU.
For the small tensor and small workloads (), there is not enough computation to warrant the shipping of data to and from the GPU, resulting in higher execution times compared to multi-threaded execution;
for all other cases, all methods have reduced execution time when using the GPU compared to the execution on 24 threads.
Furthermore, in those cases, JK-CALS is consistently faster than its counterparts, exhibiting the largest speedups when
the workload is at its highest (“All”), with values of , , compared to JK-OALS, and , , compared to JK-ALS, for the small, medium and large tensors, respectively.
5.2. Real-world applications
In this second experiment, we selected two tensors of size and from the field of Chemometrics (Acar et al., 2008; Skov et al., 2008). In this field it is common to fit multiple, randomly initialized models in a range of low components (e.g. , – models for each , and then analyze (e.g., using jackknife) those models that might be of particular interest (often those with components close to the expected rank of the target tensor); in the tensors we consider, the expected rank is and , respectively. To mimic the typical workflow of practitioners, we fitted three models to each tensor, of components and , respectively, and used the three methods (JK-ALS, JK-OALS and JK-CALS) to apply jackknife resampling to the fitted models. The values for tolerance and maximum number of iterations were set according to typical values for the particular field, namely and , respectively.
In Fig. 5 we report the execution time for 1 thread, 24 threads, and GPU + 24 threads. For both datasets and for all configurations, JK-CALS is faster than the other two methods. Specifically, when compared to JK-ALS over the two tensors, JK-CALS achieves speedups of and for single threaded execution, and for 24-threaded execution. Similarly, when compared to JK-OALS, JK-CALS achieves speedups of and for 24-threaded execution. Finally, JK-CALS takes advantage of the GPU the most, exhibiting speedups of and over JK-ALS, and and over JK-OALS, for GPU execution.
6. Conclusion
Jackknife resampling of CP models is useful for estimating uncertainties, but the computation requires fitting multiple submodels and is therefore computationally expensive. We presented a new technique for implementing jackknife that better utilizes the computer’s memory hierarchy. The technique is based on a novel extension of the Concurrent ALS (CALS) algorithm for fitting multiple CP models to the same underlying tensor, first introduced in (Psarras et al., 2020). The new technique has a modest arithmetic overhead that is bounded above by factor of two in the worst case. Numerical experiments on both synthetic and real-world datasets using a multicore processor paired with a GPU demonstrated that the proposed algorithm can be several times faster than a straightforward implementation of jackknife resampling based on multiple calls to a regular CP-ALS implementation.
Future work includes extending the software to support delete- jackknife.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Author Contributions
CP drafted the main manuscript text, developed the source code, performed the experiments and prepared all figures. LK and PB revised the main manuscript text. CP, LK, and PB discussed and formulated the jackknife extension of CALS. CP, LK, RB, and PB discussed and formulated the experiments. LK, RB, and CP discussed the related work section. PB oversaw the entire process. All authors reviewed and approved the final version of the manuscript.
Data Availability Statement
The source code and datasets used in this study are available online: https://github.com/HPAC/CP-CALS/tree/jackknife.
Acknowledgements.
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 333849990/GRK2379 (IRTG Modern Inverse Problems).References
- (1)
- Acar et al. (2008) Evrim Acar, Rasmus Bro, and Bonnie Schmidt. 2008. New exploratory clustering tool. Journal of Chemometrics 22, 1 (2008), 91–100. https://doi.org/10.1002/cem.1106 arXiv:https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdf/10.1002/cem.1106
- Acar et al. (2011) E. Acar, D.M. Dunlavy, and T.G. Kolda. 2011. A scalable optimization approach for fitting canonical tensor decompositions. Journal of Chemometrics 25, 2 (2011), 67–86. https://doi.org/10.1002/cem.1335
- Ang et al. (2019) A.M.S. Ang, J.E. Cohen, L.T.K. Hien, and N. Gillis. 2019. Extrapolated alternating algorithms for approximate canonical polyadic decomposition. In ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3147–3151. https://doi.org/10.1109/ICASSP40776.2020.9053849
- Battaglino et al. (2018) C. Battaglino, G. Ballard, and T.G. Kolda. 2018. A practical randomized CP tensor decomposition. SIAM J. Matrix Anal. Appl. 39, 2 (2018), 876–901. https://doi.org/10.1137/17M1112303
- Belotti and Peracchi (2020) Federico Belotti and Franco Peracchi. 2020. Fast leave-one-out methods for inference, model selection, and diagnostic checking. The Stata Journal 20, 4 (2020), 785–804. https://doi.org/10.1177/1536867X20976312 arXiv:https://doi.org/10.1177/1536867X20976312
- Bro and Andersson (1998) R. Bro and C.A. Andersson. 1998. Improving the speed of multiway algorithms: Part II: Compression. Chemometrics and Intelligent Laboratory Systems 42, 1–2 (1998), 105–113. https://doi.org/10.1016/S0169-7439(98)00011-2
- Buzas (1997) J. S. Buzas. 1997. Fast Estimators of the Jackknife. The American Statistician 51, 3 (1997), 235–240. https://doi.org/10.1080/00031305.1997.10473969 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00031305.1997.10473969
- Farrance and Frenkel (2012) Ian Farrance and Robert Frenkel. 2012. Uncertainty of measurement: a review of the rules for calculating uncertainty components through functional relationships. The Clinical Biochemist Reviews 33, 2 (2012), 49.
- Hinkle and Stromberg (1996) John E. Hinkle and Arnold J. Stromberg. 1996. Efficient computation of statistical procedures based on all subsets of a specified size. Communications in Statistics - Theory and Methods 25, 3 (1996), 489–500. https://doi.org/10.1080/03610929608831709 arXiv:https://doi.org/10.1080/03610929608831709
- Kannan et al. (2016) Ramakrishnan Kannan, Grey Ballard, and Haesun Park. 2016. A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization. SIGPLAN Not. 51, 8, Article 9 (Feb. 2016), 11 pages. https://doi.org/10.1145/3016078.2851152
- Kiers (2004) Henk AL Kiers. 2004. Bootstrap confidence intervals for three-way methods. Journal of Chemometrics: A Journal of the Chemometrics Society 18, 1 (2004), 22–36.
- Kolda and Bader (2009) T.G. Kolda and B.W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (September 2009), 455–500. https://doi.org/10.1137/07070111X
- Lourakis and Liavas (2018) Georgios Lourakis and Athanasios P. Liavas. 2018. Nesterov-Based Alternating Optimization for Nonnegative Tensor Completion: Algorithm and Parallel Implementation. In 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). 1–5. https://doi.org/10.1109/SPAWC.2018.8445941
- Martens and Martens (2000) Harald Martens and Magni Martens. 2000. Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Quality and Preference 11, 1 (2000), 5–16. https://doi.org/10.1016/S0950-3293(99)00039-7
- Murphy et al. (2013) K.R. Murphy, C.A. Stedmon, D. Graeber, and R. Bro. 2013. Fluorescence spectroscopy and multi-way techniques. PARAFAC. Anal. Methods 5 (2013), 6557–6566. Issue 23. https://doi.org/10.1039/C3AY41160E
- Peddada (1993) SD Peddada. 1993. Jackknife variance estimation and bias reduction,” CR Rao, ed., Handbook of Statistics, Vol. 9. , 723–744 pages.
- Phan et al. (2013) Anh-Huy Phan, Petr Tichavský, and Andrzej Cichocki. 2013. Fast Alternating LS Algorithms for High Order CANDECOMP/PARAFAC Tensor Factorizations. IEEE Transactions on Signal Processing 61, 19 (2013), 4834–4846. https://doi.org/10.1109/TSP.2013.2269903
- Phipps and Kolda (2019) Eric T. Phipps and Tamara G. Kolda. 2019. Software for Sparse Tensor Decomposition on Emerging Computing Architectures. SIAM Journal on Scientific Computing 41, 3 (2019), C269–C290. https://doi.org/10.1137/18M1210691 arXiv:https://doi.org/10.1137/18M1210691
- Psarras et al. (2020) Christos Psarras, Lars Karlsson, and Paolo Bientinesi. 2020. Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions. arXiv:cs.MS/2010.04678
- Psarras et al. (2021) Christos Psarras, Lars Karlsson, Jiajia Li, and Paolo Bientinesi. 2021. The landscape of software for tensor computations. arXiv:cs.MS/2103.13756
- Rajih et al. (2008) M. Rajih, P. Comon, and R.A. Harshman. 2008. Enhanced line search: a novel method to accelerate PARAFAC. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1128–1147. https://doi.org/10.1137/06065577
- Riu and Bro (2003) Jordi Riu and Rasmus Bro. 2003. Jack-knife technique for outlier detection and estimation of standard errors in PARAFAC models. Chemometrics and Intelligent Laboratory Systems 65, 1 (2003), 35–49. https://doi.org/10.1016/S0169-7439(02)00090-4
- Sanchez and Kowalski (1986) Eugenio. Sanchez and Bruce R. Kowalski. 1986. Generalized rank annihilation factor analysis. Analytical Chemistry 58, 2 (1986), 496–499. https://doi.org/10.1021/ac00293a054 arXiv:https://doi.org/10.1021/ac00293a054
- Skov et al. (2008) Thomas Skov, Davide Ballabio, and Rasmus Bro. 2008. Multiblock variance partitioning: A new approach for comparing variation in multiple data blocks. Analytica Chimica Acta 615, 1 (2008), 18–29. https://doi.org/10.1016/j.aca.2008.03.045
- Smith et al. (2015) S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In 2015 IEEE International Parallel and Distributed Processing Symposium. 61–70. https://doi.org/10.1109/IPDPS.2015.27
- Solomonik et al. (2013) E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. 2013. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 813–824. https://doi.org/10.1109/IPDPS.2013.112
- Vervliet and De Lathauwer (2016) N. Vervliet and L. De Lathauwer. 2016. A randomized block sampling approach to canonical polyadic decomposition of large-scale tensors. IEEE Journal of Selected Topics in Signal Processing 10, 2 (2016), 284–295. https://doi.org/10.1109/JSTSP.2015.2503260
- Westad and Marini (2015) Frank Westad and Federico Marini. 2015. Validation of chemometric models – A tutorial. Analytica Chimica Acta 893 (2015), 14–24. https://doi.org/10.1016/j.aca.2015.06.056
- Wiberg and Jacobsson (2004) Kent Wiberg and Sven P Jacobsson. 2004. Parallel factor analysis of HPLC-DAD data for binary mixtures of lidocaine and prilocaine with different levels of chromatographic separation. Analytica Chimica Acta 514, 2 (2004), 203–209. https://doi.org/10.1016/j.aca.2004.03.062