Coded Alternating Least Squares for Straggler Mitigation in Distributed Recommendations

Siyuan Wang1, Qifa Yan2, Jingjing Zhang3, Jianping Wang1, Linqi Song1 1City University of Hong Kong, 2University of Illinoise at Chicago, 3Fudan University
Email: sywang34-c@my.cityu.edu.hk, qifay2014@163.com, jingjingzhang@fudan.edu.cn, {jianwang, linqi.song}@city.edu.hk

(November 2020)

Abstract

Matrix factorization is an important representation learning algorithm, e.g., recommender systems, where a large matrix can be factorized into the product of two low dimensional matrices termed as latent representations. This paper investigates the problem of matrix factorization in distributed computing systems with stragglers, those compute nodes that are slow to return computation results. A computation procedure, called coded Alternative Least Square (ALS), is proposed for mitigating the effect of stragglers in such systems. The coded ALS algorithm iteratively computes two low dimensional latent matrices by solving various linear equations, with the Entangled Polynomial Code (EPC) as a building block. We theoretically characterize the maximum number of stragglers that the algorithm can tolerate (or the recovery threshold) in relation to the redundancy of coding (or the code rate). In addition, we theoretically show the computation complexity for the coded ALS algorithm and conduct numerical experiments to validate our design.

I Introduction

Matrix factorization is one of the most successful algorithms for many machine learning tasks [1]. For example, recommender systems have played an increasingly important role in the field of Internet business in recent years. Companies such as Amazon and Alibaba have used recommender systems to promote sales. Netflix, HBO, and YouTube have also used video recommender systems to recommend videos to target users. Since the Netfilx Prize competition held by Netflix [2], the accuracy of recommendations have been greatly improved by matrix factorization algorithms.

With large amounts of data available nowadays, distributed computation is an important approach to deal with large scale data computations. Straggler nodes are one of the most prominent problem in distributed computing systems [3, 4, 5, 6, 7, 8]. Straggler is a node that runs much slower than others, which may be caused by various software or hardware issues such as hardware reliability, or network congestion. As a result, straggler mitigation in distributed matrix multiplication - a basic building block for many machine learning algorithms - is crucial and has been extensively studied in the literature. Among them, coding techniques have attracted more attention recently in the information theory community, for example, Entangled Polynomial Code (EPC) [9].

In this paper, we investigate the problem of large scale matrix factorization through distributed computation systems. Consider a data matrix $R$ of size $m\times n$ , where the dimensions $m$ and $n$ are typically very large so that each individual computing node can only deal with computations over matrices with dimensions much smaller than $\min\{m,n\}$ . We aim to factorize $R$ approximately into the product $UV^{\top}$ , where the dimensions of $U$ and $V$ are $m\times d$ and $n\times d$ respectively for some latent dimension $d\ll\mathrm{rank}(R)\leq\min\{m,n\}$ . The factorization of $R$ can be formalized as minimizing $||R-UV^{\top}||_{2}^{2}$ for some $U,V$ , where $||\cdot||_{2}$ is the Frobenius norm.

Alternating Least Squares (ALS) is an efficient iterative algorithm to find out a solution by updaing $U$ and $V$ alternatively, where in each iteration, the algorithm updates $U$ with the current estimate of $V$ and then updates $V$ with the updated estimate of $U$ . We propose a distributed implementation of the ALS algorithm, with the ability to tolerate straggler nodes. In distributed ALS, matrix multiplication is a key building block and we adopt the EPC as a means to realize matrix multiplication, however, making special tailoring to the ALS algorithms where multiple matrix multiplications are involved at each iteration. To speed up the iteration, we first obtain the formulas that update $U$ from the current estimate of $U$ or update $V$ from the current estimate of $V$ . Based on the new update formulas, we only need to update either $U$ or $V$ , since the estimates of $U$ and $V$ are connected by the original ALS formulas.

Therefore, we propose a coded ALS framework as follows:

1.

Pre-computation: computing the transformed data matrix $RR^{\top}$ (or $R^{\top}R$ ) through the distributed computing system;
2.

Iterative computation: update $U$ (or $V$ ) through the distributed computing system;
3.

Post-computation: compute the estimate of $V$ (or $U$ ) from the estimate of $U$ (or $V$ ).

In this computation framework, the bottleneck happens in the iterative computation phase, as the pre- and post- computations only need to be carried out once. In the iterative computations, both $U$ and $V$ are partitioned into submatrices along the larger dimension (i.e., $m$ or $n$ ), and the transformed data matrix $RR^{\top}$ or $R^{\top}R$ is partitioned in both row and column dimensions and stored at the workers in coded form. We show that, with given partition parameters, the recovery threshold to compute matrix multiplications using EPC code is optimal among all linear codes (for matrix multiplication). We characterize the relationship between the coding redundancy and the recovery threshold. In addition, we provide the computation complexity analysis for the proposed coded ALS algorithm. Finally, we conduct numerical experiments to validate our proposed design.

I-A Related Work

The slow machine problem has been existed in distributed machine learning for a long time[10]. To tackle this, many different approaches have been proposed.In synchronous machine learning problems, solutions using speculative executions [11, 12]. However, this types of methods need much more communication time and thus perform poor.

Adding redundancy is another effective way to cope with straggler problems. With each worker bear more information than it was supposed to, the final result could be recovered by these newly added extra information. In [13], the idea of using coded to solve straggler problems in distributed learning tasks. However, this work only focus on matrix multiplication and data shuffling. Then more and more research have been put into this area. One typical type of approach is data encoding, where the encoded data is stored in different workers. Works like [13, 4, 3] encodes data as the linear combination of original data and recover the result according to the encoding matrix.

Another type is to encode the intermediate parameter of this code. A typical type of this coding is the gradient coding[14]. Many gradient based methods have been proposed within these years like [15, 16, 8]. However, these works focus only on gradient based distributed learning tasks.

[17] Using coding in iterative matrix multiplication, which shares a similar application scenario with this paper.

I-B Statement of Contributions

In this work, we make the following contributions. 1) Propose a coding scheme for large scale matrix decomposition problem to help with the recommender system. 2) Analysis of the complexity of this scheme and the running time of the coded distributed computation. 3) Solve the problem that the data partitions are too big when the numbers of columns and rows of the data matrix are both large.

Notation

Throughout the paper, we use $[x]$ to denote the set $\{1,2,\ldots,x\}$ , where $x\in\mathbb{N}_{+}$ is a positive integer. We denote by $||X||=(\sum_{i\in[m],j\in[n]}X_{ij}^{2})^{1/2}$ the Frobenius norm of the matrix $X\in\mathbb{R}^{m\times n}$ . For easy of presentation, we will simply write $||X||_{F}$ as $||X||$ when there is no confusion.

II Problem Formulation

II-A Matrix Factorization via ALS

Given a data matrix $R\in\mathbb{R}^{m\times n}$ and a latent dimension $d$ , we consider the matrix factorization problem to learn the $d$ dimensional representations $U\in\mathbb{R}^{m\times d}$ and $V\in\mathbb{R}^{n\times d}$ as follows:

\operatorname*{minimize}\limits_{U\in\mathbb{R}^{m\times d},V\in\mathbb{R}^{n\times d}}||R-UV^{\top}||^{2}.

(1)

For example, the matrix factorization can be used for the recommendation problem, where $m$ represents the number of users; $n$ represents the number of items; $(i,j)$ entry of the data matrix $R_{i,j}$ represents the rating (or preference) of user $i\in[m]$ to item $j\in[n]$ . This rating can be approximated by the inner product of the latent vector $u_{i}$ of user $i$ and the latent vector $v_{j}$ of item $j$ , where $u_{i}$ and $v_{j}$ are the $i$ -th row of $U$ and the $j$ -th row of $V$ , respectively. The problem is to find these representations $u_{i}$ and $v_{j}$ , so as to minimize the differences $\sum_{u_{i},v_{j}}||R_{ij}-u_{i}v_{j}^{\top}||^{2}$ . Note that parameters $m,n$ are often large and the latent dimension $d\ll\mathrm{rank}(R)\leq\min\{m,n\}$ .

Since this matrix factorization problem is non-convex, it is in general not easy to find the optimal solution. A well known iterative algorithm to solve this problem is the ALS algorithm, which iterates between optimizing ${U}$ for given ${V}$ and optimizing ${V}$ for given ${U}$ . The update formulas for ${U}$ and ${V}$ are given as follows:

	$\displaystyle U^{(t+1)}$	$\displaystyle=$	$\displaystyle RV^{(t)}\Big{(}{V^{(t)}}^{\top}{V^{(t)}}\Big{)}^{-1}$		(2)
	$\displaystyle V^{(t+1)}$	$\displaystyle=$	$\displaystyle R^{\top}U^{(t+1)}\Big{(}{U^{(t+1)}}^{\top}U^{(t+1)}\Big{)}^{-1}$		(3)

where ${U}^{(t)}$ and ${V}^{(t)}$ are the estimates of ${U}$ and ${V}$ in the $t$ -th iteration. In (2) $V$ is fixed, $U$ is updated and fixed to be used in (3), updating $V$ alternatively.

II-B Distributed Matrix Factorization with Stragglers

For large dimensions $m,n$ , the updates in (2) and (3) are unpractical in a single computation node. We consider solving the matrix factorization problem via a ‘master-worker’ distributed computing system with a master and $W$ workers. There may be some straggling workers (i.e., stragglers) among these workers which may perform the calculations slow and affect the system performance.

We consider a coding aided framework to solve this matrix factorization problem.

$\bullet$ Data matrix encoding and distribution: we first encode the data matrix $R$ to another matrix $\tilde{R}$ via some encoding function and the encoded data matrix will be distributed among workers. For worker $w$ , the encoding and data distribution can be represented as $\tilde{R}_{w}=\phi_{w}(R)$ via some function $\phi_{w}()$ .

$\bullet$ Iterative calculation and model aggregation: the calculation is carried out in an iterative manner. At each round $\tau$ , each worker aims to calculate the model parameters and send model parameters to the master, namely, $\theta_{w}^{\tau+1}={f}^{\tau}_{w}(\tilde{R}_{w},\theta^{\tau})$ , where ${f}^{\tau}_{w}()$ is the local computation function of worker $w$ at round $\tau$ . However, there are a set of stragglers $W_{s}^{\tau}\subseteq W$ among these workers that either cannot calculate the result in time or cannot make successful transmissions. The server aggregates these model parameters from non-straggling workers in $W\backslash W_{s}^{\tau}$ to get a new model, namely, $\theta^{\tau+1}=g^{\tau}(\{\theta_{w}^{\tau+1}\}_{w\in W\backslash W_{s}^{\tau}},\theta^{\tau})$ , where $g^{\tau}()$ is the decoding and aggregation function at round $\tau$ . After that, the server returns the updated parameter to all workers via a lossless broadcast channel.

Our aim is to design a coded ALS scheme $\{\phi_{w},f^{\tau}_{w},g^{\tau}\}$ to solve the distributed matrix factorization problem¹¹1Here, we mean the coded ALS scheme achieves the same computation result as the centralized counterpart in Eqs. (2) and (3). when there are no more than $s$ stragglers among the $W$ workers ( $||W_{s}(\tau)||\leq s,\forall\tau$ ).

Moreover, we would like to study how the coding scheme performs and the computation complexity with respect to the straggler mitigation capability $s$ . In particular, let $\mathrm{size}(A)$ denotes the the number of elements in matrix $A$ . We define the redundancy of coding (coding rate) $\mu$ as the ratio between the coded matrix size and the original data matrix size $\mu=W*\mathrm{size}(\tilde{R}_{w})/\mathrm{size}(R)$ .

We then ask what is the relationship between $s$ and $\mu$ and how this further affects the computation complexity.

III Distributed Computation of ALS Algorithm

In this section, we will present our framework to implement the ALS algorithm in distributed computing systems.

III-A Preliminary: Entangled Polynomial Code

Entangled Polynomial Code (EPC) [9] is an efficient linear code [9, Definition 1] to compute the large scale matrix multiplication in distributed computation systems with stragglers.

The entangled polynomial code computes ${A}^{\top}{B}$ via $W$ distributed worker nodes. Each worker node stores a coded sub-matrix based on the polynomials

	$\displaystyle\widetilde{{A}}(x)$	$\displaystyle=$	$\displaystyle\sum_{j=0}^{r-1}\sum_{i=0}^{p-1}{A}_{j,i}x^{j+ir},$		(4)
	$\displaystyle\widetilde{{B}}(x)$	$\displaystyle=$	$\displaystyle\sum_{j=0}^{r-1}\sum_{k=0}^{q-1}{B}_{j,k}x^{r-1-j+krp},$		(5)

Specially, let $x_{1},\ldots,x_{W}$ be $W$ distinct numbers in $\mathbb{R}$ . Each worker $w\in[W]$ stores $\widetilde{{A}}(x_{w})$ and $\widetilde{B}(x_{w})$ , and computes and returns $\widetilde{{A}}(x_{w})^{\top}\widetilde{{B}}(x_{w})$ . It was shown in [9] that all the sub-matrices of the product ${A}^{\top}{B}$ , i.e.,

\displaystyle C_{i,k}=\sum_{j=1}^{r}A_{j,i}^{\top}B_{j,k},\quad i\in[p],k\in[q]

(6)

are enclosed in the polynimal $\widetilde{{C}}(x)=\widetilde{{A}}(x)^{\top}\widetilde{{B}}(x)$ , which has degree $pqr+r-2$ . As a result, the master can decode via interpolation from any responses of $K$ worker nodes, where

\displaystyle K=rpq+r-1

(7)

Given the parameters $p,q,r$ , the $K$ of responses that the master needs to decode is called recovery theoreshold. It was showed in [9, Theorem 2] that, when $p=1$ or $q=1$ , the EPC achieves best recovery threshold among all linear code.

III-B Direct Update Formulas

Unlike the traditional way (2) and (3) to update $U$ and $V$ , here we choose to update either $U$ and $V$ only based on the following new update formulas. Note that here each iteration consists of several rounds.

Lemma 1.

In ALS algorithm, let ${U}^{(t)},{V}^{(t)}$ be the estimates of ${U}$ and ${V}$ in the $t$ -th iteration, for each $t=1,2,\ldots T$ ,

	$\displaystyle{V}^{(t+1)}={R}^{\top}{R}{V}^{(t)}({V}^{(t)\top}{R}^{\top}{R}{V}^{(t)})^{-1}{V}^{(t)\top}{V}^{(t)}$		(8)
	$\displaystyle{U}^{(t+1)}={R}{R}^{\top}{U}^{(t)}({U}^{(t)^{\top}}{R}{R}^{\top}{U}^{(t)})^{-1}{U}^{(t)\top}{U}^{(t)}$		(9)

Proof:

E.q. (8) can be obtained by plugging E.q. (2) into E.q. (3), and E.q. (9) can be obtained by plugging E.q. (3) (with the subscript $(t+1)$ replaced by $(t)$ ) into E.q. (2). For detailed proof, please refer appendix B

∎

The main computation iterations either update the estimate of ${V}$ according to (8) or update the estimate of ${U}$ according to (9) instead of both in traditional ALS. Before the iteration, the matrix $R^{\top}R$ or $RR^{\top}$ is computed as a pre-computation to update ${V}$ or ${U}$ . After obtaining the estimate of $V$ or $U$ , a post-computation is used to compute the other factor via the relations (2) or (3).

III-C Distribute ALS Algorithm

In this subsection, we describe our algorithm in detail. The whole computation consists of three phases.

III-C1 Pre-computation

If $m\geq n$ , the master aims to compute $R^{\top}R$ in order to update the estimate of $V$ according to (8); or else, the master aims to compute $RR^{\top}$ according to the estimate of $U$ according to (9). This can be done by using the standard EPC code, with appropriate parameters $p,q,r$ . For clarity, in the following, we will denote matrix to be computed by $D$ , and the factor that to be updated by $B$ i.e.,

	$\displaystyle D$	$\displaystyle=$	$\displaystyle\left\{\begin{array}[]{ll}R^{\top}R,&\mathrm{if}~{}m\geq n\\ RR^{\top},&\mathrm{if}~{}m<n\end{array}\right.,$		(12)
	$\displaystyle B$	$\displaystyle=$	$\displaystyle\left\{\begin{array}[]{ll}V,&\mathrm{if}~{}m\geq n\\ U,&\mathrm{if}~{}m<n\end{array}\right..$		(15)

To have to smaller size of data to compute, we choose $D$ to be smaller one within $R^{T}R$ and $RR^{T}$ . We choose $B$ in a similar way. Notice that, by (8) and (9), let $B^{(t)}$ be the estimate in the $t$ -th iteration, then the update formula is

\displaystyle B^{(t+1)}=DB^{(t)}\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1}B^{(t)\top}B^{(t)},

(16)

where $D$ is an $l\times l$ symmetric matrix given by (12), $l=\min\{m,n\}$ , and $B^{(t)}$ is of size $l\times d$ .

III-C2 Iterative Computation of $B$

To implement the ALS algorithm in the distributed computing system, the matrix $D$ is partitioned into $h^{2}$ equal-size submatrices, each of size $\frac{l}{h}\times\frac{l}{h}$ for some positive integer $h$ , i.e.,

\displaystyle{D}=\left[\begin{array}[]{cccc}{D}_{0,0}&D_{0,1}&\ldots&{D}_{0,h-1}\\ {D}_{1,0}&D_{1,1}&\ldots&{D}_{1,h-1}\\ \vdots&\vdots&\ddots&\vdots\\ {D}_{h-1,0}&{D}_{h-1,1}&\ldots&{D}_{h-1,h-1}\\ \end{array}\right]

(21)

In accordance with the partition in (21), $B^{(t)}$ is partitioned into $h$ equal-size sub-matrices of size $\frac{l}{h}\times d$ , i.e.,

\displaystyle B^{(t)}=\left[\begin{array}[]{c}B_{0}^{(t)}\\ \vdots\\ B_{h-1}^{(t)}\end{array}\right],\quad\forall\,t\geq 0.

(25)

The bottleneck of updating the estimate of $B$ according to (16) is three matrix product computations: $DB^{(t)}$ , $B^{(t)\top}DB^{(t)}$ and $B^{(t)\top}B^{(t)}$ , and the product operation between the matrics $DB^{(t)}$ and $\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1}B^{(t)\top}B^{(t)}$ , which need to be computed at the distributed worker nodes²²2Notice that, as the dimensions of the matrix $B^{(t)\top}DB^{(t)}$ and $B^{(t)\top}B^{(t)}$ are both $d\times d$ , the inverse operation in $\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1}$ and the product operation between $\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1}$ and $B^{(t)\top}B^{(t)}$ can be calculated at the master..

For clarity, we define the following polynomials:

\displaystyle\widetilde{D}(x)=\sum_{j=0}^{h-1}\sum_{i=0}^{h-1}D_{j,i}x^{j+ih}

(26)

where $\tilde{D}$ represents the encoded version of matrix $D$ .

For any $C\in\mathbb{R}^{l\times d}$ , partition $C$ in the same manner as in (25), i.e., $C=[C_{0}^{\top},\ldots,C_{h-1}^{\top}]^{\top}$ where $C_{i}\in\mathbb{R}^{\frac{l}{h}\times d}$ , define

	$\displaystyle f_{\mathrm{L}}(C,x)$	$\displaystyle=$	$\displaystyle C_{0}+C_{1}x+\ldots+C_{h-1}x^{h-1},$		(27)
	$\displaystyle f_{\mathrm{R}}(C,x)$	$\displaystyle=$	$\displaystyle C_{h}+C_{h-1}x+\ldots+C_{0}x^{h-1}.$		(28)

Let $x_{1},x_{2},\ldots,x_{W}$ be $W$ distinct real numbers. The system operates as follows:

Initially, $B^{(0)}$ is generated randomly according to some continuous distribution. The master sends $\widetilde{D}(x_{w})$ , $f_{\mathrm{L}}(B^{(0)},x_{w})$ and $f_{\mathrm{R}}(B^{(0)},x_{w})$ to the worker $w$ for each $w\in[W]$ .

In each iteration $t=0,1,\ldots,T$

1.

Each worker $w\in[W]$ computes $\widetilde{D}(x_{w})f_{\mathrm{R}}(B^{(t)},x_{w})$ and $f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(B^{(t)},x_{w})$ , then responds the results to the master;
2.

By EPC, receiving any $h^{2}+h-1$ results among

$\displaystyle\big{\{}\widetilde{D}(x_{w})f_{\mathrm{R}}(B^{(t)},x_{w}):w\in[W]\big{\}},$ (29)

the master can decode the matrix

$\displaystyle E^{(t)}\triangleq DB^{(t)}.$ (30)

Receiving any $2h-1$ results among

$\displaystyle\{f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(B^{(t)},x_{w}):w\in[W]\},$ (31)

the master can decode the matrix

$\displaystyle F^{(t)}\triangleq B^{(t)\top}B^{(t)}.$ (32)

The master then sends $f_{\mathrm{R}}(E^{(t)},x_{w})$ to worker $w$ for each $w\in[W]$ ;
3.

Each worker $w\in[W]$ replace the matrix $f_{\mathrm{R}}(B^{(t)},x_{w})$ with the matrix $f_{\mathrm{R}}(E^{(t)},x)$ . Then it computes $f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(E^{(t)},x_{w})$ and sends the result to the master.
4.

Receiving any $2h-1$ responses among

$\displaystyle\big{\{}f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(E^{(t)},x_{w}):w\in[W]\big{\}},$ (33)

the master decodes $B^{(t)\top}E^{(t)}$ , and then the master computes

$\displaystyle G^{(t)}=\big{(}B^{(t)\top}E^{(t)}\big{)}^{-1}F^{(t)},$ (34)

and sends $G^{(t)}$ to all the worker nodes.
5.

Each worker $w\in[W]$ computes

$\displaystyle f_{\mathrm{R}}(E^{(t)},x_{w})G^{(t)}=f_{\mathrm{R}}(E^{(t)}G^{(t)},x_{w})$ (35)

and response $f_{\mathrm{R}}(E^{(t)}G^{(t)},x_{w})$ to the master.
6.

With any $h$ responses among

$\displaystyle\{f_{\mathrm{R}}(E^{(t)}G^{(t)},x_{w}):w\in[W]\},$ (36)

the master decodes

$\displaystyle B^{(t+1)}=E^{(t)}G^{(t)}.$ (37)

Then the master sends $f_{\mathrm{L}}(B^{(t+1)},x_{w})$ and $f_{\mathrm{R}}(B^{(t+1)},x_{w})$ to worker $w$ for each $w\in[W]$ .
7.

Each worker $w\in[W]$ updates $f_{\mathrm{L}}(B^{(t)},x_{w})$ and $f_{\mathrm{R}}(B^{(t)},x_{w})$ with $f_{\mathrm{L}}(B^{(t+1)},x_{w})$ and $f_{\mathrm{R}}(B^{(t+1)},x_{w})$ respectively, and then starts the $(t+1)$ -th iteration.

The procedure iterates until $B^{(t)}$ converges. Now, we claim that the above iterative procedure is correct.

In fact, it is easy to verify (16) by (30), (32), (34) and (37). We only need to show the following facts:

a)

With any $h^{2}+h-1$ responses in (29), the master can decode $E^{(t)}$ . Notice that, the polynomials $\widetilde{D}(x)$ and $f_{R}(B^{(t)},x)$ are created according the $\eqref{eqn:Ax}$ and $\eqref{eqn:Bx}$ respectively, with parameters $p=h,r=h,q=1$ . Thus, it directly follows from the result of EPC.
b)

With any $2h-1$ responses in (31) and (33), the master can decode $B^{(t)\top}B^{(t)}$ and $B^{(t)\top}E^{(t)}$ respectively. This directly follows by observing that the polynomials $f_{\mathrm{L}}(C,x)$ and $f_{\mathrm{R}}(C,x)$ are created according to $\eqref{eqn:Ax}$ and $\eqref{eqn:Bx}$ respectively with parameters $p=q=1,r=h$ .
c)

With any $h$ responses in (36), the master can decode $E^{(t)}G^{(t)}$ . This directly follows from the fact the polynomial $f_{\mathrm{R}}(C,x)$ has degree $h-1$ , thus the polynomial $f_{\mathrm{R}}(E^{(t)}G^{(t)},x)$ can be obtained by Interpolation with any $h$ responses in (36).

The whole process can be represented in Figure 1.

Refer to caption — Figure 1: This figure describes the whole coding and decoding process of the algorithm. In each iteration, Firstly, $B$ is encoded into 2 copies that are $B_{L}$ and $B_{R}$ , These two copies are used to create $E$ and $F$ . Finally $G$ will be created. With $G$ and $E$ , we could obtain final $B^{(t+1)}$ in next iteration.

Lemma 2.

(Recovery threshold) The recovery threshold of the algorithm is given by

\displaystyle K=h^{2}+h-1.

(38)

Proof: From the above descriptions a), b), c), the iteration involves four distributed matrix multiplication. The recovery threshold is given by the maximum recovery threshold of those updates, i.e.,

\displaystyle K=\max(h^{2}+h-1,2h-1,h)=h^{2}+h-1.

(39)

Remark 1 (Optimality of the Recovery Thresholds).

Notice that, with the given parameter $h$ , from the results of EPC in [9, Theorem 2], all the EPC code in a) and b) achieves the optimal recovery threshold among all linear code for the corresponding matrix multiplication problems. For the calculation in c), it also achieves the optimal recovery threshold by simple cut-set bound.

III-C3 Post-Computation

The master has obtained the final estimate of $B$ , denoted by $\widetilde{B}$ . The master first computes $\widetilde{H}=\widetilde{B}(\widetilde{B}^{\top}\widetilde{B})^{-1}$ as follows:

1.

The master sends $f_{\mathrm{L}}(\widetilde{B},x_{w})$ and $f_{\mathrm{R}}(\widetilde{B},x_{w})$ to worker $w$ for each $w\in[W]$ ;
2.

Each worker $w\in[W]$ computes $f_{\mathrm{L}}(\widetilde{B},x_{w})^{\top}f_{\mathrm{R}}(\widetilde{B},x_{w})$ and sends the results to the master;
3.

With any $2h-1$ responses, the master decode

$\displaystyle\widetilde{F}=\widetilde{B}^{\top}\widetilde{B}$ (40)

by EPC decoding. Then, it computes and sends $\widetilde{F}^{-1}$ to all the worker nodes;
4.

Each worker computes

$\displaystyle f_{\mathrm{L}}(\widetilde{B},x_{w})\widetilde{F}^{-1}=f_{\mathrm{L}}(\widetilde{B}\widetilde{F}^{-1},x_{w})$ (41)

and sends it back to the master;
5.

With any $h$ responses, the master decodes $\widetilde{H}=\widetilde{B}\widetilde{F}^{-1}$ by interpolating the polynomial $f_{\mathrm{L}}(\widetilde{B}\widetilde{F}^{-1},x)$ .

Then the master continue to obtain the estimates of $V$ and $U$ as follows:

1.

If $m\geq n$ , the estimate of $V$ is given by $\widetilde{V}=\widetilde{B}$ . The estimate of $U$ is given by

$\displaystyle\widetilde{U}=R^{\top}\widetilde{H},$ (42a)

which computed with standard EPC code with appropriate parameters.
2.

If $m<n$ , the estimate of $U$ is given by $\widetilde{U}=\widetilde{B}$ , the estimate is given by

$\displaystyle\widetilde{V}=R\widetilde{H},$ (42b)

which can be computed by standard EPC code with appropriate parameters.

Remark 2.

Since the main computation load is in the iterative computation of $B$ , we omit the details of computation of $D$ in (12) and the computations in (42). One convenient choice of the partition parameters is partition $R$ (if $m\geq n$ ) or $R^{\top}$ (if $m<n$ ) into $r\times h$ equal-size sub-matrices for some $r$ , and partition $\widetilde{H}$ in the same form as in (25), so that the result of $D$ has the form (21). Under such partitions, the EPC computation of (42) also achieves the optimal recovery threshold among all linear code since $q=1$ by [9, Theorem 2].

IV Main Results

The following theorem characterizes the maximum number of stragglers that the algorithm can tolerate in relation to the coding redundancy.

Theorem 1.

The relation between the recovery threshold $K$ (or the maximum number of stragglers that the coded ALS algorithm can tolerate $s=W-K$ ) and coding redundancy $\mu$ is given by

\displaystyle K=\frac{W}{\mu}+\sqrt{\frac{W}{\mu}}-1,\quad\mu>1

(43)

where $W\geq h^{2}$ .

Proof:

Suppose each worker have a same size of data partitions of $\tilde{R}$ with $k$ elements. According to (21) and the fact that each worker holds one partition of $R$ , and there are $h^{2}$ partitions in $R$ , and the number of elements in $R$ is $h^{2}k$ . So that $\mu$ could be represented as

\displaystyle\mu=\frac{Wk}{h^{2}k}=\frac{W}{h^{2}}

(44)

From Lemma 2, we know the relation between $K$ and $h$ . By replacing $h$ in (38) by (44), we can get (43). ∎

Theorem 2.

(Computation Complexity Analysis) Given a distributed matrix factorization problem with $W$ workers, $m\times n$ matrix $R$ , latent dimension $d$ , and coding redundancy $\mu$ , the computation complexity of the proposed coded ALS algorithm can be calculated as follows.

1.

The pre-computation complexity (at master) is $O(\min\{mn^{2},nm^{2}\})$ (one time).
2.

The computation complexity at each worker is $O(\frac{n^{2}d\mu}{W})$ at each iteration³³3Each iteration involves both updates of $V$ and $U$ .
3.

The encoding and decoding complexities are $O(nd\sqrt{W\mu})$ and $O(nd(\frac{W}{\mu}+\sqrt{\frac{W}{\mu}}))$ at the master at each iteration.
4.

The decoding complexity, donotes the decoding procedure happened in the master, is $O(nd(h^{2}+h))$

For the whole process proposed in this algorithm depicted in Fig.1, where we use $T$ to represent the number of iterations, and $n$ to represent the number of columns and rows of $B$ ⁴⁴4Here we assume that $n<m$ for R. The complexity for each stage of the algorithm can be represented as the following form:

Proof:

Consider the size of matrix multiplication, we can measure the computation complexity of different procedure of this proposed algorithm. In the pre-computation stage, we only have a computation task of $R^{T}R$ or $RR^{T}$ , and thus the the complexity is $O(\min{mn^{2},nm^{2}})$ . In each worker, we only consider the first stage that is $E=DB$ . The two matrices involved in the computation have size that are $\frac{n^{2}}{h^{2}}$ and $\frac{nd}{h^{2}}$ . Replace $h$ by $h=\sqrt{W}{\mu}$ , we will get the complexity of each worker. For the encoding and decoding complexity, it is an weighted sum of matrices from each workers and data partitions. We consider the number of workers $W$ and number of data partitions $h$ . Finally, we could get the encoding and decoding complexity respectively. ∎

Remark 3.

(Number of partitions) In the proposed algorithm, to get a better partition method, in common configurations, e.g., when $W<100$ . When $h^{*}=\lfloor\sqrt{W+\frac{3}{4}-s}\rfloor$ , we will achieve a relatively less expected computation time $\mathbb{E}[T_{cp}]$ when workers doing their computation task.

By the definition of stragglers $s+K\leq W$ and the fact that $K=h^{2}+h-1$ , we could find the largest $h$ that could be tolerated by the algorithm, which will therefore save a large amount of time in practical.

According to the definition of stragglers

\displaystyle s+K\leq W

(45)

There are 4 decoding procedure in the algorithm, shown in figure 1, the recovery threshold for the first step to get $E$ is the largest which is

\displaystyle K=f_{K}(h)=h^{2}+h-1

(46)

In Remark 4, we have discussed that the computation time $T_{cp}$ decreases as $h$ grows. Therefore, to get smaller $T_{cp}$ , we need to make $h$ greater.

It’s easly known that $f_{K}(h)$ monotonously increased when $h\geq 1$ , which is upper bounded by (45). Therefore, (45) and (46) can be written as

\displaystyle h^{2}+h+s-(w+1)\leq 0

(47)

and therefore the best $h^{*}$ is got to make to make the computation time less.

Remark 4.

In worker’s computation stage, when doing the multiplication of matrices, the computation time $T_{cp}$ decreases as $h$ grows.

We consider two impacts of this algorithm. The first impact is more partitions makes each worker share less proportion of data which speeds up the computation. The second impact is more partitions require more usable worker, which will make us get an bigger order statistics. See Appendix A for full explanation.

$\mu$ and $\sigma$ are determined by the performance of the computer, which is hard to measure in this setup.Intuitively, with larger $h$ , each worker has smaller data to process, the total computation time could be less.

V Simulations

In this section, we design a simulation experiment to measure the running time of our algorithm.

We conduct our simulations on synthetic data by adding noise into the product of two randomly initiated matrices $U$ and $V$ . In the simulation, we set the parameters $m=2400$ , $n=1500$ , $d=200$ and run the simulations when the number of non-stragglers $k=W-s=10,20,30,40,50$ . We tested the computation time of the proposed algorithm of different number of data partitions and record the result in Table I.

TABLE I: Computation time for different partitions

	$h=2$	$h=3$	$h=4$	$h=5$	$h=6$
$k=10$	7.59468	-	-	-	-
$k=20$	7.376128	3.952123	2.856271	-	-
$k=30$	6.581517	3.7818	2.729829	0.113088	-
$k=40$	6.860791	3.744443	1.285898	0.008618	-
$k=50$	6.852958	3.778484	0.937565	0.003492	0.000598

In Table I, we could easily see the relationship between the computation time in workers and the number of partitions $h$ in data matrix $R$ for different number of non-straggling workers $k$ . The minus sign in the table means there is no data, because the condition $h^{2}+h-1+s\leq W$ is not met. Intuitively, smaller partitions result in a faster computation time in workers although smaller partitions will lead to more workers involved in the computation and therefore increase the computation time. For the same $h$ , larger $k$ could make the computation faster on the whole. That’s because more workers in total bring more workers that are faster. Given the same number of workers, we can also see that when the coding redundancy $\mu$ increases ( $h$ decreases), we need less computation time, indicating a tradeoff of straggler mitigation ability and computation complexity in the simulation results.

VI Conclusion

We presented a distributed implementation of the ALS algorithm, which solves the matrix factorization problem in a distributed computation system. The procedure takes the advantage of the entangled polynomial code as a building block, which can resist stragglers. The relationship between the recovery threshold and the storage load is characterized. The simulation result indicates that with more workers and partitions of data matrix, we could obtain a shorter computation time, and thereby fully implemented the role of distributed learning.

Appendix A

Let $T_{cp}$ represents the total time consuming in the encoding process, and thus

\displaystyle T_{cp}=T^{[1]}_{cp}+T^{[2]}_{cp}+T^{[3]}_{cp}+T^{[4]}_{cp}

(48)

where $T^{[A]}_{cp}$ represents the $A$ -th computation stage. Because $F$ , $B^{T}E$ and $B$ have the same size of matrix to do the multiplication, so the

\displaystyle\mathbb{E}T^{[2]}_{cp}=\mathbb{E}T^{[3]}_{cp}

(49)

So that the expected encoding time can be represented as

\displaystyle\mathbb{E}T_{cp}=\mathbb{E}T_{cp}^{[1]}+2\mathbb{E}T_{cp}^{[2]}+\mathbb{E}T_{cp}^{[4]}

(50)

where $T_{cp}^{[A]}=\sum\limits_{j}^{T}T_{cp(t)}^{[A]}$ , in which $t$ represents the $t$ -th iteration of the algorithm and $T_{cp(t)}^{[A]}$ represents the total time consumed of stage $A$ at $t$ -th iteration.

For each worker, The primary task is to compute $E$ , whose computation time is denoted as $T_{cp}^{[1]}$ . The computation time for it is $\sum\limits_{l}^{\frac{n^{2}d}{h^{2}}}T_{u(l)}$ for each worker at each iteration. To compute $F$ , $B^{T}E$ and $B$ ,denoted as $T_{cp}^{[2]}$ , $T_{cp}^{[3]}$ , $T_{cp}^{[4]}$ , where each worker needs $\sum\limits_{l}^{\frac{d^{2}n}{h}}T_{u(l)}$ at each iteration, where $T_{u(l)}\sim N(\mu_{u},\sigma_{u}^{2})$ represents the time to do a element-wise multiplication in one worker.

The computation time at each iteration can be represented as the following form.

\displaystyle T_{cp(t)}^{[A]}=\max\limits_{(i)}^{W-s}T_{cp(t)}^{[A]},i\in[W]

(51)

where $\max\limits_{(i)}^{W-s}T_{cp(t)}^{[A]}$ denotes the $i$ -th shortest time taken by the workers to return the result.

The multiplication of matrices can be seen as element-wise operation in each worker. According to the Central Limit Theorem, the ${}^{i}T_{cp(j)}^{[1]}$ can be represented as

{}^{i}T_{cp(t)}^{[1]}\sim N(\frac{n^{2}d}{h^{2}}\mu_{u},\frac{n^{2}d}{h^{2}}\sigma_{u}^{2})

(52)

where ${}^{i}T_{cp(t)}^{[1]}$ represents the time taken by worker $w_{i}$ in $t$ -th iteration in stage 1.

The expected computation time can be represented as

\displaystyle\mathbb{E}T_{cp}^{[1]}=T\mathbb{E}[\max\limits_{(i)}^{W-s}T_{cp(t)}^{[1]}],i\in[W]

(53)

Similarly, the $\mathbb{E}T_{cp}^{[2]}$ can be represented as same form, and

\displaystyle T_{cp(t)}^{[2]}\sim N(\frac{nd^{2}}{h}\mu_{u},\frac{nd^{2}}{h}\sigma_{u}^{2})

(54)

According to the [18], the approximation of expected value of the $r$ -th order statistic can be represented as

\displaystyle E(r:n)\approx\mu_{u}+\Phi^{-1}(\frac{r-\alpha}{n-2\alpha+1})\sigma_{u}

(55)

where $\alpha=0.375$ and $\Phi(x)$ is the inverse function of the cumulative distribution function of standart normal distribution.

Because $n^{2}d>>nd2$ , combining (52) and (54), we know that $\mathbb{E}T_{cp}^{[1]}>>\mathbb{E}T_{cp}^{[i]}$ when $i\in\{2,3,4\}$

so that the computation time in (48) could be written as

	$\displaystyle\mathbb{E}T_{cp}\approx T\mathbb{E}[\max\limits_{(i)}^{W-s}T_{cp(t)}^{[A]}]$		(56)
	$\displaystyle\approx T(\frac{n^{2}d}{h^{2}}\mu_{u}+\Phi^{-1}(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})\frac{\sqrt{d}n}{h}\sigma_{u})$		(57)

Let $\theta(h)$ represents the right hand side term in (57) and then $\theta^{\prime}(h)$ can be written as

\displaystyle\theta^{\prime}(h)\!=\!-2\mu_{u}n^{2}\!dh^{-3}\!+\!\frac{(2h\!+\!1)\sqrt{d}n\sigma_{u}}{\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})h}\!-\!2\Phi^{-\!1}\!(\!\frac{h\!^{2}\!+h\!-1\!-\!\alpha}{\!W\!-\!s\!-\!2\!\alpha\!+\!1}\!)\sqrt{d}n\sigma_{u}h^{-2}

(58)

where $\phi(x)$ denotes the probability density function of standard normal distribution.

Dividing both sides by $2n\sqrt{d}\sigma$ , we could get

\displaystyle\theta_{2}^{\prime}(h)=-\frac{\mu_{u}n\sqrt{d}h^{-3}}{\sigma_{u}}\!+\!\frac{1+\frac{1}{2h}}{\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})}\!-\!\Phi^{-1}\!(\frac{h\!^{2}\!+h\!-1\!-\!\alpha}{\!W\!-\!s\!-\!2\!\alpha\!+\!1})h^{-2}

(59)

\begin{array}[]{llll}\theta_{2}^{\prime}(h)=-\frac{\mu_{u}n\sqrt{d}h^{-3}}{\sigma_{u}}\!+\!\frac{1+\frac{1}{2h}}{\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})}\!-\!\Phi^{-1}\!(\frac{h\!^{2}\!+h\!-1\!-\!\alpha}{\!W\!-\!s\!-\!2\!\alpha\!+\!1})h^{-2}\end{array}

(60)

So that we only need to (60) to determine whether it’s positive or negative.

In (60), let’s use $\phi(\cdot)$ to represent $\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})$ and use $\Phi(\cdot)$ to represent $\Phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})$ . $\phi(\cdot)$ is a decreaseing function after $h\geq 1$ . When $h=1$ , $\phi(\cdot)$ reaches its maximum $\phi(\frac{1-\alpha}{100-2\alpha+1})$ , which is less than 0.398 when $W-s\leq 100$ ; When $h=h_{M}$ , where $h_{M}$ denotes the max $h$ such that $s+h^{2}+h-1\leq W$ , $\phi(\cdot)$ reaches its minimum $\min\phi(\cdot)=\phi(\frac{h_{M}^{2}+h_{M}-1-\alpha}{W-s-2\alpha+1})\geq\phi(\frac{W-s-1-\alpha}{W-s-2\alpha+1})$ which is greater than 0.246 when $W-s\leq 100$ .

Therefore,

\displaystyle 0.246\leq\phi(\cdot)\leq 0.398\quad W-s\leq 100

(61)

meanwhile, $1+\frac{1}{2h}$ meets

\displaystyle 1.056\leq 1+\frac{1}{2h}\leq 1.5

(62)

So that, the second term in (60) meets

\displaystyle 2.6537\leq\frac{1+\frac{1}{2h}}{\phi(\cdot)}\leq 6.0975

(63)

Let’s use $\Phi^{-1}(\cdot)$ to represent $\Phi^{-1}(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})h^{-2}$ . Then $\Phi^{-1}(\cdot)$ is increasing when $h\geq 1$ . So the $\Phi^{-1}(\cdot)$ reaches its maximum when $h=h_{M}$

\displaystyle\max\Phi^{-1}(\cdot)\leq\Phi^{-1}(\frac{W-s-\alpha}{W-s-2\alpha+1})\leq\Phi^{-1}(0.98379)=2.14\quad W\leq 100

(64)

when $h=1$

\displaystyle\min\Phi^{-1}(\cdot)\geq\Phi^{-1}(\frac{1-\alpha}{W-s-2\alpha+1})\geq-2.5\quad W\leq 100

(65)

Therefore, $-2.14\leq-\Phi^{-1}(\cdot)h^{-2}\leq 2.5$ . In other words, the second and third term in (60) are all bounded. If we can guarantee

\displaystyle\frac{\mu_{u}n\sqrt{d}}{h^{3}\sigma_{u}}\geq 8.5975\quad W\leq 100

(66)

such that $\theta_{2}^{\prime}(h)<0$ , which suggests $\theta(h)$ decrease monotonously when $h\geq 1$ , resulting in a less time of computation with a greater $h$ .

Appendix B

\displaystyle\begin{split}{V}^{(t+1)}={R}^{\top}{R}{V}^{(t)}(V^{(t)\top}{V})^{-1}(({R}{V}^{(t)}({V}^{(t)\top}{V}^{(t)})^{-1})^{\top}({RV}(V^{T}V)^{-1}))^{-1}\end{split}

(67)

in which $(RV(V^{T}V)^{-1})^{T}$ could be writen as

\displaystyle{(V^{T}V)^{-1}}^{T}R^{T}V^{T}

(68)

Because ${A^{-1}}^{T}={A^{T}}^{-1}$ , and that $V^{T}V$ is a symmetric matrix, (68) could be $(V^{T}V)^{-1}R^{T}V^{T}$ With (68) and (67), we will get that

\displaystyle\begin{split}V^{(t+1)}=R^{T}{RV}(V^{T}V)^{-1}((V^{T}V)^{-1}V^{T}R^{T}{RV}(V^{T}V))^{-1}\end{split}

(69)

Now we know that $U^{T}U$ is invertible. So that the items within the rightmost bracket is invertible. From the fact that $({ABC})^{-1}=C^{-1}B^{-1}A^{-1}$ , the RHS of 69 could be

\displaystyle((V^{T}V)(V^{T}R^{T}{RV})^{-1}((V^{T}V)^{-1}))

(70)

and 69 becomes

\displaystyle\begin{split}V^{(t+1)}=R^{T}{RV}(V^{T}V)^{-1}((V^{T}V)(V^{T}R^{T}{RV})^{-1}((V^{T}V)^{-1}))\end{split}

(71)

Simplify (71) we will finally get (8). Similarly, we can do the same procedure and then get (9)

References

[1] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, no. 8, pp. 30–37, 2009.
[2] J. Bennett, S. Lanning et al., “The netflix prize,” in Proceedings of KDD cup and workshop, vol. 2007. New York, NY, USA., 2007, p. 35.
[3] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Straggler mitigation in distributed optimization through data encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 5434–5442.
[4] D. Data, L. Song, and S. Diggavi, “Data encoding for byzantine-resilient distributed gradient descent,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2018, pp. 863–870.
[5] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigation via data encoding,” arXiv preprint arXiv:1805.09934, 2018.
[6] R. Bitar, M. Wootters, and S. E. Rouayheb, “Stochastic gradient coding for straggler mitigation in distributed learning,” arXiv preprint arXiv:1905.05383, 2019.
[7] E. Ozfatura, D. Gündüz, and S. Ulukus, “Speeding up distributed gradient descent by utilizing non-persistent stragglers,” in 2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019, pp. 2729–2733.
[8] R. K. Maity, A. S. Rawa, and A. Mazumdar, “Robust gradient descent via moment encoding and ldpc codes,” in 2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019, pp. 2734–2738.
[9] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,” IEEE Transactions on Information Theory, vol. 66, no. 3, pp. 1920–1933, 2020.
[10] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Advances in neural information processing systems, 2013, pp. 1223–1231.
[11] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[12] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in Osdi, vol. 8, no. 4, 2008, p. 7.
[13] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, 2017.
[14] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in International Conference on Machine Learning, 2017, pp. 3368–3376.
[15] J. M. Neighbors, “The draco approach to constructing software from reusable components,” IEEE Transactions on Software Engineering, no. 5, pp. 564–574, 1984.
[16] W. Halbawi, N. Azizan, F. Salehi, and B. Hassibi, “Improving distributed gradient descent using reed-solomon codes,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 2027–2031.
[17] S. Dutta, Z. Bai, H. Jeong, T. M. Low, and P. Grover, “A unified coded deep neural network training strategy based on generalized polydot codes,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 1585–1589.
[18] J. Royston, “Algorithm as 177: Expected normal order statistics (exact and approximate),” Journal of the royal statistical society. Series C (Applied statistics), vol. 31, no. 2, pp. 161–165, 1982.


	$\displaystyle\widetilde{U}=R^{\top}\widetilde{H},$	(42a)
which computed with standard EPC code with appropriate parameters.

Coded Alternating Least Squares for Straggler Mitigation in Distributed Recommendations

Abstract

I Introduction

I-A Related Work

I-B Statement of Contributions

Notation

II Problem Formulation

II-A Matrix Factorization via ALS

II-B Distributed Matrix Factorization with Stragglers

III Distributed Computation of ALS Algorithm

III-A Preliminary: Entangled Polynomial Code

III-B Direct Update Formulas

Lemma 1.

Proof:

III-C Distribute ALS Algorithm

III-C1 Pre-computation

III-C2 Iterative Computation of BB

Lemma 2.

Remark 1 (Optimality of the Recovery Thresholds).

III-C3 Post-Computation

Remark 2.

IV Main Results

Theorem 1.

Proof:

Theorem 2.

Proof:

Remark 3.

Remark 4.

V Simulations

VI Conclusion

Appendix A

Appendix B

References

III-C2 Iterative Computation of $B$