This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Coded Alternating Least Squares for Straggler Mitigation in Distributed Recommendations

Siyuan Wang1, Qifa Yan2, Jingjing Zhang3, Jianping Wang1, Linqi Song1 1City University of Hong Kong, 2University of Illinoise at Chicago, 3Fudan University
Email: sywang34-c@my.cityu.edu.hk, qifay2014@163.com, jingjingzhang@fudan.edu.cn, {jianwang, linqi.song}@city.edu.hk
(November 2020)
Abstract

Matrix factorization is an important representation learning algorithm, e.g., recommender systems, where a large matrix can be factorized into the product of two low dimensional matrices termed as latent representations. This paper investigates the problem of matrix factorization in distributed computing systems with stragglers, those compute nodes that are slow to return computation results. A computation procedure, called coded Alternative Least Square (ALS), is proposed for mitigating the effect of stragglers in such systems. The coded ALS algorithm iteratively computes two low dimensional latent matrices by solving various linear equations, with the Entangled Polynomial Code (EPC) as a building block. We theoretically characterize the maximum number of stragglers that the algorithm can tolerate (or the recovery threshold) in relation to the redundancy of coding (or the code rate). In addition, we theoretically show the computation complexity for the coded ALS algorithm and conduct numerical experiments to validate our design.

I Introduction

Matrix factorization is one of the most successful algorithms for many machine learning tasks [1]. For example, recommender systems have played an increasingly important role in the field of Internet business in recent years. Companies such as Amazon and Alibaba have used recommender systems to promote sales. Netflix, HBO, and YouTube have also used video recommender systems to recommend videos to target users. Since the Netfilx Prize competition held by Netflix [2], the accuracy of recommendations have been greatly improved by matrix factorization algorithms.

With large amounts of data available nowadays, distributed computation is an important approach to deal with large scale data computations. Straggler nodes are one of the most prominent problem in distributed computing systems [3, 4, 5, 6, 7, 8]. Straggler is a node that runs much slower than others, which may be caused by various software or hardware issues such as hardware reliability, or network congestion. As a result, straggler mitigation in distributed matrix multiplication - a basic building block for many machine learning algorithms - is crucial and has been extensively studied in the literature. Among them, coding techniques have attracted more attention recently in the information theory community, for example, Entangled Polynomial Code (EPC) [9].

In this paper, we investigate the problem of large scale matrix factorization through distributed computation systems. Consider a data matrix RR of size m×nm\times n, where the dimensions mm and nn are typically very large so that each individual computing node can only deal with computations over matrices with dimensions much smaller than min{m,n}\min\{m,n\}. We aim to factorize RR approximately into the product UVUV^{\top}, where the dimensions of UU and VV are m×dm\times d and n×dn\times d respectively for some latent dimension drank(R)min{m,n}d\ll\mathrm{rank}(R)\leq\min\{m,n\}. The factorization of RR can be formalized as minimizing RUV22||R-UV^{\top}||_{2}^{2} for some U,VU,V, where ||||2||\cdot||_{2} is the Frobenius norm.

Alternating Least Squares (ALS) is an efficient iterative algorithm to find out a solution by updaing UU and VV alternatively, where in each iteration, the algorithm updates UU with the current estimate of VV and then updates VV with the updated estimate of UU. We propose a distributed implementation of the ALS algorithm, with the ability to tolerate straggler nodes. In distributed ALS, matrix multiplication is a key building block and we adopt the EPC as a means to realize matrix multiplication, however, making special tailoring to the ALS algorithms where multiple matrix multiplications are involved at each iteration. To speed up the iteration, we first obtain the formulas that update UU from the current estimate of UU or update VV from the current estimate of VV. Based on the new update formulas, we only need to update either UU or VV, since the estimates of UU and VV are connected by the original ALS formulas.

Therefore, we propose a coded ALS framework as follows:

  1. 1.

    Pre-computation: computing the transformed data matrix RRRR^{\top} (or RRR^{\top}R) through the distributed computing system;

  2. 2.

    Iterative computation: update UU (or VV) through the distributed computing system;

  3. 3.

    Post-computation: compute the estimate of VV (or UU) from the estimate of UU (or VV).

In this computation framework, the bottleneck happens in the iterative computation phase, as the pre- and post- computations only need to be carried out once. In the iterative computations, both UU and VV are partitioned into submatrices along the larger dimension (i.e., mm or nn), and the transformed data matrix RRRR^{\top} or RRR^{\top}R is partitioned in both row and column dimensions and stored at the workers in coded form. We show that, with given partition parameters, the recovery threshold to compute matrix multiplications using EPC code is optimal among all linear codes (for matrix multiplication). We characterize the relationship between the coding redundancy and the recovery threshold. In addition, we provide the computation complexity analysis for the proposed coded ALS algorithm. Finally, we conduct numerical experiments to validate our proposed design.

I-A Related Work

The slow machine problem has been existed in distributed machine learning for a long time[10]. To tackle this, many different approaches have been proposed.In synchronous machine learning problems, solutions using speculative executions [11, 12]. However, this types of methods need much more communication time and thus perform poor.

Adding redundancy is another effective way to cope with straggler problems. With each worker bear more information than it was supposed to, the final result could be recovered by these newly added extra information. In [13], the idea of using coded to solve straggler problems in distributed learning tasks. However, this work only focus on matrix multiplication and data shuffling. Then more and more research have been put into this area. One typical type of approach is data encoding, where the encoded data is stored in different workers. Works like [13, 4, 3] encodes data as the linear combination of original data and recover the result according to the encoding matrix.

Another type is to encode the intermediate parameter of this code. A typical type of this coding is the gradient coding[14]. Many gradient based methods have been proposed within these years like [15, 16, 8]. However, these works focus only on gradient based distributed learning tasks.

[17] Using coding in iterative matrix multiplication, which shares a similar application scenario with this paper.

I-B Statement of Contributions

In this work, we make the following contributions. 1) Propose a coding scheme for large scale matrix decomposition problem to help with the recommender system. 2) Analysis of the complexity of this scheme and the running time of the coded distributed computation. 3) Solve the problem that the data partitions are too big when the numbers of columns and rows of the data matrix are both large.

Notation

Throughout the paper, we use [x][x] to denote the set {1,2,,x}\{1,2,\ldots,x\}, where x+x\in\mathbb{N}_{+} is a positive integer. We denote by X=(i[m],j[n]Xij2)1/2||X||=(\sum_{i\in[m],j\in[n]}X_{ij}^{2})^{1/2} the Frobenius norm of the matrix Xm×nX\in\mathbb{R}^{m\times n}. For easy of presentation, we will simply write XF||X||_{F} as X||X|| when there is no confusion.

II Problem Formulation

II-A Matrix Factorization via ALS

Given a data matrix Rm×nR\in\mathbb{R}^{m\times n} and a latent dimension dd, we consider the matrix factorization problem to learn the dd dimensional representations Um×dU\in\mathbb{R}^{m\times d} and Vn×dV\in\mathbb{R}^{n\times d} as follows:

minimizeUm×d,Vn×dRUV2.\operatorname*{minimize}\limits_{U\in\mathbb{R}^{m\times d},V\in\mathbb{R}^{n\times d}}||R-UV^{\top}||^{2}. (1)

For example, the matrix factorization can be used for the recommendation problem, where mm represents the number of users; nn represents the number of items; (i,j)(i,j) entry of the data matrix Ri,jR_{i,j} represents the rating (or preference) of user i[m]i\in[m] to item j[n]j\in[n]. This rating can be approximated by the inner product of the latent vector uiu_{i} of user ii and the latent vector vjv_{j} of item jj, where uiu_{i} and vjv_{j} are the ii-th row of UU and the jj-th row of VV, respectively. The problem is to find these representations uiu_{i} and vjv_{j}, so as to minimize the differences ui,vjRijuivj2\sum_{u_{i},v_{j}}||R_{ij}-u_{i}v_{j}^{\top}||^{2}. Note that parameters m,nm,n are often large and the latent dimension drank(R)min{m,n}d\ll\mathrm{rank}(R)\leq\min\{m,n\}.

Since this matrix factorization problem is non-convex, it is in general not easy to find the optimal solution. A well known iterative algorithm to solve this problem is the ALS algorithm, which iterates between optimizing U{U} for given V{V} and optimizing V{V} for given U{U}. The update formulas for U{U} and V{V} are given as follows:

U(t+1)\displaystyle U^{(t+1)} =\displaystyle= RV(t)(V(t)V(t))1\displaystyle RV^{(t)}\Big{(}{V^{(t)}}^{\top}{V^{(t)}}\Big{)}^{-1} (2)
V(t+1)\displaystyle V^{(t+1)} =\displaystyle= RU(t+1)(U(t+1)U(t+1))1\displaystyle R^{\top}U^{(t+1)}\Big{(}{U^{(t+1)}}^{\top}U^{(t+1)}\Big{)}^{-1} (3)

where U(t){U}^{(t)} and V(t){V}^{(t)} are the estimates of U{U} and V{V} in the tt-th iteration. In (2) VV is fixed, UU is updated and fixed to be used in (3), updating VV alternatively.

II-B Distributed Matrix Factorization with Stragglers

For large dimensions m,nm,n, the updates in (2) and (3) are unpractical in a single computation node. We consider solving the matrix factorization problem via a ‘master-worker’ distributed computing system with a master and WW workers. There may be some straggling workers (i.e., stragglers) among these workers which may perform the calculations slow and affect the system performance.

We consider a coding aided framework to solve this matrix factorization problem.

\bullet Data matrix encoding and distribution: we first encode the data matrix RR to another matrix R~\tilde{R} via some encoding function and the encoded data matrix will be distributed among workers. For worker ww, the encoding and data distribution can be represented as R~w=ϕw(R)\tilde{R}_{w}=\phi_{w}(R) via some function ϕw()\phi_{w}().

\bullet Iterative calculation and model aggregation: the calculation is carried out in an iterative manner. At each round τ\tau, each worker aims to calculate the model parameters and send model parameters to the master, namely, θwτ+1=fwτ(R~w,θτ)\theta_{w}^{\tau+1}={f}^{\tau}_{w}(\tilde{R}_{w},\theta^{\tau}), where fwτ(){f}^{\tau}_{w}() is the local computation function of worker ww at round τ\tau. However, there are a set of stragglers WsτWW_{s}^{\tau}\subseteq W among these workers that either cannot calculate the result in time or cannot make successful transmissions. The server aggregates these model parameters from non-straggling workers in W\WsτW\backslash W_{s}^{\tau} to get a new model, namely, θτ+1=gτ({θwτ+1}wW\Wsτ,θτ)\theta^{\tau+1}=g^{\tau}(\{\theta_{w}^{\tau+1}\}_{w\in W\backslash W_{s}^{\tau}},\theta^{\tau}), where gτ()g^{\tau}() is the decoding and aggregation function at round τ\tau. After that, the server returns the updated parameter to all workers via a lossless broadcast channel.

Our aim is to design a coded ALS scheme {ϕw,fwτ,gτ}\{\phi_{w},f^{\tau}_{w},g^{\tau}\} to solve the distributed matrix factorization problem111Here, we mean the coded ALS scheme achieves the same computation result as the centralized counterpart in Eqs. (2) and (3). when there are no more than ss stragglers among the WW workers (Ws(τ)s,τ||W_{s}(\tau)||\leq s,\forall\tau).

Moreover, we would like to study how the coding scheme performs and the computation complexity with respect to the straggler mitigation capability ss. In particular, let size(A)\mathrm{size}(A) denotes the the number of elements in matrix AA. We define the redundancy of coding (coding rate) μ\mu as the ratio between the coded matrix size and the original data matrix size μ=Wsize(R~w)/size(R)\mu=W*\mathrm{size}(\tilde{R}_{w})/\mathrm{size}(R).

We then ask what is the relationship between ss and μ\mu and how this further affects the computation complexity.

III Distributed Computation of ALS Algorithm

In this section, we will present our framework to implement the ALS algorithm in distributed computing systems.

III-A Preliminary: Entangled Polynomial Code

Entangled Polynomial Code (EPC) [9] is an efficient linear code [9, Definition 1] to compute the large scale matrix multiplication in distributed computation systems with stragglers.

The entangled polynomial code computes AB{A}^{\top}{B} via WW distributed worker nodes. Each worker node stores a coded sub-matrix based on the polynomials

A~(x)\displaystyle\widetilde{{A}}(x) =\displaystyle= j=0r1i=0p1Aj,ixj+ir,\displaystyle\sum_{j=0}^{r-1}\sum_{i=0}^{p-1}{A}_{j,i}x^{j+ir}, (4)
B~(x)\displaystyle\widetilde{{B}}(x) =\displaystyle= j=0r1k=0q1Bj,kxr1j+krp,\displaystyle\sum_{j=0}^{r-1}\sum_{k=0}^{q-1}{B}_{j,k}x^{r-1-j+krp}, (5)

Specially, let x1,,xWx_{1},\ldots,x_{W} be WW distinct numbers in \mathbb{R}. Each worker w[W]w\in[W] stores A~(xw)\widetilde{{A}}(x_{w}) and B~(xw)\widetilde{B}(x_{w}), and computes and returns A~(xw)B~(xw)\widetilde{{A}}(x_{w})^{\top}\widetilde{{B}}(x_{w}). It was shown in [9] that all the sub-matrices of the product AB{A}^{\top}{B}, i.e.,

Ci,k=j=1rAj,iBj,k,i[p],k[q]\displaystyle C_{i,k}=\sum_{j=1}^{r}A_{j,i}^{\top}B_{j,k},\quad i\in[p],k\in[q] (6)

are enclosed in the polynimal C~(x)=A~(x)B~(x)\widetilde{{C}}(x)=\widetilde{{A}}(x)^{\top}\widetilde{{B}}(x), which has degree pqr+r2pqr+r-2. As a result, the master can decode via interpolation from any responses of KK worker nodes, where

K=rpq+r1\displaystyle K=rpq+r-1 (7)

Given the parameters p,q,rp,q,r, the KK of responses that the master needs to decode is called recovery theoreshold. It was showed in [9, Theorem 2] that, when p=1p=1 or q=1q=1, the EPC achieves best recovery threshold among all linear code.

III-B Direct Update Formulas

Unlike the traditional way (2) and (3) to update UU and VV, here we choose to update either UU and VV only based on the following new update formulas. Note that here each iteration consists of several rounds.

Lemma 1.

In ALS algorithm, let U(t),V(t){U}^{(t)},{V}^{(t)} be the estimates of U{U} and V{V} in the tt-th iteration, for each t=1,2,Tt=1,2,\ldots T,

V(t+1)=RRV(t)(V(t)RRV(t))1V(t)V(t)\displaystyle{V}^{(t+1)}={R}^{\top}{R}{V}^{(t)}({V}^{(t)\top}{R}^{\top}{R}{V}^{(t)})^{-1}{V}^{(t)\top}{V}^{(t)} (8)
U(t+1)=RRU(t)(U(t)RRU(t))1U(t)U(t)\displaystyle{U}^{(t+1)}={R}{R}^{\top}{U}^{(t)}({U}^{(t)^{\top}}{R}{R}^{\top}{U}^{(t)})^{-1}{U}^{(t)\top}{U}^{(t)} (9)
Proof:

E.q. (8) can be obtained by plugging E.q. (2) into E.q. (3), and E.q. (9) can be obtained by plugging E.q. (3) (with the subscript (t+1)(t+1) replaced by (t)(t)) into E.q. (2). For detailed proof, please refer appendix B

The main computation iterations either update the estimate of V{V} according to (8) or update the estimate of U{U} according to (9) instead of both in traditional ALS. Before the iteration, the matrix RRR^{\top}R or RRRR^{\top} is computed as a pre-computation to update V{V} or U{U}. After obtaining the estimate of VV or UU, a post-computation is used to compute the other factor via the relations (2) or (3).

III-C Distribute ALS Algorithm

In this subsection, we describe our algorithm in detail. The whole computation consists of three phases.

III-C1 Pre-computation

If mnm\geq n, the master aims to compute RRR^{\top}R in order to update the estimate of VV according to (8); or else, the master aims to compute RRRR^{\top} according to the estimate of UU according to (9). This can be done by using the standard EPC code, with appropriate parameters p,q,rp,q,r. For clarity, in the following, we will denote matrix to be computed by DD, and the factor that to be updated by BB i.e.,

D\displaystyle D =\displaystyle= {RR,ifmnRR,ifm<n,\displaystyle\left\{\begin{array}[]{ll}R^{\top}R,&\mathrm{if}~{}m\geq n\\ RR^{\top},&\mathrm{if}~{}m<n\end{array}\right., (12)
B\displaystyle B =\displaystyle= {V,ifmnU,ifm<n.\displaystyle\left\{\begin{array}[]{ll}V,&\mathrm{if}~{}m\geq n\\ U,&\mathrm{if}~{}m<n\end{array}\right.. (15)

To have to smaller size of data to compute, we choose DD to be smaller one within RTRR^{T}R and RRTRR^{T}. We choose BB in a similar way. Notice that, by (8) and (9), let B(t)B^{(t)} be the estimate in the tt-th iteration, then the update formula is

B(t+1)=DB(t)(B(t)DB(t))1B(t)B(t),\displaystyle B^{(t+1)}=DB^{(t)}\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1}B^{(t)\top}B^{(t)}, (16)

where DD is an l×ll\times l symmetric matrix given by (12), l=min{m,n}l=\min\{m,n\}, and B(t)B^{(t)} is of size l×dl\times d.

III-C2 Iterative Computation of BB

To implement the ALS algorithm in the distributed computing system, the matrix DD is partitioned into h2h^{2} equal-size submatrices, each of size lh×lh\frac{l}{h}\times\frac{l}{h} for some positive integer hh, i.e.,

D=[D0,0D0,1D0,h1D1,0D1,1D1,h1Dh1,0Dh1,1Dh1,h1]\displaystyle{D}=\left[\begin{array}[]{cccc}{D}_{0,0}&D_{0,1}&\ldots&{D}_{0,h-1}\\ {D}_{1,0}&D_{1,1}&\ldots&{D}_{1,h-1}\\ \vdots&\vdots&\ddots&\vdots\\ {D}_{h-1,0}&{D}_{h-1,1}&\ldots&{D}_{h-1,h-1}\\ \end{array}\right] (21)

In accordance with the partition in (21), B(t)B^{(t)} is partitioned into hh equal-size sub-matrices of size lh×d\frac{l}{h}\times d, i.e.,

B(t)=[B0(t)Bh1(t)],t0.\displaystyle B^{(t)}=\left[\begin{array}[]{c}B_{0}^{(t)}\\ \vdots\\ B_{h-1}^{(t)}\end{array}\right],\quad\forall\,t\geq 0. (25)

The bottleneck of updating the estimate of BB according to (16) is three matrix product computations: DB(t)DB^{(t)}, B(t)DB(t)B^{(t)\top}DB^{(t)} and B(t)B(t)B^{(t)\top}B^{(t)}, and the product operation between the matrics DB(t)DB^{(t)} and (B(t)DB(t))1B(t)B(t)\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1}B^{(t)\top}B^{(t)}, which need to be computed at the distributed worker nodes222Notice that, as the dimensions of the matrix B(t)DB(t)B^{(t)\top}DB^{(t)} and B(t)B(t)B^{(t)\top}B^{(t)} are both d×dd\times d, the inverse operation in (B(t)DB(t))1\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1} and the product operation between (B(t)DB(t))1\big{(}B^{(t)\top}DB^{(t)}\big{)}^{-1} and B(t)B(t)B^{(t)\top}B^{(t)} can be calculated at the master..

For clarity, we define the following polynomials:

D~(x)=j=0h1i=0h1Dj,ixj+ih\displaystyle\widetilde{D}(x)=\sum_{j=0}^{h-1}\sum_{i=0}^{h-1}D_{j,i}x^{j+ih} (26)

where D~\tilde{D} represents the encoded version of matrix DD.

For any Cl×dC\in\mathbb{R}^{l\times d}, partition CC in the same manner as in (25), i.e., C=[C0,,Ch1]C=[C_{0}^{\top},\ldots,C_{h-1}^{\top}]^{\top} where Cilh×dC_{i}\in\mathbb{R}^{\frac{l}{h}\times d}, define

fL(C,x)\displaystyle f_{\mathrm{L}}(C,x) =\displaystyle= C0+C1x++Ch1xh1,\displaystyle C_{0}+C_{1}x+\ldots+C_{h-1}x^{h-1}, (27)
fR(C,x)\displaystyle f_{\mathrm{R}}(C,x) =\displaystyle= Ch+Ch1x++C0xh1.\displaystyle C_{h}+C_{h-1}x+\ldots+C_{0}x^{h-1}. (28)

Let x1,x2,,xWx_{1},x_{2},\ldots,x_{W} be WW distinct real numbers. The system operates as follows:

Initially, B(0)B^{(0)} is generated randomly according to some continuous distribution. The master sends D~(xw)\widetilde{D}(x_{w}), fL(B(0),xw)f_{\mathrm{L}}(B^{(0)},x_{w}) and fR(B(0),xw)f_{\mathrm{R}}(B^{(0)},x_{w}) to the worker ww for each w[W]w\in[W].

In each iteration t=0,1,,Tt=0,1,\ldots,T

  1. 1.

    Each worker w[W]w\in[W] computes D~(xw)fR(B(t),xw)\widetilde{D}(x_{w})f_{\mathrm{R}}(B^{(t)},x_{w}) and fL(B(t),xw)fR(B(t),xw)f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(B^{(t)},x_{w}), then responds the results to the master;

  2. 2.

    By EPC, receiving any h2+h1h^{2}+h-1 results among

    {D~(xw)fR(B(t),xw):w[W]},\displaystyle\big{\{}\widetilde{D}(x_{w})f_{\mathrm{R}}(B^{(t)},x_{w}):w\in[W]\big{\}}, (29)

    the master can decode the matrix

    E(t)DB(t).\displaystyle E^{(t)}\triangleq DB^{(t)}. (30)

    Receiving any 2h12h-1 results among

    {fL(B(t),xw)fR(B(t),xw):w[W]},\displaystyle\{f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(B^{(t)},x_{w}):w\in[W]\}, (31)

    the master can decode the matrix

    F(t)B(t)B(t).\displaystyle F^{(t)}\triangleq B^{(t)\top}B^{(t)}. (32)

    The master then sends fR(E(t),xw)f_{\mathrm{R}}(E^{(t)},x_{w}) to worker ww for each w[W]w\in[W];

  3. 3.

    Each worker w[W]w\in[W] replace the matrix fR(B(t),xw)f_{\mathrm{R}}(B^{(t)},x_{w}) with the matrix fR(E(t),x)f_{\mathrm{R}}(E^{(t)},x). Then it computes fL(B(t),xw)fR(E(t),xw)f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(E^{(t)},x_{w}) and sends the result to the master.

  4. 4.

    Receiving any 2h12h-1 responses among

    {fL(B(t),xw)fR(E(t),xw):w[W]},\displaystyle\big{\{}f_{\mathrm{L}}(B^{(t)},x_{w})^{\top}f_{\mathrm{R}}(E^{(t)},x_{w}):w\in[W]\big{\}}, (33)

    the master decodes B(t)E(t)B^{(t)\top}E^{(t)}, and then the master computes

    G(t)=(B(t)E(t))1F(t),\displaystyle G^{(t)}=\big{(}B^{(t)\top}E^{(t)}\big{)}^{-1}F^{(t)}, (34)

    and sends G(t)G^{(t)} to all the worker nodes.

  5. 5.

    Each worker w[W]w\in[W] computes

    fR(E(t),xw)G(t)=fR(E(t)G(t),xw)\displaystyle f_{\mathrm{R}}(E^{(t)},x_{w})G^{(t)}=f_{\mathrm{R}}(E^{(t)}G^{(t)},x_{w}) (35)

    and response fR(E(t)G(t),xw)f_{\mathrm{R}}(E^{(t)}G^{(t)},x_{w}) to the master.

  6. 6.

    With any hh responses among

    {fR(E(t)G(t),xw):w[W]},\displaystyle\{f_{\mathrm{R}}(E^{(t)}G^{(t)},x_{w}):w\in[W]\}, (36)

    the master decodes

    B(t+1)=E(t)G(t).\displaystyle B^{(t+1)}=E^{(t)}G^{(t)}. (37)

    Then the master sends fL(B(t+1),xw)f_{\mathrm{L}}(B^{(t+1)},x_{w}) and fR(B(t+1),xw)f_{\mathrm{R}}(B^{(t+1)},x_{w}) to worker ww for each w[W]w\in[W].

  7. 7.

    Each worker w[W]w\in[W] updates fL(B(t),xw)f_{\mathrm{L}}(B^{(t)},x_{w}) and fR(B(t),xw)f_{\mathrm{R}}(B^{(t)},x_{w}) with fL(B(t+1),xw)f_{\mathrm{L}}(B^{(t+1)},x_{w}) and fR(B(t+1),xw)f_{\mathrm{R}}(B^{(t+1)},x_{w}) respectively, and then starts the (t+1)(t+1)-th iteration.

The procedure iterates until B(t)B^{(t)} converges. Now, we claim that the above iterative procedure is correct.

In fact, it is easy to verify (16) by (30), (32), (34) and (37). We only need to show the following facts:

  1. a)

    With any h2+h1h^{2}+h-1 responses in (29), the master can decode E(t)E^{(t)}. Notice that, the polynomials D~(x)\widetilde{D}(x) and fR(B(t),x)f_{R}(B^{(t)},x) are created according the (4)\eqref{eqn:Ax} and (5)\eqref{eqn:Bx} respectively, with parameters p=h,r=h,q=1p=h,r=h,q=1. Thus, it directly follows from the result of EPC.

  2. b)

    With any 2h12h-1 responses in (31) and (33), the master can decode B(t)B(t)B^{(t)\top}B^{(t)} and B(t)E(t)B^{(t)\top}E^{(t)} respectively. This directly follows by observing that the polynomials fL(C,x)f_{\mathrm{L}}(C,x) and fR(C,x)f_{\mathrm{R}}(C,x) are created according to (4)\eqref{eqn:Ax} and (5)\eqref{eqn:Bx} respectively with parameters p=q=1,r=hp=q=1,r=h.

  3. c)

    With any hh responses in (36), the master can decode E(t)G(t)E^{(t)}G^{(t)}. This directly follows from the fact the polynomial fR(C,x)f_{\mathrm{R}}(C,x) has degree h1h-1, thus the polynomial fR(E(t)G(t),x)f_{\mathrm{R}}(E^{(t)}G^{(t)},x) can be obtained by Interpolation with any hh responses in (36).

The whole process can be represented in Figure 1.

Refer to caption
Figure 1: This figure describes the whole coding and decoding process of the algorithm. In each iteration, Firstly, BB is encoded into 2 copies that are BLB_{L} and BRB_{R}, These two copies are used to create EE and FF. Finally GG will be created. With GG and EE, we could obtain final B(t+1)B^{(t+1)} in next iteration.
Lemma 2.

(Recovery threshold) The recovery threshold of the algorithm is given by

K=h2+h1.\displaystyle K=h^{2}+h-1. (38)

Proof: From the above descriptions a), b), c), the iteration involves four distributed matrix multiplication. The recovery threshold is given by the maximum recovery threshold of those updates, i.e.,

K=max(h2+h1,2h1,h)=h2+h1.\displaystyle K=\max(h^{2}+h-1,2h-1,h)=h^{2}+h-1. (39)
Remark 1 (Optimality of the Recovery Thresholds).

Notice that, with the given parameter hh, from the results of EPC in [9, Theorem 2], all the EPC code in a) and b) achieves the optimal recovery threshold among all linear code for the corresponding matrix multiplication problems. For the calculation in c), it also achieves the optimal recovery threshold by simple cut-set bound.

III-C3 Post-Computation

The master has obtained the final estimate of BB, denoted by B~\widetilde{B}. The master first computes H~=B~(B~B~)1\widetilde{H}=\widetilde{B}(\widetilde{B}^{\top}\widetilde{B})^{-1} as follows:

  1. 1.

    The master sends fL(B~,xw)f_{\mathrm{L}}(\widetilde{B},x_{w}) and fR(B~,xw)f_{\mathrm{R}}(\widetilde{B},x_{w}) to worker ww for each w[W]w\in[W];

  2. 2.

    Each worker w[W]w\in[W] computes fL(B~,xw)fR(B~,xw)f_{\mathrm{L}}(\widetilde{B},x_{w})^{\top}f_{\mathrm{R}}(\widetilde{B},x_{w}) and sends the results to the master;

  3. 3.

    With any 2h12h-1 responses, the master decode

    F~=B~B~\displaystyle\widetilde{F}=\widetilde{B}^{\top}\widetilde{B} (40)

    by EPC decoding. Then, it computes and sends F~1\widetilde{F}^{-1} to all the worker nodes;

  4. 4.

    Each worker computes

    fL(B~,xw)F~1=fL(B~F~1,xw)\displaystyle f_{\mathrm{L}}(\widetilde{B},x_{w})\widetilde{F}^{-1}=f_{\mathrm{L}}(\widetilde{B}\widetilde{F}^{-1},x_{w}) (41)

    and sends it back to the master;

  5. 5.

    With any hh responses, the master decodes H~=B~F~1\widetilde{H}=\widetilde{B}\widetilde{F}^{-1} by interpolating the polynomial fL(B~F~1,x)f_{\mathrm{L}}(\widetilde{B}\widetilde{F}^{-1},x).

Then the master continue to obtain the estimates of VV and UU as follows:

  1. 1.

    If mnm\geq n, the estimate of VV is given by V~=B~\widetilde{V}=\widetilde{B}. The estimate of UU is given by

    U~=RH~,\displaystyle\widetilde{U}=R^{\top}\widetilde{H}, (42a)
    which computed with standard EPC code with appropriate parameters.
  2. 2.

    If m<nm<n, the estimate of UU is given by U~=B~\widetilde{U}=\widetilde{B}, the estimate is given by

    V~=RH~,\displaystyle\widetilde{V}=R\widetilde{H}, (42b)

    which can be computed by standard EPC code with appropriate parameters.

Remark 2.

Since the main computation load is in the iterative computation of BB, we omit the details of computation of DD in (12) and the computations in (42). One convenient choice of the partition parameters is partition RR (if mnm\geq n) or RR^{\top} (if m<nm<n) into r×hr\times h equal-size sub-matrices for some rr, and partition H~\widetilde{H} in the same form as in (25), so that the result of DD has the form (21). Under such partitions, the EPC computation of (42) also achieves the optimal recovery threshold among all linear code since q=1q=1 by [9, Theorem 2].

IV Main Results

The following theorem characterizes the maximum number of stragglers that the algorithm can tolerate in relation to the coding redundancy.

Theorem 1.

The relation between the recovery threshold KK (or the maximum number of stragglers that the coded ALS algorithm can tolerate s=WKs=W-K) and coding redundancy μ\mu is given by

K=Wμ+Wμ1,μ>1\displaystyle K=\frac{W}{\mu}+\sqrt{\frac{W}{\mu}}-1,\quad\mu>1 (43)

where Wh2W\geq h^{2}.

Proof:

Suppose each worker have a same size of data partitions of R~\tilde{R} with kk elements. According to (21) and the fact that each worker holds one partition of RR, and there are h2h^{2} partitions in RR, and the number of elements in RR is h2kh^{2}k. So that μ\mu could be represented as

μ=Wkh2k=Wh2\displaystyle\mu=\frac{Wk}{h^{2}k}=\frac{W}{h^{2}} (44)

From Lemma 2, we know the relation between KK and hh. By replacing hh in (38) by (44), we can get (43). ∎

Theorem 2.

(Computation Complexity Analysis) Given a distributed matrix factorization problem with WW workers, m×nm\times n matrix RR, latent dimension dd, and coding redundancy μ\mu, the computation complexity of the proposed coded ALS algorithm can be calculated as follows.

  1. 1.

    The pre-computation complexity (at master) is O(min{mn2,nm2})O(\min\{mn^{2},nm^{2}\}) (one time).

  2. 2.

    The computation complexity at each worker is O(n2dμW)O(\frac{n^{2}d\mu}{W}) at each iteration333Each iteration involves both updates of VV and UU.

  3. 3.

    The encoding and decoding complexities are O(ndWμ)O(nd\sqrt{W\mu}) and O(nd(Wμ+Wμ))O(nd(\frac{W}{\mu}+\sqrt{\frac{W}{\mu}})) at the master at each iteration.

  4. 4.

    The decoding complexity, donotes the decoding procedure happened in the master, is O(nd(h2+h))O(nd(h^{2}+h))

For the whole process proposed in this algorithm depicted in Fig.1, where we use TT to represent the number of iterations, and nn to represent the number of columns and rows of BB444Here we assume that n<mn<m for R. The complexity for each stage of the algorithm can be represented as the following form:

Proof:

Consider the size of matrix multiplication, we can measure the computation complexity of different procedure of this proposed algorithm. In the pre-computation stage, we only have a computation task of RTRR^{T}R or RRTRR^{T}, and thus the the complexity is O(minmn2,nm2)O(\min{mn^{2},nm^{2}}). In each worker, we only consider the first stage that is E=DBE=DB. The two matrices involved in the computation have size that are n2h2\frac{n^{2}}{h^{2}} and ndh2\frac{nd}{h^{2}}. Replace hh by h=Wμh=\sqrt{W}{\mu}, we will get the complexity of each worker. For the encoding and decoding complexity, it is an weighted sum of matrices from each workers and data partitions. We consider the number of workers WW and number of data partitions hh. Finally, we could get the encoding and decoding complexity respectively. ∎

Remark 3.

(Number of partitions) In the proposed algorithm, to get a better partition method, in common configurations, e.g., when W<100W<100. When h=W+34sh^{*}=\lfloor\sqrt{W+\frac{3}{4}-s}\rfloor, we will achieve a relatively less expected computation time 𝔼[Tcp]\mathbb{E}[T_{cp}] when workers doing their computation task.

By the definition of stragglers s+KWs+K\leq W and the fact that K=h2+h1K=h^{2}+h-1, we could find the largest hh that could be tolerated by the algorithm, which will therefore save a large amount of time in practical.

According to the definition of stragglers

s+KW\displaystyle s+K\leq W (45)

There are 4 decoding procedure in the algorithm, shown in figure 1, the recovery threshold for the first step to get EE is the largest which is

K=fK(h)=h2+h1\displaystyle K=f_{K}(h)=h^{2}+h-1 (46)

In Remark 4, we have discussed that the computation time TcpT_{cp} decreases as hh grows. Therefore, to get smaller TcpT_{cp}, we need to make hh greater.

It’s easly known that fK(h)f_{K}(h) monotonously increased when h1h\geq 1, which is upper bounded by (45). Therefore, (45) and (46) can be written as

h2+h+s(w+1)0\displaystyle h^{2}+h+s-(w+1)\leq 0 (47)

and therefore the best hh^{*} is got to make to make the computation time less.

Remark 4.

In worker’s computation stage, when doing the multiplication of matrices, the computation time TcpT_{cp} decreases as hh grows.

We consider two impacts of this algorithm. The first impact is more partitions makes each worker share less proportion of data which speeds up the computation. The second impact is more partitions require more usable worker, which will make us get an bigger order statistics. See Appendix A for full explanation.

μ\mu and σ\sigma are determined by the performance of the computer, which is hard to measure in this setup.Intuitively, with larger hh, each worker has smaller data to process, the total computation time could be less.

V Simulations

In this section, we design a simulation experiment to measure the running time of our algorithm.

We conduct our simulations on synthetic data by adding noise into the product of two randomly initiated matrices UU and VV. In the simulation, we set the parameters m=2400m=2400, n=1500n=1500, d=200d=200 and run the simulations when the number of non-stragglers k=Ws=10,20,30,40,50k=W-s=10,20,30,40,50. We tested the computation time of the proposed algorithm of different number of data partitions and record the result in Table I.

TABLE I: Computation time for different partitions
h=2h=2 h=3h=3 h=4h=4 h=5h=5 h=6h=6
k=10k=10 7.59468 - - - -
k=20k=20 7.376128 3.952123 2.856271 - -
k=30k=30 6.581517 3.7818 2.729829 0.113088 -
k=40k=40 6.860791 3.744443 1.285898 0.008618 -
k=50k=50 6.852958 3.778484 0.937565 0.003492 0.000598

In Table I, we could easily see the relationship between the computation time in workers and the number of partitions hh in data matrix RR for different number of non-straggling workers kk. The minus sign in the table means there is no data, because the condition h2+h1+sWh^{2}+h-1+s\leq W is not met. Intuitively, smaller partitions result in a faster computation time in workers although smaller partitions will lead to more workers involved in the computation and therefore increase the computation time. For the same hh, larger kk could make the computation faster on the whole. That’s because more workers in total bring more workers that are faster. Given the same number of workers, we can also see that when the coding redundancy μ\mu increases (hh decreases), we need less computation time, indicating a tradeoff of straggler mitigation ability and computation complexity in the simulation results.

VI Conclusion

We presented a distributed implementation of the ALS algorithm, which solves the matrix factorization problem in a distributed computation system. The procedure takes the advantage of the entangled polynomial code as a building block, which can resist stragglers. The relationship between the recovery threshold and the storage load is characterized. The simulation result indicates that with more workers and partitions of data matrix, we could obtain a shorter computation time, and thereby fully implemented the role of distributed learning.

Appendix A

Let TcpT_{cp} represents the total time consuming in the encoding process, and thus

Tcp=Tcp[1]+Tcp[2]+Tcp[3]+Tcp[4]\displaystyle T_{cp}=T^{[1]}_{cp}+T^{[2]}_{cp}+T^{[3]}_{cp}+T^{[4]}_{cp} (48)

where Tcp[A]T^{[A]}_{cp} represents the AA-th computation stage. Because FF, BTEB^{T}E and BB have the same size of matrix to do the multiplication, so the

𝔼Tcp[2]=𝔼Tcp[3]\displaystyle\mathbb{E}T^{[2]}_{cp}=\mathbb{E}T^{[3]}_{cp} (49)

So that the expected encoding time can be represented as

𝔼Tcp=𝔼Tcp[1]+2𝔼Tcp[2]+𝔼Tcp[4]\displaystyle\mathbb{E}T_{cp}=\mathbb{E}T_{cp}^{[1]}+2\mathbb{E}T_{cp}^{[2]}+\mathbb{E}T_{cp}^{[4]} (50)

where Tcp[A]=jTTcp(t)[A]T_{cp}^{[A]}=\sum\limits_{j}^{T}T_{cp(t)}^{[A]}, in which tt represents the tt-th iteration of the algorithm and Tcp(t)[A]T_{cp(t)}^{[A]} represents the total time consumed of stage AA at tt-th iteration.

For each worker, The primary task is to compute EE, whose computation time is denoted as Tcp[1]T_{cp}^{[1]}. The computation time for it is ln2dh2Tu(l)\sum\limits_{l}^{\frac{n^{2}d}{h^{2}}}T_{u(l)} for each worker at each iteration. To compute FF, BTEB^{T}E and BB,denoted as Tcp[2]T_{cp}^{[2]},Tcp[3]T_{cp}^{[3]}, Tcp[4]T_{cp}^{[4]}, where each worker needs ld2nhTu(l)\sum\limits_{l}^{\frac{d^{2}n}{h}}T_{u(l)} at each iteration, where Tu(l)N(μu,σu2)T_{u(l)}\sim N(\mu_{u},\sigma_{u}^{2}) represents the time to do a element-wise multiplication in one worker.

The computation time at each iteration can be represented as the following form.

Tcp(t)[A]=max(i)WsTcp(t)[A],i[W]\displaystyle T_{cp(t)}^{[A]}=\max\limits_{(i)}^{W-s}T_{cp(t)}^{[A]},i\in[W] (51)

where max(i)WsTcp(t)[A]\max\limits_{(i)}^{W-s}T_{cp(t)}^{[A]} denotes the ii-th shortest time taken by the workers to return the result.

The multiplication of matrices can be seen as element-wise operation in each worker. According to the Central Limit Theorem, the Tcp(j)[1]i{}^{i}T_{cp(j)}^{[1]} can be represented as

Tcp(t)[1]iN(n2dh2μu,n2dh2σu2){}^{i}T_{cp(t)}^{[1]}\sim N(\frac{n^{2}d}{h^{2}}\mu_{u},\frac{n^{2}d}{h^{2}}\sigma_{u}^{2}) (52)

where Tcp(t)[1]i{}^{i}T_{cp(t)}^{[1]} represents the time taken by worker wiw_{i} in tt-th iteration in stage 1.

The expected computation time can be represented as

𝔼Tcp[1]=T𝔼[max(i)WsTcp(t)[1]],i[W]\displaystyle\mathbb{E}T_{cp}^{[1]}=T\mathbb{E}[\max\limits_{(i)}^{W-s}T_{cp(t)}^{[1]}],i\in[W] (53)

Similarly, the 𝔼Tcp[2]\mathbb{E}T_{cp}^{[2]} can be represented as same form, and

Tcp(t)[2]N(nd2hμu,nd2hσu2)\displaystyle T_{cp(t)}^{[2]}\sim N(\frac{nd^{2}}{h}\mu_{u},\frac{nd^{2}}{h}\sigma_{u}^{2}) (54)

According to the [18], the approximation of expected value of the rr-th order statistic can be represented as

E(r:n)μu+Φ1(rαn2α+1)σu\displaystyle E(r:n)\approx\mu_{u}+\Phi^{-1}(\frac{r-\alpha}{n-2\alpha+1})\sigma_{u} (55)

where α=0.375\alpha=0.375 and Φ(x)\Phi(x) is the inverse function of the cumulative distribution function of standart normal distribution.

Because n2d>>nd2n^{2}d>>nd2, combining (52) and (54), we know that 𝔼Tcp[1]>>𝔼Tcp[i]\mathbb{E}T_{cp}^{[1]}>>\mathbb{E}T_{cp}^{[i]} when i{2,3,4}i\in\{2,3,4\}

so that the computation time in (48) could be written as

𝔼TcpT𝔼[max(i)WsTcp(t)[A]]\displaystyle\mathbb{E}T_{cp}\approx T\mathbb{E}[\max\limits_{(i)}^{W-s}T_{cp(t)}^{[A]}] (56)
T(n2dh2μu+Φ1(h2+h1αWs2α+1)dnhσu)\displaystyle\approx T(\frac{n^{2}d}{h^{2}}\mu_{u}+\Phi^{-1}(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})\frac{\sqrt{d}n}{h}\sigma_{u}) (57)

Let θ(h)\theta(h) represents the right hand side term in (57) and then θ(h)\theta^{\prime}(h) can be written as

θ(h)=2μun2dh3+(2h+1)dnσuϕ(h2+h1αWs2α+1)h2Φ1(h2+h1αWs2α+1)dnσuh2\displaystyle\theta^{\prime}(h)\!=\!-2\mu_{u}n^{2}\!dh^{-3}\!+\!\frac{(2h\!+\!1)\sqrt{d}n\sigma_{u}}{\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})h}\!-\!2\Phi^{-\!1}\!(\!\frac{h\!^{2}\!+h\!-1\!-\!\alpha}{\!W\!-\!s\!-\!2\!\alpha\!+\!1}\!)\sqrt{d}n\sigma_{u}h^{-2} (58)

where ϕ(x)\phi(x) denotes the probability density function of standard normal distribution.

Dividing both sides by 2ndσ2n\sqrt{d}\sigma, we could get

θ2(h)=μundh3σu+1+12hϕ(h2+h1αWs2α+1)Φ1(h2+h1αWs2α+1)h2\displaystyle\theta_{2}^{\prime}(h)=-\frac{\mu_{u}n\sqrt{d}h^{-3}}{\sigma_{u}}\!+\!\frac{1+\frac{1}{2h}}{\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})}\!-\!\Phi^{-1}\!(\frac{h\!^{2}\!+h\!-1\!-\!\alpha}{\!W\!-\!s\!-\!2\!\alpha\!+\!1})h^{-2} (59)
θ2(h)=μundh3σu+1+12hϕ(h2+h1αWs2α+1)Φ1(h2+h1αWs2α+1)h2\begin{array}[]{llll}\theta_{2}^{\prime}(h)=-\frac{\mu_{u}n\sqrt{d}h^{-3}}{\sigma_{u}}\!+\!\frac{1+\frac{1}{2h}}{\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})}\!-\!\Phi^{-1}\!(\frac{h\!^{2}\!+h\!-1\!-\!\alpha}{\!W\!-\!s\!-\!2\!\alpha\!+\!1})h^{-2}\end{array} (60)

So that we only need to (60) to determine whether it’s positive or negative.

In (60), let’s use ϕ()\phi(\cdot) to represent ϕ(h2+h1αWs2α+1)\phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1}) and use Φ()\Phi(\cdot) to represent Φ(h2+h1αWs2α+1)\Phi(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1}). ϕ()\phi(\cdot) is a decreaseing function after h1h\geq 1. When h=1h=1, ϕ()\phi(\cdot) reaches its maximum ϕ(1α1002α+1)\phi(\frac{1-\alpha}{100-2\alpha+1}), which is less than 0.398 when Ws100W-s\leq 100; When h=hMh=h_{M}, where hMh_{M} denotes the max hh such that s+h2+h1Ws+h^{2}+h-1\leq W, ϕ()\phi(\cdot) reaches its minimum minϕ()=ϕ(hM2+hM1αWs2α+1)ϕ(Ws1αWs2α+1)\min\phi(\cdot)=\phi(\frac{h_{M}^{2}+h_{M}-1-\alpha}{W-s-2\alpha+1})\geq\phi(\frac{W-s-1-\alpha}{W-s-2\alpha+1}) which is greater than 0.246 when Ws100W-s\leq 100.

Therefore,

0.246ϕ()0.398Ws100\displaystyle 0.246\leq\phi(\cdot)\leq 0.398\quad W-s\leq 100 (61)

meanwhile, 1+12h1+\frac{1}{2h} meets

1.0561+12h1.5\displaystyle 1.056\leq 1+\frac{1}{2h}\leq 1.5 (62)

So that, the second term in (60) meets

2.65371+12hϕ()6.0975\displaystyle 2.6537\leq\frac{1+\frac{1}{2h}}{\phi(\cdot)}\leq 6.0975 (63)

Let’s use Φ1()\Phi^{-1}(\cdot) to represent Φ1(h2+h1αWs2α+1)h2\Phi^{-1}(\frac{h^{2}+h-1-\alpha}{W-s-2\alpha+1})h^{-2}. Then Φ1()\Phi^{-1}(\cdot) is increasing when h1h\geq 1. So the Φ1()\Phi^{-1}(\cdot) reaches its maximum when h=hMh=h_{M}

maxΦ1()Φ1(WsαWs2α+1)Φ1(0.98379)=2.14W100\displaystyle\max\Phi^{-1}(\cdot)\leq\Phi^{-1}(\frac{W-s-\alpha}{W-s-2\alpha+1})\leq\Phi^{-1}(0.98379)=2.14\quad W\leq 100 (64)

when h=1h=1

minΦ1()Φ1(1αWs2α+1)2.5W100\displaystyle\min\Phi^{-1}(\cdot)\geq\Phi^{-1}(\frac{1-\alpha}{W-s-2\alpha+1})\geq-2.5\quad W\leq 100 (65)

Therefore, 2.14Φ1()h22.5-2.14\leq-\Phi^{-1}(\cdot)h^{-2}\leq 2.5. In other words, the second and third term in (60) are all bounded. If we can guarantee

μundh3σu8.5975W100\displaystyle\frac{\mu_{u}n\sqrt{d}}{h^{3}\sigma_{u}}\geq 8.5975\quad W\leq 100 (66)

such that θ2(h)<0\theta_{2}^{\prime}(h)<0, which suggests θ(h)\theta(h) decrease monotonously when h1h\geq 1, resulting in a less time of computation with a greater hh.

Appendix B

V(t+1)=RRV(t)(V(t)V)1((RV(t)(V(t)V(t))1)(RV(VTV)1))1\displaystyle\begin{split}{V}^{(t+1)}={R}^{\top}{R}{V}^{(t)}(V^{(t)\top}{V})^{-1}(({R}{V}^{(t)}({V}^{(t)\top}{V}^{(t)})^{-1})^{\top}({RV}(V^{T}V)^{-1}))^{-1}\end{split} (67)

in which (RV(VTV)1)T(RV(V^{T}V)^{-1})^{T} could be writen as

(VTV)1TRTVT\displaystyle{(V^{T}V)^{-1}}^{T}R^{T}V^{T} (68)

Because A1T=AT1{A^{-1}}^{T}={A^{T}}^{-1}, and that VTVV^{T}V is a symmetric matrix, (68) could be (VTV)1RTVT(V^{T}V)^{-1}R^{T}V^{T} With (68) and (67), we will get that

V(t+1)=RTRV(VTV)1((VTV)1VTRTRV(VTV))1\displaystyle\begin{split}V^{(t+1)}=R^{T}{RV}(V^{T}V)^{-1}((V^{T}V)^{-1}V^{T}R^{T}{RV}(V^{T}V))^{-1}\end{split} (69)

Now we know that UTUU^{T}U is invertible. So that the items within the rightmost bracket is invertible. From the fact that (ABC)1=C1B1A1({ABC})^{-1}=C^{-1}B^{-1}A^{-1}, the RHS of 69 could be

((VTV)(VTRTRV)1((VTV)1))\displaystyle((V^{T}V)(V^{T}R^{T}{RV})^{-1}((V^{T}V)^{-1})) (70)

and 69 becomes

V(t+1)=RTRV(VTV)1((VTV)(VTRTRV)1((VTV)1))\displaystyle\begin{split}V^{(t+1)}=R^{T}{RV}(V^{T}V)^{-1}((V^{T}V)(V^{T}R^{T}{RV})^{-1}((V^{T}V)^{-1}))\end{split} (71)

Simplify (71) we will finally get (8). Similarly, we can do the same procedure and then get (9)

References

  • [1] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, no. 8, pp. 30–37, 2009.
  • [2] J. Bennett, S. Lanning et al., “The netflix prize,” in Proceedings of KDD cup and workshop, vol. 2007.   New York, NY, USA., 2007, p. 35.
  • [3] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Straggler mitigation in distributed optimization through data encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 5434–5442.
  • [4] D. Data, L. Song, and S. Diggavi, “Data encoding for byzantine-resilient distributed gradient descent,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2018, pp. 863–870.
  • [5] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigation via data encoding,” arXiv preprint arXiv:1805.09934, 2018.
  • [6] R. Bitar, M. Wootters, and S. E. Rouayheb, “Stochastic gradient coding for straggler mitigation in distributed learning,” arXiv preprint arXiv:1905.05383, 2019.
  • [7] E. Ozfatura, D. Gündüz, and S. Ulukus, “Speeding up distributed gradient descent by utilizing non-persistent stragglers,” in 2019 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2019, pp. 2729–2733.
  • [8] R. K. Maity, A. S. Rawa, and A. Mazumdar, “Robust gradient descent via moment encoding and ldpc codes,” in 2019 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2019, pp. 2734–2738.
  • [9] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,” IEEE Transactions on Information Theory, vol. 66, no. 3, pp. 1920–1933, 2020.
  • [10] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Advances in neural information processing systems, 2013, pp. 1223–1231.
  • [11] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  • [12] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in Osdi, vol. 8, no. 4, 2008, p. 7.
  • [13] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, 2017.
  • [14] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in International Conference on Machine Learning, 2017, pp. 3368–3376.
  • [15] J. M. Neighbors, “The draco approach to constructing software from reusable components,” IEEE Transactions on Software Engineering, no. 5, pp. 564–574, 1984.
  • [16] W. Halbawi, N. Azizan, F. Salehi, and B. Hassibi, “Improving distributed gradient descent using reed-solomon codes,” in 2018 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2018, pp. 2027–2031.
  • [17] S. Dutta, Z. Bai, H. Jeong, T. M. Low, and P. Grover, “A unified coded deep neural network training strategy based on generalized polydot codes,” in 2018 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2018, pp. 1585–1589.
  • [18] J. Royston, “Algorithm as 177: Expected normal order statistics (exact and approximate),” Journal of the royal statistical society. Series C (Applied statistics), vol. 31, no. 2, pp. 161–165, 1982.