This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[2]\fnmChi-Hao \surWu

1]\orgdivDepartment of Mathematics & Statistics, \orgnameThe State University of New York at Albany, \orgaddress\cityAlbany, \postcode12222, \stateNY, \countryUSA

[2]\orgdivDepartment of Mathematics, \orgnameUniversity of California Los Angeles, \orgaddress\cityLos Angeles, \postcode90095, \stateCA, \countryUSA

A federated Kaczmarz algorithm

\fnmHalyun \surJeong hjeong2@albany.edu    \fnmDeanna \surNeedell deanna@math.ucla.edu    chwu93@math.ucla.edu [ *
Abstract

In this paper, we propose a federated algorithm for solving large linear systems that is inspired by the classic randomized Kaczmarz algorithm. We provide convergence guarantees of the proposed method, and as a corollary of our analysis, we provide a new proof for the convergence of the classic randomized Kaczmarz method. We demonstrate experimentally the behavior of our method when applied to related problems. For underdetermined systems, we demonstrate that our algorithm can be used for sparse approximation. For inconsistent systems, we demonstrate that our algorithm converges to a horizon of the least squares solution. Finally, we apply our algorithm to real data and show that it is consistent with the selection of Lasso, while still offering the computational advantages of the Kaczmarz framework and thresholding-based algorithms in the federated setting.

keywords:
federated learning, Kaczmarz method, sparse approximation, feature selection
pacs:
[

MSC Classification]65F10, 65F20

1 Introduction

In this paper, we propose a federated algorithm for solving large linear systems. Federated learning was originally proposed by [1] to train neural networks in a decentralized setting. The global model is trained across multiple clients (e.g., mobile devices, sensors, or edge nodes) without transferring local data to a central server. This method improves privacy, reduces communication overhead, and enables learning from heterogeneous, distributed datasets; for instance, see [2] for more details. On the other hand, the Kaczmarz algorithm [3] is an iterative method for solving overdetermined linear systems, and the randomized Kaczmarz algorithm (RK) [4] is a version with a particular sampling scheme. Given a matrix Am×nA\in\mathbb{R}^{m\times n}, we denote by aja_{j} the jj-th row of AA, and we denote the Frobenius norm,

AF2=j=1maj22.\|A\|_{F}^{2}=\sum_{j=1}^{m}\|a_{j}\|_{2}^{2}.

To solve a linear system Ax=bAx=b for Am×nA\in\mathbb{R}^{m\times n}, bmb\in\mathbb{R}^{m}, RK considers the iteration,

xk+1=xk+bjkajk,xkajk2ajk,k=0,1,,x_{k+1}=x_{k}+\frac{b_{j_{k}}-\langle a_{j_{k}},x_{k}\rangle}{\|a_{j_{k}}\|^{2}}\,a_{j_{k}},\quad k=0,1,\dotsc,

where jkj_{k}’s are independent identically distributed (i.i.d.) random variables with distribution (jk=j)=aj22/AF2\mathbb{P}(j_{k}=j)=\|a_{j}\|_{2}^{2}/\|A\|_{F}^{2} for k=0,1,k=0,1,\dotsc. For consistent overdetermined linear systems, it was shown in [4] that RK converges linearly to the solution in expectation. In a broader framework, it is also known that RK can be viewed as a stochastic gradient descent method with carefully chosen step sizes [5].

We consider the federated setting where there are MM local clients, and each client owns a subset of equations. In particular, let Ax=bAx=b be the overdetermined, consistent linear system we aim to solve, where Am×nA\in\mathbb{R}^{m\times n} and bmb\in\mathbb{R}^{m}. The system is partitioned into MM parts,

A=[A1AM]andb=[b1bM],A=\begin{bmatrix}A_{1}\\ \vdots\\ A_{M}\end{bmatrix}\quad\mbox{and}\quad b=\begin{bmatrix}b_{1}\\ \vdots\\ b_{M}\end{bmatrix},

and the ii-th client sees the system Aix=biA_{i}x=b_{i} for i[M]i\in[M]; note that locally the system Aix=biA_{i}x=b_{i} may be underdetermined for i[M]i\in[M]. We briefly describe the framework of federated algorithms using federated averaging (FedAvg) [1] as an example. At each federated round, a subset of local clients is selected, and the global server broadcasts the current global parameters (position) to the selected local clients. Each of the selected clients performs local updates using the global position as the initial and returns the updated local position after some iterations. The global server averages the local updates returned by the local clients and uses the averaged position as the updated global position. The process is iterated until it converges. Following the notation used in the federated optimization community [2], we denote by x(t)nx^{(t)}\in\mathbb{R}^{n} the solution at the global server after tt federated rounds, and xi(t,k)x_{i}^{(t,k)} the solution at the ii-th client after kk local iterations during the t+1t+1-th federated round. We propose Algorithm 1 for solving large linear systems in a federated setting.

We briefly comment on the ideas behind Algorithm 1 and the difficulties for the convergence analysis. In our setting, we allow the linear systems at the local clients Aix=biA_{i}x=b_{i}, to be underdetermined for i{1,,M}i\in\{1,\dotsc,M\}, and it is known that RK converges to the orthogonal projection of the initial position onto the affine subsets

Ci={xn:Aix=bi},i=1,,MC_{i}=\left\{x\in\mathbb{R}^{n}:A_{i}x=b_{i}\right\},\quad i=1,\dotsc,M (1)

for underdetermined linear systems; in particular, if the initial position is orthogonal to the null space of AiA_{i}, RK converges to the least norm solution of Aix=biA_{i}x=b_{i}; see [6]. Based on this observation, we see that Δi(t)\Delta_{i}^{(t)} defined in Algorithm 1 is approximating a normal vector to a hyperplane containing CiC_{i} (up to a normalizing constant), and projecting x(t)x^{(t)} onto CiC_{i}’s is approximately the same as projecting x(t)x^{(t)} onto

C~i={xn:Δi(t)x=di},i=1,,M.\tilde{C}_{i}=\left\{x\in\mathbb{R}^{n}:\Delta_{i}^{(t)}x=d_{i}\right\},\quad i=1,\dotsc,M.

Therefore, instead of treating the local model changes Δi(t)\Delta_{i}^{(t)}’s as simply displacements, we transform them into approximate hyperplanes C~i\tilde{C}_{i}’s, which we then use at the server to run RK. In some sense, Algorithm 1 is a stochastic process where at each time t{0,1,,T1}t\in\{0,1,\dotsc,T-1\}, we choose a collection of affine subsets, dependent on x(t)x^{(t)}, onto which to project. Given that it is difficult to study the convergence with a purely algebraic approach, we take a more geometric approach; see Section 2 for details.

Algorithm 1 FedRK
Initial model x(0)x^{(0)}
for t = 0 to T1T-1 do
     Sample a subset S(t)S^{(t)} of clients
     for client iS(t)i\in S^{(t)} do
         Initialize local model xi(t,0)=x(t)x_{i}^{(t,0)}=x^{(t)}
         for k=0,,τ1k=0,\dotsc,\tau-1  do
              Perform RK initialized with x(t,k)x^{(t,k)}
         end for
         Compute the local model changes Δi(t)=xi(t,τ)xi(t,0)\Delta_{i}^{(t)}=x_{i}^{(t,\tau)}-x_{i}^{(t,0)}
     end for
     for iS(t)i\in S^{(t)} do
         Define di=Δi(t),Δi(t)+x(t)d_{i}=\langle\Delta_{i}^{(t)},\Delta_{i}^{(t)}+x^{(t)}\rangle
     end for
     Apply RK to solve Δ(t)x=d\Delta^{(t)}x=d (ignoring the rows where Δi(t)=0\Delta_{i}^{(t)}=0 with uniform sampling scheme), and let x(t+1)x^{(t+1)} be the solution after τg\tau_{g} iterates
end for

For the convergence analysis, we consider two scenarios. First, we study the scenario where the local clients run finitely many iterations and the server runs one iteration (τg=1\tau_{g}=1 in Algorithm 1). Second, we study the scenario where the local clients run infinitely many iterations (τ=\tau=\infty in Algorithm 1), and the server runs some finite number of iterations. We state the assumptions that we make in our main theorems, except for Corollary 1, where the system is allowed to be underdetermined.

Assumption 1.

The linear system Ax=bAx=b is overdetermined.

By an appropriate translation, one can assume that the true solution x=0x^{*}=0, and thus it is natural to also include the second assumption:

Assumption 2.

The solution to the linear system is x=0x^{*}=0.

The proof for the first scenario (τg=1\tau_{g}=1) utilizes that of RK applied to a suitable linear system. Specifically, we show the following:

Theorem 1.

Following the notation in Algorithm 1, assume τ=T\tau=T, τg=1\tau_{g}=1 and |S(t)|N|S^{(t)}|\equiv N\in\mathbb{N}. Let X(t+1)X^{(t+1)} be the global update after t+1t+1 iterations. Then there exists 0<β<10<\beta<1 such that

𝔼X(t+1)2βt+1X(0)2,t=0,1,.\mathbb{E}\|X^{(t+1)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

For the second scenario, let us first assume the server performs one RK iteration (τg=1\tau_{g}=1 in Algorithm 1). Since we assume that the solution x=0x^{*}=0, the CiC_{i}’s defined in (1) are linear subspaces for i[M]i\in[M]. Denote by PCiP_{C_{i}} the orthogonal projection operator onto CiC_{i} for i[M]i\in[M]. We see that (following the notation in Algorithm 1),

𝔼X(t+1)=sS(t)1|S(t)|(I(X(t)PCsX(t))(X(t)PCsX(t))TX(t)PCsX(t)2)X(t),\mathbb{E}X^{(t+1)}=\sum_{s\in S^{(t)}}\frac{1}{|S^{(t)}|}\left(I-\frac{(X^{(t)}-P_{C_{s}}X^{(t)})(X^{(t)}-P_{C_{s}}X^{(t)})^{T}}{\|X^{(t)}-P_{C_{s}}X^{(t)}\|^{2}}\right)X^{(t)}, (2)

defines the sequence produced by our federated Kaczmarz algorithm. We first prove a technical theorem that characterizes the decay of Algorithm 1 when the current position is one unit distance away from the true solution, which involves studying a function related to (2); see Theorem 4. Then we use the technical theorem to prove the following:

Theorem 2.

Following the notation in Algorithm 1, assume τ=\tau=\infty, τg=T\tau_{g}=T and |S(t)|N|S^{(t)}|\equiv N\in\mathbb{N}. Let X(t+1)X^{(t+1)} be the global update after t+1t+1 iterations. Then there exists 0<β<10<\beta<1 such that

𝔼X(t+1)2βt+1X(0)2,t=0,1,.\mathbb{E}\|X^{(t+1)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

Perhaps interestingly, Theorem 4 also gives an alternative proof for the convergence of classical RK; see Corollary 2. Finally, in the second scenario, we also deduce that Algorithm 1 converges linearly to the orthogonal projection from the initial global position x(0)x^{(0)} onto i=1MCi\cap_{i=1}^{M}C_{i} when the whole system is underdetermined. Specifically, if i=1MCi\cap_{i=1}^{M}C_{i} is a single point, Algorithm 1 converges to the true solution as before.

Corollary 1.

Following the notation in Algorithm 1, assume τ=\tau=\infty, τg=T\tau_{g}=T and |S(t)|N|S^{(t)}|\equiv N\in\mathbb{N}. Let X(t+1)X^{(t+1)} be the global update after t+1t+1 iterations. Denote C=i=1MCiC=\cap_{i=1}^{M}C_{i}, where CiC_{i}’s are defined in (1), and PCP_{C} the orthogonal projection onto CC. Then there exists 0<β<10<\beta<1 such that

𝔼X(t+1)PCX(0)2βt+1X(0)2,t=0,1,.\mathbb{E}\|X^{(t+1)}-P_{C}X^{(0)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

We demonstrate experimentally the behavior of FedRK when applied to related problems. For sparse approximation problems, the system is modeled as b=Ax+eb=Ax^{*}+e where Am×nA\in\mathbb{R}^{m\times n} is a wide matrix (m<nm<n), xnx^{*}\in\mathbb{R}^{n} is a true sparse signal, and eme\in\mathbb{R}^{m} is noise. Given a sparsity level ss\in\mathbb{N}, the hard thresholding operator Ts:nnT_{s}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} is defined as the orthogonal projection onto the entries with the ss largest magnitudes. By combining with a hard thresholding operator, the Kaczmarz algorithm has been proposed as a method to solve such problems; see for instance [7, 8]. In the federated setting, we show that our algorithm can be combined with the hard thresholding operator TsT_{s} to solve the sparse approximation problem; see Algorithm 2. For the least squares problem, where Am×nA\in\mathbb{R}^{m\times n} is a tall matrix (m>nm>n) and the goal is to minimize Axb2\|Ax-b\|^{2}, it is known that the randomized Kaczmarz algorithm does not converge to the least squares solution in general, but its iterates reach within a horizon of the solution; see [9]. In the federated setting, we show that our algorithm has the same behavior. Moreover, we show that by adding a suitable amount of noisy columns to AA, one can shrink the horizon. Finally, we apply Algorithm 2 to the prostate cancer data considered in [10], and we demonstrate that the selection of Algorithm 2 is consistent with Lasso.

Algorithm 2 FedRK with thresholding
Initial model x(0)x^{(0)}
for t = 0 to T1T-1 do
     Sample a subset S(t)S^{(t)} of clients
     for client iS(t)i\in S^{(t)} do
         Initialize local model xi(t,0)=x(t)x_{i}^{(t,0)}=x^{(t)}
         for k=0,,τ1k=0,\dotsc,\tau-1  do
              Perform RK initialized with x(t,k)x^{(t,k)}
         end for
         Compute the local model changes Δi(t)=xi(t,τ)xi(t,0)\Delta_{i}^{(t)}=x_{i}^{(t,\tau)}-x_{i}^{(t,0)}
     end for
     for iS(t)i\in S^{(t)} do
         Define di=Δi(t),Δi(t)+x(t)d_{i}=\langle\Delta_{i}^{(t)},\Delta_{i}^{(t)}+x^{(t)}\rangle
     end for
     Apply RK to solve Δ(t)x=d\Delta^{(t)}x=d (ignoring the rows where Δi(t)=0\Delta_{i}^{(t)}=0 with uniform sampling scheme), and let x(t+1)x^{(t+1)} be the solution after τg\tau_{g} iterates
     Apply hard thresholding operator x(t+1)=Tsx(t+1)x^{(t+1)}=T_{s}x^{(t+1)}
end for

1.1 Contribution

We summarize our contributions. First, we propose the federated Kaczmarz algorithm (FedRK) for solving large linear systems in the federated setting, and we prove the linear convergence of our algorithm. As a corollary of our analysis, we give an alternative proof for the linear convergence of the classic randomized Kaczmarz algorithm (RK). Second, we demonstrate experimentally that our algorithm can be combined with hard thresholding to solve sparse approximation problems. We also demonstrate experimentally that it converges to a horizon of the least squares solution as the RK when applied to inconsistent systems. Finally, we apply our algorithm to real data and show the possible use for feature selection in the federated setting.

1.2 Organization

The rest of the paper is organized as follows. In Section 2, we present our main results. We first present Theorem 4, which characterizes the convergence behavior when a collection of local clients is selected and the current position is one unit distance away from the true solution; the proof of Theorem 4 can be interesting in its own right. In particular, a corollary of Theorem 4 is an alternative proof for the linear convergence of the classic RK; see Corollary 2. We then prove the convergence theorems of our algorithm. The proofs of Theorem 1 and Theorem 2 are based on slightly different strategies, and we provide roadmaps of the proofs as guides. In Section 3, we present the experiments. We demonstrate experimentally the linear convergence of Algorithm 1. Perhaps interestingly, our experiment shows that running more local iterations helps the algorithm converge faster, which is usually not seen in other federated algorithms. We then apply Algorithm 2 to sparse approximation problems and Algorithm 1 to least squares problems for inconsistent systems; our results show that our algorithm behaves similarly to the classical RK, which hints that our algorithm can be efficiently combined with other Kaczmarz variants to extend other variants to the federated setting. Finally, we apply Algorithm 2 to real data and show that our algorithm can potentially be used for feature selection in the federated setting. In Section 4, we present some discussion and future directions.

2 Main Results

2.1 Notation

In this section, we define the notation used throughout. Denote by [M]={1,,M}[M]=\{1,\dotsc,M\}. We partition the system into MM parts,

A=[A1AM]andb=[b1bM],A=\begin{bmatrix}A_{1}\\ \vdots\\ A_{M}\end{bmatrix}\quad\mbox{and}\quad b=\begin{bmatrix}b_{1}\\ \vdots\\ b_{M}\end{bmatrix},

and the ii-th client sees the system Aix=biA_{i}x=b_{i} for i[M]i\in[M]. Let CiC_{i} be defined as in (1) for i[M]i\in[M]. Under Assumption 2, we have CinC_{i}\subseteq\mathbb{R}^{n} are linear subspaces for i[M]i\in[M]. We denote by PC:nCP_{C}:\mathbb{R}^{n}\rightarrow C the orthogonal projection onto a linear subspace CnC\subseteq\mathbb{R}^{n}.

Let 𝒫\mathcal{P} be the set of probability measures on [M][M]. Given xnx\in\mathbb{R}^{n} and S[M]S\subseteq[M], we denote by 𝒫S𝒫\mathcal{P}_{S}\subseteq\mathcal{P} the set of probability measures such that

𝒫S={p𝒫:supp(p)=S};\mathcal{P}_{S}=\left\{p\in\mathcal{P}:\operatorname{supp}(p)=S\right\}; (3)

for instance, the uniform distribution over S[M]S\subseteq[M] is in 𝒫S\mathcal{P}_{S}. An object that forms the core of our analysis is the following function,

f(S,x,p,y)=sSp(s)(I(xPCsx)(xPCsx)TxPCsx2)y2f(S,\,x,\,p,\,y)=\sum_{s\in S}p(s)\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2} (4)

for S[M]S\subseteq[M], xsSCscx\in\cap_{s\in S}C_{s}^{c}, p𝒫p\in\mathcal{P} and yny\in\mathbb{R}^{n}. For p𝒫Sp\in\mathcal{P}_{S}, this function measures the average decrease of the norm of yy, when randomly projecting onto the hyperplanes formed by the normal vectors {xPCsx}sS\{x-P_{C_{s}}x\}_{s\in S}. In fact, we will study the function on a refined domain; see Section 2.2 for more details.

For the readers’ convenience, we review some notions in geometry and topology; one can find detailed descriptions in [11] and [12], for instance. A topological space is a pairing (X,𝒯)(X,\mathcal{T}), where XX is the whole space and 𝒯\mathcal{T} is the topology, i.e. a collection of open subsets satisfying

  • 𝒯\emptyset\in\mathcal{T}

  • U𝒯U\in\mathcal{T} implies Uc𝒯U^{c}\in\mathcal{T}

  • {Uα}αA𝒯\{U_{\alpha}\}_{\alpha\in A}\subseteq\mathcal{T} implies αAUα𝒯\cup_{\alpha\in A}U_{\alpha}\in\mathcal{T}, where AA is an arbitrary index set (possibly uncountable)

A set KXK\subseteq X is compact if any open covering of KK admits a finite sub-covering. A map f:XYf:X\rightarrow Y between two topological spaces is continuous if f1(U)Xf^{-1}(U)\subseteq X is open for all open sets UYU\subseteq Y. If f:XYf:X\rightarrow Y between two topological spaces is continuous, then KK is compact implies f(K)f(K) is compact. Another ingredient in our proof is the implicit function theorem, which we recall the statement:

Theorem 3 (Implicit function theorem).

Let Un+mU\subseteq\mathbb{R}^{n+m} be an open subset, and f:Umf:U\rightarrow\mathbb{R}^{m} be smooth. Given (x0,y0)n+m(x_{0},y_{0})\in\mathbb{R}^{n+m} such that f(x0,y0)=0f(x_{0},y_{0})=0 and Dyf(x0,y0):mmD_{y}f(x_{0},y_{0}):\mathbb{R}^{m}\rightarrow\mathbb{R}^{m} is invertible, then, in a open neighborhood of (x0,y0)(x_{0},y_{0}), the level set

{(x,y)n+m:f(x,y)=0}\left\{(x,y)\in\mathbb{R}^{n+m}:f(x,y)=0\right\}

is smoothly parameterized by xx; i.e. in a neighborhood VnV\subseteq\mathbb{R}^{n} of x0x_{0} there exists a smooth function g:Vmg:V\rightarrow\mathbb{R}^{m} such that

{(x,y)n+m:f(x,y)=0,xV}={(x,g(x))n+m:xV}.\left\{(x,y)\in\mathbb{R}^{n+m}:f(x,y)=0,\,x\in V\right\}=\left\{(x,g(x))\in\mathbb{R}^{n+m}:x\in V\right\}.

2.2 A technical theorem

In this section, we prove our main technical theorem, and deduce the convergence of the classical randomized Kaczmarz as a corollary. To motivate the setting, we start with a discussion of how one can reduce the analysis to a function on the product of two spheres.

Given xn{0}x\in\mathbb{R}^{n}\setminus\{0\}, one has by linearity

PCix=xPCixx,P_{C_{i}}x=\|x\|P_{C_{i}}\frac{x}{\|x\|},

and

xPCix=x(xxPCixx).x-P_{C_{i}}x=\|x\|\left(\frac{x}{\|x\|}-P_{C_{i}}\frac{x}{\|x\|}\right).

This shows that the normal vector obtained after normalizing xPCixx-P_{C_{i}}x is independent of the length of xnx\in\mathbb{R}^{n}. Similarly, for yn{0}y\in\mathbb{R}^{n}\setminus\{0\}, one has

(I(xPCsx)(xPCsx)TxPCsx2)y=y(I(xPCsx)(xPCsx)TxPCsx2)yy.\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y=\|y\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)\frac{y}{\|y\|}.

This suggests that it is natural to study our problem on the unit sphere Sn1nS^{n-1}\subseteq\mathbb{R}^{n}. Given ϵ>0\epsilon>0, denote

BS={xSn1:dim(span{xPCsx}sS)<|S|},BS,ϵ={xSn1:d(x,BS)<ϵ}B_{S}=\left\{x\in S^{n-1}:\operatorname{dim}\left(\operatorname{span}\{x-P_{C_{s}}x\}_{s\in S}\right)<|S|\right\},\quad B_{S,\,\epsilon}=\{x\in S^{n-1}:d(x,B_{S})<\epsilon\} (5)

and 𝒮S,ϵ=Sn1BS,ϵ\overset{*}{\mathcal{S}}_{S,\,\epsilon}=S^{n-1}\setminus B_{S,\,\epsilon}. We will consider ff as a function on 2M×𝒮S,ϵ×𝒫×Sn12^{M}\times\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times\mathcal{P}\times S^{n-1} for technical reasons that will become clear later; see Figure 1 for an illustration. It is clear that f(S,,p,):𝒮S,ϵ×Sn1f(S,\,\cdot,\,p,\,\cdot):\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}\rightarrow\mathbb{R} is smooth for each (S,p)2M×𝒫(S,\,p)\in 2^{M}\times\mathcal{P}. We have the following:

Theorem 4.

Given S[M]S\subseteq[M] and ϵ>0\epsilon>0, let

hS,x=sS{ySn1:(xPCsx)Ty=0},xSn1.h_{S,\,x}=\bigcap_{s\in S}\left\{y\in S^{n-1}:(x-P_{C_{s}}x)^{T}\,y=0\right\},\quad x\in S^{n-1}.

Denote

CS,ϵ={(x,y)𝒮S,ϵ×Sn1:d(y,hS,x)<ϵ},DS,ϵ=(𝒮S,ϵ×Sn1)CS,ϵ.C_{S,\,\epsilon}=\{(x,y)\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}:d(y,\,h_{S,\,x})<\epsilon\},\quad D_{S,\,\epsilon}=(\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1})\setminus C_{S,\,\epsilon}.

Given p𝒫Sp\in\mathcal{P}_{S} defined in (3), consider f(S,,p,):DS,ϵf(S,\,\cdot,\,p,\,\cdot):D_{S,\,\epsilon}\rightarrow\mathbb{R} defined as

f(S,x,p,y)=sSp(s)(I(xPCsx)(xPCsx)TxPCsx2)y2.f(S,\,x,\,p,\,y)=\sum_{s\in S}p(s)\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}.

Then, f(S,x,p,y)α(S,p,ϵ)<1f(S,\,x,\,p,\,y)\leq\alpha(S,\,p,\,\epsilon)<1 for (x,y)DS,ϵ(x,y)\in D_{S,\,\epsilon}. Moreover, for all (x,y)DS,ϵ(x,y)\in D_{S,\,\epsilon}, there exists s(y)Ss(y)\in S such that

(I(xPCs(y)x)(xPCs(y)x)TxPCs(y)x2)y2α(S,p,ϵ)|S|p(s(y)).\left\|\left(I-\frac{(x-P_{C_{s(y)}}x)(x-P_{C_{s(y)}}x)^{T}}{\|x-P_{C_{s(y)}}x\|^{2}}\right)y\right\|^{2}\leq\frac{\alpha(S,\,p,\,\epsilon)}{|S|\,p(s(y))}.
Proof.

To make our presentation clear, we can assume that the index set, S={1,,|S|}S=\{1,\dotsc,|S|\}. Our first objective is to show that CS,ϵ𝒮S,ϵ×Sn1C_{S,\,\epsilon}\subseteq\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1} is open.

Given (x0,y0)𝒮S,ϵ×Sn1(x_{0},y_{0})\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1} such that x0BS,ϵx_{0}\in B_{S,\,\epsilon} and y0hS,x0y_{0}\in h_{S,\,x_{0}}, one can find a1,,an|S|1a_{1},\dotsc,a_{n-|S|-1} such that

{x0PC1x0,,x0PC|S|x0,a1,,an|S|1,y}n\left\{x_{0}-P_{C_{1}}x_{0},\dotsc,x_{0}-P_{C_{|S|}}x_{0},a_{1},\dotsc,a_{n-|S|-1},y\right\}\subseteq\mathbb{R}^{n}

is a basis. One can consider F:2nnF:\mathbb{R}^{2n}\rightarrow\mathbb{R}^{n} defined as

F(x,y)=[(xPC1x)Ty(xPC|S|x)Tya1Tyan|S|1Tyi=1nyi21].F(x,y)=\begin{bmatrix}(x-P_{C_{1}}x)^{T}y\\ \vdots\\ (x-P_{C_{|}S|}x)^{T}y\\[3.0pt] a_{1}^{T}y\\ \vdots\\ a_{n-|S|-1}^{T}y\\[3.0pt] \sum_{i=1}^{n}y_{i}^{2}-1\\ \end{bmatrix}.

Denote by DF:2nn×2nDF:\mathbb{R}^{2n}\rightarrow\mathbb{R}^{n\times 2n} the Jacobian matrix of FF, and denote by DxFD_{x}F, DyFD_{y}F the submatrices when the partial derivatives are only taken with respect to xx, yy, respectively. A direct calculation shows

DF(x0,y0)=[DxF(x0,y0)DyF(x0,y0)]=[yT(IPC1)xT(IPC1)TyT(IPC|S|)xT(IPC|S|)T0a1T0an|S|1T02yT]n×2n,DF(x_{0},y_{0})=\begin{bmatrix}D_{x}F(x_{0},y_{0})&D_{y}F(x_{0},y_{0})\end{bmatrix}=\begin{bmatrix}y^{T}(I-P_{C_{1}})&x^{T}(I-P_{C_{1}})^{T}\\ \vdots&\vdots\\ y^{T}(I-P_{C_{|S|}})&x^{T}(I-P_{C_{|S|}})^{T}\\ 0&a_{1}^{T}\\ \vdots&\vdots\\ 0&a_{n-|S|-1}^{T}\\[3.0pt] 0&2y^{T}\end{bmatrix}\in\mathbb{R}^{n\times 2n},

and it is clear from our choice that DyF(x0,y0)D_{y}F(x_{0},y_{0}) is of full rank. If one considers the level set

{(x,y)2n:F(x,y)=0},\left\{(x,y)\in\mathbb{R}^{2n}:F(x,y)=0\right\},

then there exists an open ball B(x0)nB(x_{0})\subseteq\mathbb{R}^{n} such that yy is a function of xx and y(x0)=y0y(x_{0})=y_{0} by the implicit function theorem.

Given (x0,y0)CS,ϵ(x_{0},y_{0})\in C_{S,\,\epsilon}, there exists yhS,x0y_{*}\in h_{S,\,x_{0}} such that

d(y0,y)=d(y0,hS(x0))=ϵ<ϵ,d(y_{0},y_{*})=d(y_{0},h_{S}(x_{0}))=\epsilon^{\prime}<\epsilon,

since {y0}Sn1\{y_{0}\}\subseteq S^{n-1} is compact and hS(x0)Sn1h_{S}(x_{0})\subseteq S^{n-1} is closed. By the argument above, there exists δ1>0\delta_{1}>0, and a diffeomorphism g:Bδ1(x0)g(Bδ1(x0))g:B_{\delta_{1}}(x_{0})\rightarrow g(B_{\delta_{1}}(x_{0})) such that g(x0)=yg(x_{0})=y_{*} and

F(x,g(x))=0,xBδ1(x0);F(x,g(x))=0,\quad x\in B_{\delta_{1}}(x_{0});

i.e., yy is a function of xx in a neighborhood of x0x_{0}. By continuity of g:Bδ1(x0)g(Bδ1(x0))g:B_{\delta_{1}}(x_{0})\rightarrow g(B_{\delta_{1}}(x_{0})), there exists δ2>0\delta_{2}>0 such that

d(x,x0)<δ2d(g(x),g(x0))<ϵd(y0,y)4.d(x,x_{0})<\delta_{2}\quad\Rightarrow\quad d(g(x),g(x_{0}))<\frac{\epsilon-d(y_{0},y_{*})}{4}.

Define δ3=(ϵd(y0,y))/4\delta_{3}=(\epsilon-d(y_{0},y_{*}))/4, and let δ=min{δ1,δ2,δ3}\delta=\min\{\delta_{1},\delta_{2},\delta_{3}\}. We claim that Bδ(x0)×Bδ(y0)CS,ϵB_{\delta}(x_{0})\times B_{\delta}(y_{0})\subseteq C_{S,\,\epsilon}, and so (x0,y0)𝒮S,ϵ×Sn1(x_{0},y_{0})\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1} is an interior point. For (x,y)Bδ(x0)×Bδ(y0)(x,y)\in B_{\delta}(x_{0})\times B_{\delta}(y_{0}),

d(y,g(x))d(y,y0)+d(y0,y)+d(y,g(x))<ϵ.d(y,g(x))\leq d(y,y_{0})+d(y_{0},y_{*})+d(y_{*},g(x))<\epsilon.

Finally, one has

d(y,hS(x))d(y,g(x))<ϵ,d(y,h_{S}(x))\leq d(y,g(x))<\epsilon,

since g(x)hS(x)g(x)\in h_{S}(x), and this shows that (x0,y0)𝒮S,ϵ×Sn1(x_{0},y_{0})\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1} is an interior point.

One has DS,ϵD_{S,\,\epsilon} is compact, since CS,ϵSϵ×Sn1C_{S,\,\epsilon}\subseteq\overset{*}{S}_{\epsilon}\times S^{n-1} is open. One also has that f(S,x,p,y)<1f(S,\,x,\,p,\,y)<1 for all (x,y)DS,ϵ(x,y)\in D_{S,\,\epsilon}, and therefore there exists α(S,p,ϵ)<1\alpha(S,\,p,\,\epsilon)<1 such that

f(S,x,p,y)α(S,p,ϵ)<1,(x,y)DS,ϵ.f(S,\,x,\,p,\,y)\leq\alpha(S,\,p,\,\epsilon)<1,\quad\forall(x,y)\in D_{S,\,\epsilon}.

Finally, by pigeonhole principle, there exists s(y)Ss(y)\in S such that

|S|p(s(y))(I(xPCs(y)x)(xPCs(y)x)TxPCs(y)x2)y2α(S,p,ϵ)|S|\,p(s(y))\,\left\|\left(I-\frac{(x-P_{C_{s(y)}}x)(x-P_{C_{s(y)}}x)^{T}}{\|x-P_{C_{s(y)}}x\|^{2}}\right)y\right\|^{2}\leq\alpha(S,\,p,\,\epsilon)

for fixed (x,y)DS,ϵ(x,y)\in D_{S,\,\epsilon}. This concludes our proof. ∎

Refer to caption
Figure 1: Suppose that C1C_{1} is a line intersecting the sphere at the green dots (left), and C2C_{2} is a plane intersecting the sphere at the red (dashed) diametric circle. Then, the pink diametric (dashed) circles in the second sphere are the points we remove in the xx-sphere, and the pink dots in the third sphere are the points we remove in the yy-sphere mentioned in Theorem 4. More precisely, a neighborhood of those points is removed.

As a corollary, we demonstrate how Theorem 4 implies the convergence of the classical randomized Kaczmarz algorithm. Indeed, the classical setting can be viewed as the scenario where one has M=mM=m local clients, and each local client has one equation of the linear system Ax=bAx=b; see [13]. This corresponds to S=[M]S=[M] and {Cs}s[M]\{C_{s}\}_{s\in[M]} are n1n-1-dimensional linear subspaces (hyperplanes) in Theorem 4.

Recall that we defined 𝒫S\mathcal{P}_{S} in (3). Given a sampling scheme p𝒫[M]p\in\mathcal{P}_{[M]}, if we denote by asa_{s} the normal vector (unique up to sign) of CsC_{s} for s[M]s\in[M], the randomized Kaczmarz algorithm is defined by

[Yk+1=(IasasT)Yk|Yk]=p(s),s[M].\mathbb{P}\left[\left.Y_{k+1}=\left(I-a_{s}\,a_{s}^{T}\right)Y_{k}\right|Y_{k}\right]=p(s),\quad s\in[M]. (6)

We have the following:

Corollary 2.

Given an initial y0ny_{0}\in\mathbb{R}^{n} and a sampling scheme p𝒫[M]p\in\mathcal{P}_{[M]} defined in (3), define Y1,,Yk,Y_{1},\dotsc,Y_{k},\dotsc by (6). Let α=maxySn1yTs[M]p(s)(IasasT)y\alpha=\max_{y\in S^{n-1}}y^{T}\sum_{s\in[M]}p(s)\left(I-a_{s}\,a_{s}^{T}\right)y. Then

𝔼(Yk+12|y0)αk+1y02.\mathbb{E}\left(\left.\|Y_{k+1}\|^{2}\,\right|y_{0}\right)\leq\alpha^{k+1}\|y_{0}\|^{2}.
Proof.

First, we describe what DS,ϵD_{S,\,\epsilon} in Theorem 4 is in this scenario. By picking ϵ>0\epsilon>0 small enough, one has that 𝒮S,ϵ\overset{*}{\mathcal{S}}_{S,\,\epsilon}\neq\emptyset. Let {as}s[M]\{a_{s}\}_{s\in[M]} be the normal vectors of the hyperplanes {Cs}s[M]\{C_{s}\}_{s\in[M]}. One has

xPCsxspan{as},x𝒮S,ϵ,s[M],x-P_{C_{s}}x\in\operatorname{span}\{a_{s}\},\quad\forall x\in\overset{*}{\mathcal{S}}_{S,\,\epsilon},\,s\in[M],

and

s[M]{yn:(xPCsx)Ty=0}=s[M]{yn:asTy=0}={0};\bigcap_{s\in[M]}\left\{y\in\mathbb{R}^{n}:(x-P_{C_{s}}x)^{T}\,y=0\right\}=\bigcap_{s\in[M]}\left\{y\in\mathbb{R}^{n}:a_{s}^{T}\,y=0\right\}=\{0\};

this implies that hS,x=Sn1{0}=h_{S,\,x}=S^{n-1}\cap\{0\}=\emptyset, CS,ϵ=C_{S,\,\epsilon}=\emptyset and therefore DS,ϵ=𝒮S,ϵ×Sn1D_{S,\,\epsilon}=\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}.

Now, we study the convergence of the randomized Kaczmarz algorithm. By (6),

𝔼(Yk+1|Yk)\displaystyle\mathbb{E}\left(Y_{k+1}|Y_{k}\right) =s[M]p(s)(IasasT)Yk2\displaystyle=\sum_{s\in[M]}p(s)\left\|\left(I-a_{s}\,a_{s}^{T}\right)Y_{k}\right\|^{2}
=Yk2s[M]p(s)(IasasT)YkYk2.\displaystyle=\|Y_{k}\|^{2}\sum_{s\in[M]}p(s)\left\|\left(I-a_{s}\,a_{s}^{T}\right)\frac{Y_{k}}{\|Y_{k}\|}\right\|^{2}.

Pick some x𝒮S,ϵx\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}. We recognize the summation above is equal to f([M],x,p,Yk/Yk)f([M],\,x,\,p,\,Y_{k}/\|Y_{k}\|). By Theorem 4, the function, f([M],,p,):DS,ϵf([M],\,\cdot,\,p,\,\cdot):D_{S,\,\epsilon}\rightarrow\mathbb{R}, satisfies

f([M],x,p,y)=s[M]p(s)(IasasT)y2α([M],p,ϵ)<1,ySn1.f([M],\,x,\,p,\,y)=\sum_{s\in[M]}p(s)\left\|\left(I-a_{s}\,a_{s}^{T}\right)y\right\|^{2}\leq\alpha([M],\,p,\,\epsilon)<1,\quad\forall y\in S^{n-1}.

(Note: the function is independent of x𝒮S,ϵx\in\overset{*}{\mathcal{S}}_{S,\,\epsilon} here). Therefore,

𝔼(Yk+1|Yk)=Yk2f([M],x,p,Yk/Yk)Yk2α([M],p,ϵ)\mathbb{E}\left(Y_{k+1}|Y_{k}\right)=\|Y_{k}\|^{2}\,f([M],\,x,\,p,\,Y_{k}/\|Y_{k}\|)\leq\|Y_{k}\|^{2}\,\alpha([M],\,p,\,\epsilon)

by which one can iterate to conclude 𝔼(Yk+12|y0)αk+1y02\mathbb{E}\left(\left.\|Y_{k+1}\|^{2}\,\right|y_{0}\right)\leq\alpha^{k+1}\|y_{0}\|^{2}.

Finally, a closer look shows that

α([M],p,ϵ)=maxySn1s[M]p(s)(IasasT)y2=maxySn1yTs[M]p(s)(IasasT)y,\alpha([M],\,p,\,\epsilon)=\max_{y\in S^{n-1}}\sum_{s\in[M]}p(s)\left\|\left(I-a_{s}\,a_{s}^{T}\right)y\right\|^{2}=\max_{y\in S^{n-1}}y^{T}\sum_{s\in[M]}p(s)\left(I-a_{s}\,a_{s}^{T}\right)y,

which gives the variational formula for the constant α\alpha. This concludes our proof. ∎

In Corollary 2, we showed the convergence of the randomized Kaczmarz algorithm for all sampling schemes in 𝒫[M]\mathcal{P}_{[M]}. This includes the sampling scheme in [4]. Indeed, if we denote pSVp_{SV} the sampling scheme in [4],

pSV(s)=as22AF2supp(pSV)=[M],p_{SV}(s)=\frac{\|a_{s}\|_{2}^{2}}{\|A\|_{F}^{2}}\quad\Rightarrow\quad\operatorname{supp}(p_{SV})=[M],

which clearly implies that pSV𝒫[M]p_{SV}\in\mathcal{P}_{[M]}. This concludes our discussion here.

As a second corollary, we demonstrate how one can produce different variations of Theorem 4; the following form will be used in the study of Algorithm 1.

Corollary 3.

Given S[M]S\subseteq[M] and ϵ>0\epsilon>0, denote by uSu_{S} the uniform distribution over the set SS and DS,ϵ=sSD{s},ϵD_{S,\,\epsilon}^{\prime}=\cap_{s\in S}D_{\{s\},\,\epsilon}, where D{s},ϵD_{\{s\},\,\epsilon}’s are defined in Theorem 4. Consider f(S,,uS,):DS,ϵf(S,\,\cdot,\,u_{S},\,\cdot):D_{S,\,\epsilon}^{\prime}\rightarrow\mathbb{R} defined as in Theorem 4. Then, f(S,x,uS,y)α(S,ϵ)<1f(S,\,x,\,u_{S},\,y)\leq\alpha^{\prime}(S,\,\epsilon)<1 for (x,y)DS,ϵ(x,y)\in D_{S,\,\epsilon}^{\prime}. Moreover, for sSs\in S, define

ES,s,ϵ={(x,y)D{s},ϵ:xsSCs},E_{S,s,\epsilon}=\left\{(x,y)\in D_{\{s\},\epsilon}:x\not\in\bigcup_{s\in S}C_{s}\right\},

where each CsC_{s} is the linear subspace associated with the index sSs\in S. Then

f(S,x,uS,y)|S|1|S|+1|S|α({s},ϵ)=γ(s,ϵ)<1,(x,y)ES,s,ϵ.f(S,x,u_{S},y)\leq\frac{|S|-1}{|S|}+\frac{1}{|S|}\alpha^{\prime}\left(\{s\},\epsilon\right)=\gamma(s,\epsilon)<1,\quad(x,y)\in E_{S,s,\epsilon}.
Proof.

To make our presentation clear, we assume S={1,,|S|}S=\left\{1,\dotsc,|S|\right\} without loss of generality. Given sSs\in S and δs𝒫x,{s}\delta_{s}\in\mathcal{P}_{x,\,\{s\}}, by Theorem 4 there exists α({s},δs,ϵ)<1\alpha(\{s\},\,\delta_{s},\,\epsilon)<1 such that the function f({s},,δs,):D{s},ϵf(\{s\},\,\cdot,\,\delta_{s},\,\cdot):D_{\{s\},\,\epsilon}\rightarrow\mathbb{R} defined as

f({s},x,δs,y)=(I(xPCsx)(xPCsx)TxPCsx2)y2.f(\{s\},\,x,\,\delta_{s},\,y)=\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}.

satisfies the bound f({s},x,δs,y)α({s},δs,ϵ)<1f(\{s\},\,x,\,\delta_{s},\,y)\leq\alpha(\{s\},\,\delta_{s},\,\epsilon)<1. In particular, we have

f({s},x,δs,y)α({s},δs,ϵ)(x,y)DS,ϵ,sS,f(\{s\},\,x,\,\delta_{s},\,y)\leq\alpha(\{s\},\,\delta_{s},\,\epsilon)\quad\forall(x,y)\in D_{S,\epsilon}^{\prime},\,s\in S,

which implies

f(S,x,uS,y)\displaystyle f(S,\,x,\,u_{S},\,y) =sS1|S|(I(xPCsx)(xPCsx)TxPCsx2)y2\displaystyle=\sum_{s\in S}\frac{1}{|S|}\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}
=sS1|S|f({s},x,δs,y)maxsS{α({s},δs,ϵ)}<1.\displaystyle=\sum_{s\in S}\frac{1}{|S|}f(\{s\},\,x,\,\delta_{s},\,y)\leq\max_{s\in S}\left\{\alpha(\{s\},\,\delta_{s},\,\epsilon)\right\}<1.

This concludes the proof of the first statement. For the second statement, we see that f(S,,uS,)f(S,\cdot,u_{S},\cdot) is properly defined on ES,s,ϵE_{S,s,\epsilon}, and

f(S,x,uS,y)=ss1|S|(I(xPCsx)(xPCsx)TxPCsx2)y2+1|S|(I(xPCsx)(xPCsx)TxPCsx2)y2.f(S,x,u_{S},y)=\sum_{s^{\prime}\neq s}\frac{1}{|S|}\left\|\left(I-\frac{(x-P_{C_{s^{\prime}}}x)(x-P_{C_{s^{\prime}}}x)^{T}}{\|x-P_{C_{s^{\prime}}}x\|^{2}}\right)y\right\|^{2}\\ +\frac{1}{|S|}\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}. (7)

Note that

(I(xPCsx)(xPCsx)TxPCsx2)y21(x,y)ES,s,ϵ,\left\|\left(I-\frac{(x-P_{C_{s^{\prime}}}x)(x-P_{C_{s^{\prime}}}x)^{T}}{\|x-P_{C_{s^{\prime}}}x\|^{2}}\right)y\right\|^{2}\leq 1\quad\forall(x,y)\in E_{S,s,\epsilon},

and by the first statement

(I(xPCsx)(xPCsx)TxPCsx2)y2α({s},ϵ)(x,y)D{s},ϵ,\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}\leq\alpha^{\prime}(\{s\},\epsilon)\quad\forall(x,y)\in D_{\{s\},\epsilon},

which can then be combined with (7) to conclude the proof. ∎

2.3 Proof of Theorem 1, Theorem 2 and Corollary 1

Roadmap of the proof for Theorem 1: When the local clients run only finitely many iterations, the local updates are not exactly orthogonal projections onto a linear subspace; therefore, the idea of the proof is different from the proof of Theroem 2. For Theorem 1, the key observation is that when running only one global iteration (τg=1\tau_{g}=1), the global update is essentially randomly selecting one of the local updates. From this perspective, one can compare the scheme with the classic RK algorithm applying to a suitable overdetermined linear system to deduce the convergence.

Proof of Theorem 1.

Denote 𝔼(X(t+1)|S,X(t))\mathbb{E}\left(X^{(t+1)}|S,X^{(t)}\right) the expected global position conditioning on the subset SS of local clients that is selected. One can observe that

𝔼(X(t+1)|S,X(t))=sS1|S|Xs(t,T),\mathbb{E}\left(X^{(t+1)}|S,X^{(t)}\right)=\sum_{s\in S}\frac{1}{|S|}X^{(t,T)}_{s}, (8)

where Xs(t,T)X^{(t,T)}_{s} is the position of client ss after TT local iterations. The idea of the proof is to compare the dynamics with a suitable choice of RK algorithm.

Fix s[M]s\in[M]. Denote as,1,,as,m(s)a_{s,1},\dotsc,a_{s,m(s)} the rows at client ss. Then

𝔼Xs(t,T)𝔼Xs(t,1)=1m(s)i=1m(s)(Ias,iTas,ias,i2)X(t,0).\mathbb{E}\|X^{(t,T)}_{s}\|\leq\mathbb{E}\|X^{(t,1)}_{s}\|=\frac{1}{m(s)}\sum_{i=1}^{m(s)}\left\|\left(I-\frac{a_{s,i}^{T}a_{s,i}}{\|a_{s,i}\|^{2}}\right)X^{(t,0)}\right\|. (9)

Using (9), we have

𝔼X(t+1)\displaystyle\mathbb{E}\|X^{(t+1)}\| =S𝔼(X(t+1)|S)(S(t)=S)\displaystyle=\sum_{S}\mathbb{E}\left(\|X^{(t+1)}\||S\right)\mathbb{P}\left(S^{(t)}=S\right)
=SsS1|S|𝔼Xs(t,T)(S(t)=S)\displaystyle=\sum_{S}\sum_{s\in S}\frac{1}{|S|}\mathbb{E}\|X^{(t,T)}_{s}\|\mathbb{P}\left(S^{(t)}=S\right)
SsS1|S|1m(s)i=1m(s)(Ias,iTas,ias,i2)X(t,0)(S(t)=S).\displaystyle\leq\sum_{S}\sum_{s\in S}\frac{1}{|S|}\frac{1}{m(s)}\sum_{i=1}^{m(s)}\left\|\left(I-\frac{a_{s,i}^{T}a_{s,i}}{\|a_{s,i}\|^{2}}\right)X^{(t,0)}\right\|\mathbb{P}\left(S^{(t)}=S\right).

One can see that the right hand side is the expected decrease when using RK to solve Ax=0A^{\prime}x=0, where AA^{\prime} consists of copies of AA. One can then deduce the convergence using the convergence analysis of classic RK. ∎

Given S[M]S\subseteq[M], we consider SxSS_{x}\subseteq S defined as

Sx={sS:xCs},S_{x}=\left\{s\in S:x\not\in C_{s}\right\},

and uSxu_{S_{x}} the uniform distribution over SxS_{x}. Then we consider the following function

g(S,x,y)={f(Sx,x,uSx,y),if Sx1,if Sx=,g(x,y)=SM1(M|S|)g(S,x,y)g(S,x,y)=\begin{cases}f(S_{x},x,u_{S_{x}},y),\quad&\mbox{if $S_{x}\neq\emptyset$}\\[3.0pt] 1,\quad&\mbox{if $S_{x}=\emptyset$}\end{cases},\quad g(x,y)=\sum_{S\subseteq M}\frac{1}{{M\choose|S|}}g(S,\,x,\,y) (10)

One can see that g(x,y)g(x,y) characterizes the convergence rate when τg=1\tau_{g}=1; for instance, see the proof of Corollary 2. We prove two lemmas, and then prove Theorem 2 as a consequence of these lemmas. Let us briefly describe the proof strategy.

Roadmap of the proof for Theroem 2: Fix a collection of linear subspaces at the local clients {Cs}s=1M\{C_{s}\}_{s=1}^{M}. First, we demonstrate that for an overdetermined system, one can find a small enough ϵ>0\epsilon>0 such that the ϵ\epsilon-neighborhood of these linear subspaces has empty intersection on the unit sphere; this is done in Lemma 1. Using Corollary 3, we know that if the initial position at a given federated round is ϵ\epsilon away from some linear subspace CsC_{s}, then we are guaranteed to shrink the distance between our current position and the solution by a constant factor via projection onto CsC_{s}. By Lemma 1, we know that an arbitrary point on the unit sphere is at least ϵ\epsilon away from some linear subspace. At each federated round, there is always some chance to select a client ss with CsC_{s} being ϵ\epsilon-away from the normalized global position, and we are guaranteed to shrink the current position by a uniform factor if we orthogonally project onto CsC_{s}. Using this fact, we show that on average we can shrink our current position by a constant factor regardless of our initial position at a given federated round; this is done in Lemma 2. Finally, everything is put together with the scaling argument in Corollary 2 to obtain Theorem 2. To deduce Corollary 1 for the underdetermined system, the key observation is to decompose the initial position as x=x1+x2x=x_{1}+x_{2}, where x1C=sCsx_{1}\in C=\cap_{s}C_{s} and x2Cx_{2}\in C^{\perp}. One can see that x1x_{1} is a fixed point for the orthogonal projections onto CsC_{s}’s, and that the orthogonal projections onto CsC_{s}’s restricted to CC^{\perp} reduce to a determined system; these observations allow us to deduce Corollary 1 from Theorem 2.

Lemma 1.

Let {Cs}s=1M\{C_{s}\}_{s=1}^{M} be a collection of linear subspaces such that s=1MCs={0}\cap_{s=1}^{M}C_{s}=\{0\}. Then there exits ϵ>0\epsilon>0 such that

sM{xSn1:d(x,CsSn1)<ϵ}=.\bigcap_{s\in M}\left\{x\in S^{n-1}:d(x,\,C_{s}\cap S^{n-1})<\epsilon\right\}=\emptyset.
Proof.

Define d~:Sn1\tilde{d}:S^{n-1}\rightarrow\mathbb{R} as

d~(x)=maxsSn1d(x,CsSn1);\tilde{d}(x)=\max_{s\in S^{n-1}}d(x,C_{s}\cap S^{n-1});

one can see that d~\tilde{d} is continuous. Since s=1MCsSn1=\cap_{s=1}^{M}C_{s}\cap S^{n-1}=\emptyset, one has

d~(x)>0xSn1.\tilde{d}(x)>0\quad\forall x\in S^{n-1}.

This further implies that there exists ϵ>0\epsilon>0 such that d~(x)ϵ\tilde{d}(x)\geq\epsilon for all xSn1x\in S^{n-1} by compactness of Sn1S^{n-1}. This concludes our proof. ∎

Lemma 2.

Given 0<ϵ10<\epsilon\ll 1, denote Ks,ϵ={xSn1:d(x,CsSn1)ϵ}K_{s,\epsilon}=\{x\in S^{n-1}:d(x,C_{s}\cap S^{n-1})\geq\epsilon\} for all s[M]s\in[M]. Then g:Sn1×Sn1g:S^{n-1}\times S^{n-1}\rightarrow\mathbb{R} defined in (10) satisfies

g(x,y)β<1,(x,y)sM(Ks,ϵ×Ks,ϵ).g(x,y)\leq\beta<1,\quad\forall(x,y)\in\bigcup_{s\in M}\left(K_{s,\epsilon}\times K_{s,\epsilon}\right).
Proof.

Given xKs,ϵx\in K_{s,\epsilon}, one has

sSsSx,s\in S\quad\Rightarrow\quad s\in S_{x},

which further implies

g(x,y)=SM1(M|S|)g(S,x,y)=sS1(M|S|)f(Sx,x,uSx,y)+sS1(M|S|)g(S,x,y).g(x,y)=\sum_{S\subseteq M}\frac{1}{{M\choose|S|}}g(S,x,y)=\sum_{s\in S}\frac{1}{{M\choose|S|}}f(S_{x},x,u_{S_{x}},y)+\sum_{s\not\in S}\frac{1}{{M\choose|S|}}g(S,x,y).

By Corollary 3, one has

f(Sx,x,uSx,y)γs,ϵ(x,y)ESx,s,ϵ,f(S_{x},x,u_{S_{x}},y)\leq\gamma_{s,\epsilon}\quad\forall(x,y)\in E_{S_{x},s,\epsilon},

which can be combined with the fact that there are (M1|S|1){M-1\choose|S|-1} subsets of MM containing ss to get

g(x,y)(M1|S|1)(M|S|)γs,ϵ+(M|S|)(M1|S|1)(M|S|)=βs,ϵ<1,(x,y)Ks,ϵ×Ks,ϵ.g(x,y)\leq\frac{{M-1\choose|S|-1}}{{M\choose|S|}}\gamma_{s,\epsilon}+\frac{{M\choose|S|}-{M-1\choose|S|-1}}{{M\choose|S|}}=\beta_{s,\epsilon}<1,\quad(x,y)\in K_{s,\epsilon}\times K_{s,\epsilon}.

One can then take βϵ=maxsM{βs,ϵ}\beta_{\epsilon}=\max_{s\in M}\{\beta_{s,\epsilon}\} to conclude the proof. ∎

Now, we combine Lemma 1 and Lemma 2 to prove Theorem 2.

Proof of Theorem 2.

First, we assume that the server runs one global iteration (τg=1\tau_{g}=1). Let ϵ>0\epsilon>0 be small enough so that Lemma 1 holds. For Ks,ϵK_{s,\epsilon} defined in Lemma 2, one has

X(s[M]Ks,ϵ)=s[M]Ks,ϵc=X\setminus\left(\bigcup_{s\in[M]}K_{s,\epsilon}\right)=\bigcap_{s\in[M]}K_{s,\epsilon}^{c}=\emptyset

by Lemma 1; this shows that {Ks,ϵ}s[M]\{K_{s,\epsilon}\}_{s\in[M]} covers Sn1S^{n-1}, and therefore

(x,x)s[M](Ks,ϵ×Ks,ϵ)xSn1.(x,x)\in\cup_{s\in[M]}\left(K_{s,\epsilon}\times K_{s,\epsilon}\right)\quad\forall x\in S^{n-1}.

By Lemma 2, one sees that

𝔼(X(t+1)2|X(t))=g(X(t),X(t))βϵ<1,ifX(t)=1.\mathbb{E}\left(\left.\|X^{(t+1)}\|^{2}\,\right|X^{(t)}\right)=g(X^{(t)},X^{(t)})\leq\beta_{\epsilon}<1,\quad\mbox{if}\quad\|X^{(t)}\|=1.

One can then iterate as in the proof Corollary 2 to conclude the proof. For general τg\tau_{g}\in\mathbb{N}, we observe that the operator norm of an orthogonal projection is less than or equal to 11, and therefore the convergence result for general τg\tau_{g}\in\mathbb{N} follows from τg=1\tau_{g}=1.

Finally, we demonstrate how one can deduce the convergence behavior for underdetermined systems.

Proof of Corollary 1.

Given an initial X(0)nX^{(0)}\in\mathbb{R}^{n}, we consider the orthogonal decomposition X(0)=PCX(0)+PCX(0)X^{(0)}=P_{C^{\perp}}X^{(0)}+P_{C}X^{(0)}, where CC is defined as in the statement of Corollary 1. Then one has

X(1)\displaystyle X^{(1)} =sS(0)1|S(0)|(I(X(0)PCsX(0))(X(0)PCsX(0))TX(0)PCsX(0)2)X(0)\displaystyle=\sum_{s\in S^{(0)}}\frac{1}{|S^{(0)}|}\left(I-\frac{(X^{(0)}-P_{C_{s}}X^{(0)})(X^{(0)}-P_{C_{s}}X^{(0)})^{T}}{\|X^{(0)}-P_{C_{s}}X^{(0)}\|^{2}}\right)X^{(0)} (11)
=sS(0)1|S(0)|(I(X(0)PCsX(0))(X(0)PCsX(0))TX(0)PCsX(0)2)PCX(0)+PCX(0),\displaystyle=\sum_{s\in S^{(0)}}\frac{1}{|S^{(0)}|}\left(I-\frac{(X^{(0)}-P_{C_{s}}X^{(0)})(X^{(0)}-P_{C_{s}}X^{(0)})^{T}}{\|X^{(0)}-P_{C_{s}}X^{(0)}\|^{2}}\right)P_{C^{\perp}}X^{(0)}+P_{C}X^{(0)}, (12)

since CCiC\subseteq C_{i} for all i{1,,M}i\in\{1,\dotsc,M\} and PCX(0)P_{C}X^{(0)} is a fixed point for projections onto CiC_{i}’s. On the other hand, if we focus on the linear subspace PC(n)P_{C^{\perp}}(\mathbb{R}^{n}), CC={0}C^{\perp}\cap C=\{0\}, and it is reduced to a determined system, and we can apply Theorem 2 to deduce

X(t+1)PCX(0)2βt+1PCX(0)2βt+1X(0)2,t=0,1,.\|X^{(t+1)}-P_{C}X^{(0)}\|^{2}\leq\beta^{t+1}\|P_{C^{\perp}}X^{(0)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

for some 0<β<10<\beta<1. ∎

3 Experiments

In the following, we demonstrate experimentally the convergence of FedRK, and we demonstrate its behavior when applied to sparse approximation problems and least squares problems. Finally, we apply Algorithm 2 to real data and compare the result with Lasso.

3.1 FedRK

In this section, we consider solving an overdetermined consistent system. We consider A2048×1024A\in\mathbb{R}^{2048\times 1024}, x1024x^{*}\in\mathbb{R}^{1024} and b=Ax2048b=Ax^{*}\in\mathbb{R}^{2048}, where the entries of AA and xx are generated as i.i.d. standard Gaussian. The data is distributed evenly to M=16M=16 clients. At each federated round, five local clients are selected to participate in the update, and each of them run τ\tau local iterations. The server runs 20 global iterations after receiving the local updates. This demonstrates the convergence behavior of Algorithm 1, and the results suggest that running more local iterations may improve the convergence rate. The results are in Figure 2.

Refer to caption
Figure 2: FedRK with clients running different numbers of local iterations

3.2 Sparse approximation problems

In this section, we consider the sparse signal recovery problem. For certain types of signals, one can find a good basis so that these signals can be approximated by sparse representations, and an important question is how one can reconstruct the sparse approximations [14]. One type of algorithm for this problem is the iterative hard thresholding algorithm [15]. Given a sparsity level ss\in\mathbb{N}, the objective is to solve

minxnbAx22subject tox0s,\min_{x\in\mathbb{R}^{n}}\|b-Ax\|_{2}^{2}\quad\mbox{subject to}\quad\|x\|_{0}\leq s,

where 0\|\cdot\|_{0} counts the number of non-zero entries, and the iterative hard thresholding algorithm performs gradient descent followed by hard thresholding (projection onto the ss largest entries) in each iteration. Recently, it has been proposed to replace the gradient descent step with a RK projection [7, 8]. Our experiment demonstrates such a strategy can be extended to the federated setting. We consider the measurement matrix A256×1024A\in\mathbb{R}^{256\times 1024}, x1024x^{*}\in\mathbb{R}^{1024} the target sparse solution, e256e\in\mathbb{R}^{256} the measurement noise, and b=Ax+0.01e256b=Ax^{*}+0.01e\in\mathbb{R}^{256} the perturbed observation; the entries in AA and ee, and the non-zero entries of xx_{*} are all generated from the standard Gaussian. Data is distributed evenly to 1616 clients, and at each federated round, 55 local clients participate. Each of the local clients runs 2020 local RK iterations, and the global server runs 2020 global iterations after receiving the local updates. The sparsity level in this experiment is 99, and we demonstrate that Algorithm 2 can recover the true signal. The experiment is performed 50 times with different initializations and sampling schemes, and we record the number of times each feature is selected in Figure 3.

Refer to caption
Figure 3: The number of times each feature was selected in the 5050 trials when Algorithm 2 with sparsity level s=9s=9 was applied.

3.3 Least squares

While the main objective of this paper is to solve consistent linear system, in many real‐world applications the right‐hand side bmb\in\mathbb{R}^{m} is corrupted by noise or modeling error and one instead faces an inconsistent system. To assess the robustness of our method under these more realistic conditions, we consider general least squares problems. When the linear system is inconsistent, recall that classical RK does not converge to the least squares solution; however, its iterates do eventually remain within some horizon of the least squares solution [9]. We consider A2048×256A\in\mathbb{R}^{2048\times 256} and b2048×1b\in\mathbb{R}^{2048\times 1}, where the entries in AA and bb are generated from standard Gaussian. Our aim is to solve minxAxb2\min_{x}\|Ax-b\|^{2}; data are distributed evenly to M=16M=16 clients. In each federated round, all local clients participate. We apply FedRK to the system, and the algorithm converges to a horizon of the least squares solution; moreover, by adding a suitable amount of noise one can shrink the convergence horizon. Specifically, we expand the matrix AA by adding gaussian noise columns,

A=[AB],A^{\prime}=\begin{bmatrix}A&B\end{bmatrix},

where the entries in BB are generated as i.i.d. standard Gaussian, and we apply FedRK to solve the extended system minxAxb2\min_{x}\|A^{\prime}x-b\|^{2}; note that the perturbation BB can be added at the clients level, and the strategy is suitable for the federated setting. To motivate our strategy, it is known that in high-dimensional space, two independent Gaussian vectors are almost orthogonal with high probability, and therefore the columns in BB are likely to be almost orthogonal to the column space of AA; hence, the columns in BB capture the components of bb that is perpendicular to the column space of AA. This is inspired by the randomized extended Kaczmarz algorithm [16], where column operations are involved to solve the least squares problem; however, column operations are not suitable in the federated setting, and therefore, we create the extra matrix BB instead. This is motivated to the approach taken in [17], where the Monte Carlo method is combined with RK to solve the least squares problem. The strategy proposed in [17] does not involve column operations, and it would be interesting to see if it can be extended to the federated setting. The results are summarized in Figure 4.

Refer to caption
Figure 4: This experiment applies Algorithm 1 to the augmented system Ax=bA^{\prime}x=b to solve the least squares problem, where AA^{\prime} is obtained from AA by augmenting n256n-256 noisy columns. The result shows that we can shrink the convergence horizon by adding a suitable amount of noisy columns.

3.4 Prostate cancer data

In this section, we test our algorithm on the prostate cancer data [18], which was used in [10]. The features are 1. intercept (intcpt), 2. log(cancer volume) (lcavol), 3. log(prostate weight) (lweight), 4. age, 5. log(benign prostatic hyperplasia amount) (lbph), 6. seminal vesicle invasion (svi), 7. log(capsular penetration) (lcp), 8. Gleason score (gleason) and 9. percentage Gleason scores 4 or 5 (pgg45), and the target variable is log(prostate specific antigen) (lpsa). We adapt a more recent use of this data [19], where the Lasso is fit to the data. Let us briefly recall that given the data matrix XN×pX\in\mathbb{R}^{N\times p} and the target vector yNy\in\mathbb{R}^{N}, Lasso solves

β^=argminβ12i=1N(yij=1pxijβj)2+λj=1p|βj|,\hat{\beta}=\operatorname*{arg\,min}_{\beta}\frac{1}{2}\sum_{i=1}^{N}(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j})^{2}+\lambda\sum_{j=1}^{p}|\beta_{j}|,

where λ0\lambda\in\mathbb{R}_{\geq 0} is the regularizing parameter (here we assume that the intercept is included as the first column of XX); it is known that when λ\lambda is large, β^\hat{\beta} tends to be sparse, and a feature is said to be selected if it has non-zero coefficients. In this case 4 features are selected: lcavol, lweight, lbph and svi for some chosen λ0\lambda\in\mathbb{R}_{\geq 0}. Following [19], we first standardize the feature columns, then add an extra column of ones to represent the intercept; we distribute the data evenly to 7 local clients. To compare with the result in [19], we set the sparsity level to 5 (so that we include one for the intercept). At each federated round, 3 local clients participate and each of them runs 20 iterations using the local data; after receiving the local updates, the server runs 20 iterations of RK and then apply the hard thresholding operator T5T_{5}. We performed 2000 federated rounds, and counted the number of times each feature has a non-zero coefficient after hard thresholding. The result is recorded in Figure 5, and the distribution is consistent with the analogous selection of Lasso if we look at the top 5 features.

Refer to caption
Figure 5: feature selection via Algorithm 2

4 Discussion

In this paper, we proposed a federated algorithm (Algorithm 1) for solving large linear systems and derived its convergence property. When applied to inconsistent systems, our experiments showed that it converges to a horizon of the least squares solution. We also proposed a modified version of our algorithm (Algorithm 2) for sparse approximation problems. We applied Algorithm 2 to real data and showed that it has potential use for feature selection in the federated setting. For future work, it would be interesting to extend the approach here to more general optimization problems.

\bmhead

Acknowledgements DN was partially supported by NSF DMS 2408912.

References

  • \bibcommenthead
  • McMahan et al. [2017] McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017). PMLR
  • Wang et al. [2021] Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H.B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al.: A field guide to federated optimization. arXiv preprint arXiv:2107.06917 (2021)
  • Karczmarz [1937] Karczmarz, S.: Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., 355–357 (1937)
  • Strohmer and Vershynin [2009] Strohmer, T., Vershynin, R.: A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications 15(2), 262–278 (2009)
  • Needell et al. [2014] Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems 27 (2014)
  • Ma et al. [2015] Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended gauss–seidel and kaczmarz methods. SIAM Journal on Matrix Analysis and Applications 36(4), 1590–1604 (2015)
  • Jeong and Needell [2023] Jeong, H., Needell, D.: Linear convergence of reshuffling kaczmarz methods with sparse constraints. arXiv preprint arXiv:2304.10123 (2023)
  • Zhang et al. [2015] Zhang, Z., Yu, Y., Zhao, S.: Iterative hard thresholding based on randomized kaczmarz method. Circuits, Systems, and Signal Processing 34, 2065–2075 (2015)
  • Needell [2010] Needell, D.: Randomized kaczmarz solver for noisy linear systems. BIT Numerical Mathematics 50, 395–403 (2010)
  • Tibshirani [1996] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288 (1996)
  • Munkres [2014] Munkres, J.: Topology (2nd Edition). Prentice-Hall, NJ (2014)
  • Spivak [1999] Spivak, M.: A Comprehensive Introduction to Differential Geometry. Publish or Perish, Inc., TX (1999)
  • Huang et al. [2024] Huang, L., Li, X., Needell, D.: Randomized kaczmarz in adversarial distributed setting. SIAM Journal on Scientific Computing 46(3), 354–376 (2024)
  • Needell and Tropp [2009] Needell, D., Tropp, J.A.: Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis 26(3), 301–321 (2009)
  • Blumensath and Davies [2008] Blumensath, T., Davies, M.E.: Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications 14, 629–654 (2008)
  • Zouzias and Freris [2013] Zouzias, A., Freris, N.M.: Randomized extended kaczmarz for solving least squares. SIAM Journal on Matrix Analysis and Applications 34(2), 773–793 (2013)
  • Epperly et al. [2024] Epperly, E.N., Goldshlager, G., Webber, R.J.: Randomized kaczmarz with tail averaging. arXiv preprint arXiv:2411.19877 (2024)
  • Stamey et al. [1989] Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F., Redwine, E.A., Yang, N.: Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. ii. radical prostatectomy treated patients. The Journal of urology 141(5), 1076–1083 (1989)
  • Hastie et al. [2009] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, NY (2009)