[2]\fnmChi-Hao \surWu

1]\orgdivDepartment of Mathematics & Statistics, \orgnameThe State University of New York at Albany, \orgaddress\cityAlbany, \postcode12222, \stateNY, \countryUSA

[2]\orgdivDepartment of Mathematics, \orgnameUniversity of California Los Angeles, \orgaddress\cityLos Angeles, \postcode90095, \stateCA, \countryUSA

A federated Kaczmarz algorithm

\fnmHalyun \surJeong hjeong2@albany.edu \fnmDeanna \surNeedell deanna@math.ucla.edu chwu93@math.ucla.edu [ *

Abstract

In this paper, we propose a federated algorithm for solving large linear systems that is inspired by the classic randomized Kaczmarz algorithm. We provide convergence guarantees of the proposed method, and as a corollary of our analysis, we provide a new proof for the convergence of the classic randomized Kaczmarz method. We demonstrate experimentally the behavior of our method when applied to related problems. For underdetermined systems, we demonstrate that our algorithm can be used for sparse approximation. For inconsistent systems, we demonstrate that our algorithm converges to a horizon of the least squares solution. Finally, we apply our algorithm to real data and show that it is consistent with the selection of Lasso, while still offering the computational advantages of the Kaczmarz framework and thresholding-based algorithms in the federated setting.

keywords:

federated learning, Kaczmarz method, sparse approximation, feature selection

pacs:

[

MSC Classification]65F10, 65F20

1 Introduction

In this paper, we propose a federated algorithm for solving large linear systems. Federated learning was originally proposed by [1] to train neural networks in a decentralized setting. The global model is trained across multiple clients (e.g., mobile devices, sensors, or edge nodes) without transferring local data to a central server. This method improves privacy, reduces communication overhead, and enables learning from heterogeneous, distributed datasets; for instance, see [2] for more details. On the other hand, the Kaczmarz algorithm [3] is an iterative method for solving overdetermined linear systems, and the randomized Kaczmarz algorithm (RK) [4] is a version with a particular sampling scheme. Given a matrix $A\in\mathbb{R}^{m\times n}$ , we denote by $a_{j}$ the $j$ -th row of $A$ , and we denote the Frobenius norm,

\|A\|_{F}^{2}=\sum_{j=1}^{m}\|a_{j}\|_{2}^{2}.

To solve a linear system $Ax=b$ for $A\in\mathbb{R}^{m\times n}$ , $b\in\mathbb{R}^{m}$ , RK considers the iteration,

x_{k+1}=x_{k}+\frac{b_{j_{k}}-\langle a_{j_{k}},x_{k}\rangle}{\|a_{j_{k}}\|^{2}}\,a_{j_{k}},\quad k=0,1,\dotsc,

where $j_{k}$ ’s are independent identically distributed (i.i.d.) random variables with distribution $\mathbb{P}(j_{k}=j)=\|a_{j}\|_{2}^{2}/\|A\|_{F}^{2}$ for $k=0,1,\dotsc$ . For consistent overdetermined linear systems, it was shown in [4] that RK converges linearly to the solution in expectation. In a broader framework, it is also known that RK can be viewed as a stochastic gradient descent method with carefully chosen step sizes [5].

We consider the federated setting where there are $M$ local clients, and each client owns a subset of equations. In particular, let $Ax=b$ be the overdetermined, consistent linear system we aim to solve, where $A\in\mathbb{R}^{m\times n}$ and $b\in\mathbb{R}^{m}$ . The system is partitioned into $M$ parts,

A=\begin{bmatrix}A_{1}\\ \vdots\\ A_{M}\end{bmatrix}\quad\mbox{and}\quad b=\begin{bmatrix}b_{1}\\ \vdots\\ b_{M}\end{bmatrix},

and the $i$ -th client sees the system $A_{i}x=b_{i}$ for $i\in[M]$ ; note that locally the system $A_{i}x=b_{i}$ may be underdetermined for $i\in[M]$ . We briefly describe the framework of federated algorithms using federated averaging (FedAvg) [1] as an example. At each federated round, a subset of local clients is selected, and the global server broadcasts the current global parameters (position) to the selected local clients. Each of the selected clients performs local updates using the global position as the initial and returns the updated local position after some iterations. The global server averages the local updates returned by the local clients and uses the averaged position as the updated global position. The process is iterated until it converges. Following the notation used in the federated optimization community [2], we denote by $x^{(t)}\in\mathbb{R}^{n}$ the solution at the global server after $t$ federated rounds, and $x_{i}^{(t,k)}$ the solution at the $i$ -th client after $k$ local iterations during the $t+1$ -th federated round. We propose Algorithm 1 for solving large linear systems in a federated setting.

We briefly comment on the ideas behind Algorithm 1 and the difficulties for the convergence analysis. In our setting, we allow the linear systems at the local clients $A_{i}x=b_{i}$ , to be underdetermined for $i\in\{1,\dotsc,M\}$ , and it is known that RK converges to the orthogonal projection of the initial position onto the affine subsets

C_{i}=\left\{x\in\mathbb{R}^{n}:A_{i}x=b_{i}\right\},\quad i=1,\dotsc,M

(1)

for underdetermined linear systems; in particular, if the initial position is orthogonal to the null space of $A_{i}$ , RK converges to the least norm solution of $A_{i}x=b_{i}$ ; see [6]. Based on this observation, we see that $\Delta_{i}^{(t)}$ defined in Algorithm 1 is approximating a normal vector to a hyperplane containing $C_{i}$ (up to a normalizing constant), and projecting $x^{(t)}$ onto $C_{i}$ ’s is approximately the same as projecting $x^{(t)}$ onto

\tilde{C}_{i}=\left\{x\in\mathbb{R}^{n}:\Delta_{i}^{(t)}x=d_{i}\right\},\quad i=1,\dotsc,M.

Therefore, instead of treating the local model changes $\Delta_{i}^{(t)}$ ’s as simply displacements, we transform them into approximate hyperplanes $\tilde{C}_{i}$ ’s, which we then use at the server to run RK. In some sense, Algorithm 1 is a stochastic process where at each time $t\in\{0,1,\dotsc,T-1\}$ , we choose a collection of affine subsets, dependent on $x^{(t)}$ , onto which to project. Given that it is difficult to study the convergence with a purely algebraic approach, we take a more geometric approach; see Section 2 for details.

Algorithm 1 FedRK

Initial model

x^{(0)}

for t =

0

T-1

Sample a subset

S^{(t)}

of clients

for client

i\in S^{(t)}

Initialize local model

x_{i}^{(t,0)}=x^{(t)}

for

k=0,\dotsc,\tau-1

Perform RK initialized with

x^{(t,k)}

end for

Compute the local model changes

\Delta_{i}^{(t)}=x_{i}^{(t,\tau)}-x_{i}^{(t,0)}

end for

for

i\in S^{(t)}

Define

d_{i}=\langle\Delta_{i}^{(t)},\Delta_{i}^{(t)}+x^{(t)}\rangle

end for

Apply RK to solve

\Delta^{(t)}x=d

(ignoring the rows where

\Delta_{i}^{(t)}=0

with uniform sampling scheme), and let

x^{(t+1)}

be the solution after

\tau_{g}

iterates

end for

For the convergence analysis, we consider two scenarios. First, we study the scenario where the local clients run finitely many iterations and the server runs one iteration ( $\tau_{g}=1$ in Algorithm 1). Second, we study the scenario where the local clients run infinitely many iterations ( $\tau=\infty$ in Algorithm 1), and the server runs some finite number of iterations. We state the assumptions that we make in our main theorems, except for Corollary 1, where the system is allowed to be underdetermined.

Assumption 1.

The linear system $Ax=b$ is overdetermined.

By an appropriate translation, one can assume that the true solution $x^{*}=0$ , and thus it is natural to also include the second assumption:

Assumption 2.

The solution to the linear system is $x^{*}=0$ .

The proof for the first scenario ( $\tau_{g}=1$ ) utilizes that of RK applied to a suitable linear system. Specifically, we show the following:

Theorem 1.

Following the notation in Algorithm 1, assume $\tau=T$ , $\tau_{g}=1$ and $|S^{(t)}|\equiv N\in\mathbb{N}$ . Let $X^{(t+1)}$ be the global update after $t+1$ iterations. Then there exists $0<\beta<1$ such that

\mathbb{E}\|X^{(t+1)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

For the second scenario, let us first assume the server performs one RK iteration ( $\tau_{g}=1$ in Algorithm 1). Since we assume that the solution $x^{*}=0$ , the $C_{i}$ ’s defined in (1) are linear subspaces for $i\in[M]$ . Denote by $P_{C_{i}}$ the orthogonal projection operator onto $C_{i}$ for $i\in[M]$ . We see that (following the notation in Algorithm 1),

\mathbb{E}X^{(t+1)}=\sum_{s\in S^{(t)}}\frac{1}{|S^{(t)}|}\left(I-\frac{(X^{(t)}-P_{C_{s}}X^{(t)})(X^{(t)}-P_{C_{s}}X^{(t)})^{T}}{\|X^{(t)}-P_{C_{s}}X^{(t)}\|^{2}}\right)X^{(t)},

(2)

defines the sequence produced by our federated Kaczmarz algorithm. We first prove a technical theorem that characterizes the decay of Algorithm 1 when the current position is one unit distance away from the true solution, which involves studying a function related to (2); see Theorem 4. Then we use the technical theorem to prove the following:

Theorem 2.

\mathbb{E}\|X^{(t+1)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

Perhaps interestingly, Theorem 4 also gives an alternative proof for the convergence of classical RK; see Corollary 2. Finally, in the second scenario, we also deduce that Algorithm 1 converges linearly to the orthogonal projection from the initial global position $x^{(0)}$ onto $\cap_{i=1}^{M}C_{i}$ when the whole system is underdetermined. Specifically, if $\cap_{i=1}^{M}C_{i}$ is a single point, Algorithm 1 converges to the true solution as before.

Corollary 1.

Following the notation in Algorithm 1, assume $\tau=\infty$ , $\tau_{g}=T$ and $|S^{(t)}|\equiv N\in\mathbb{N}$ . Let $X^{(t+1)}$ be the global update after $t+1$ iterations. Denote $C=\cap_{i=1}^{M}C_{i}$ , where $C_{i}$ ’s are defined in (1), and $P_{C}$ the orthogonal projection onto $C$ . Then there exists $0<\beta<1$ such that

\mathbb{E}\|X^{(t+1)}-P_{C}X^{(0)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

We demonstrate experimentally the behavior of FedRK when applied to related problems. For sparse approximation problems, the system is modeled as $b=Ax^{*}+e$ where $A\in\mathbb{R}^{m\times n}$ is a wide matrix ( $m<n$ ), $x^{*}\in\mathbb{R}^{n}$ is a true sparse signal, and $e\in\mathbb{R}^{m}$ is noise. Given a sparsity level $s\in\mathbb{N}$ , the hard thresholding operator $T_{s}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ is defined as the orthogonal projection onto the entries with the $s$ largest magnitudes. By combining with a hard thresholding operator, the Kaczmarz algorithm has been proposed as a method to solve such problems; see for instance [7, 8]. In the federated setting, we show that our algorithm can be combined with the hard thresholding operator $T_{s}$ to solve the sparse approximation problem; see Algorithm 2. For the least squares problem, where $A\in\mathbb{R}^{m\times n}$ is a tall matrix ( $m>n$ ) and the goal is to minimize $\|Ax-b\|^{2}$ , it is known that the randomized Kaczmarz algorithm does not converge to the least squares solution in general, but its iterates reach within a horizon of the solution; see [9]. In the federated setting, we show that our algorithm has the same behavior. Moreover, we show that by adding a suitable amount of noisy columns to $A$ , one can shrink the horizon. Finally, we apply Algorithm 2 to the prostate cancer data considered in [10], and we demonstrate that the selection of Algorithm 2 is consistent with Lasso.

Algorithm 2 FedRK with thresholding

Initial model

x^{(0)}

for t =

0

T-1

Sample a subset

S^{(t)}

of clients

for client

i\in S^{(t)}

Initialize local model

x_{i}^{(t,0)}=x^{(t)}

for

k=0,\dotsc,\tau-1

Perform RK initialized with

x^{(t,k)}

end for

Compute the local model changes

\Delta_{i}^{(t)}=x_{i}^{(t,\tau)}-x_{i}^{(t,0)}

end for

for

i\in S^{(t)}

Define

d_{i}=\langle\Delta_{i}^{(t)},\Delta_{i}^{(t)}+x^{(t)}\rangle

end for

Apply RK to solve

\Delta^{(t)}x=d

(ignoring the rows where

\Delta_{i}^{(t)}=0

with uniform sampling scheme), and let

x^{(t+1)}

be the solution after

\tau_{g}

iterates

Apply hard thresholding operator

x^{(t+1)}=T_{s}x^{(t+1)}

end for

1.1 Contribution

We summarize our contributions. First, we propose the federated Kaczmarz algorithm (FedRK) for solving large linear systems in the federated setting, and we prove the linear convergence of our algorithm. As a corollary of our analysis, we give an alternative proof for the linear convergence of the classic randomized Kaczmarz algorithm (RK). Second, we demonstrate experimentally that our algorithm can be combined with hard thresholding to solve sparse approximation problems. We also demonstrate experimentally that it converges to a horizon of the least squares solution as the RK when applied to inconsistent systems. Finally, we apply our algorithm to real data and show the possible use for feature selection in the federated setting.

1.2 Organization

The rest of the paper is organized as follows. In Section 2, we present our main results. We first present Theorem 4, which characterizes the convergence behavior when a collection of local clients is selected and the current position is one unit distance away from the true solution; the proof of Theorem 4 can be interesting in its own right. In particular, a corollary of Theorem 4 is an alternative proof for the linear convergence of the classic RK; see Corollary 2. We then prove the convergence theorems of our algorithm. The proofs of Theorem 1 and Theorem 2 are based on slightly different strategies, and we provide roadmaps of the proofs as guides. In Section 3, we present the experiments. We demonstrate experimentally the linear convergence of Algorithm 1. Perhaps interestingly, our experiment shows that running more local iterations helps the algorithm converge faster, which is usually not seen in other federated algorithms. We then apply Algorithm 2 to sparse approximation problems and Algorithm 1 to least squares problems for inconsistent systems; our results show that our algorithm behaves similarly to the classical RK, which hints that our algorithm can be efficiently combined with other Kaczmarz variants to extend other variants to the federated setting. Finally, we apply Algorithm 2 to real data and show that our algorithm can potentially be used for feature selection in the federated setting. In Section 4, we present some discussion and future directions.

2 Main Results

2.1 Notation

In this section, we define the notation used throughout. Denote by $[M]=\{1,\dotsc,M\}$ . We partition the system into $M$ parts,

A=\begin{bmatrix}A_{1}\\ \vdots\\ A_{M}\end{bmatrix}\quad\mbox{and}\quad b=\begin{bmatrix}b_{1}\\ \vdots\\ b_{M}\end{bmatrix},

and the $i$ -th client sees the system $A_{i}x=b_{i}$ for $i\in[M]$ . Let $C_{i}$ be defined as in (1) for $i\in[M]$ . Under Assumption 2, we have $C_{i}\subseteq\mathbb{R}^{n}$ are linear subspaces for $i\in[M]$ . We denote by $P_{C}:\mathbb{R}^{n}\rightarrow C$ the orthogonal projection onto a linear subspace $C\subseteq\mathbb{R}^{n}$ .

Let $\mathcal{P}$ be the set of probability measures on $[M]$ . Given $x\in\mathbb{R}^{n}$ and $S\subseteq[M]$ , we denote by $\mathcal{P}_{S}\subseteq\mathcal{P}$ the set of probability measures such that

\mathcal{P}_{S}=\left\{p\in\mathcal{P}:\operatorname{supp}(p)=S\right\};

(3)

for instance, the uniform distribution over $S\subseteq[M]$ is in $\mathcal{P}_{S}$ . An object that forms the core of our analysis is the following function,

f(S,\,x,\,p,\,y)=\sum_{s\in S}p(s)\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}

(4)

for $S\subseteq[M]$ , $x\in\cap_{s\in S}C_{s}^{c}$ , $p\in\mathcal{P}$ and $y\in\mathbb{R}^{n}$ . For $p\in\mathcal{P}_{S}$ , this function measures the average decrease of the norm of $y$ , when randomly projecting onto the hyperplanes formed by the normal vectors $\{x-P_{C_{s}}x\}_{s\in S}$ . In fact, we will study the function on a refined domain; see Section 2.2 for more details.

For the readers’ convenience, we review some notions in geometry and topology; one can find detailed descriptions in [11] and [12], for instance. A topological space is a pairing $(X,\mathcal{T})$ , where $X$ is the whole space and $\mathcal{T}$ is the topology, i.e. a collection of open subsets satisfying

•

$\emptyset\in\mathcal{T}$
•

$U\in\mathcal{T}$ implies $U^{c}\in\mathcal{T}$
•

$\{U_{\alpha}\}_{\alpha\in A}\subseteq\mathcal{T}$ implies $\cup_{\alpha\in A}U_{\alpha}\in\mathcal{T}$ , where $A$ is an arbitrary index set (possibly uncountable)

A set $K\subseteq X$ is compact if any open covering of $K$ admits a finite sub-covering. A map $f:X\rightarrow Y$ between two topological spaces is continuous if $f^{-1}(U)\subseteq X$ is open for all open sets $U\subseteq Y$ . If $f:X\rightarrow Y$ between two topological spaces is continuous, then $K$ is compact implies $f(K)$ is compact. Another ingredient in our proof is the implicit function theorem, which we recall the statement:

Theorem 3 (Implicit function theorem).

Let $U\subseteq\mathbb{R}^{n+m}$ be an open subset, and $f:U\rightarrow\mathbb{R}^{m}$ be smooth. Given $(x_{0},y_{0})\in\mathbb{R}^{n+m}$ such that $f(x_{0},y_{0})=0$ and $D_{y}f(x_{0},y_{0}):\mathbb{R}^{m}\rightarrow\mathbb{R}^{m}$ is invertible, then, in a open neighborhood of $(x_{0},y_{0})$ , the level set

\left\{(x,y)\in\mathbb{R}^{n+m}:f(x,y)=0\right\}

is smoothly parameterized by $x$ ; i.e. in a neighborhood $V\subseteq\mathbb{R}^{n}$ of $x_{0}$ there exists a smooth function $g:V\rightarrow\mathbb{R}^{m}$ such that

\left\{(x,y)\in\mathbb{R}^{n+m}:f(x,y)=0,\,x\in V\right\}=\left\{(x,g(x))\in\mathbb{R}^{n+m}:x\in V\right\}.

2.2 A technical theorem

In this section, we prove our main technical theorem, and deduce the convergence of the classical randomized Kaczmarz as a corollary. To motivate the setting, we start with a discussion of how one can reduce the analysis to a function on the product of two spheres.

Given $x\in\mathbb{R}^{n}\setminus\{0\}$ , one has by linearity

P_{C_{i}}x=\|x\|P_{C_{i}}\frac{x}{\|x\|},

and

x-P_{C_{i}}x=\|x\|\left(\frac{x}{\|x\|}-P_{C_{i}}\frac{x}{\|x\|}\right).

This shows that the normal vector obtained after normalizing $x-P_{C_{i}}x$ is independent of the length of $x\in\mathbb{R}^{n}$ . Similarly, for $y\in\mathbb{R}^{n}\setminus\{0\}$ , one has

\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y=\|y\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)\frac{y}{\|y\|}.

This suggests that it is natural to study our problem on the unit sphere $S^{n-1}\subseteq\mathbb{R}^{n}$ . Given $\epsilon>0$ , denote

B_{S}=\left\{x\in S^{n-1}:\operatorname{dim}\left(\operatorname{span}\{x-P_{C_{s}}x\}_{s\in S}\right)<|S|\right\},\quad B_{S,\,\epsilon}=\{x\in S^{n-1}:d(x,B_{S})<\epsilon\}

(5)

and $\overset{*}{\mathcal{S}}_{S,\,\epsilon}=S^{n-1}\setminus B_{S,\,\epsilon}$ . We will consider $f$ as a function on $2^{M}\times\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times\mathcal{P}\times S^{n-1}$ for technical reasons that will become clear later; see Figure 1 for an illustration. It is clear that $f(S,\,\cdot,\,p,\,\cdot):\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}\rightarrow\mathbb{R}$ is smooth for each $(S,\,p)\in 2^{M}\times\mathcal{P}$ . We have the following:

Theorem 4.

Given $S\subseteq[M]$ and $\epsilon>0$ , let

h_{S,\,x}=\bigcap_{s\in S}\left\{y\in S^{n-1}:(x-P_{C_{s}}x)^{T}\,y=0\right\},\quad x\in S^{n-1}.

Denote

C_{S,\,\epsilon}=\{(x,y)\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}:d(y,\,h_{S,\,x})<\epsilon\},\quad D_{S,\,\epsilon}=(\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1})\setminus C_{S,\,\epsilon}.

Given $p\in\mathcal{P}_{S}$ defined in (3), consider $f(S,\,\cdot,\,p,\,\cdot):D_{S,\,\epsilon}\rightarrow\mathbb{R}$ defined as

f(S,\,x,\,p,\,y)=\sum_{s\in S}p(s)\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}.

Then, $f(S,\,x,\,p,\,y)\leq\alpha(S,\,p,\,\epsilon)<1$ for $(x,y)\in D_{S,\,\epsilon}$ . Moreover, for all $(x,y)\in D_{S,\,\epsilon}$ , there exists $s(y)\in S$ such that

\left\|\left(I-\frac{(x-P_{C_{s(y)}}x)(x-P_{C_{s(y)}}x)^{T}}{\|x-P_{C_{s(y)}}x\|^{2}}\right)y\right\|^{2}\leq\frac{\alpha(S,\,p,\,\epsilon)}{|S|\,p(s(y))}.

Proof.

To make our presentation clear, we can assume that the index set, $S=\{1,\dotsc,|S|\}$ . Our first objective is to show that $C_{S,\,\epsilon}\subseteq\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}$ is open.

Given $(x_{0},y_{0})\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}$ such that $x_{0}\in B_{S,\,\epsilon}$ and $y_{0}\in h_{S,\,x_{0}}$ , one can find $a_{1},\dotsc,a_{n-|S|-1}$ such that

\left\{x_{0}-P_{C_{1}}x_{0},\dotsc,x_{0}-P_{C_{|S|}}x_{0},a_{1},\dotsc,a_{n-|S|-1},y\right\}\subseteq\mathbb{R}^{n}

is a basis. One can consider $F:\mathbb{R}^{2n}\rightarrow\mathbb{R}^{n}$ defined as

F(x,y)=\begin{bmatrix}(x-P_{C_{1}}x)^{T}y\\ \vdots\\ (x-P_{C_{|}S|}x)^{T}y\\[3.0pt] a_{1}^{T}y\\ \vdots\\ a_{n-|S|-1}^{T}y\\[3.0pt] \sum_{i=1}^{n}y_{i}^{2}-1\\ \end{bmatrix}.

Denote by $DF:\mathbb{R}^{2n}\rightarrow\mathbb{R}^{n\times 2n}$ the Jacobian matrix of $F$ , and denote by $D_{x}F$ , $D_{y}F$ the submatrices when the partial derivatives are only taken with respect to $x$ , $y$ , respectively. A direct calculation shows

DF(x_{0},y_{0})=\begin{bmatrix}D_{x}F(x_{0},y_{0})&D_{y}F(x_{0},y_{0})\end{bmatrix}=\begin{bmatrix}y^{T}(I-P_{C_{1}})&x^{T}(I-P_{C_{1}})^{T}\\ \vdots&\vdots\\ y^{T}(I-P_{C_{|S|}})&x^{T}(I-P_{C_{|S|}})^{T}\\ 0&a_{1}^{T}\\ \vdots&\vdots\\ 0&a_{n-|S|-1}^{T}\\[3.0pt] 0&2y^{T}\end{bmatrix}\in\mathbb{R}^{n\times 2n},

and it is clear from our choice that $D_{y}F(x_{0},y_{0})$ is of full rank. If one considers the level set

\left\{(x,y)\in\mathbb{R}^{2n}:F(x,y)=0\right\},

then there exists an open ball $B(x_{0})\subseteq\mathbb{R}^{n}$ such that $y$ is a function of $x$ and $y(x_{0})=y_{0}$ by the implicit function theorem.

Given $(x_{0},y_{0})\in C_{S,\,\epsilon}$ , there exists $y_{*}\in h_{S,\,x_{0}}$ such that

d(y_{0},y_{*})=d(y_{0},h_{S}(x_{0}))=\epsilon^{\prime}<\epsilon,

since $\{y_{0}\}\subseteq S^{n-1}$ is compact and $h_{S}(x_{0})\subseteq S^{n-1}$ is closed. By the argument above, there exists $\delta_{1}>0$ , and a diffeomorphism $g:B_{\delta_{1}}(x_{0})\rightarrow g(B_{\delta_{1}}(x_{0}))$ such that $g(x_{0})=y_{*}$ and

F(x,g(x))=0,\quad x\in B_{\delta_{1}}(x_{0});

i.e., $y$ is a function of $x$ in a neighborhood of $x_{0}$ . By continuity of $g:B_{\delta_{1}}(x_{0})\rightarrow g(B_{\delta_{1}}(x_{0}))$ , there exists $\delta_{2}>0$ such that

d(x,x_{0})<\delta_{2}\quad\Rightarrow\quad d(g(x),g(x_{0}))<\frac{\epsilon-d(y_{0},y_{*})}{4}.

Define $\delta_{3}=(\epsilon-d(y_{0},y_{*}))/4$ , and let $\delta=\min\{\delta_{1},\delta_{2},\delta_{3}\}$ . We claim that $B_{\delta}(x_{0})\times B_{\delta}(y_{0})\subseteq C_{S,\,\epsilon}$ , and so $(x_{0},y_{0})\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}$ is an interior point. For $(x,y)\in B_{\delta}(x_{0})\times B_{\delta}(y_{0})$ ,

d(y,g(x))\leq d(y,y_{0})+d(y_{0},y_{*})+d(y_{*},g(x))<\epsilon.

Finally, one has

d(y,h_{S}(x))\leq d(y,g(x))<\epsilon,

since $g(x)\in h_{S}(x)$ , and this shows that $(x_{0},y_{0})\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}$ is an interior point.

One has $D_{S,\,\epsilon}$ is compact, since $C_{S,\,\epsilon}\subseteq\overset{*}{S}_{\epsilon}\times S^{n-1}$ is open. One also has that $f(S,\,x,\,p,\,y)<1$ for all $(x,y)\in D_{S,\,\epsilon}$ , and therefore there exists $\alpha(S,\,p,\,\epsilon)<1$ such that

f(S,\,x,\,p,\,y)\leq\alpha(S,\,p,\,\epsilon)<1,\quad\forall(x,y)\in D_{S,\,\epsilon}.

Finally, by pigeonhole principle, there exists $s(y)\in S$ such that

|S|\,p(s(y))\,\left\|\left(I-\frac{(x-P_{C_{s(y)}}x)(x-P_{C_{s(y)}}x)^{T}}{\|x-P_{C_{s(y)}}x\|^{2}}\right)y\right\|^{2}\leq\alpha(S,\,p,\,\epsilon)

for fixed $(x,y)\in D_{S,\,\epsilon}$ . This concludes our proof. ∎

Refer to caption — Figure 1: Suppose that $C_{1}$ is a line intersecting the sphere at the green dots (left), and $C_{2}$ is a plane intersecting the sphere at the red (dashed) diametric circle. Then, the pink diametric (dashed) circles in the second sphere are the points we remove in the $x$ -sphere, and the pink dots in the third sphere are the points we remove in the $y$ -sphere mentioned in Theorem 4. More precisely, a neighborhood of those points is removed.

As a corollary, we demonstrate how Theorem 4 implies the convergence of the classical randomized Kaczmarz algorithm. Indeed, the classical setting can be viewed as the scenario where one has $M=m$ local clients, and each local client has one equation of the linear system $Ax=b$ ; see [13]. This corresponds to $S=[M]$ and $\{C_{s}\}_{s\in[M]}$ are $n-1$ -dimensional linear subspaces (hyperplanes) in Theorem 4.

Recall that we defined $\mathcal{P}_{S}$ in (3). Given a sampling scheme $p\in\mathcal{P}_{[M]}$ , if we denote by $a_{s}$ the normal vector (unique up to sign) of $C_{s}$ for $s\in[M]$ , the randomized Kaczmarz algorithm is defined by

\mathbb{P}\left[\left.Y_{k+1}=\left(I-a_{s}\,a_{s}^{T}\right)Y_{k}\right|Y_{k}\right]=p(s),\quad s\in[M].

(6)

We have the following:

Corollary 2.

Given an initial $y_{0}\in\mathbb{R}^{n}$ and a sampling scheme $p\in\mathcal{P}_{[M]}$ defined in (3), define $Y_{1},\dotsc,Y_{k},\dotsc$ by (6). Let $\alpha=\max_{y\in S^{n-1}}y^{T}\sum_{s\in[M]}p(s)\left(I-a_{s}\,a_{s}^{T}\right)y$ . Then

\mathbb{E}\left(\left.\|Y_{k+1}\|^{2}\,\right|y_{0}\right)\leq\alpha^{k+1}\|y_{0}\|^{2}.

Proof.

First, we describe what $D_{S,\,\epsilon}$ in Theorem 4 is in this scenario. By picking $\epsilon>0$ small enough, one has that $\overset{*}{\mathcal{S}}_{S,\,\epsilon}\neq\emptyset$ . Let $\{a_{s}\}_{s\in[M]}$ be the normal vectors of the hyperplanes $\{C_{s}\}_{s\in[M]}$ . One has

x-P_{C_{s}}x\in\operatorname{span}\{a_{s}\},\quad\forall x\in\overset{*}{\mathcal{S}}_{S,\,\epsilon},\,s\in[M],

and

\bigcap_{s\in[M]}\left\{y\in\mathbb{R}^{n}:(x-P_{C_{s}}x)^{T}\,y=0\right\}=\bigcap_{s\in[M]}\left\{y\in\mathbb{R}^{n}:a_{s}^{T}\,y=0\right\}=\{0\};

this implies that $h_{S,\,x}=S^{n-1}\cap\{0\}=\emptyset$ , $C_{S,\,\epsilon}=\emptyset$ and therefore $D_{S,\,\epsilon}=\overset{*}{\mathcal{S}}_{S,\,\epsilon}\times S^{n-1}$ .

Now, we study the convergence of the randomized Kaczmarz algorithm. By (6),

	$\displaystyle\mathbb{E}\left(Y_{k+1}\|Y_{k}\right)$	$\displaystyle=\sum_{s\in[M]}p(s)\left\\|\left(I-a_{s}\,a_{s}^{T}\right)Y_{k}\right\\|^{2}$
		$\displaystyle=\\|Y_{k}\\|^{2}\sum_{s\in[M]}p(s)\left\\|\left(I-a_{s}\,a_{s}^{T}\right)\frac{Y_{k}}{\\|Y_{k}\\|}\right\\|^{2}.$

Pick some $x\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}$ . We recognize the summation above is equal to $f([M],\,x,\,p,\,Y_{k}/\|Y_{k}\|)$ . By Theorem 4, the function, $f([M],\,\cdot,\,p,\,\cdot):D_{S,\,\epsilon}\rightarrow\mathbb{R}$ , satisfies

f([M],\,x,\,p,\,y)=\sum_{s\in[M]}p(s)\left\|\left(I-a_{s}\,a_{s}^{T}\right)y\right\|^{2}\leq\alpha([M],\,p,\,\epsilon)<1,\quad\forall y\in S^{n-1}.

(Note: the function is independent of $x\in\overset{*}{\mathcal{S}}_{S,\,\epsilon}$ here). Therefore,

\mathbb{E}\left(Y_{k+1}|Y_{k}\right)=\|Y_{k}\|^{2}\,f([M],\,x,\,p,\,Y_{k}/\|Y_{k}\|)\leq\|Y_{k}\|^{2}\,\alpha([M],\,p,\,\epsilon)

by which one can iterate to conclude $\mathbb{E}\left(\left.\|Y_{k+1}\|^{2}\,\right|y_{0}\right)\leq\alpha^{k+1}\|y_{0}\|^{2}$ .

Finally, a closer look shows that

\alpha([M],\,p,\,\epsilon)=\max_{y\in S^{n-1}}\sum_{s\in[M]}p(s)\left\|\left(I-a_{s}\,a_{s}^{T}\right)y\right\|^{2}=\max_{y\in S^{n-1}}y^{T}\sum_{s\in[M]}p(s)\left(I-a_{s}\,a_{s}^{T}\right)y,

which gives the variational formula for the constant $\alpha$ . This concludes our proof. ∎

In Corollary 2, we showed the convergence of the randomized Kaczmarz algorithm for all sampling schemes in $\mathcal{P}_{[M]}$ . This includes the sampling scheme in [4]. Indeed, if we denote $p_{SV}$ the sampling scheme in [4],

p_{SV}(s)=\frac{\|a_{s}\|_{2}^{2}}{\|A\|_{F}^{2}}\quad\Rightarrow\quad\operatorname{supp}(p_{SV})=[M],

which clearly implies that $p_{SV}\in\mathcal{P}_{[M]}$ . This concludes our discussion here.

As a second corollary, we demonstrate how one can produce different variations of Theorem 4; the following form will be used in the study of Algorithm 1.

Corollary 3.

Given $S\subseteq[M]$ and $\epsilon>0$ , denote by $u_{S}$ the uniform distribution over the set $S$ and $D_{S,\,\epsilon}^{\prime}=\cap_{s\in S}D_{\{s\},\,\epsilon}$ , where $D_{\{s\},\,\epsilon}$ ’s are defined in Theorem 4. Consider $f(S,\,\cdot,\,u_{S},\,\cdot):D_{S,\,\epsilon}^{\prime}\rightarrow\mathbb{R}$ defined as in Theorem 4. Then, $f(S,\,x,\,u_{S},\,y)\leq\alpha^{\prime}(S,\,\epsilon)<1$ for $(x,y)\in D_{S,\,\epsilon}^{\prime}$ . Moreover, for $s\in S$ , define

E_{S,s,\epsilon}=\left\{(x,y)\in D_{\{s\},\epsilon}:x\not\in\bigcup_{s\in S}C_{s}\right\},

where each $C_{s}$ is the linear subspace associated with the index $s\in S$ . Then

f(S,x,u_{S},y)\leq\frac{|S|-1}{|S|}+\frac{1}{|S|}\alpha^{\prime}\left(\{s\},\epsilon\right)=\gamma(s,\epsilon)<1,\quad(x,y)\in E_{S,s,\epsilon}.

Proof.

To make our presentation clear, we assume $S=\left\{1,\dotsc,|S|\right\}$ without loss of generality. Given $s\in S$ and $\delta_{s}\in\mathcal{P}_{x,\,\{s\}}$ , by Theorem 4 there exists $\alpha(\{s\},\,\delta_{s},\,\epsilon)<1$ such that the function $f(\{s\},\,\cdot,\,\delta_{s},\,\cdot):D_{\{s\},\,\epsilon}\rightarrow\mathbb{R}$ defined as

f(\{s\},\,x,\,\delta_{s},\,y)=\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}.

satisfies the bound $f(\{s\},\,x,\,\delta_{s},\,y)\leq\alpha(\{s\},\,\delta_{s},\,\epsilon)<1$ . In particular, we have

f(\{s\},\,x,\,\delta_{s},\,y)\leq\alpha(\{s\},\,\delta_{s},\,\epsilon)\quad\forall(x,y)\in D_{S,\epsilon}^{\prime},\,s\in S,

which implies

	$\displaystyle f(S,\,x,\,u_{S},\,y)$	$\displaystyle=\sum_{s\in S}\frac{1}{\|S\|}\left\\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\\|x-P_{C_{s}}x\\|^{2}}\right)y\right\\|^{2}$
		$\displaystyle=\sum_{s\in S}\frac{1}{\|S\|}f(\{s\},\,x,\,\delta_{s},\,y)\leq\max_{s\in S}\left\{\alpha(\{s\},\,\delta_{s},\,\epsilon)\right\}<1.$

This concludes the proof of the first statement. For the second statement, we see that $f(S,\cdot,u_{S},\cdot)$ is properly defined on $E_{S,s,\epsilon}$ , and

f(S,x,u_{S},y)=\sum_{s^{\prime}\neq s}\frac{1}{|S|}\left\|\left(I-\frac{(x-P_{C_{s^{\prime}}}x)(x-P_{C_{s^{\prime}}}x)^{T}}{\|x-P_{C_{s^{\prime}}}x\|^{2}}\right)y\right\|^{2}\\ +\frac{1}{|S|}\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}.

(7)

Note that

\left\|\left(I-\frac{(x-P_{C_{s^{\prime}}}x)(x-P_{C_{s^{\prime}}}x)^{T}}{\|x-P_{C_{s^{\prime}}}x\|^{2}}\right)y\right\|^{2}\leq 1\quad\forall(x,y)\in E_{S,s,\epsilon},

and by the first statement

\left\|\left(I-\frac{(x-P_{C_{s}}x)(x-P_{C_{s}}x)^{T}}{\|x-P_{C_{s}}x\|^{2}}\right)y\right\|^{2}\leq\alpha^{\prime}(\{s\},\epsilon)\quad\forall(x,y)\in D_{\{s\},\epsilon},

which can then be combined with (7) to conclude the proof. ∎

2.3 Proof of Theorem 1, Theorem 2 and Corollary 1

Roadmap of the proof for Theorem 1: When the local clients run only finitely many iterations, the local updates are not exactly orthogonal projections onto a linear subspace; therefore, the idea of the proof is different from the proof of Theroem 2. For Theorem 1, the key observation is that when running only one global iteration ( $\tau_{g}=1$ ), the global update is essentially randomly selecting one of the local updates. From this perspective, one can compare the scheme with the classic RK algorithm applying to a suitable overdetermined linear system to deduce the convergence.

Proof of Theorem 1.

Denote $\mathbb{E}\left(X^{(t+1)}|S,X^{(t)}\right)$ the expected global position conditioning on the subset $S$ of local clients that is selected. One can observe that

\mathbb{E}\left(X^{(t+1)}|S,X^{(t)}\right)=\sum_{s\in S}\frac{1}{|S|}X^{(t,T)}_{s},

(8)

where $X^{(t,T)}_{s}$ is the position of client $s$ after $T$ local iterations. The idea of the proof is to compare the dynamics with a suitable choice of RK algorithm.

Fix $s\in[M]$ . Denote $a_{s,1},\dotsc,a_{s,m(s)}$ the rows at client $s$ . Then

\mathbb{E}\|X^{(t,T)}_{s}\|\leq\mathbb{E}\|X^{(t,1)}_{s}\|=\frac{1}{m(s)}\sum_{i=1}^{m(s)}\left\|\left(I-\frac{a_{s,i}^{T}a_{s,i}}{\|a_{s,i}\|^{2}}\right)X^{(t,0)}\right\|.

(9)

Using (9), we have

	$\displaystyle\mathbb{E}\\|X^{(t+1)}\\|$	$\displaystyle=\sum_{S}\mathbb{E}\left(\\|X^{(t+1)}\\|\|S\right)\mathbb{P}\left(S^{(t)}=S\right)$
		$\displaystyle=\sum_{S}\sum_{s\in S}\frac{1}{\|S\|}\mathbb{E}\\|X^{(t,T)}_{s}\\|\mathbb{P}\left(S^{(t)}=S\right)$
		$\displaystyle\leq\sum_{S}\sum_{s\in S}\frac{1}{\|S\|}\frac{1}{m(s)}\sum_{i=1}^{m(s)}\left\\|\left(I-\frac{a_{s,i}^{T}a_{s,i}}{\\|a_{s,i}\\|^{2}}\right)X^{(t,0)}\right\\|\mathbb{P}\left(S^{(t)}=S\right).$

One can see that the right hand side is the expected decrease when using RK to solve $A^{\prime}x=0$ , where $A^{\prime}$ consists of copies of $A$ . One can then deduce the convergence using the convergence analysis of classic RK. ∎

Given $S\subseteq[M]$ , we consider $S_{x}\subseteq S$ defined as

S_{x}=\left\{s\in S:x\not\in C_{s}\right\},

and $u_{S_{x}}$ the uniform distribution over $S_{x}$ . Then we consider the following function

g(S,x,y)=\begin{cases}f(S_{x},x,u_{S_{x}},y),\quad&\mbox{if $S_{x}\neq\emptyset$}\\[3.0pt] 1,\quad&\mbox{if $S_{x}=\emptyset$}\end{cases},\quad g(x,y)=\sum_{S\subseteq M}\frac{1}{{M\choose|S|}}g(S,\,x,\,y)

(10)

One can see that $g(x,y)$ characterizes the convergence rate when $\tau_{g}=1$ ; for instance, see the proof of Corollary 2. We prove two lemmas, and then prove Theorem 2 as a consequence of these lemmas. Let us briefly describe the proof strategy.

Roadmap of the proof for Theroem 2: Fix a collection of linear subspaces at the local clients $\{C_{s}\}_{s=1}^{M}$ . First, we demonstrate that for an overdetermined system, one can find a small enough $\epsilon>0$ such that the $\epsilon$ -neighborhood of these linear subspaces has empty intersection on the unit sphere; this is done in Lemma 1. Using Corollary 3, we know that if the initial position at a given federated round is $\epsilon$ away from some linear subspace $C_{s}$ , then we are guaranteed to shrink the distance between our current position and the solution by a constant factor via projection onto $C_{s}$ . By Lemma 1, we know that an arbitrary point on the unit sphere is at least $\epsilon$ away from some linear subspace. At each federated round, there is always some chance to select a client $s$ with $C_{s}$ being $\epsilon$ -away from the normalized global position, and we are guaranteed to shrink the current position by a uniform factor if we orthogonally project onto $C_{s}$ . Using this fact, we show that on average we can shrink our current position by a constant factor regardless of our initial position at a given federated round; this is done in Lemma 2. Finally, everything is put together with the scaling argument in Corollary 2 to obtain Theorem 2. To deduce Corollary 1 for the underdetermined system, the key observation is to decompose the initial position as $x=x_{1}+x_{2}$ , where $x_{1}\in C=\cap_{s}C_{s}$ and $x_{2}\in C^{\perp}$ . One can see that $x_{1}$ is a fixed point for the orthogonal projections onto $C_{s}$ ’s, and that the orthogonal projections onto $C_{s}$ ’s restricted to $C^{\perp}$ reduce to a determined system; these observations allow us to deduce Corollary 1 from Theorem 2.

Lemma 1.

Let $\{C_{s}\}_{s=1}^{M}$ be a collection of linear subspaces such that $\cap_{s=1}^{M}C_{s}=\{0\}$ . Then there exits $\epsilon>0$ such that

\bigcap_{s\in M}\left\{x\in S^{n-1}:d(x,\,C_{s}\cap S^{n-1})<\epsilon\right\}=\emptyset.

Proof.

Define $\tilde{d}:S^{n-1}\rightarrow\mathbb{R}$ as

\tilde{d}(x)=\max_{s\in S^{n-1}}d(x,C_{s}\cap S^{n-1});

one can see that $\tilde{d}$ is continuous. Since $\cap_{s=1}^{M}C_{s}\cap S^{n-1}=\emptyset$ , one has

\tilde{d}(x)>0\quad\forall x\in S^{n-1}.

This further implies that there exists $\epsilon>0$ such that $\tilde{d}(x)\geq\epsilon$ for all $x\in S^{n-1}$ by compactness of $S^{n-1}$ . This concludes our proof. ∎

Lemma 2.

Given $0<\epsilon\ll 1$ , denote $K_{s,\epsilon}=\{x\in S^{n-1}:d(x,C_{s}\cap S^{n-1})\geq\epsilon\}$ for all $s\in[M]$ . Then $g:S^{n-1}\times S^{n-1}\rightarrow\mathbb{R}$ defined in (10) satisfies

g(x,y)\leq\beta<1,\quad\forall(x,y)\in\bigcup_{s\in M}\left(K_{s,\epsilon}\times K_{s,\epsilon}\right).

Proof.

Given $x\in K_{s,\epsilon}$ , one has

s\in S\quad\Rightarrow\quad s\in S_{x},

which further implies

g(x,y)=\sum_{S\subseteq M}\frac{1}{{M\choose|S|}}g(S,x,y)=\sum_{s\in S}\frac{1}{{M\choose|S|}}f(S_{x},x,u_{S_{x}},y)+\sum_{s\not\in S}\frac{1}{{M\choose|S|}}g(S,x,y).

By Corollary 3, one has

f(S_{x},x,u_{S_{x}},y)\leq\gamma_{s,\epsilon}\quad\forall(x,y)\in E_{S_{x},s,\epsilon},

which can be combined with the fact that there are ${M-1\choose|S|-1}$ subsets of $M$ containing $s$ to get

g(x,y)\leq\frac{{M-1\choose|S|-1}}{{M\choose|S|}}\gamma_{s,\epsilon}+\frac{{M\choose|S|}-{M-1\choose|S|-1}}{{M\choose|S|}}=\beta_{s,\epsilon}<1,\quad(x,y)\in K_{s,\epsilon}\times K_{s,\epsilon}.

One can then take $\beta_{\epsilon}=\max_{s\in M}\{\beta_{s,\epsilon}\}$ to conclude the proof. ∎

Now, we combine Lemma 1 and Lemma 2 to prove Theorem 2.

Proof of Theorem 2.

First, we assume that the server runs one global iteration ( $\tau_{g}=1$ ). Let $\epsilon>0$ be small enough so that Lemma 1 holds. For $K_{s,\epsilon}$ defined in Lemma 2, one has

X\setminus\left(\bigcup_{s\in[M]}K_{s,\epsilon}\right)=\bigcap_{s\in[M]}K_{s,\epsilon}^{c}=\emptyset

by Lemma 1; this shows that $\{K_{s,\epsilon}\}_{s\in[M]}$ covers $S^{n-1}$ , and therefore

(x,x)\in\cup_{s\in[M]}\left(K_{s,\epsilon}\times K_{s,\epsilon}\right)\quad\forall x\in S^{n-1}.

By Lemma 2, one sees that

\mathbb{E}\left(\left.\|X^{(t+1)}\|^{2}\,\right|X^{(t)}\right)=g(X^{(t)},X^{(t)})\leq\beta_{\epsilon}<1,\quad\mbox{if}\quad\|X^{(t)}\|=1.

One can then iterate as in the proof Corollary 2 to conclude the proof. For general $\tau_{g}\in\mathbb{N}$ , we observe that the operator norm of an orthogonal projection is less than or equal to $1$ , and therefore the convergence result for general $\tau_{g}\in\mathbb{N}$ follows from $\tau_{g}=1$ .

∎

Finally, we demonstrate how one can deduce the convergence behavior for underdetermined systems.

Proof of Corollary 1.

Given an initial $X^{(0)}\in\mathbb{R}^{n}$ , we consider the orthogonal decomposition $X^{(0)}=P_{C^{\perp}}X^{(0)}+P_{C}X^{(0)}$ , where $C$ is defined as in the statement of Corollary 1. Then one has

	$\displaystyle X^{(1)}$	$\displaystyle=\sum_{s\in S^{(0)}}\frac{1}{\|S^{(0)}\|}\left(I-\frac{(X^{(0)}-P_{C_{s}}X^{(0)})(X^{(0)}-P_{C_{s}}X^{(0)})^{T}}{\\|X^{(0)}-P_{C_{s}}X^{(0)}\\|^{2}}\right)X^{(0)}$		(11)
		$\displaystyle=\sum_{s\in S^{(0)}}\frac{1}{\|S^{(0)}\|}\left(I-\frac{(X^{(0)}-P_{C_{s}}X^{(0)})(X^{(0)}-P_{C_{s}}X^{(0)})^{T}}{\\|X^{(0)}-P_{C_{s}}X^{(0)}\\|^{2}}\right)P_{C^{\perp}}X^{(0)}+P_{C}X^{(0)},$		(12)

since $C\subseteq C_{i}$ for all $i\in\{1,\dotsc,M\}$ and $P_{C}X^{(0)}$ is a fixed point for projections onto $C_{i}$ ’s. On the other hand, if we focus on the linear subspace $P_{C^{\perp}}(\mathbb{R}^{n})$ , $C^{\perp}\cap C=\{0\}$ , and it is reduced to a determined system, and we can apply Theorem 2 to deduce

\|X^{(t+1)}-P_{C}X^{(0)}\|^{2}\leq\beta^{t+1}\|P_{C^{\perp}}X^{(0)}\|^{2}\leq\beta^{t+1}\|X^{(0)}\|^{2},\quad t=0,1,\dotsc.

for some $0<\beta<1$ . ∎

3 Experiments

In the following, we demonstrate experimentally the convergence of FedRK, and we demonstrate its behavior when applied to sparse approximation problems and least squares problems. Finally, we apply Algorithm 2 to real data and compare the result with Lasso.

3.1 FedRK

In this section, we consider solving an overdetermined consistent system. We consider $A\in\mathbb{R}^{2048\times 1024}$ , $x^{*}\in\mathbb{R}^{1024}$ and $b=Ax^{*}\in\mathbb{R}^{2048}$ , where the entries of $A$ and $x$ are generated as i.i.d. standard Gaussian. The data is distributed evenly to $M=16$ clients. At each federated round, five local clients are selected to participate in the update, and each of them run $\tau$ local iterations. The server runs 20 global iterations after receiving the local updates. This demonstrates the convergence behavior of Algorithm 1, and the results suggest that running more local iterations may improve the convergence rate. The results are in Figure 2.

3.2 Sparse approximation problems

In this section, we consider the sparse signal recovery problem. For certain types of signals, one can find a good basis so that these signals can be approximated by sparse representations, and an important question is how one can reconstruct the sparse approximations [14]. One type of algorithm for this problem is the iterative hard thresholding algorithm [15]. Given a sparsity level $s\in\mathbb{N}$ , the objective is to solve

\min_{x\in\mathbb{R}^{n}}\|b-Ax\|_{2}^{2}\quad\mbox{subject to}\quad\|x\|_{0}\leq s,

where $\|\cdot\|_{0}$ counts the number of non-zero entries, and the iterative hard thresholding algorithm performs gradient descent followed by hard thresholding (projection onto the $s$ largest entries) in each iteration. Recently, it has been proposed to replace the gradient descent step with a RK projection [7, 8]. Our experiment demonstrates such a strategy can be extended to the federated setting. We consider the measurement matrix $A\in\mathbb{R}^{256\times 1024}$ , $x^{*}\in\mathbb{R}^{1024}$ the target sparse solution, $e\in\mathbb{R}^{256}$ the measurement noise, and $b=Ax^{*}+0.01e\in\mathbb{R}^{256}$ the perturbed observation; the entries in $A$ and $e$ , and the non-zero entries of $x_{*}$ are all generated from the standard Gaussian. Data is distributed evenly to $16$ clients, and at each federated round, $5$ local clients participate. Each of the local clients runs $20$ local RK iterations, and the global server runs $20$ global iterations after receiving the local updates. The sparsity level in this experiment is $9$ , and we demonstrate that Algorithm 2 can recover the true signal. The experiment is performed 50 times with different initializations and sampling schemes, and we record the number of times each feature is selected in Figure 3.

3.3 Least squares

While the main objective of this paper is to solve consistent linear system, in many real‐world applications the right‐hand side $b\in\mathbb{R}^{m}$ is corrupted by noise or modeling error and one instead faces an inconsistent system. To assess the robustness of our method under these more realistic conditions, we consider general least squares problems. When the linear system is inconsistent, recall that classical RK does not converge to the least squares solution; however, its iterates do eventually remain within some horizon of the least squares solution [9]. We consider $A\in\mathbb{R}^{2048\times 256}$ and $b\in\mathbb{R}^{2048\times 1}$ , where the entries in $A$ and $b$ are generated from standard Gaussian. Our aim is to solve $\min_{x}\|Ax-b\|^{2}$ ; data are distributed evenly to $M=16$ clients. In each federated round, all local clients participate. We apply FedRK to the system, and the algorithm converges to a horizon of the least squares solution; moreover, by adding a suitable amount of noise one can shrink the convergence horizon. Specifically, we expand the matrix $A$ by adding gaussian noise columns,

A^{\prime}=\begin{bmatrix}A&B\end{bmatrix},

where the entries in $B$ are generated as i.i.d. standard Gaussian, and we apply FedRK to solve the extended system $\min_{x}\|A^{\prime}x-b\|^{2}$ ; note that the perturbation $B$ can be added at the clients level, and the strategy is suitable for the federated setting. To motivate our strategy, it is known that in high-dimensional space, two independent Gaussian vectors are almost orthogonal with high probability, and therefore the columns in $B$ are likely to be almost orthogonal to the column space of $A$ ; hence, the columns in $B$ capture the components of $b$ that is perpendicular to the column space of $A$ . This is inspired by the randomized extended Kaczmarz algorithm [16], where column operations are involved to solve the least squares problem; however, column operations are not suitable in the federated setting, and therefore, we create the extra matrix $B$ instead. This is motivated to the approach taken in [17], where the Monte Carlo method is combined with RK to solve the least squares problem. The strategy proposed in [17] does not involve column operations, and it would be interesting to see if it can be extended to the federated setting. The results are summarized in Figure 4.

3.4 Prostate cancer data

In this section, we test our algorithm on the prostate cancer data [18], which was used in [10]. The features are 1. intercept (intcpt), 2. log(cancer volume) (lcavol), 3. log(prostate weight) (lweight), 4. age, 5. log(benign prostatic hyperplasia amount) (lbph), 6. seminal vesicle invasion (svi), 7. log(capsular penetration) (lcp), 8. Gleason score (gleason) and 9. percentage Gleason scores 4 or 5 (pgg45), and the target variable is log(prostate specific antigen) (lpsa). We adapt a more recent use of this data [19], where the Lasso is fit to the data. Let us briefly recall that given the data matrix $X\in\mathbb{R}^{N\times p}$ and the target vector $y\in\mathbb{R}^{N}$ , Lasso solves

\hat{\beta}=\operatorname*{arg\,min}_{\beta}\frac{1}{2}\sum_{i=1}^{N}(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j})^{2}+\lambda\sum_{j=1}^{p}|\beta_{j}|,

where $\lambda\in\mathbb{R}_{\geq 0}$ is the regularizing parameter (here we assume that the intercept is included as the first column of $X$ ); it is known that when $\lambda$ is large, $\hat{\beta}$ tends to be sparse, and a feature is said to be selected if it has non-zero coefficients. In this case 4 features are selected: lcavol, lweight, lbph and svi for some chosen $\lambda\in\mathbb{R}_{\geq 0}$ . Following [19], we first standardize the feature columns, then add an extra column of ones to represent the intercept; we distribute the data evenly to 7 local clients. To compare with the result in [19], we set the sparsity level to 5 (so that we include one for the intercept). At each federated round, 3 local clients participate and each of them runs 20 iterations using the local data; after receiving the local updates, the server runs 20 iterations of RK and then apply the hard thresholding operator $T_{5}$ . We performed 2000 federated rounds, and counted the number of times each feature has a non-zero coefficient after hard thresholding. The result is recorded in Figure 5, and the distribution is consistent with the analogous selection of Lasso if we look at the top 5 features.

4 Discussion

In this paper, we proposed a federated algorithm (Algorithm 1) for solving large linear systems and derived its convergence property. When applied to inconsistent systems, our experiments showed that it converges to a horizon of the least squares solution. We also proposed a modified version of our algorithm (Algorithm 2) for sparse approximation problems. We applied Algorithm 2 to real data and showed that it has potential use for feature selection in the federated setting. For future work, it would be interesting to extend the approach here to more general optimization problems.

\bmhead

Acknowledgements DN was partially supported by NSF DMS 2408912.

References

\bibcommenthead
McMahan et al. [2017] McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017). PMLR
Wang et al. [2021] Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H.B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al.: A field guide to federated optimization. arXiv preprint arXiv:2107.06917 (2021)
Karczmarz [1937] Karczmarz, S.: Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., 355–357 (1937)
Strohmer and Vershynin [2009] Strohmer, T., Vershynin, R.: A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications 15(2), 262–278 (2009)
Needell et al. [2014] Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in neural information processing systems 27 (2014)
Ma et al. [2015] Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended gauss–seidel and kaczmarz methods. SIAM Journal on Matrix Analysis and Applications 36(4), 1590–1604 (2015)
Jeong and Needell [2023] Jeong, H., Needell, D.: Linear convergence of reshuffling kaczmarz methods with sparse constraints. arXiv preprint arXiv:2304.10123 (2023)
Zhang et al. [2015] Zhang, Z., Yu, Y., Zhao, S.: Iterative hard thresholding based on randomized kaczmarz method. Circuits, Systems, and Signal Processing 34, 2065–2075 (2015)
Needell [2010] Needell, D.: Randomized kaczmarz solver for noisy linear systems. BIT Numerical Mathematics 50, 395–403 (2010)
Tibshirani [1996] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288 (1996)
Munkres [2014] Munkres, J.: Topology (2nd Edition). Prentice-Hall, NJ (2014)
Spivak [1999] Spivak, M.: A Comprehensive Introduction to Differential Geometry. Publish or Perish, Inc., TX (1999)
Huang et al. [2024] Huang, L., Li, X., Needell, D.: Randomized kaczmarz in adversarial distributed setting. SIAM Journal on Scientific Computing 46(3), 354–376 (2024)
Needell and Tropp [2009] Needell, D., Tropp, J.A.: Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis 26(3), 301–321 (2009)
Blumensath and Davies [2008] Blumensath, T., Davies, M.E.: Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications 14, 629–654 (2008)
Zouzias and Freris [2013] Zouzias, A., Freris, N.M.: Randomized extended kaczmarz for solving least squares. SIAM Journal on Matrix Analysis and Applications 34(2), 773–793 (2013)
Epperly et al. [2024] Epperly, E.N., Goldshlager, G., Webber, R.J.: Randomized kaczmarz with tail averaging. arXiv preprint arXiv:2411.19877 (2024)
Stamey et al. [1989] Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F., Redwine, E.A., Yang, N.: Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. ii. radical prostatectomy treated patients. The Journal of urology 141(5), 1076–1083 (1989)
Hastie et al. [2009] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, NY (2009)

	$\displaystyle\mathbb{E}\\|X^{(t+1)}\\|$	$\displaystyle=\sum_{S}\mathbb{E}\left(\\|X^{(t+1)}\\|\|S\right)\mathbb{P}\left(S^{(t)}=S\right)$
		$\displaystyle=\sum_{S}\sum_{s\in S}\frac{1}{\|S\|}\mathbb{E}\\|X^{(t,T)}_{s}\\|\mathbb{P}\left(S^{(t)}=S\right)$
		$\displaystyle\leq\sum_{S}\sum_{s\in S}\frac{1}{\|S\|}\frac{1}{m(s)}\sum_{i=1}^{m(s)}\left\\|\left(I-\frac{a_{s,i}^{T}a_{s,i}}{\\|a_{s,i}\\|^{2}}\right)X^{(t,0)}\right\\|\mathbb{P}\left(S^{(t)}=S\right).$