This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Correlation Aware Sparsified Mean Estimation Using Random Projection

Shuli Jiang
Robotics Institute
Carnegie Mellon University
shulij@andrew.cmu.edu
&Pranay Sharma
ECE
Carnegie Mellon University
pranaysh@andrew.cmu.edu
&Gauri Joshi
ECE
Carnegie Mellon University
gaurij@andrew.cmu.edu
Abstract

We study the problem of communication-efficient distributed vector mean estimation, a commonly used subroutine in distributed optimization and Federated Learning (FL). Rand-kk sparsification is a commonly used technique to reduce communication cost, where each client sends k<dk<d of its coordinates to the server. However, Rand-kk is agnostic to any correlations, that might exist between clients in practical scenarios. The recently proposed Rand-kk-Spatial estimator leverages the cross-client correlation information at the server to improve Rand-kk’s performance. Yet, the performance of Rand-kk-Spatial is suboptimal. We propose the Rand-Proj-Spatial estimator with a more flexible encoding-decoding procedure, which generalizes the encoding of Rand-kk by projecting the client vectors to a random kk-dimensional subspace. We utilize Subsampled Randomized Hadamard Transform (SRHT) as the projection matrix and show that Rand-Proj-Spatial with SRHT outperforms Rand-kk-Spatial, using the correlation information more efficiently. Furthermore, we propose an approach to incorporate varying degrees of correlation and suggest a practical variant of Rand-Proj-Spatial when the correlation information is not available to the server. Experiments on real-world distributed optimization tasks showcase the superior performance of Rand-Proj-Spatial compared to Rand-kk-Spatial and other more sophisticated sparsification techniques.

1 Introduction

In modern machine learning applications, data is naturally distributed across a large number of edge devices or clients. The underlying learning task in such settings is modeled by distributed optimization or the recent paradigm of Federated Learning (FL) konevcny16federated ; fedavg17aistats ; kairouz2021advances ; wang2021field . A crucial subtask in distributed learning is for the server to compute the mean of the vectors sent by the clients. In FL, for example, clients run training steps on their local data and once-in-a-while send their local models (or local gradients) to the server, which averages them to compute the new global model. However, with the ever-increasing size of machine learning models simonyan2014very ; brown2020language , and the limited battery life of the edge clients, communication cost is often the major constraint for the clients. This motivates the problem of (empirical) distributed mean estimation (DME) under communication constraints, as illustrated in Figure 1. Each of the nn clients holds a vector 𝐱id{\mathbf{x}}_{i}\in\mathbb{R}^{d}, on which there are no distributional assumptions. Given a communication budget, each client sends a compressed version 𝐱^i\widehat{{\mathbf{x}}}_{i} of its vector to the server, which utilizes these to compute an estimate of the mean vector 1ni=1n𝐱i\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}.

Quantization and sparsification are two major techniques for reducing the communication costs of DME. Quantization gubner1993distributed ; davies2021new_bounds_dme_var_reduction ; vargaftik2022eden ; suresh2017dme_icml involves compressing each coordinate of the client vector to a given precision and aims to reduce the number of bits to represent each coordinate, achieving a constant reduction in the communication cost. However, the communication cost still remains Θ(d)\Theta(d). Sparsification, on the other hand, aims to reduce the number of coordinates each clinet sends and compresses each client vector to only kdk\ll d of its coordinates (e.g. Rand-kk konevcny2018rand_dme ). As a result, sparsification reduces communication costs more aggressively compared to quantization, achieving better communication efficiency at a cost of only O(k)O(k). While in practice, one can use a combination of quantization and sparsification techniques for communication cost reduction, in this work, we focus on the more aggressive sparsification techniques. We call kk, the dimension of the vector each client sends to the server, the per-client communication budget.

Refer to caption
Figure 1: The problem of distributed mean estimation under limited communication. Each client i[n]i\in[n] encodes its vector 𝐱i{\mathbf{x}}_{i} as 𝐱^i\widehat{{\mathbf{x}}}_{i} and sends this compressed version to the server. The server decodes them to compute an estimate of the true mean 1ni=1n𝐱i\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}.

Most existing works on sparsification ignore the potential correlation (or similarity) among the client vectors, which often exists in practice. For example, the data of a specific client in federated learning can be similar to that of multiple clients. Hence, it is reasonable to expect their models (or gradients) to be similar as well. To the best of our knowledge, jhunjhunwala2021dme_spatial_temporal is the first work to account for spatial correlation across individual client vectors. They propose the Rand-kk-Spatial family of unbiased estimators, which generalizes Rand-kk and achieves a better estimation error in the presence of cross-client correlation. However, their approach is focused only on the server-side decoding procedure, while the clients do simple Rand-kk encoding.

In this work, we consider a more general encoding scheme that directly compresses a vector from d\mathbb{R}^{d} to k\mathbb{R}^{k} using a (random) linear map. The encoded vector consists of kk linear combinations of the original coordinates. Intuitively, this has a higher chance of capturing the large-magnitude coordinates (“heavy hitters”) of the vector than randomly sampling kk out of the dd coordinates (Rand-kk), which is crucial for the estimator to recover the true mean vector. For example, consider a vector where only a few coordinates are heavy hitters. For small kk, Rand-kk has a decent chance of missing all the heavy hitters. But with a linear-maps-based general encoding procedure, the large coordinates are more likely to be encoded in the linear measurements, resulting in a more accurate estimator of the mean vector. Guided by this intuition, we ask:

Can we design an improved joint encoding-decoding scheme that utilizes the correlation information and achieves an improved estimation error?

One naïve solution is to apply the same random rotation matrix 𝑮d×d{\bm{G}}\in\mathbb{R}^{d\times d} to each client vector, before applying Rand-kk or Rand-kk-Spatial encoding. Indeed, such preprocessing is applied to improve the estimator using quantization techniques on heterogeneous vectors suresh2022correlated_dme_icml ; suresh2017dme_icml . However, as we see in Appendix A.1, for sparsification, we can show that this leads to no improvement. But what happens if every client uses a different random matrix, or applies a random k×dk\times d-dimensional linear map? How to design the corresponding decoding procedure to leverage cross-client correlation? As there is no way for one to directly apply the decoding procedure of Rand-kk-Spatial in such cases. To answer these questions, we propose the Rand-Proj-Spatial family estimator. We propose a flexible encoding procedure in which each client applies its own random linear map to encode the vector. Further, our novel decoding procedure can better leverage cross-client correlation. The resulting mean estimator generalizes and improves over the Rand-kk-Spatial family estimator.

Next, we discuss some reasonable restrictions we expect our mean estimator to obey. 1) Unbiased. An unbiased mean estimator is theoretically more convenient compared to a biased one horvath2021induced . 2) Non-adaptive. We focus on an encoding procedure that does not depend on the actual client data, as opposed to the adaptive ones, e.g. Rand-kk with vector-based sampling probability konevcny2018rand_dme ; wangni2018grad_sparse . Designing a data-adaptive encoding procedure is computationally expensive as this might require using an iterative procedure to find out the sampling probabilities konevcny2018rand_dme . In practice, however, clients often have limited computational power compared to the server. Further, as discussed earlier, mean estimation is often a subroutine in more complicated tasks. For applications with streaming data nokleby2018stochastic , the additional computational overhead of adaptive schemes is challenging to maintain. Note that both Rand-kk and Rand-kk-Spatial family estimator jhunjhunwala2021dme_spatial_temporal are unbiased and non-adaptive.

In this paper, we focus on the severely communication-constrained case nkdnk\leq d, when the server receives very limited information about any single client vector. If nkdnk\gg d, we see in Appendix A.2 that the cross-client information has no additional advantage in terms of improving the mean estimate under both Rand-kk-Spatial or Rand-Proj-Spatial, with different choices of random linear maps. Furthermore, when nkdnk\gg d, the performance of both the estimators converges to that of Rand-kk. Intuitively, this means when the server receives sufficient information regarding the client vectors, it does not need to leverage cross-client correlation to improve the mean estimator.

Our contributions can be summarized as follows:

  1. 1.

    We propose the Rand-Proj-Spatial family estimator with a more flexible encoding-decoding procedure, which can better leverage the cross-client correlation information to achieve a more general and improved mean estimator compared to existing ones.

  2. 2.

    We show the benefit of using Subsampled Randomized Hadamard Transform (SRHT) as the random linear maps in Rand-Proj-Spatial in terms of better mean estimation error (MSE). We theoretically analyze the case when the correlation information is known at the server (see Theorems 4.3, 4.4 and Section 4.3). Further, we propose a practical configuration called Rand-Proj-Spatial(Avg) when the correlation is unknown.

  3. 3.

    We conduct experiments on common distributed optimization tasks, and demonstrate the superior performance of Rand-Proj-Spatial compared to existing sparsification techniques.

2 Related Work

Quantization and Sparsification. Commonly used techniques to achieve communication efficiency are quantization, sparsification, or more generic compression schemes, which generalize the former two basu2019qsparse . Quantization involves either representing each coordinate of the vector by a small number of bits davies2021new_bounds_dme_var_reduction ; vargaftik2022eden ; suresh2017dme_icml ; alistarh2017qsgd_neurips ; bernstein2018signsgd ; reisizadeh2020fedpaq_aistats , or more involved vector quantization techniques shlezinger2020uveqfed_tsp ; gandikota2021vqsgd_aistats . Sparsification wangni2018grad_sparse ; alistarh2018convergence ; stich2018sparsified ; karimireddy2019error ; sattler2019robust , on the other hand, involves communicating a small number k<dk<d of coordinates, to the server. Common protocols include Rand-kk konevcny2018rand_dme , sending kk uniformly randomly selected coordinates; Top-kk shi2019topk , sending the kk largest magnitude coordinates; and a combination of the two barnes2020rtop_jsait . Some recent works, with a focus on distributed learning, further refine these communication-saving mechanisms ozfatura2021time by incorporating temporal correlation or error feedback horvath2021induced ; karimireddy2019error .

Distributed Mean Estimation (DME). DME has wide applications in distributed optimization and FL. Most of the existing literature on DME either considers statistical mean estimation zhang2013lower_bd ; garg2014comm_neurips , assuming that the data across clients is generated i.i.d. according to the same distribution, or empirical mean estimation suresh2017dme_icml ; chen2020breaking ; mayekar2021wyner ; jhunjhunwala2021dme_spatial_temporal ; konevcny2018rand_dme ; vargaftik2021drive_neurips ; vargaftik2022eden_icml , without making any distributional assumptions on the data. A recent line of work on empirical DME considers applying additional information available to the server, to further improve the mean estimate. This side information includes cross-client correlation  jhunjhunwala2021dme_spatial_temporal ; suresh2022correlated_dme_icml , or the memory of the past updates sent by the clients liang2021improved_isit .

Subsampled Randomized Hadamard Transformation (SRHT). SRHT was introduced for random dimensionality reduction using sketching Ailon2006srht_initial ; tropp2011improved ; lacotte2020optimal_iter_sketching_srht . Common applications of SRHT include faster computation of matrix problems, such as low-rank approximation Balabanov2022block_srht_dist_low_rank ; boutsidis2013improved_srht , and machine learning tasks, such as ridge regression lu2013ridge_reg_srht , and least square problems Chehreghani2020graph_reg_srht ; dan2022least_sq_srht ; lacotte2020optimal_first_order_srht . SRHT has also been applied to improve communication efficiency in distributed optimization ivkin2019distSGD_sketch_neurips and FL haddadpour2020fedsketch ; rothchild2020fetchsgd_icml .

3 Preliminaries

Notation. We use bold lowercase (uppercase) letters, e.g. 𝐱{\mathbf{x}} (𝑮{\bm{G}}) to denote vectors (matrices). 𝐞jd{\mathbf{e}}_{j}\in\mathbb{R}^{d}, for j[d]j\in[d], denotes the jj-th canonical basis vector. 2\|\cdot\|_{2} denotes the Euclidean norm. For a vector 𝐱{\mathbf{x}}, 𝐱(j){\mathbf{x}}(j) denotes its jj-th coordinate. Given integer mm, we denote by [m][m] the set {1,2,,m}\{1,2,\dots,m\}.

Problem Setup. Consider nn geographically separated clients coordinated by a central server. Each client i[n]i\in[n] holds a vector 𝐱id{\mathbf{x}}_{i}\in\mathbb{R}^{d}, while the server wants to estimate the mean vector 𝐱¯1ni=1n𝐱i\bar{{\mathbf{x}}}\triangleq\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}. Given a per-client communication budget of k[d]k\in[d], each client ii computes 𝐱^i\widehat{{\mathbf{x}}}_{i} and sends it to the central server. 𝐱^i\widehat{{\mathbf{x}}}_{i} is an approximation of 𝐱i{\mathbf{x}}_{i} that belongs to a random kk-dimensional subspace. Each client also sends a random seed to the server, which conveys the subspace information, and can usually be communicated using a negligible amount of bits. Having received the encoded vectors {𝐱^i}i=1n\{\widehat{{\mathbf{x}}}_{i}\}_{i=1}^{n}, the server then computes 𝐱^d\widehat{{\mathbf{x}}}\in\mathbb{R}^{d}, an estimator of 𝐱¯\bar{{\mathbf{x}}}. We consider the severely communication-constrained setting where nkdnk\leq d, when only a limited amount of information about the client vectors is seen by the server.

Error Metric. We measure the quality of the decoded vector 𝐱^\widehat{{\mathbf{x}}} using the Mean Squared Error (MSE) 𝔼[𝐱^𝐱¯22]\mathbb{E}\left[\|\widehat{{\mathbf{x}}}-\bar{{\mathbf{x}}}\|_{2}^{2}\right], where the expectation is with respect to all the randomness in the encoding-decoding scheme. Our goal is to design an encoding-decoding algorithm to achieve an unbiased estimate 𝐱^\widehat{{\mathbf{x}}} (i.e. 𝔼[𝐱^]=𝐱¯\mathbb{E}[\widehat{{\mathbf{x}}}]=\bar{{\mathbf{x}}}) that minimizes the MSE, given the per-client communication budget kk. To consider an example, in rand-kk sparsification, each client sends randomly selected kk out of its dd coordinates to the server. The server then computes the mean estimate as 𝐱^(Rand-k)=1ndki=1n𝐱^i\widehat{{\mathbf{x}}}^{(\text{Rand-$k$})}=\frac{1}{n}\frac{d}{k}\sum_{i=1}^{n}\widehat{{\mathbf{x}}}_{i}. By (jhunjhunwala2021dme_spatial_temporal, , Lemma 1), the MSE of Rand-kk sparsification is given by

𝔼[𝐱^(Rand-k)𝐱¯22]=1n2(dk1)i=1n𝐱i22\displaystyle\mathbb{E}\Big{[}\|\widehat{{\mathbf{x}}}^{(\text{Rand-$k$})}-\bar{{\mathbf{x}}}\|_{2}^{2}\Big{]}=\frac{1}{n^{2}}\Big{(}\frac{d}{k}-1\Big{)}\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2} (1)

The Rand-kk-Spatial Family Estimator. For large values of dk\frac{d}{k}, the Rand-kk MSE in Eq. 1 can be prohibitive. jhunjhunwala2021dme_spatial_temporal proposed the Rand-kk-Spatial family estimator, which achieves an improved MSE, by leveraging the knowledge of the correlation between client vectors at the server. The encoded vectors {𝐱^i}\{\widehat{{\mathbf{x}}}_{i}\} are the same as in Rand-kk. However, the jj-th coordinate of the decoded vector is given as

𝐱^(Rand-k-Spatial)(j)=1nβ¯T(Mj)i=1n𝐱^i(j)\displaystyle\widehat{{\mathbf{x}}}^{(\text{Rand-$k$-Spatial})}(j)=\frac{1}{n}\frac{\bar{\beta}}{T(M_{j})}\sum_{i=1}^{n}\widehat{{\mathbf{x}}}_{i}(j) (2)

Here, T:T:\mathbb{R}\rightarrow\mathbb{R} is a pre-defined transformation function of MjM_{j}, the number of clients which sent their jj-th coordinate, and β¯\bar{\beta} is a normalization constant to ensure 𝐱^\widehat{{\mathbf{x}}} is an unbiased estimator of 𝐱{\mathbf{x}}. The resulting MSE is given by

𝔼[𝐱^(Rand-k-Spatial)𝐱¯22]=1n2(dk1)i=1n𝐱i22+(c1i=1n𝐱i22c2i=1nli𝐱i,𝐱l)\displaystyle\mathbb{E}\Big{[}\|\widehat{{\mathbf{x}}}^{(\text{Rand-$k$-Spatial})}-\bar{{\mathbf{x}}}\|_{2}^{2}\Big{]}=\frac{1}{n^{2}}\Big{(}\frac{d}{k}-1\Big{)}\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2}+\left(c_{1}\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2}-c_{2}\sum_{i=1}^{n}\sum_{l\neq i}\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle\right) (3)

where c1,c2c_{1},c_{2} are constants dependent on n,d,kn,d,k and TT, but independent of client vectors {𝐱i}i=1n\{{\mathbf{x}}_{i}\}_{i=1}^{n}. When the client vectors are orthogonal, i.e., 𝐱i,𝐱l=0\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle=0, for all ili\neq l, jhunjhunwala2021dme_spatial_temporal show that with appropriately chosen TT, the MSE in Eq. 3 reduces to Eq. 1. However, if there exists a positive correlation between the vectors, the MSE in Eq. 3 is strictly smaller than that for Rand-kk Eq. 1.

4 The Rand-Proj-Spatial Family Estimator

While the Rand-kk-Spatial family estimator proposed in jhunjhunwala2021dme_spatial_temporal focuses only on improving the decoding at the server, we consider a more general encoding-decoding scheme. Rather than simply communicating kk out of the dd coordinates of its vector 𝐱i{\mathbf{x}}_{i} to the server, client ii applies a (random) linear map 𝑮ik×d{\bm{G}}_{i}\in\mathbb{R}^{k\times d} to 𝐱i{\mathbf{x}}_{i} and sends 𝐱^i=𝑮i𝐱ik\widehat{{\mathbf{x}}}_{i}={\bm{G}}_{i}{\mathbf{x}}_{i}\in\mathbb{R}^{k} to the server. The decoding process on the server first projects the encoded vectors {𝑮i𝐱i}i=1n\{{\bm{G}}_{i}{\mathbf{x}}_{i}\}_{i=1}^{n} back to the dd-dimensional space and then forms an estimate 𝐱^\widehat{{\mathbf{x}}}. We motivate our new decoding procedure with the following regression problem:

𝐱^(Rand-Proj)=argmin𝐱i=1n𝑮i𝐱𝑮i𝐱i22\displaystyle\widehat{{\mathbf{x}}}^{(\text{Rand-Proj})}=\operatorname*{arg\,min}_{{\mathbf{x}}}\sum_{i=1}^{n}\|{\bm{G}}_{i}{\mathbf{x}}-{\bm{G}}_{i}{\mathbf{x}}_{i}\|_{2}^{2} (4)

To understand the motivation behind Eq. 4, first consider the special case where 𝑮i=𝑰d{\bm{G}}_{i}={\bm{I}}_{d} for all i[n]i\in[n], that is, the clients communicate their vectors without compressing. The server can then exactly compute the mean 𝐱¯=1ni=1n𝐱i\bar{\mathbf{x}}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}. Equivalently, 𝐱¯\bar{\mathbf{x}} is the solution of argmin𝐱i=1n𝐱𝐱i22\operatorname*{arg\,min}_{{\mathbf{x}}}\sum_{i=1}^{n}\|{\mathbf{x}}-{\mathbf{x}}_{i}\|_{2}^{2}. In the more general setting, we require that the mean estimate 𝐱^\widehat{{\mathbf{x}}} when encoded using the map 𝑮i{\bm{G}}_{i}, should be “close” to the encoded vector 𝑮i𝐱i{\bm{G}}_{i}{\mathbf{x}}_{i} originally sent by client ii, for all clients i[n]i\in[n].

We note the above intuition can also be translated into different regression problems to motivate the design of the new decoding procedure. We discuss in Appendix B.2 intuitive alternatives which, unfortunately, either do not enable the usage of cross-client correlation information, or do not use such information effectively. We choose the formulation in Eq. 4 due to its analytical tractability and its direct relevance to our target error metric MSE. We note that it is possible to consider the problem in Eq. 4 in the other norms, such as the sum of 2\ell_{2} norms (without the squares) or the \ell_{\infty} norm. We leave this as a future direction to explore.

The solution to Eq. 4 is given by 𝐱^(Rand-Proj)=(i=1n𝑮iT𝑮i)i=1n𝑮iT𝑮i𝐱i\widehat{{\mathbf{x}}}^{(\text{Rand-Proj})}=(\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i})^{{\dagger}}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}{\mathbf{x}}_{i}, where {\dagger} denotes the Moore-Penrose pseudo inverse golub2013matrix_book . However, while 𝐱^(Rand-Proj)\widehat{{\mathbf{x}}}^{(\text{Rand-Proj})} minimizes the error of the regression problem, our goal is to design an unbiased estimator that also improves the MSE. Therefore, we make the following two modifications to 𝐱^(Rand-Proj)\widehat{{\mathbf{x}}}^{(\text{Rand-Proj})}: First, to ensure that the mean estimate is unbiased, we scale the solution by a normalization factor β¯\bar{\beta}111We show that it suffices for β¯\bar{\beta} to be a scalar in Appendix B.1. . Second, to incorporate varying degrees of correlation among the clients, we propose to apply a scalar transformation function T:T:\mathbb{R}\rightarrow\mathbb{R} to each of the eigenvalues of i=1n𝑮iT𝑮i\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}. The resulting Rand-Proj-Spatial family estimator is given by

𝐱^(Rand-Proj-Spatial)=β¯(T(i=1n𝑮iT𝑮i))i=1n𝑮iT𝑮i𝐱i\displaystyle\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial})}=\bar{\beta}\Big{(}T(\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i})\Big{)}^{{\dagger}}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}{\mathbf{x}}_{i} (5)

Though applying the transformation function TT in Rand-Proj-Spatial requires computing the eigendecomposition of i=1n𝑮iT𝑮i\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}. However, this happens only at the server, which has more computational power than the clients. Next, we observe that for appropriate choice of {𝑮i}i=1n\{{\bm{G}}_{i}\}_{i=1}^{n}, the Rand-Proj-Spatial family estimator reduces to the Rand-kk-Spatial family estimator jhunjhunwala2021dme_spatial_temporal .

Lemma 4.1 (Recovering Rand-kk-Spatial).

Suppose client ii generates a subsampling matrix 𝐄i=[𝐞i1,,𝐞ik]{\bm{E}}_{i}=\begin{bmatrix}\mathbf{e}_{i_{1}},&\dots,&\mathbf{e}_{i_{k}}\end{bmatrix}^{\top}, where {𝐞j}j=1d\{\mathbf{e}_{j}\}_{j=1}^{d} are the canonical basis vectors, and {i1,,ik}\{i_{1},\dots,i_{k}\} are sampled from {1,,d}\{1,\dots,d\} without replacement. The encoded vectors are given as 𝐱^i=𝐄i𝐱i\widehat{{\mathbf{x}}}_{i}={\bm{E}}_{i}{\mathbf{x}}_{i}. Given a function TT, 𝐱^\widehat{{\mathbf{x}}} computed as in Eq. 5 recovers the Rand-kk-Spatial estimator.

The proof details are in Appendix C.5. We discuss the choice of TT and how it compares to Rand-kk-Spatial in detail in Section 4.3.

Remark 4.2.

In the simple case when 𝐆i{\bm{G}}_{i}’s are subsampling matrices (as in Rand-kk-Spatial jhunjhunwala2021dme_spatial_temporal ), the jj-th diagonal entry of i=1n𝐆iT𝐆i\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}, MjM_{j} conveys the number of clients which sent the jj-th coordinate. Rand-kk-Spatial incorporates correlation among client vectors by applying a function TT to MjM_{j}. Intuitively, it means scaling different coordinates differently. This is in contrast to Rand-kk, which scales all the coordinates by d/kd/k. In our more general case, we apply a function TT to the eigenvalues of i=1n𝐆iT𝐆i\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i} to similarly incorporate correlation in Rand-Proj-Spatial.

To showcase the utility of the Rand-Proj-Spatial family estimator, we propose to set the random linear maps 𝑮i{\bm{G}}_{i} to be scaled Subsampled Randomized Hadamard Transform (SRHT, e.g. tropp2011improved ). Assuming dd to be a power of 22, the linear map 𝑮i{\bm{G}}_{i} is given as

𝑮i=1d𝑬i𝑯𝑫ik×d\displaystyle{\bm{G}}_{i}=\frac{1}{\sqrt{d}}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}\in\mathbb{R}^{k\times d} (6)

where 𝑬ik×d{\bm{E}}_{i}\in\mathbb{R}^{k\times d} is the subsampling matrix, 𝑯d×d{\bm{H}}\in\mathbb{R}^{d\times d} is the (deterministic) Hadamard matrix and 𝑫id×d{\bm{D}}_{i}\in\mathbb{R}^{d\times d} is a diagonal matrix with independent Rademacher random variables as its diagonal entries. We choose SRHT due to its superior performance compared to other random matrices. Other possible choices of random matrices for Rand-Proj-Spatial estimator include sketching matrices commonly used for dimensionality reduction, such as Gaussian weinberger2004learning ; tripathy2016gaussian , row-normalized Gaussian, and Count Sketch minton2013improved_bounds_countsketch , as well as error-correction coding matrices, such as Low-Density Parity Check (LDPC) gallager1962LDPC and Fountain Codes Shokrollahi2005fountain_codes . However, in the absence of correlation between client vectors, all these matrices suffer a higher MSE.

In the following, we first compare the MSE of Rand-Proj-Spatial with SRHT against Rand-kk and Rand-kk-Spatial in two extreme cases: when all the client vectors are identical, and when all the client vectors are orthogonal to each other. In both cases, we highlight the transformation function TT used in Rand-Proj-Spatial (Eq. 5) to incorporate the knowledge of cross-client correlation. We define

:=i=1nli𝐱i,𝐱li=1n𝐱i22\displaystyle{\mathcal{R}}:=\frac{\sum_{i=1}^{n}\sum_{l\neq i}\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle}{\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2}} (7)

to measure the correlation between the client vectors. Note that [1,n1]{\mathcal{R}}\in[-1,n-1]. =0{\mathcal{R}}=0 implies all client vectors are orthogonal, while =n1{\mathcal{R}}=n-1 implies identical client vectors.

4.1 Case I: Identical Client Vectors (=n1{\mathcal{R}}=n-1)

When all the client vectors are identical (𝐱i𝐱{\mathbf{x}}_{i}\equiv{\mathbf{x}}), jhunjhunwala2021dme_spatial_temporal showed that setting the transformation TT to identity, i.e., T(m)=mT(m)=m, for all mm, leads to the minimum MSE in the Rand-kk-Spatial family of estimators. The resulting estimator is called Rand-kk-Spatial (Max). Under the same setting, using the same transformation TT in Rand-Proj-Spatial with SRHT, the decoded vector in Eq. 5 simplifies to

𝐱^(Rand-Proj-Spatial)=β¯(i=1n𝑮iT𝑮i)i=1n𝑮iT𝑮i𝐱=β¯𝑺𝑺𝐱,\displaystyle\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial})}=\bar{\beta}\Big{(}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}\Big{)}^{{\dagger}}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}{\mathbf{x}}=\bar{\beta}{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}}, (8)

where 𝑺:=i=1n𝑮iT𝑮i{\bm{S}}:=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}. By construction, rank(𝑺)nk\text{rank}({\bm{S}})\leq nk, and we focus on the case nkdnk\leq d.

Limitation of Subsampling matrices. As mentioned above, with 𝑮i=𝑬i,i[n]{\bm{G}}_{i}={\bm{E}}_{i},\forall\ i\in[n], we recover the Rand-kk-Spatial family of estimators. In this case, 𝑺{\bm{S}} is a diagonal matrix, where each diagonal entry 𝑺jj=Mj{\bm{S}}_{jj}=M_{j}, j[d]j\in[d]. MjM_{j} is the number of clients which sent their jj-th coordinate to the server. To ensure rank(𝑺)=nk\text{rank}({\bm{S}})=nk, we need 𝑺jj1,j{\bm{S}}_{jj}\leq 1,\forall j, i.e., each of the dd coordinates is sent by at most one client. If all the clients sample their matrices {𝑬i}i=1n\{{\bm{E}}_{i}\}_{i=1}^{n} independently, this happens with probability (dnk)(dk)n\frac{{d\choose nk}}{{d\choose k}^{n}}. As an example, for k=1k=1, Prob(rank(𝑺)=n)=(dn)dn1n!\text{Prob}(\text{rank}({\bm{S}})=n)=\frac{{d\choose n}}{d^{n}}\leq\frac{1}{n!} (because dnnn(dn)dnn!\frac{d^{n}}{n^{n}}\leq{d\choose n}\leq\frac{d^{n}}{n!}). Therefore, to guarantee that 𝑺{\bm{S}} is full-rank, each client would need the subsampling information of all the other clients. This not only requires additional communication but also has serious privacy implications. Essentially, the limitation with subsampling matrices 𝑬i{\bm{E}}_{i} is that the eigenvectors of 𝑺{\bm{S}} are restricted to be canonical basis vectors {𝐞j}j=1d\{{\mathbf{e}}_{j}\}_{j=1}^{d}. Generalizing 𝑮i{\bm{G}}_{i}’s to general rank kk matrices relaxes this constraint and hence we can ensure that 𝑺{\bm{S}} is full-rank with high probability. In the next result, we show the benefit of choosing 𝑮i{\bm{G}}_{i} as SRHT matrices. We call the resulting estimator Rand-Proj-Spatial(Max).

Theorem 4.3 (MSE under Full Correlation).

Consider nn clients, each holding the same vector 𝐱d{\mathbf{x}}\in\mathbb{R}^{d}. Suppose we set T(λ)=λT(\lambda)=\lambda, β¯=dk\bar{\beta}=\frac{d}{k} in Eq. 5, and the random linear map 𝐆i{\bm{G}}_{i} at each client to be an SRHT matrix. Let δ\delta be the probability that 𝐒=i=1n𝐆iT𝐆i{\bm{S}}=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i} does not have full rank. Then, for nkdnk\leq d,

𝔼[𝐱^(Rand-Proj-Spatial(Max))𝐱¯22][d(1δ)nk+δk1]𝐱22\displaystyle\mathbb{E}\Big{[}\|\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial}\text{(Max)})}-\bar{{\mathbf{x}}}\|_{2}^{2}\Big{]}\leq\Big{[}\frac{d}{(1-\delta)nk+\delta k}-1\Big{]}\|{\mathbf{x}}\|_{2}^{2} (9)

The proof details are in Appendix C.1. To compare the performance of Rand-Proj-Spatial(Max) against Rand-kk, we show in Appendix C.2 that for n2n\geq 2, as long as δ23\delta\leq\frac{2}{3}, the MSE of Rand-Proj-Spatial(Max) is less than that of Rand-kk. Furthermore, in Appendix C.3 we empirically demonstrate that with d{32,64,128,,1024}d\in\{32,64,128,\dots,1024\} and different values of nkdnk\leq d, the rank of 𝑺{\bm{S}} is full with high probability, i.e., δ0\delta\approx 0. This implies 𝔼[𝐱^(Rand-Proj-Spatial(Max))𝐱¯22](dnk1)𝐱22\mathbb{E}[\|\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial}\text{(Max)})}-\bar{{\mathbf{x}}}\|_{2}^{2}]\approx(\frac{d}{nk}-1)\|{\mathbf{x}}\|_{2}^{2}.

Futhermore, since setting 𝑮i{\bm{G}}_{i} as SRHT significantly increases the probability of recovering nknk coordinates of 𝐱{\mathbf{x}}, the MSE of Rand-Proj-Spatial with SRHT (Eq. 4.3) is strictly less than that of Rand-kk-Spatial (Eq. 3). We also compare the MSEs of the three estimators in Figure 2 in the following setting: 𝐱2=1\|{\mathbf{x}}\|_{2}=1, d=1024,n{10,20,50,100}d=1024,n\in\{10,20,50,100\} and small kk values such that nk<dnk<d.

Refer to caption
Figure 2: MSE comparison of Rand-kk, Rand-kk-Spatial(Max) and Rand-Proj-Spatial(Max) estimators, when all clients have identical vectors (maximum inter-client correlation).

4.2 Case II: Orthogonal Client Vectors (=0{\mathcal{R}}=0)

When all the client vectors are orthogonal to each other, jhunjhunwala2021dme_spatial_temporal showed that Rand-kk has the lowest MSE among the Rand-kk-Spatial family of decoders. We show in the next result that if we set the random linear maps 𝑮i{\bm{G}}_{i} at client ii to be SRHT, and choose the fixed transformation T1T\equiv 1 as in jhunjhunwala2021dme_spatial_temporal , Rand-Proj-Spatial achieves the same MSE as that of Rand-kk.

Theorem 4.4 (MSE under No Correlation).

Consider nn clients, each holding a vector 𝐱id{\mathbf{x}}_{i}\in\mathbb{R}^{d}, i[n]\forall i\in[n]. Suppose we set T1T\equiv 1, β¯=d2k\bar{\beta}=\frac{d^{2}}{k} in Eq. 5, and the random linear map 𝐆i{\bm{G}}_{i} at each client to be an SRHT matrix. Then, for nkdnk\leq d,

𝔼[𝐱^(Rand-Proj-Spatial)𝐱¯22]=1n2(dk1)i=1n𝐱i22.\displaystyle\mathbb{E}\Big{[}\|\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial})}-\bar{{\mathbf{x}}}\|_{2}^{2}\Big{]}=\frac{1}{n^{2}}\Big{(}\frac{d}{k}-1\Big{)}\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2}. (10)

The proof details are in Appendix C.4. Theorem 4.4 above shows that with zero correlation among client vectors, Rand-Proj-Spatial achieves the same MSE as that of Rand-kk.

4.3 Incorporating Varying Degrees of Correlation

In practice, it unlikely that all the client vectors are either identical or orthogonal to each other. In general, there is some “imperfect” correlation among the client vectors, i.e., (0,n1){\mathcal{R}}\in(0,n-1). Given correlation level {\mathcal{R}}, jhunjhunwala2021dme_spatial_temporal shows that the estimator from the Rand-kk-Spatial family that minimizes the MSE is given by the following transformation.

T(m)=1+n1(m1)\displaystyle T(m)=1+\frac{{\mathcal{R}}}{n-1}(m-1) (11)

Recall from Section 4.1 (Section 4.2) that setting T(m)=1T(m)=1 (T(m)=mT(m)=m) leads to the estimator among the Rand-kk-Spatial family that minimizes MSE when there is zero (maximum) correlation among the client vectors. We observe the function TT defined in Eq. 11 essentially interpolates between the two extreme cases, using the normalized degree of correlation n1[1n1,1]\frac{{\mathcal{R}}}{n-1}\in[-\frac{1}{n-1},1] as the weight. This motivates us to apply the same function TT defined in Eq. 11 on the eigenvalues of 𝑺=i=1n𝑮iT𝑮i{\bm{S}}=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i} in Rand-Proj-Spatial. As we shall see in our results, the resulting Rand-Proj-Spatial family estimator improves over the MSE of both Rand-kk and Rand-kk-Spatial family estimator.

We note that deriving a closed-form expression of MSE for Rand-Proj-Spatial with SRHT in the general case with the transformation function TT (Eq. 11) is hard (we elaborate on this in Appendix B.3), as this requires a closed form expression for the non-asymptotic distributions of eigenvalues and eigenvectors of the random matrix 𝑺{\bm{S}}. To the best of our knowledge, previous analyses of SRHT, for example in Ailon2006srht_initial ; tropp2011improved ; lacotte2020optimal_iter_sketching_srht ; lacotte2020optimal_first_order_srht ; lei20srht_topk_aaai , rely on the asymptotic properties of SRHT, such as the limiting eigen spectrum, or concentration bounds on the singular values, to derive asymptotic or approximate guarantees. However, to analyze the MSE of Rand-Proj-Spatial, we need an exact, non-asymptotic analysis of the eigenvalues and eigenvectors distribution of SRHT. Given the apparent intractability of the theoretical analysis, we compare the MSE of Rand-Proj-Spatial, Rand-kk-Spatial, and Rand-kk via simulations.

Simulations. In each experiment, we first simulate β¯\bar{\beta} in Eq. 5, which ensures our estimator is unbiased, based on 10001000 random runs. Given the degree of correlation {\mathcal{R}}, we then compute the squared error, i.e. 𝐱^(Rand-Proj-Spatial)𝐱¯22\|\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial})}-\bar{{\mathbf{x}}}\|_{2}^{2}, where Rand-Proj-Spatial has 𝑮i{\bm{G}}_{i} as SRHT matrix (Eq. 6) and TT as in Eq. 11. We plot the average over 10001000 random runs as an approximation to MSE. Each client holds a dd-dimensional base vector 𝐞j{\mathbf{e}}_{j} for some j[d]j\in[d], and so two clients either hold the same or orthogonal vectors. We control the degree of correlation {\mathcal{R}} by changing the number of clients which hold the same vector. We consider d=1024d=1024, n{21,51}n\in\{21,51\}. We consider positive correlation values, where {\mathcal{R}} is chosen to be linearly spaced within [0,n1][0,n-1]. Hence, for n=21n=21, we use {4,8,12,16}{\mathcal{R}}\in\{4,8,12,16\} and for n=51n=51, we use {10,20,30,40}{\mathcal{R}}\in\{10,20,30,40\}. All results are presented in Figure 3. As expected, given {\mathcal{R}}, Rand-Proj-Spatial consistently achieves a lower MSE than the lowest possible MSE from the Rand-kk-Spatial family decoder. Additional results with different values of n,d,kn,d,k, including the setting nkdnk\ll d, can be found in Appendix B.4.

Refer to caption
Refer to caption
Figure 3: MSE comparison of estimators Rand-kk, Rand-kk-Spatial(Opt), Rand-Proj-Spatial, given the degree of correlation {\mathcal{R}}. Rand-kk-Spatial(Opt) denotes the estimator that gives the lowest possible MSE from the Rand-kk-Spatial family. We consider d=1024d=1024, number of clients n{21,51}n\in\{21,51\}, and kk values such that nk<dnk<d. In each plot, we fix n,k,dn,k,d and vary the degree of positive correlation {\mathcal{R}}. The y-axis represents MSE. Notice since each client has a fixed 𝐱i2=1\|{\mathbf{x}}_{i}\|_{2}=1, and Rand-kk does not leverage cross-client correlation, the MSE of Rand-kk in each plot remains the same for different {\mathcal{R}}.

A Practical Configuration. In reality, it is hard to know the correlation information {\mathcal{R}} among the client vectors. jhunjhunwala2021dme_spatial_temporal uses the transformation function which interpolates to the middle point between the full correlation and no correlation cases, such that T(m)=1+n2m1n1T(m)=1+\frac{n}{2}\frac{m-1}{n-1}. Rand-kk-Spatial with such TT is called Rand-kk-Spatial(Avg). Following this approach, we evaluate Rand-Proj-Spatial with SRHT using this TT, and call it Rand-Proj-Spatial(Avg) in practical settings (see Figure 4).

5 Experiments

We consider three practical distributed optimization tasks for evaluation: distributed power iteration, distributed kk-means and distributed linear regression. We compare Rand-Proj-Spatial(Avg) against Rand-kk, Rand-kk-Spatial(Avg), and two more sophisticated but widely used sparsification schemes: non-uniform coordinate-wise gradient sparsification wangni2018grad_sparse (we call it Rand-kk(Wangni)) and the Induced compressor with Rand-kk + Top-kk horvath2021induced . The results are presented in Figure 4.

Dataset. For both distributed power iteration and distributed kk-means, we use the test set of the Fashion-MNIST dataset xiao2017fashion consisting of 1000010000 samples. The original images from Fashion-MNIST are 28×2828\times 28 in size. We preprocess and resize each image to be 32×3232\times 32. Resizing images to have their dimension as a power of 2 is a common technique used in computer vision to accelerate the convolution operation. We use the UJIndoor dataset 222https://archive.ics.uci.edu/ml/datasets/ujiindoorloc for distributed linear regression. We subsample 1000010000 data points, and use the first 512512 out of the total 520520 features on signals of phone calls. The task is to predict the longitude of the location of a phone call. In all the experiments in Figure 4, the datasets are split IID across the clients via random shuffling. In Appendix D.1, we have additional results for non-IID data split across the clients.

Setup and Metric. Recall that nn denotes the number of clients, kk the per-client communication budget, and dd the vector dimension. For Rand-Proj-Spatial, we use the first 5050 iterations to estimate β¯\bar{\beta} (see Eq. 5). Note that β¯\bar{\beta} only depends on n,k,dn,k,d, and TT (the transformation function in Eq. 5), but is independent of the dataset. We repeat the experiments across 10 independent runs, and report the mean MSE (solid lines) and one standard deviation (shaded regions) for each estimator. For each task, we plot the squared error of the mean estimator 𝐱^\widehat{{\mathbf{x}}}, i.e., 𝐱^𝐱¯22\|\widehat{{\mathbf{x}}}-\bar{{\mathbf{x}}}\|_{2}^{2}, and the values of the task-specific loss function, detailed below.

Tasks and Settings:

1. Distributed power iteration. We estimate the principle eigenvector of the covariance matrix, with the dataset (Fashion-MNIST) distributed across the nn clients. In each iteration, each client computes a local principle eigenvector estimate based on a single power iteration and sends an encoded version to the server. The server then computes a global estimate and sends it back to the clients. The task-specific loss here is 𝐯t𝐯top2\|{\mathbf{v}}_{t}-{\mathbf{v}}_{top}\|_{2}, where 𝐯t{\mathbf{v}}_{t} is the global estimate of the principal eigenvector at iteration tt, and 𝐯top{\mathbf{v}}_{top} is the true principle eigenvector.

2. Distributed kk-means. We perform kk-means clustering balcan2013distributed with the data distributed across nn clients (Fashion-MNIST, 10 classes) using Lloyd’s algorithm. At each iteration, each client performs a single iteration of kk-means to find its local centroids and sends the encoded version to the server. The server then computes an estimate of the global centroids and sends them back to the clients. We report the average squared mean estimation error across 10 clusters, and the kk-means loss, i.e., the sum of the squared distances of the data points to the centroids.

For both distributed power iterations and distributed kk-means, we run the experiments for 3030 iterations and consider two different settings: n=10,k=102n=10,k=102 and n=50,k=20n=50,k=20.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Experiment results on three distributed optimization tasks: distributed power iteration, distributed kk-means, and distributed linear regression. The first two use the Fashion-MNIST dataset with the images resized to 32×3232\times 32, hence d=1024d=1024. Distributed linear regression uses UJIndoor dataset with d=512d=512. All the experiments are repeated for 10 random runs, and we report the mean as the solid lines, and one standard deviation using the shaded region. The violet line in the plots represents our proposed Rand-Proj-Spatial(Avg) estimator.
Refer to caption
Refer to caption
Refer to caption
Figure 5: The corresponding wall-clock time to encode and decode client vectors (in seconds) using different sparsification schemes, across the three tasks.

3. Distributed linear regression. We perform linear regression on the UJIndoor dataset distributed across nn clients using SGD. At each iteration, each client computes a local gradient and sends an encoded version to the server. The server computes a global estimate of the gradient, performs an SGD step, and sends the updated parameter to the clients. We run the experiments for 5050 iterations with learning rate 0.0010.001. The task-specific loss is the linear regression loss, i.e. empirical mean squared error. To have a proper scale that better showcases the difference in performance of different estimators, we plot the results starting from the 10th iteration.

Results. It is evident from Figure 4 that Rand-Proj-Spatial(Avg), our estimator with the practical configuration TT (see Section 4.3) that does not require the knowledge of the actual degree of correlation among clients, consistently outperforms the other estimators in all three tasks. Additional experiments for the three tasks are included in Appendix D.1. Furthermore, we present the wall-clock time to encode and decode client vectors using different sparsification schemes in Figure 5. Though Rand-Proj-Spatial(Avg) has the longest decoding time, the encoding time of Rand-Proj-Spatial(Avg) is less than that of the adaptive Rand-kk(Wangni) sparsifier. In practice, the server has more computational power than the clients and hence can afford a longer decoding time. Therefore, it is more important to have efficient encoding procedures.

6 Limitations

We note two practical limitations of the proposed Rand-Proj-Spatial.
1) Computation Time of Rand-Proj-Spatial. The encoding time of Rand-Proj-Spatial is O(kd)O(kd), while the decoding time is O(d2nk)O(d^{2}\cdot nk). The computation bottleneck in decoding is computing the eigendecomposition of the d×dd\times d matrix i=1n𝑮iT𝑮i\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i} of rank at most nknk. Improving the computation time for both the encoding and decoding schemes is an important direction for future work.
2) Perfect Shared Randomness. It is common to assume perfect shared randomness between the server and the clients in distributed settings zhou2022ldp_sparse_vec_agg . However, to perfectly simulate randomness using Pseudo Random Number Generator (PRNG), at least log2d\log_{2}d bits of the seed need to be exchanged in practice. We acknowledge this gap between theory and practice.

7 Conclusion

In this paper, we propose the Rand-Proj-Spatial estimator, a novel encoding-decoding scheme, for communication-efficient distributed mean estimation. The proposed client-side encoding generalizes and improves the commonly used Rand-kk sparsification, by utilizing projections onto general kk-dimensional subspaces. On the server side, cross-client correlation is leveraged to improve the approximation error. Compared to existing methods, the proposed scheme consistently achieves better mean estimation error across a variety of tasks. Potential future directions include improving the computation time of Rand-Proj-Spatial and exploring whether the proposed Rand-Proj-Spatial achieves the optimal estimation error among the class of non-adaptive estimators, given correlation information. Furthermore, combining sparsification and quantization techniques and deriving such algorithms with the optimal communication cost-estimation error trade-offs would be interesting.

Acknowledgments

We would like to thank the anonymous reviewer for providing valuable feedback on the title of this work, interesting open problems, alternative motivating regression problems and practical limitations of shared randomness. This work was supported in part by NSF grants CCF 2045694, CCF 2107085, CNS-2112471, and ONR N00014-23-1- 2149.

References

  • (1) Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
  • (2) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
  • (3) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • (4) Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
  • (5) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • (6) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • (7) John A Gubner. Distributed estimation and quantization. IEEE Transactions on Information Theory, 39(4):1456–1459, 1993.
  • (8) Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, and Dan Alistarh. New bounds for distributed mean estimation and variance reduction, 2021.
  • (9) Shay Vargaftik, Ran Ben Basat, Amit Portnoy, Gal Mendelson, Yaniv Ben-Itzhak, and Michael Mitzenmacher. Eden: Communication-efficient and robust distributed mean estimation for federated learning, 2022.
  • (10) Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation with limited communication. In International conference on machine learning, pages 3329–3337. PMLR, 2017.
  • (11) Jakub Konečnỳ and Peter Richtárik. Randomized distributed mean estimation: Accuracy vs. communication. Frontiers in Applied Mathematics and Statistics, 4:62, 2018.
  • (12) Divyansh Jhunjhunwala, Ankur Mallick, Advait Harshal Gadhikar, Swanand Kadhe, and Gauri Joshi. Leveraging spatial and temporal correlations in sparsified mean estimation. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  • (13) Ananda Theertha Suresh, Ziteng Sun, Jae Ro, and Felix Yu. Correlated quantization for distributed mean estimation and optimization. In International Conference on Machine Learning, pages 20856–20876. PMLR, 2022.
  • (14) Samuel Horváth and Peter Richtarik. A better alternative to error feedback for communication-efficient distributed learning. In International Conference on Learning Representations, 2021.
  • (15) Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.
  • (16) Matthew Nokleby and Waheed U Bajwa. Stochastic optimization from distributed streaming data in rate-limited networks. IEEE transactions on signal and information processing over networks, 5(1):152–167, 2018.
  • (17) Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019.
  • (18) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems, 30, 2017.
  • (19) Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018.
  • (20) Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pages 2021–2031. PMLR, 2020.
  • (21) Nir Shlezinger, Mingzhe Chen, Yonina C Eldar, H Vincent Poor, and Shuguang Cui. Uveqfed: Universal vector quantization for federated learning. IEEE Transactions on Signal Processing, 69:500–514, 2020.
  • (22) Venkata Gandikota, Daniel Kane, Raj Kumar Maity, and Arya Mazumdar. vqsgd: Vector quantized stochastic gradient descent. In International Conference on Artificial Intelligence and Statistics, pages 2197–2205. PMLR, 2021.
  • (23) Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. Advances in Neural Information Processing Systems, 31, 2018.
  • (24) Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31, 2018.
  • (25) Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252–3261. PMLR, 2019.
  • (26) Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 31(9):3400–3413, 2019.
  • (27) Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, and Simon See. Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772, 2019.
  • (28) Leighton Pate Barnes, Huseyin A Inan, Berivan Isik, and Ayfer Özgür. rtop-k: A statistical estimation approach to distributed sgd. IEEE Journal on Selected Areas in Information Theory, 1(3):897–907, 2020.
  • (29) Emre Ozfatura, Kerem Ozfatura, and Deniz Gündüz. Time-correlated sparsification for communication-efficient federated learning. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 461–466. IEEE, 2021.
  • (30) Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. Advances in Neural Information Processing Systems, 26, 2013.
  • (31) Ankit Garg, Tengyu Ma, and Huy Nguyen. On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems, 27, 2014.
  • (32) Wei-Ning Chen, Peter Kairouz, and Ayfer Ozgur. Breaking the communication-privacy-accuracy trilemma. Advances in Neural Information Processing Systems, 33:3312–3324, 2020.
  • (33) Prathamesh Mayekar, Ananda Theertha Suresh, and Himanshu Tyagi. Wyner-ziv estimators: Efficient distributed mean estimation with side-information. In International Conference on Artificial Intelligence and Statistics, pages 3502–3510. PMLR, 2021.
  • (34) Shay Vargaftik, Ran Ben-Basat, Amit Portnoy, Gal Mendelson, Yaniv Ben-Itzhak, and Michael Mitzenmacher. Drive: One-bit distributed mean estimation. Advances in Neural Information Processing Systems, 34:362–377, 2021.
  • (35) Shay Vargaftik, Ran Ben Basat, Amit Portnoy, Gal Mendelson, Yaniv Ben Itzhak, and Michael Mitzenmacher. Eden: Communication-efficient and robust distributed mean estimation for federated learning. In International Conference on Machine Learning, pages 21984–22014. PMLR, 2022.
  • (36) Kai Liang and Youlong Wu. Improved communication efficiency for distributed mean estimation with side information. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 3185–3190. IEEE, 2021.
  • (37) Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’06, page 557–563, New York, NY, USA, 2006. Association for Computing Machinery.
  • (38) Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform, 2011.
  • (39) Jonathan Lacotte, Sifan Liu, Edgar Dobriban, and Mert Pilanci. Optimal iterative sketching with the subsampled randomized hadamard transform, 2020.
  • (40) Oleg Balabanov, Matthias Beaupère, Laura Grigori, and Victor Lederer. Block subsampled randomized Hadamard transform for low-rank approximation on distributed architectures. working paper or preprint, October 2022.
  • (41) Christos Boutsidis and Alex Gittens. Improved matrix algorithms via the subsampled randomized hadamard transform, 2013.
  • (42) Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  • (43) Mostafa Haghir Chehreghani. Subsampled randomized hadamard transform for regression of dynamic graphs. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, page 2045–2048, New York, NY, USA, 2020. Association for Computing Machinery.
  • (44) Dan Teng, Xiaowei Zhang, Li Cheng, and Delin Chu. Least squares approximation via sparse subsampled randomized hadamard transform. IEEE Transactions on Big Data, 8(2):446–457, 2022.
  • (45) Jonathan Lacotte and Mert Pilanci. Optimal randomized first-order methods for least-squares problems, 2020.
  • (46) Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, et al. Communication-efficient distributed sgd with sketching. Advances in Neural Information Processing Systems, 32, 2019.
  • (47) Farzin Haddadpour, Belhal Karimi, Ping Li, and Xiaoyun Li. Fedsketch: Communication-efficient and private federated learning via sketching. arXiv preprint arXiv:2008.04975, 2020.
  • (48) Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning, pages 8253–8265. PMLR, 2020.
  • (49) Gene H Golub and Charles F Van Loan. Matrix computations. JHU press, 2013.
  • (50) Kilian Q Weinberger, Fei Sha, and Lawrence K Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the twenty-first international conference on Machine learning, page 106, 2004.
  • (51) Rohit Tripathy, Ilias Bilionis, and Marcial Gonzalez. Gaussian processes with built-in dimensionality reduction: Applications to high-dimensional uncertainty propagation. Journal of Computational Physics, 321:191–223, 2016.
  • (52) Gregory T. Minton and Eric Price. Improved concentration bounds for count-sketch, 2013.
  • (53) R. Gallager. Low-density parity-check codes. IRE Transactions on Information Theory, 8(1):21–28, 1962.
  • (54) Amin Shokrollahi. Fountain codes. Iee Proceedings-communications - IEE PROC-COMMUN, 152, 01 2005.
  • (55) Zijian Lei and Liang Lan. Improved subsampled randomized hadamard transform for linear svm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4519–4526, 2020.
  • (56) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • (57) Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. Distributed kk-means and kk-median clustering on general topologies. Advances in neural information processing systems, 26, 2013.
  • (58) Mingxun Zhou, Tianhao Wang, T-H. Hubert Chan, Giulia Fanti, and Elaine Shi. Locally differentially private sparse vector aggregation. In 2022 IEEE Symposium on Security and Privacy (SP), pages 422–439, 2022.
  • (59) Gene H. Golub. Some modified matrix eigenvalue problems. SIAM Review, 15(2):318–334, 1973.
  • (60) Ming Gu and Stanley C. Eisenstat. A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. SIAM Journal on Matrix Analysis and Applications, 15(4):1266–1276, 1994.
  • (61) Peter Arbenz, Walter Gander, and Gene H. Golub. Restricted rank modification of the symmetric eigenvalue problem: Theoretical considerations. Linear Algebra and its Applications, 104:75–95, 1988.
  • (62) H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data, 2023.

Appendix A Additional Details on Motivation in Introduction

A.1 Preprocssing all client vectors by the same random matrix does not improve performance

Consider nn clients. Suppose client ii holds a vector 𝐱id{\mathbf{x}}_{i}\in\mathbb{R}^{d}. We want to apply Rand-kk or Rand-kk-Spatial, while also making the encoding process more flexible than just randomly choosing kk out of dd coordinates. One naïve way of doing this is for each client to pre-process its vector by applying an orthogonal matrix 𝑮d×d{\bm{G}}\in\mathbb{R}^{d\times d} that is the same across all clients. Such a technique might be helpful in improving the performance of quantization because the MSE due to quantization often depends on how uniform the coordinates of 𝐱i{\mathbf{x}}_{i}’s are, i.e. whether the coordinates of 𝐱i{\mathbf{x}}_{i} have values close to each other. 𝑮{\bm{G}} is designed to be the random matrix (e.g. SRHT) that rotates 𝐱i{\mathbf{x}}_{i} and makes its coordinates uniform.

Each client sends the server 𝐱^i=𝑬i𝑮𝐱i\widehat{{\mathbf{x}}}_{i}={\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}, where 𝑬ik×d{\bm{E}}_{i}\in\mathbb{R}^{k\times d} is the subsamaping matrix. If we use Rand-kk, the server can decode each client vector by first applying the decoding procedure of Rand-kk and then rotating it back to the original space, i.e., 𝐱^i(Naïve)=dk𝑮T𝑬iT𝑬i𝑮𝐱i\widehat{{\mathbf{x}}}_{i}^{\text{(Na\"{i}ve)}}=\frac{d}{k}{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}. Note that

𝔼[𝐱^i(Naïve)]=dk𝔼[𝑮T𝑬iT𝑬i𝑮𝐱i]\displaystyle\mathbb{E}[\widehat{{\mathbf{x}}}_{i}^{\text{(Na\"{i}ve)}}]=\frac{d}{k}\mathbb{E}[{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}]
=dk𝑮Tkd𝑰d𝑮𝐱i\displaystyle=\frac{d}{k}{\bm{G}}^{T}\frac{k}{d}{\bm{I}}_{d}{\bm{G}}{\mathbf{x}}_{i}
=𝐱i.\displaystyle={\mathbf{x}}_{i}.

Hence, 𝐱^i(Naïve)\widehat{{\mathbf{x}}}_{i}^{\text{(Na\"{i}ve)}} is unbiased. The MSE of 𝐱^(Naïve)=1ni=1n𝐱^i(Naïve)\widehat{{\mathbf{x}}}^{\text{(Na\"{i}ve)}}=\frac{1}{n}\sum_{i=1}^{n}\widehat{{\mathbf{x}}}_{i}^{\text{(Na\"{i}ve)}} is given as

𝔼𝐱¯𝐱^(Naïve)22=𝔼1ni=1n𝐱i1ndki=1n𝑮T𝑬iT𝑬i𝑮𝐱i22\displaystyle\mathbb{E}\left\|\bar{{\mathbf{x}}}-\widehat{{\mathbf{x}}}^{\text{(Na\"{i}ve)}}\right\|_{2}^{2}=\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}-\frac{1}{n}\frac{d}{k}\sum_{i=1}^{n}{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}\right\|_{2}^{2}
=1n2𝔼i=1n𝐱idki=1n𝑮T𝑬iT𝑬i𝑮𝐱i22\displaystyle=\frac{1}{n^{2}}\mathbb{E}\left\|\sum_{i=1}^{n}{\mathbf{x}}_{i}-\frac{d}{k}\sum_{i=1}^{n}{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}\right\|_{2}^{2}
=1n2{d2k2𝔼i=1n𝑮T𝑬iT𝑬i𝑮𝐱i2i=1n𝐱i2}\displaystyle=\frac{1}{n^{2}}\left\{\frac{d^{2}}{k^{2}}\mathbb{E}\left\|\sum_{i=1}^{n}{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}\right\|^{2}-\left\|\sum_{i=1}^{n}{\mathbf{x}}_{i}\right\|^{2}\right\}
=1n2{d2k2(i=1n𝔼𝑮T𝑬iT𝑬i𝑮𝐱i22+ij𝔼𝑮T𝑬iT𝑬i𝑮𝐱i,𝑮𝑬lT𝑬l𝑮𝐱l)i=1n𝐱i2}.\displaystyle=\frac{1}{n^{2}}\left\{\frac{d^{2}}{k^{2}}\left(\sum_{i=1}^{n}\mathbb{E}\|{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}\|_{2}^{2}+\sum_{i\neq j}\mathbb{E}\left\langle{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i},{\bm{G}}{\bm{E}}_{l}^{T}{\bm{E}}_{l}{\bm{G}}{\mathbf{x}}_{l}\right\rangle\right)-\left\|\sum_{i=1}^{n}{\mathbf{x}}_{i}\right\|^{2}\right\}. (12)

Next, we bound the first term in Eq. 12.

𝔼𝑮T𝑬iT𝑬i𝑮𝐱i22=𝔼[𝐱iT𝑮T𝑬iT𝑬i𝑮𝑮T𝑬iT𝑬i𝑮𝐱i]=𝔼[𝐱iT𝑮T𝑬iT𝑬i𝑬iT𝑬i𝑮𝐱i]\displaystyle\mathbb{E}\|{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}\|_{2}^{2}=\mathbb{E}[{\mathbf{x}}_{i}^{T}{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}]=\mathbb{E}[{\mathbf{x}}_{i}^{T}{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i}]
=𝐱iT𝑮T𝔼[(𝑬iT𝑬i)2]𝑮𝐱i\displaystyle={\mathbf{x}}_{i}^{T}{\bm{G}}^{T}\mathbb{E}[({\bm{E}}_{i}^{T}{\bm{E}}_{i})^{2}]{\bm{G}}{\mathbf{x}}_{i}
=𝐱iTkd𝑰d𝐱i\displaystyle={\mathbf{x}}_{i}^{T}\frac{k}{d}{\bm{I}}_{d}{\mathbf{x}}_{i} ((𝑬iT𝑬i)2=𝑬iT𝑬i\because({\bm{E}}_{i}^{T}{\bm{E}}_{i})^{2}={\bm{E}}_{i}^{T}{\bm{E}}_{i})
=kd𝐱i22\displaystyle=\frac{k}{d}\|{\mathbf{x}}_{i}\|_{2}^{2} (13)

The second term in Eq. 12 can also be simplified as follows.

𝔼[𝑮T𝑬iT𝑬i𝑮𝐱i,𝑮T𝑬lT𝑬l𝑮𝐱l]\displaystyle\mathbb{E}[\langle{\bm{G}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{G}}{\mathbf{x}}_{i},{\bm{G}}^{T}{\bm{E}}_{l}^{T}{\bm{E}}_{l}{\bm{G}}{\mathbf{x}}_{l}\rangle]
=𝑮T𝔼[𝑬iT𝑬i]𝑮𝐱i,𝑮T𝔼[𝑬lT𝑬l]𝑮𝐱l\displaystyle=\langle{\bm{G}}^{T}\mathbb{E}[{\bm{E}}_{i}^{T}{\bm{E}}_{i}]{\bm{G}}{\mathbf{x}}_{i},{\bm{G}}^{T}\mathbb{E}[{\bm{E}}_{l}^{T}{\bm{E}}_{l}]{\bm{G}}{\mathbf{x}}_{l}\rangle
=𝑮Tkd𝑰d𝑮𝐱i,𝑮Tkd𝑰d𝑮𝐱l\displaystyle=\langle{\bm{G}}^{T}\frac{k}{d}{\bm{I}}_{d}{\bm{G}}{\mathbf{x}}_{i},{\bm{G}}^{T}\frac{k}{d}{\bm{I}}_{d}{\bm{G}}{\mathbf{x}}_{l}\rangle
=k2d2𝐱i,𝐱l.\displaystyle=\frac{k^{2}}{d^{2}}\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle. (14)

Plugging Eq. 13 and Eq. 14 into Eq. 12, we get the MSE is

𝔼𝐱¯𝐱^(Naïve)22\displaystyle\mathbb{E}\|\bar{{\mathbf{x}}}-\widehat{{\mathbf{x}}}^{\text{(Na\"{i}ve)}}\|_{2}^{2}
=1n2{d2k2(i=1nkd𝐱i22+2i=1nl=i+1nk2d2𝐱i,𝐱l)i=1n𝐱i2}\displaystyle=\frac{1}{n^{2}}\Big{\{}\frac{d^{2}}{k^{2}}\Big{(}\sum_{i=1}^{n}\frac{k}{d}\|{\mathbf{x}}_{i}\|_{2}^{2}+2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\frac{k^{2}}{d^{2}}\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle\Big{)}-\left\|\sum_{i=1}^{n}{\mathbf{x}}_{i}\right\|^{2}\Big{\}}
=1n2(dk1)i=1n𝐱i22,\displaystyle=\frac{1}{n^{2}}(\frac{d}{k}-1)\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2},

which has exactly the same MSE as that of Rand-kk. The problem is that if each client applies the same rotational matrix 𝑮{\bm{G}}, simply rotating the vectors will not change the 2\ell_{2} norm of the decoded vector, and hence the MSE. Similarly, if one applies Rand-kk-Spatial, one ends up having exactly the same MSE as that of Rand-kk-Spatial as well. Hence, we need to design a new decoding procedure when the encoding procedure at the clients are more flexible.

A.2 nkdnk\gg d is not interesting

One can rewrite i=1n𝑮iT𝑮i\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i} in the Rand-Proj-Spatial estimator (Eq. 5) as i=1n𝑮iT𝑮i=j=1nk𝐠i𝐠iT\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}=\sum_{j=1}^{nk}{\mathbf{g}}_{i}{\mathbf{g}}_{i}^{T}, where 𝐠jd{\mathbf{g}}_{j}\in\mathbb{R}^{d} and 𝐠ik,𝐠ik+1,,𝐠(i+1)k{\mathbf{g}}_{ik},{\mathbf{g}}_{ik+1},\dots,{\mathbf{g}}_{(i+1)k} are the rows of 𝑮i{\bm{G}}_{i}. Since when nkdnk\gg d, j=1nk𝐠j𝐠jT𝔼[j=1n𝐠j𝐠jT]\sum_{j=1}^{nk}{\mathbf{g}}_{j}{\mathbf{g}}_{j}^{T}\rightarrow\mathbb{E}[\sum_{j=1}^{n}{\mathbf{g}}_{j}{\mathbf{g}}_{j}^{T}] due to Law of Large Numbers, one way to see the limiting MSE of Rand-Proj-Spatial when nknk is large is to approximate i=1nj=1nk𝐠i𝐠iT\sum_{i=1}^{n}\sum_{j=1}^{nk}{\mathbf{g}}_{i}{\mathbf{g}}_{i}^{T} by its expectation.

By Lemma 4.1, when 𝑮i=𝑬i{\bm{G}}_{i}={\bm{E}}_{i}, Rand-Proj-Spatial recovers Rand-kk-Spatial. We now discuss the limiting behavior of Rand-kk-Spatial when nkdnk\gg d by leveraging our proposed Rand-Proj-Spatial. In this case, each 𝐠j{\mathbf{g}}_{j} can be viewed as a random based vector 𝐞w{\mathbf{e}}_{w} for ww randomly chosen in [d][d]. i=1nk𝐠j𝐠jT𝔼[i=1nk𝐠j𝐠jT]=i=1nk1d𝑰d=nkd𝑰d\sum_{i=1}^{nk}{\mathbf{g}}_{j}{\mathbf{g}}_{j}^{T}\rightarrow\mathbb{E}[\sum_{i=1}^{nk}{\mathbf{g}}_{j}{\mathbf{g}}_{j}^{T}]=\sum_{i=1}^{nk}\frac{1}{d}{\bm{I}}_{d}=\frac{nk}{d}{\bm{I}}_{d}. And so the scalar β¯\bar{\beta} in Eq. 5 to ensure an unbiased estimator is computed as

β¯𝔼[(nkd𝑰d)𝑮iT𝑮i]=𝑰d\displaystyle\bar{\beta}\mathbb{E}[(\frac{nk}{d}{\bm{I}}_{d})^{{\dagger}}{\bm{G}}_{i}^{T}{\bm{G}}_{i}]={\bm{I}}_{d}
β¯dnk𝑰d𝔼[𝑮iT𝑮i]=𝑰d\displaystyle\bar{\beta}\frac{d}{nk}{\bm{I}}_{d}\mathbb{E}[{\bm{G}}_{i}^{T}{\bm{G}}_{i}]={\bm{I}}_{d}
β¯dnkkd=𝑰d\displaystyle\bar{\beta}\frac{d}{nk}\frac{k}{d}={\bm{I}}_{d}
β¯=n\displaystyle\bar{\beta}=n

And the MSE is now

𝔼[𝐱¯𝐱^]=𝔼[1ni=1n𝐱i1nβ¯dnk𝑰di=1n𝑬iT𝑬i𝐱i22]\displaystyle\mathbb{E}\Big{[}\|\bar{{\mathbf{x}}}-\hat{{\mathbf{x}}}\|\Big{]}=\mathbb{E}\Big{[}\|\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}-\frac{1}{n}\bar{\beta}\frac{d}{nk}{\bm{I}}_{d}\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}\Big{]}
=1n2{β¯2d2n2k2𝔼[i=1n𝑬iT𝑬i𝐱i22]i=1n𝐱i22}\displaystyle=\frac{1}{n^{2}}\Big{\{}\bar{\beta}^{2}\frac{d^{2}}{n^{2}k^{2}}\mathbb{E}\Big{[}\|\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}\Big{]}-\|\sum_{i=1}^{n}{\mathbf{x}}_{i}\|_{2}^{2}\Big{\}}
=1n2{n2d2n2k2(i=1n𝔼[𝑬iT𝑬i𝐱i22]+2i=1nl=i+1n𝑬iT𝑬i𝐱i,𝑬lT𝑬l𝐱l])i=1n𝐱i22}\displaystyle=\frac{1}{n^{2}}\Big{\{}n^{2}\frac{d^{2}}{n^{2}k^{2}}\Big{(}\sum_{i=1}^{n}\mathbb{E}\Big{[}\|{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}\Big{]}+2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\langle{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i},{\bm{E}}_{l}^{T}{\bm{E}}_{l}{\mathbf{x}}_{l}\rangle\Big{]}\Big{)}-\|\sum_{i=1}^{n}{\mathbf{x}}_{i}\|_{2}^{2}\Big{\}}
=1n2{d2k2(i=1n𝔼[𝐱iT(𝑬iT𝑬i)2𝐱i]+2i=1nl=i+1nk2d2𝐱i,𝐱l)i=1n𝐱i22}\displaystyle=\frac{1}{n^{2}}\Big{\{}\frac{d^{2}}{k^{2}}\Big{(}\sum_{i=1}^{n}\mathbb{E}\Big{[}{\mathbf{x}}_{i}^{T}({\bm{E}}_{i}^{T}{\bm{E}}_{i})^{2}{\mathbf{x}}_{i}\Big{]}+2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\frac{k^{2}}{d^{2}}\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle\Big{)}-\|\sum_{i=1}^{n}{\mathbf{x}}_{i}\|_{2}^{2}\Big{\}}
=1n2{d2k2(i=1nkd𝐱i22+2i=1nl=i+1nk2d2𝐱i,𝐱l)i=1n𝐱i222i=1nl=i+1n𝐱i,𝐱l}\displaystyle=\frac{1}{n^{2}}\Big{\{}\frac{d^{2}}{k^{2}}\Big{(}\sum_{i=1}^{n}\frac{k}{d}\|{\mathbf{x}}_{i}\|_{2}^{2}+2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\frac{k^{2}}{d^{2}}\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle\Big{)}-\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2}-2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\langle{\mathbf{x}}_{i},{\mathbf{x}}_{l}\rangle\Big{\}}
=1n2(dk1)i=1n𝐱i22\displaystyle=\frac{1}{n^{2}}(\frac{d}{k}-1)\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2}

which is exactly the same MSE as Rand-kk. This implies when nknk is large, the MSE of Rand-kk-Spatial does not get improved compared to Rand-kk with correlation information. Intuitively, this implies when nkdnk\gg d, the server gets enough amount of information from the client, and does not need correlation to improve its estimator. Hence, we focus on the more interesting case when nk<dnk<d — that is, when the server does not have enough information from the clients, and thus wants to use additional information, i.e. cross-client correlation, to improve its estimator.

Appendix B Additional Details on the Rand-Proj-Spatial Family Estimator

B.1 β¯\bar{\beta} is a scalar

From Eq. 20 in the proof of Theorem 4.3 and Eq. 25 in the proof of Theorem 4.4, it is evident that the unbiasedness of the mean estimator 𝐱^Rand-Proj-Spatial\widehat{{\mathbf{x}}}^{\text{Rand-Proj-Spatial}} is ensured collectively by

  • The random sampling matrices {𝑬i}\{{\bm{E}}_{i}\}.

  • The orthogonality of scaled Hadamard matrices 𝑯T𝑯=d𝑰d=𝑯𝑯T{\bm{H}}^{T}{\bm{H}}=d{\bm{I}}_{d}={\bm{H}}{\bm{H}}^{T}.

  • The rademacher diagonal matrices, with the property (𝑫i)2=𝑰d({\bm{D}}_{i})^{2}={\bm{I}}_{d}.

B.2 Alternative motivating regression problems

Alternative motivating regression problem 1.

Let 𝑮ik×d{\bm{G}}_{i}\in\mathbb{R}^{k\times d} and 𝑾id×k{\bm{W}}_{i}\in\mathbb{R}^{d\times k} be the encoding and decoding matrix for client ii. One possible alternative estimator that translates the intuition that the decoded vector should be close to the client’s original vector, for all clients, is by solving the following regression problem,

𝐱^=argmin𝑾f(𝑾)=𝔼[𝐱¯1ni=1n𝑾i𝑮i𝐱i22]\displaystyle\hat{{\mathbf{x}}}=\operatorname*{arg\,min}_{{\bm{W}}}f({\bm{W}})=\mathbb{E}[\|\bar{{\mathbf{x}}}-\frac{1}{n}\sum_{i=1}^{n}{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}]
subject to 𝐱¯=1ni=1n𝔼[𝑾i𝑮i𝐱i]\displaystyle\text{subject to }\bar{{\mathbf{x}}}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}] (15)

where 𝑾=(𝑾1,𝑾2,,𝑾n){\bm{W}}=({\bm{W}}_{1},{\bm{W}}_{2},\dots,{\bm{W}}_{n}) and the constraint enforces unbiasedness of the estimator. The estimator is then the solution of the above problem. However, we note that optimizing a decoding matrix 𝑾i{\bm{W}}_{i} for each client leads to performing individual decoding of each client’s compressed vector instead of a joint decoding process that considers all clients’ compressed vectors. Only a joint decoding process can achieve the goal of leveraging cross-client information to reduce the estimation error. Indeed, we show as follows that solving the above optimization problem in Eq. B.2 recovers the MSE of our baseline Rand-kk. Note

f(𝑾)=𝔼[1ni=1n(𝐱i𝑾i𝑮i𝐱i)22]=𝔼[1ni=1n(𝑰d𝑾i𝑮i)𝐱i22]\displaystyle f({\bm{W}})=\mathbb{E}[\|\frac{1}{n}\sum_{i=1}^{n}({\mathbf{x}}_{i}-{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i})\|_{2}^{2}]=\mathbb{E}[\|\frac{1}{n}\sum_{i=1}^{n}({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i}\|_{2}^{2}]
=𝔼[1n2(i=1n(𝑰d𝑾i𝑮i𝐱i)22+ij(𝑰d𝑾i𝑮i)𝐱i,(𝑰d𝑾j𝑮j)𝐱j)]\displaystyle=\mathbb{E}\Big{[}\frac{1}{n^{2}}\Big{(}\sum_{i=1}^{n}\|({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i})\|_{2}^{2}+\sum_{i\neq j}\Big{\langle}({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i},({\bm{I}}_{d}-{\bm{W}}_{j}{\bm{G}}_{j}){\mathbf{x}}_{j}\Big{\rangle}\Big{)}\Big{]}
=1n2(i=1n𝔼[(𝑰d𝑾i𝑮i)𝐱i22]+ij𝔼[(𝑰d𝑾i𝑮i)𝐱i,(𝑰d𝑾j𝑮j)𝐱j]).\displaystyle=\frac{1}{n^{2}}\Big{(}\sum_{i=1}^{n}\mathbb{E}\Big{[}\|({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i}\|_{2}^{2}\Big{]}+\sum_{i\neq j}\mathbb{E}\Big{[}\Big{\langle}({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i},({\bm{I}}_{d}-{\bm{W}}_{j}{\bm{G}}_{j}){\mathbf{x}}_{j}\Big{\rangle}\Big{]}\Big{)}. (16)

By the constraint of unbiasedness, i.e., 𝐱¯=1ni=1n𝐱i=1ni=1n𝔼[𝑾i𝑮i𝐱i]\bar{{\mathbf{x}}}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}], there is

1ni=1n𝐱i1ni=1n𝔼[𝑾i𝑮i𝐱i]=01ni=1n𝔼[(𝑰d𝑾i𝑮i)𝐱i]=0.\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}-\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}]=0\Leftrightarrow\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i}]=0.

We now show that a sufficient and necessary condition to satisfy the above unbiasedness constraint is that for all i[n]i\in[n], 𝔼[𝑾i𝑮i]=𝑰d\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]={\bm{I}}_{d}.

Sufficiency. It is obvious that if for all i[n]i\in[n], 𝔼[𝑾i𝑮i]=𝑰d\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]={\bm{I}}_{d}, then we have 1n𝔼[(𝑰d𝑾i𝑮i)𝐱i]=0\frac{1}{n}\mathbb{E}[({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i}]=0.

Necessity. Consider the special case that for some i[n]i\in[n] and λ[d]\lambda\in[d], 𝐱i=n𝐞λ{\mathbf{x}}_{i}=n{\mathbf{e}}_{\lambda}, where 𝐞λ{\mathbf{e}}_{\lambda} is the λ\lambda-th canonical basis vector, and 𝐱j=0{\mathbf{x}}_{j}=0, and for all j[n]{i}j\in[n]\setminus\{i\}. Then,

𝐞λ=𝐱¯=1ni=1n𝔼[𝑾i𝑮i𝐱i]=1n𝔼[𝑾i𝑮i]𝐞λ=[𝔼[𝑾i𝑮i]]λ,\displaystyle{\mathbf{e}}_{\lambda}=\bar{{\mathbf{x}}}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}]=\frac{1}{n}\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]{\mathbf{e}}_{\lambda}=[\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]]_{\lambda},

where []λ[\cdot]_{\lambda} denotes the λ\lambda-th column of matrix 𝔼[𝑾i𝑮i]\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}].

Since our approach is agnostic to the choice of vectors, we need this choice of decoder matrices, by varying λ\lambda over [d][d], we see that we need 𝔼[𝑾i𝑮i]=𝑰d\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]={\bm{I}}_{d}. And by varying ii over [n][n], we see that we need 𝔼[𝑾j𝑮j]=𝑰d\mathbb{E}[{\bm{W}}_{j}{\bm{G}}_{j}]={\bm{I}}_{d} for all j[n]j\in[n].

Therefore, 𝐱¯=1ni=1n𝔼[𝑾i𝑮i𝐱i]i[n],𝔼[𝑾i𝑮i]=𝑰d\bar{{\mathbf{x}}}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}]\Leftrightarrow\forall i\in[n],\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]={\bm{I}}_{d}.

This implies the second term of f(𝑾)f({\bm{W}}) in Eq. 16 is 0, that is,

ij𝔼[(𝑰d𝑾i𝑮i)𝐱i,(𝑰d𝑾j𝑮j)𝐱j=0.\displaystyle\sum_{i\neq j}\mathbb{E}\Big{[}\Big{\langle}({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i},({\bm{I}}_{d}-{\bm{W}}_{j}{\bm{G}}_{j}){\mathbf{x}}_{j}\Big{\rangle}=0.

Hence, we only need to solve

𝐱^=argmin𝑾f2(𝑾)=i=1n𝔼[(𝑰d𝑾i𝑮i)𝐱i22]\displaystyle\hat{{\mathbf{x}}}=\operatorname*{arg\,min}_{{\bm{W}}}f_{2}({\bm{W}})=\sum_{i=1}^{n}\mathbb{E}\Big{[}\|({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i}\|_{2}^{2}\Big{]} (17)

Since each 𝑾i{\bm{W}}_{i} appears in f2(𝑾)f_{2}({\bm{W}}) separately, each 𝑾i{\bm{W}}_{i} can be optimized separately, via solving

min𝑾i𝔼[(𝑰d𝑾i𝑮i)𝐱i22] subject to 𝔼[𝑾i𝑮i]=𝑰d.\displaystyle\min_{{\bm{W}}_{i}}\mathbb{E}\Big{[}\|({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i}\|_{2}^{2}\Big{]}\quad\text{ subject to }\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]={\bm{I}}_{d}.

One natural solution is to take 𝑾i=dk𝑮i{\bm{W}}_{i}=\frac{d}{k}{\bm{G}}_{i}^{{\dagger}}, i[n]\forall i\in[n]. For i[n]i\in[n], let 𝑮i=𝑽iΛi𝑼iT{\bm{G}}_{i}={\bm{V}}_{i}\Lambda_{i}{\bm{U}}_{i}^{T} be its SVD, where 𝑽ik×d{\bm{V}}_{i}\in\mathbb{R}^{k\times d} and 𝑼id×d{\bm{U}}_{i}\in\mathbb{R}^{d\times d} are orthogonal matrices. Then,

𝑾i𝑮i=dk𝑼iΛi𝑽iT𝑽iΛ𝑼T=dk𝑼iΛiΛ𝑼T=dk𝑼iΣ𝑼iT,\displaystyle{\bm{W}}_{i}{\bm{G}}_{i}=\frac{d}{k}{\bm{U}}_{i}\Lambda_{i}^{{\dagger}}{\bm{V}}_{i}^{T}{\bm{V}}_{i}\Lambda{\bm{U}}^{T}=\frac{d}{k}{\bm{U}}_{i}\Lambda_{i}^{{\dagger}}\Lambda{\bm{U}}^{T}=\frac{d}{k}{\bm{U}}_{i}\Sigma{\bm{U}}_{i}^{T},

where Σ\Sigma is a diagonal matrix with 0s and 1s on the diagonal.

For simplicity, we assume the random matrix 𝑼i{\bm{U}}_{i} follows a continuous distribution. 𝑼i{\bm{U}}_{i} being discrete follows a similar analysis. Let μ(𝑼i)\mu({\bm{U}}_{i}) be the measure of 𝑼i{\bm{U}}_{i}.

𝔼[𝑾i𝑮i]\displaystyle\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}] =dk𝔼[𝑼iΣ𝑼iT]=dk𝑼i𝔼[𝑼iΣi𝑼iT𝑼i]𝑑μ(𝑼i)\displaystyle=\frac{d}{k}\mathbb{E}[{\bm{U}}_{i}\Sigma{\bm{U}}_{i}^{T}]=\frac{d}{k}\int_{{\bm{U}}_{i}}\mathbb{E}[{\bm{U}}_{i}\Sigma_{i}{\bm{U}}_{i}^{T}\mid{\bm{U}}_{i}]\cdot d\mu({\bm{U}}_{i})
=dk𝑼i𝑼i𝔼[Σi𝑼i]𝑼iTμ(𝑼i)\displaystyle=\frac{d}{k}\int_{{\bm{U}}_{i}}{\bm{U}}_{i}\mathbb{E}[\Sigma_{i}\mid{\bm{U}}_{i}]{\bm{U}}_{i}^{T}\cdot\mu({\bm{U}}_{i})
=dk𝑼i𝑼ikd𝑰d𝑼iT𝑑μ(𝑼i)\displaystyle=\frac{d}{k}\int_{{\bm{U}}_{i}}{\bm{U}}_{i}\frac{k}{d}{\bm{I}}_{d}{\bm{U}}_{i}^{T}\cdot d\mu({\bm{U}}_{i})
=dkkd𝑰d=𝑰d,\displaystyle=\frac{d}{k}\frac{k}{d}{\bm{I}}_{d}={\bm{I}}_{d},

which means the estimator 1ni=1nkd𝑮i𝑮i\frac{1}{n}\sum_{i=1}^{n}\frac{k}{d}{\bm{G}}_{i}^{{\dagger}}{\bm{G}}_{i} satisfies unbiasedness. The MSE is now

MSE\displaystyle MSE =𝔼[𝐱¯1ni=1n𝑾i𝑮i𝐱i22]=1n2i=1n𝔼[(𝑰d𝑾i𝑮i)𝐱i22]\displaystyle=\mathbb{E}\Big{[}\|\bar{{\mathbf{x}}}-\frac{1}{n}\sum_{i=1}^{n}{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}\Big{]}=\frac{1}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\Big{[}\|({\bm{I}}_{d}-{\bm{W}}_{i}{\bm{G}}_{i}){\mathbf{x}}_{i}\|_{2}^{2}\Big{]}
=1n2i=1n(𝐱i22+𝔼[𝑾i𝑮i𝐱i22]2𝐱i,𝔼[𝑾i𝑮i]𝐱i)\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\Big{(}\|{\mathbf{x}}_{i}\|_{2}^{2}+\mathbb{E}[\|{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}]-2\langle{\mathbf{x}}_{i},\mathbb{E}[{\bm{W}}_{i}{\bm{G}}_{i}]{\mathbf{x}}_{i}\rangle\Big{)}
=1n2i=1n(𝐱i22+𝔼[𝑾i𝑮i𝐱i22]2𝐱i,𝐱i)\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\Big{(}\|{\mathbf{x}}_{i}\|_{2}^{2}+\mathbb{E}[\|{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}]-2\langle{\mathbf{x}}_{i},{\mathbf{x}}_{i}\rangle\Big{)}
=1n2i=1n(𝔼[𝑾i𝑮i𝐱i22𝐱i22])\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\Big{(}\mathbb{E}[\|{\bm{W}}_{i}{\bm{G}}_{i}{\mathbf{x}}_{i}\|_{2}^{2}-\|{\mathbf{x}}_{i}\|_{2}^{2}]\Big{)}
=1n2i=1n(𝐱i𝔼[(𝑾i𝑮i)T(𝑾i𝑮i)]𝐱i𝐱i22).\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\Big{(}{\mathbf{x}}_{i}\mathbb{E}[({\bm{W}}_{i}{\bm{G}}_{i})^{T}({\bm{W}}_{i}{\bm{G}}_{i})]{\mathbf{x}}_{i}-\|{\mathbf{x}}_{i}\|_{2}^{2}\Big{)}.

Again, let 𝑮i=𝑽iΛi𝑼iT{\bm{G}}_{i}={\bm{V}}_{i}\Lambda_{i}{\bm{U}}_{i}^{T} be its SVD and consider 𝑾i𝑮i=dk𝑼iΣi𝑼iT{\bm{W}}_{i}{\bm{G}}_{i}=\frac{d}{k}{\bm{U}}_{i}\Sigma_{i}{\bm{U}}_{i}^{T}, where Σi\Sigma_{i} is a diagonal matrix with 0s and 1s. Then,

MSE\displaystyle MSE =1n2i=1ni=1n(𝐱iTd2k2𝔼[𝑼iΣi𝑼iT𝑼iΣi𝑼iT]𝐱i𝐱i22)\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{i=1}^{n}\Big{(}{\mathbf{x}}_{i}^{T}\frac{d^{2}}{k^{2}}\mathbb{E}[{\bm{U}}_{i}\Sigma_{i}{\bm{U}}_{i}^{T}{\bm{U}}_{i}\Sigma_{i}{\bm{U}}_{i}^{T}]{\mathbf{x}}_{i}-\|{\mathbf{x}}_{i}\|_{2}^{2}\Big{)}
=1n2i=1n(d2k2𝐱iT𝔼[𝑼iΣ2𝑼iT]𝐱i𝐱i22).\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\Big{(}\frac{d^{2}}{k^{2}}{\mathbf{x}}_{i}^{T}\mathbb{E}[{\bm{U}}_{i}\Sigma^{2}{\bm{U}}_{i}^{T}]{\mathbf{x}}_{i}-\|{\mathbf{x}}_{i}\|_{2}^{2}\Big{)}.

Since 𝑮i{\bm{G}}_{i} has rank kk, Σi\Sigma_{i} is a diagonal matrix with kk out of dd entries being 1 and the rest being 0. Let μ(𝑼i)\mu({\bm{U}}_{i}) be the measure of 𝑼i{\bm{U}}_{i}. Hence, for i[n]i\in[n],

𝔼[𝑼iΣi2𝑼iT]\displaystyle\mathbb{E}[{\bm{U}}_{i}\Sigma_{i}^{2}{\bm{U}}_{i}^{T}] =𝑼i𝔼[𝑼iΣi2𝑼iT𝑼i]𝑑μ(𝑼i)\displaystyle=\int_{{\bm{U}}_{i}}\mathbb{E}[{\bm{U}}_{i}\Sigma_{i}^{2}{\bm{U}}_{i}^{T}\mid{\bm{U}}_{i}]d\mu({\bm{U}}_{i})
=𝑼i𝑼i𝔼[Σi2𝑼i]𝑼iT𝑑μ(𝑼i)\displaystyle=\int_{{\bm{U}}_{i}}{\bm{U}}_{i}\mathbb{E}[\Sigma_{i}^{2}\mid{\bm{U}}_{i}]{\bm{U}}_{i}^{T}d\mu({\bm{U}}_{i})
=𝑼ikd𝑼i𝑰d𝑼iT𝑑μ(𝑼i)\displaystyle=\int_{{\bm{U}}_{i}}\frac{k}{d}{\bm{U}}_{i}{\bm{I}}_{d}{\bm{U}}_{i}^{T}d\mu({\bm{U}}_{i})
=kd𝑼i𝑰d𝑑μ(𝑼i)\displaystyle=\frac{k}{d}\int_{{\bm{U}}_{i}}{\bm{I}}_{d}d\mu({\bm{U}}_{i})
=kd𝑰d.\displaystyle=\frac{k}{d}{\bm{I}}_{d}.

Therefore, the MSE of the estimator, which is the solution of the optimization problem in Eq. B.2, is

MSE\displaystyle MSE =1n2i=1n(d2k2𝐱iTkd𝑰d𝐱i𝐱i22)=1n2(dk1)i=1n𝐱i22,\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{n}\Big{(}\frac{d^{2}}{k^{2}}{\mathbf{x}}_{i}^{T}\frac{k}{d}{\bm{I}}_{d}{\mathbf{x}}_{i}-\|{\mathbf{x}}_{i}\|_{2}^{2}\Big{)}=\frac{1}{n^{2}}(\frac{d}{k}-1)\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2},

which is the same MSE as that of Rand-kk.

Alternative motivating regression problem 2.

Another motivating regression problem based on which we can design our estimator is

𝐱^=argmin𝐱1ni=1n𝐆i𝐱1ni=1n𝐆i𝐱i22\displaystyle\widehat{\mathbf{x}}=\operatorname*{arg\,min}_{\mathbf{x}}\|\frac{1}{n}\sum_{i=1}^{n}\mathbf{G}_{i}\mathbf{x}-\frac{1}{n}\sum_{i=1}^{n}\mathbf{G}_{i}\mathbf{x}_{i}\|_{2}^{2} (18)

Note that 𝑮ik×d,i[n]{\bm{G}}_{i}\in\mathbb{R}^{k\times d},\forall i\in[n], and so the solution to the above problem is

𝐱^(solution)=(1ni=1n𝑮i)(1ni=1n𝑮i𝐱i),\displaystyle\widehat{{\mathbf{x}}}^{\text{(solution)}}=\Big{(}\frac{1}{n}\sum_{i=1}^{n}{\bm{G}}_{i}\Big{)}^{{\dagger}}\Big{(}\frac{1}{n}\sum_{i=1}^{n}{\bm{G}}_{i}{\mathbf{x}}_{i}\Big{)},

and to ensure unbiasedness of the estimator, we can set β¯\bar{\beta}\in\mathbb{R} and have the estimator as

𝐱^(estimator)=β¯(1ni=1n𝑮i)(1ni=1n𝑮i𝐱i).\displaystyle\widehat{{\mathbf{x}}}^{\text{(estimator)}}=\bar{\beta}\Big{(}\frac{1}{n}\sum_{i=1}^{n}{\bm{G}}_{i}\Big{)}^{{\dagger}}\Big{(}\frac{1}{n}\sum_{i=1}^{n}{\bm{G}}_{i}{\mathbf{x}}_{i}\Big{)}.

It is not hard to see this estimator does not lead to an MSE as low as Rand-Proj-Spatial does. Consider the full correlation case, i.e., 𝐱i=𝐱,i[n]{\mathbf{x}}_{i}={\mathbf{x}},\forall i\in[n], for example, the estimator is now

𝐱^(estimator)=β¯(1ni=1n𝑮i)(1ni=1n𝑮i)𝐱.\displaystyle\widehat{{\mathbf{x}}}^{\text{(estimator)}}=\bar{\beta}\Big{(}\frac{1}{n}\sum_{i=1}^{n}{\bm{G}}_{i}\Big{)}^{{\dagger}}\Big{(}\frac{1}{n}\sum_{i=1}^{n}{\bm{G}}_{i}\Big{)}{\mathbf{x}}.

Note that rank(1ni=1n𝑮i)\text{rank}(\frac{1}{n}\sum_{i=1}^{n}{\bm{G}}_{i}) is at most kk, since 𝑮ik×d{\bm{G}}_{i}\in\mathbb{R}^{k\times d}, i[k]\forall i\in[k]. This limits the amount of information of 𝐱{\mathbf{x}} the server can recover.

While recall that in this case, the Rand-Proj-Spatial estimator is

𝐱^(Rand-Proj-Spatial)=β¯(i=1n𝑮iT𝑮i)i=1n𝑮iT𝑮i𝐱=β¯𝑺𝑺𝐱,\displaystyle\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial})}=\bar{\beta}\Big{(}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}\Big{)}^{{\dagger}}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}{\mathbf{x}}=\bar{\beta}{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}},

where 𝑺{\bm{S}} can have rank at most nknk.

B.3 Why deriving the MSE of Rand-Proj-Spatial with SRHT is hard

To analyze Eq. 11, one needs to compute the distribution of eigendecomposition of 𝑺=i=1n𝑮iT𝑮i{\bm{S}}=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}, i.e. the sum of the covariance of SRHT. To the best of our knowledge, there is no non-trivial closed form expression of the distribution of eigen-decomposition of even a single 𝑮iT𝑮i{\bm{G}}_{i}^{T}{\bm{G}}_{i}, when 𝑮i{\bm{G}}_{i} is SRHT, or other commonly used random matrices, e.g. Gaussian. When 𝑮i{\bm{G}}_{i} is SRHT, since 𝑮iT𝑮i=𝑫i𝑯𝑬iT𝑬i𝑯𝑫i{\bm{G}}_{i}^{T}{\bm{G}}_{i}={\bm{D}}_{i}{\bm{H}}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i} and the eigenvalues of 𝑬iT𝑬i{\bm{E}}_{i}^{T}{\bm{E}}_{i} are just diagonal entries, one might attempt to analyze 𝑯𝑫i{\bm{H}}{\bm{D}}_{i}. While the hardmard matrix 𝑯{\bm{H}}’s eigenvalues and eigenvectors are known333See this note https://core.ac.uk/download/pdf/81967428.pdf, the result can hardly be applied to analyze the distribution of singular values or singular vectors of 𝑯𝑫i{\bm{H}}{\bm{D}}_{i}.

Even if one knows the eigen-decomposition of a single 𝑮iT𝑮i{\bm{G}}_{i}^{T}{\bm{G}}_{i}, it is still hard to get the eigen-decomposition of 𝑺{\bm{S}}. The eigenvalues of a matrix 𝑨{\bm{A}} can be viewed as a non-linear function in the 𝑨{\bm{A}}, and hence it is in general hard to derive closed form expressions for the eigenvalues of 𝑨+𝑩{\bm{A}}+{\bm{B}}, given the eigenvalues of 𝑨{\bm{A}} and that of 𝑩{\bm{B}}. One exception is when 𝑨{\bm{A}} and 𝑩{\bm{B}} have the same eigenvector and the eigenvalues of 𝑨+𝑩{\bm{A}}+{\bm{B}} becomes a sum of the eigenvalues of 𝑨{\bm{A}} and 𝑩{\bm{B}}. Recall when 𝑮i=𝑬i{\bm{G}}_{i}={\bm{E}}_{i}, Rand-Proj-Spatial recovers Rand-kk-Spatial. Since 𝑬iT𝑬i{\bm{E}}_{i}^{T}{\bm{E}}_{i}’s all have the same eigenvectors (i.e. same as 𝑰d{\bm{I}}_{d}), the eigenvalues of 𝑺=i=1n𝑬iT𝑬i{\bm{S}}=\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i} are just the sum of diagonal entries of 𝑬iT𝑬i{\bm{E}}_{i}^{T}{\bm{E}}_{i}’s. Hence, deriving the MSE for Rand-kk-Spatial is not hard compared to the more general case when 𝑮iT𝑮i{\bm{G}}_{i}^{T}{\bm{G}}_{i}’s can have different eigenvectors.

Since one can also view 𝑺=i=1nk𝐠i𝐠iT{\bm{S}}=\sum_{i=1}^{nk}{\mathbf{g}}_{i}{\mathbf{g}}_{i}^{T}, i.e. the sum of nknk rank-one matrices, one might attempt to recursively analyze the eigen-decomposition of i=1n𝐠i𝐠iT+𝐠n+1𝐠n+1T\sum_{i=1}^{n^{\prime}}{\mathbf{g}}_{i}{\mathbf{g}}_{i}^{T}+{\mathbf{g}}_{n^{\prime}+1}{\mathbf{g}}_{n^{\prime}+1}^{T} for nnn^{\prime}\leq n. One related problem is eigen-decomposition of a low-rank updated matrix in perturbation analysis: Given the eigen-decomposition of a matrix 𝑨{\bm{A}}, what is the eigen-decomposition of 𝑨+𝑽𝑽T{\bm{A}}+{\bm{V}}{\bm{V}}^{T}, where 𝑽{\bm{V}} is low-rank matrix (or more commonly rank-one)? To compute the eigenvalues of 𝑨+𝑽𝑽T{\bm{A}}+{\bm{V}}{\bm{V}}^{T} directly from that of 𝑨{\bm{A}}, the most effective and widely applied solution is to solve the so-called secular equation, e.g. [59, 60, 61]. While this can be done computationally efficiently, it is hard to get a closed form expression for the eigenvalues of 𝑨+𝑽𝑽T{\bm{A}}+{\bm{V}}{\bm{V}}^{T} from the secular equation.

The previous analysis of SRHT in e.g. [37, 38, 39, 45, 55] is based on asymptotic properties of SRHT, such as the limiting eigen-spectrum, or concentration bounds that bounds the singular values. To analyze the MSE of Rand-Proj-Spatial, however, we need an exact, non-asymptotic analysis of the distribution of SRHT. Concentration bounds does not apply, since computing the pseudo-inverse in Eq. 5 naturally bounds the eigenvalues, and applying concentration bounds will only lead to a loose upper bound on MSE.

B.4 More simulation results on incorporating various degrees of correlation

Refer to caption
Refer to caption
Figure 6: MSE comparison of estimators Rand-kk, Rand-kk-Spatial(Opt), Rand-Proj-Spatial, given the degree of correlation {\mathcal{R}}. Rand-kk-Spatial(Opt) denotes the estimator that gives the lowest possible MSE from the Rand-kk-Spatial family. We consider d=1024d=1024, a smaller number of clients n{5,11}n\in\{5,11\}, and kk values such that nk<dnk<d. In each plot, we fix n,k,dn,k,d and vary the degree of positive correlation {\mathcal{R}}. Note the range of {\mathcal{R}} is [0,n1]{\mathcal{R}}\in[0,n-1]. We choose {\mathcal{R}} with equal space in this range.

Appendix C All Proof Details

C.1 Proof of Theorem 4.3

Theorem 4.3 (MSE under Full Correlation).

Consider nn clients, each holding the same vector 𝐱d{\mathbf{x}}\in\mathbb{R}^{d}. Suppose we set T(λ)=λT(\lambda)=\lambda, β¯=dk\bar{\beta}=\frac{d}{k} in Eq. 5, and the random linear map 𝐆i{\bm{G}}_{i} at each client to be an SRHT matrix. Let δ\delta be the probability that 𝐒=i=1n𝐆iT𝐆i{\bm{S}}=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i} does not have full rank. Then, for nkdnk\leq d,

𝔼[𝐱^(Rand-Proj-Spatial(Max))𝐱¯22][d(1δ)nk+δk1]𝐱22\displaystyle\mathbb{E}\Big{[}\|\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial}\text{(Max)})}-\bar{{\mathbf{x}}}\|_{2}^{2}\Big{]}\leq\Big{[}\frac{d}{(1-\delta)nk+\delta k}-1\Big{]}\|{\mathbf{x}}\|_{2}^{2} (19)
Proof.

All clients have the same vector 𝐱1=𝐱2==𝐱n=𝐱d{\mathbf{x}}_{1}={\mathbf{x}}_{2}=\dots={\mathbf{x}}_{n}={\mathbf{x}}\in\mathbb{R}^{d}. Hence, 𝐱¯=1ni=1n𝐱i=𝐱\bar{{\mathbf{x}}}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}={\mathbf{x}}, and the decoding scheme is

𝐱^(Rand-Proj-Spatial(Max))=β¯(i=1n𝑮iT𝑮i)i=1n𝑮iT𝑮i𝐱=β¯𝑺𝑺𝐱,\displaystyle\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial}\text{(Max)})}=\bar{\beta}\Big{(}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}\Big{)}^{{\dagger}}\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}{\mathbf{x}}=\bar{\beta}{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}},

where 𝑺=i=1n𝑮iT𝑮i{\bm{S}}=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}. Let 𝑺=𝑼Λ𝑼T{\bm{S}}={\bm{U}}\Lambda{\bm{U}}^{T} be its eigendecomposition. Since 𝑺{\bm{S}} is a real symmetric matrix, 𝑼{\bm{U}} is orthogonal, i.e., 𝑼T𝑼=𝑰d=𝑼𝑼T{\bm{U}}^{T}{\bm{U}}={\bm{I}}_{d}={\bm{U}}{\bm{U}}^{T}. Also, 𝑺=𝑼Λ𝑼T{\bm{S}}^{{\dagger}}={\bm{U}}\Lambda^{\dagger}{\bm{U}}^{T}, where Λ\Lambda^{\dagger} is a diagonal matrix, such that

[Λ]ii={1/[Λ]ii if Λii0,0 else.\displaystyle[\Lambda^{\dagger}]_{ii}=\begin{cases}1/[\Lambda]_{ii}&\text{ if }\Lambda_{ii}\neq 0,\\ 0&\text{ else.}\end{cases}

Let δc\delta_{c} be the probability that 𝑺{\bm{S}} has rank cc, for c{k,k+1,,nk1}c\in\{k,k+1,\dots,nk-1\}. Note that δ=c=knk1δc\delta=\sum_{c=k}^{nk-1}\delta_{c}. For vector 𝐦d{\mathbf{m}}\in\mathbb{R}^{d}, we use diag(𝐦)d×d\text{diag}({\mathbf{m}})\in\mathbb{R}^{d\times d} to denote the matrix whose diagonal entries correspond to the coordinates of 𝐦{\mathbf{m}} and the rest of the entries are zeros.

Computing β¯\bar{\beta}. First, we compute β¯\bar{\beta}. To ensure that our estimator 𝐱^(Rand-Proj-Spatial(Max))\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial}\text{(Max)})} is unbiased, we need β¯𝔼[𝑺𝑺𝐱]=𝐱\bar{\beta}\mathbb{E}[{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}}]={\mathbf{x}}. Consequently,

𝐱\displaystyle{\mathbf{x}} =β¯𝔼[𝑼Λ𝑼T𝑼Λ𝑼T]𝐱\displaystyle=\bar{\beta}\mathbb{E}[{\bm{U}}\Lambda^{{\dagger}}{\bm{U}}^{T}{\bm{U}}\Lambda{\bm{U}}^{T}]{\mathbf{x}}
=β¯[𝑼=ΦPr[𝑼=Φ]𝔼[𝑼ΛΛ𝑼T𝑼=Φ]]𝐱\displaystyle=\bar{\beta}\left[\sum_{{\bm{U}}=\Phi}\Pr[{\bm{U}}=\Phi]\mathbb{E}[{\bm{U}}\Lambda^{{\dagger}}\Lambda{\bm{U}}^{T}\mid{\bm{U}}=\Phi]\right]{\mathbf{x}}
=β¯[𝑼=ΦPr[𝑼=Φ]𝑼𝔼[ΛΛ𝑼=Φ]𝑼T]𝐱\displaystyle=\bar{\beta}\left[\sum_{{\bm{U}}=\Phi}\Pr[{\bm{U}}=\Phi]{\bm{U}}\mathbb{E}[\Lambda^{{\dagger}}\Lambda\mid{\bm{U}}=\Phi]{\bm{U}}^{T}\right]{\mathbf{x}}
=(a)β¯[𝑼=ΦPr[𝑼=Φ]𝑼𝔼[diag(𝐦)𝑼=Φ]𝑼T]𝐱\displaystyle\overset{(a)}{=}\bar{\beta}\left[\sum_{{\bm{U}}=\Phi}\Pr[{\bm{U}}=\Phi]{\bm{U}}\mathbb{E}[\text{diag}(\mathbf{m})\mid{\bm{U}}=\Phi]{\bm{U}}^{T}\right]{\mathbf{x}}
=(b)β¯𝑼=ΦPr[𝑼=Φ][𝑼((1δ)nkd𝑰d+c=knk1δccd𝑰d)𝑼T]𝐱\displaystyle\overset{(b)}{=}\bar{\beta}\sum_{{\bm{U}}=\Phi}\Pr[{\bm{U}}=\Phi]\left[{\bm{U}}\Big{(}(1-\delta)\frac{nk}{d}{\bm{I}}_{d}+\sum_{c=k}^{nk-1}\delta_{c}\frac{c}{d}{\bm{I}}_{d}\Big{)}{\bm{U}}^{T}\right]{\mathbf{x}}
=β¯[(1δ)nkd+c=knk1δccd]𝐱\displaystyle=\bar{\beta}\Big{[}(1-\delta)\frac{nk}{d}+\sum_{c=k}^{nk-1}\delta_{c}\frac{c}{d}\Big{]}{\mathbf{x}}
β¯\displaystyle\Rightarrow\bar{\beta} =d(1δ)nk+c=knk1δcc\displaystyle=\frac{d}{(1-\delta)nk+\sum_{c=k}^{nk-1}\delta_{c}c} (20)

where in (a)(a), 𝐦d\mathbf{m}\in\mathbb{R}^{d} such that

𝐦i={1if Λjj>00else.\displaystyle\mathbf{m}_{i}=\begin{cases}1&\text{if }\Lambda_{jj}>0\\ 0&\text{else.}\end{cases}

Also, by construction of 𝑺{\bm{S}}, rank(diag(𝐦))nk\text{rank}(\text{diag}(\mathbf{m}))\leq nk. Further, (b)(b) follows by symmetry across the dd dimensions.

Since δkc=knk1δccδ(nk1)\delta k\leq\sum_{c=k}^{nk-1}\delta_{c}c\leq\delta(nk-1), there is

d(1δ)nk+δ(nk1)β¯d(1δ)nk+δk\displaystyle\frac{d}{(1-\delta)nk+\delta(nk-1)}\leq\bar{\beta}\leq\frac{d}{(1-\delta)nk+\delta k} (21)

Computing the MSE. Next, we use the value of β¯\bar{\beta} in Eq. 20 to compute MSE.

MSE(Rand-Proj-Spatial(Max))=𝔼[𝐱^(Rand-Proj-Spatial(Max))𝐱¯22]=𝔼[β¯𝑺𝑺𝐱𝐱22]\displaystyle MSE(\text{Rand-Proj-Spatial}\text{(Max)})=\mathbb{E}[\|\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial}\text{(Max)})}-\bar{{\mathbf{x}}}\|_{2}^{2}]=\mathbb{E}[\|\bar{\beta}{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}}-{\mathbf{x}}\|_{2}^{2}]
=β¯2𝔼[𝑺𝑺𝐱22]+𝐱222β¯𝔼[𝑺𝑺𝐱],𝐱\displaystyle=\bar{\beta}^{2}\mathbb{E}[\|{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}}\|_{2}^{2}]+\|{\mathbf{x}}\|_{2}^{2}-2\Big{\langle}\bar{\beta}\mathbb{E}[{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}}],{\mathbf{x}}\Big{\rangle}
=β¯2𝔼[𝑺𝑺𝐱22]𝐱22\displaystyle=\bar{\beta}^{2}\mathbb{E}[\|{\bm{S}}^{{\dagger}}{\bm{S}}{\mathbf{x}}\|_{2}^{2}]-\|{\mathbf{x}}\|_{2}^{2} (Using unbiasedness of 𝐱^(Rand-Proj-Spatial(Max))\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial}\text{(Max)})})
=β¯2𝐱T𝔼[𝑺T(𝑺)T𝑺𝑺]𝐱𝐱22.\displaystyle=\bar{\beta}^{2}{\mathbf{x}}^{T}\mathbb{E}[{\bm{S}}^{T}({\bm{S}}^{{\dagger}})^{T}{\bm{S}}^{{\dagger}}{\bm{S}}]{\mathbf{x}}-\|{\mathbf{x}}\|_{2}^{2}. (22)

Using 𝑺=𝑼Λ𝑼T{\bm{S}}^{{\dagger}}={\bm{U}}\Lambda^{{\dagger}}{\bm{U}}^{T},

𝔼[𝑺T(𝑺)T𝑺𝑺]\displaystyle\mathbb{E}[{\bm{S}}^{T}({\bm{S}}^{{\dagger}})^{T}{\bm{S}}^{{\dagger}}{\bm{S}}] =𝔼[𝑼Λ𝑼T𝑼Λ𝑼T𝑼Λ𝑼T𝑼Λ𝑼T]\displaystyle=\mathbb{E}[{\bm{U}}\Lambda{\bm{U}}^{T}{\bm{U}}\Lambda^{{\dagger}}{\bm{U}}^{T}{\bm{U}}\Lambda^{{\dagger}}{\bm{U}}^{T}{\bm{U}}\Lambda{\bm{U}}^{T}]
=𝔼[𝑼Λ(Λ)2Λ𝑼T]\displaystyle=\mathbb{E}[{\bm{U}}\Lambda(\Lambda^{{\dagger}})^{2}\Lambda{\bm{U}}^{T}]
=𝑼=Φ𝑼𝔼[Λ(Λ)2Λ]𝑼TPr[𝑼=Φ]\displaystyle=\sum_{{\bm{U}}=\Phi}{\bm{U}}\mathbb{E}[\Lambda(\Lambda^{{\dagger}})^{2}\Lambda]{\bm{U}}^{T}\cdot\Pr[{\bm{U}}=\Phi]
=𝑼=Φ𝑼[(1δ)nkd𝑰d+c=knk1δccd𝑰d]𝑼TPr[𝑼=Φ]\displaystyle=\sum_{{\bm{U}}=\Phi}{\bm{U}}\Big{[}(1-\delta)\frac{nk}{d}{\bm{I}}_{d}+\sum_{c=k}^{nk-1}\delta_{c}\frac{c}{d}{\bm{I}}_{d}\Big{]}{\bm{U}}^{T}\cdot\Pr[{\bm{U}}=\Phi]
=[(1δ)nkd+c=knk1δccd]𝑼=Φ𝑼𝑼TPr[𝑼=Φ]\displaystyle=\Big{[}(1-\delta)\frac{nk}{d}+\sum_{c=k}^{nk-1}\delta_{c}\frac{c}{d}\Big{]}\cdot\sum_{{\bm{U}}=\Phi}{\bm{U}}{\bm{U}}^{T}\cdot\Pr[{\bm{U}}=\Phi]
=[(1δ)nkd+c=knk1δccd]𝑰d\displaystyle=\Big{[}(1-\delta)\frac{nk}{d}+\sum_{c=k}^{nk-1}\delta_{c}\frac{c}{d}\Big{]}{\bm{I}}_{d}
=1β¯𝑰d\displaystyle=\frac{1}{\bar{\beta}}{\bm{I}}_{d} (23)

Substituting Eq. 23 in Eq. 22, we get

MSE(Rand-Proj-Spatial(Max))\displaystyle MSE(\text{Rand-Proj-Spatial}\text{(Max)}) =β¯2𝐱T1β¯𝑰d𝐱𝐱22=(β¯1)𝐱22\displaystyle=\bar{\beta}^{2}{\mathbf{x}}^{T}\frac{1}{\bar{\beta}}{\bm{I}}_{d}{\mathbf{x}}-\|{\mathbf{x}}\|_{2}^{2}=(\bar{\beta}-1)\|{\mathbf{x}}\|_{2}^{2}
[d(1δ)nk+δk1]𝐱22,\displaystyle\leq\Big{[}\frac{d}{(1-\delta)nk+\delta k}-1\Big{]}\|{\mathbf{x}}\|_{2}^{2},

where the inequality is by Eq 21. ∎

C.2 Comparing against Rand-kk

Next, we compare the MSE of Rand-Proj-Spatial(Max) with the MSE of the baseline Rand-kk analytically in the full-correlation case. Recall that in this case,

MSE(Rand-k)=1n(dk1)𝐱22.\displaystyle MSE(\text{Rand-$k$})=\frac{1}{n}(\frac{d}{k}-1)\|{\mathbf{x}}\|_{2}^{2}.

We have

MSE(Rand-Proj-Spatial(Max))MSE(Rand-k)\displaystyle MSE(\text{Rand-Proj-Spatial}\text{(Max)})\leq MSE(\text{Rand-$k$})
d(1δ)nk+δk11n(dk1)\displaystyle\Leftrightarrow\frac{d}{(1-\delta)nk+\delta k}-1\leq\frac{1}{n}(\frac{d}{k}-1)
dkn(1δ)nδn((1δ)n+δ)11n\displaystyle\Leftrightarrow\frac{d}{k}\frac{n-(1-\delta)n-\delta}{n((1-\delta)n+\delta)}\leq 1-\frac{1}{n}
dkδδ/n(1δ)n+δn1n\displaystyle\Leftrightarrow\frac{d}{k}\cdot\frac{\delta-\delta/n}{(1-\delta)n+\delta}\leq\frac{n-1}{n}
dδ(11n)nk(n1)((1δ)n+δ)\displaystyle\Leftrightarrow d\delta(1-\frac{1}{n})n\leq k(n-1)\cdot((1-\delta)n+\delta)
dδk((1δ)n+δ)\displaystyle\Leftrightarrow d\delta\leq k\cdot((1-\delta)n+\delta)
dδ+knδkδkn\displaystyle\Leftrightarrow d\delta+kn\delta-k\delta\leq kn
δknd+knk\displaystyle\Leftrightarrow\delta\leq\frac{kn}{d+kn-k}
δ1dkn+11n\displaystyle\Leftrightarrow\delta\leq\frac{1}{\frac{d}{kn}+1-\frac{1}{n}}

Since nkdnk\leq d, for n2n\geq 2, the above implies when

δ11+12=23,\displaystyle\delta\leq\frac{1}{1+\frac{1}{2}}=\frac{2}{3},

the MSE of Rand-Proj-Spatial(Max) is always less than that of Rand-kk.

C.3 𝑺{\bm{S}} has full rank with high probability

We empirically verify that δ0\delta\approx 0. With d{32,64,128,,1024}d\in\{32,64,128,\dots,1024\} and 4 different nknk value such that nkdnk\leq d for each dd, we compute rank(𝑺)\text{rank}({\bm{S}}) for 10510^{5} trials for each pair of (nk,d)(nk,d) values, and plot the results for all trials. All results are presented in Figure 7. As one can observe from the plots, rank(𝑺)=nk\text{rank}({\bm{S}})=nk with high probability, suggesting δ0\delta\approx 0.

This implies the MSE of Rand-Proj-Spatial(Max) is

MSE(Rand-Proj-Spatial(Max))(dnk1)𝐱22,\displaystyle MSE(\text{Rand-Proj-Spatial}\text{(Max)})\approx(\frac{d}{nk}-1)\|{\mathbf{x}}\|_{2}^{2},

in the full correlation case.

Refer to caption
Figure 7: Simulation results of rank(𝑺{\bm{S}}), where 𝑺=i=1n𝑮iT𝑮i{\bm{S}}=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i}, with 𝑮i{\bm{G}}_{i} being SRHT. With d{32,64,128,,1024}d\in\{32,64,128,\dots,1024\} and 4 different nknk values such that nkdnk\leq d for each dd, we compute rank(S) for 10510^{5} trials for each pairs of (nk,d)(nk,d) values and plot the results for all trials. When d=32d=32 and nk=32nk=32 in the first plot, rank(𝑺)=31\text{rank}({\bm{S}})=31 in 21002100 trials, and rank(𝑺)=nk=32\text{rank}({\bm{S}})=nk=32 in all the rest of the trials. For all other (nk,d)(nk,d) pairs, 𝑺{\bm{S}} always has rank nknk in the 10510^{5} trials. This verifies that δ=Pr[rank(𝑺)<nk]0\delta=\Pr[\text{rank}({\bm{S}})<nk]\approx 0.

C.4 Proof of Theorem 4.4

Theorem 4.4 (MSE under No Correlation).

Consider nn clients, each holding a vector 𝐱id{\mathbf{x}}_{i}\in\mathbb{R}^{d}, i[n]\forall i\in[n]. Suppose we set T1T\equiv 1, β¯=d2k\bar{\beta}=\frac{d^{2}}{k} in Eq. 5, and the random linear map 𝐆i{\bm{G}}_{i} at each client to be an SRHT matrix. Then, for nkdnk\leq d,

𝔼[𝐱^(Rand-Proj-Spatial)𝐱¯22]=1n2(dk1)i=1n𝐱i22.\displaystyle\mathbb{E}\Big{[}\|\widehat{{\mathbf{x}}}^{(\text{Rand-Proj-Spatial})}-\bar{{\mathbf{x}}}\|_{2}^{2}\Big{]}=\frac{1}{n^{2}}\Big{(}\frac{d}{k}-1\Big{)}\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2}.
Proof.

When the client vectors are all orthogonal to each other, we define the transformation function on the eigenvalue to be T(λ)=1,λ0T(\lambda)=1,\forall\lambda\geq 0. We show that by considering the above constant TT, SRHT becomes the same as rand kk. Recall 𝑺=i=1n𝑮iT𝑮i{\bm{S}}=\sum_{i=1}^{n}{\bm{G}}_{i}^{T}{\bm{G}}_{i} and let 𝑮T𝑮=𝑼Λ𝑼T{\bm{G}}^{T}{\bm{G}}={\bm{U}}\Lambda{\bm{U}}^{T} be its eigendecompostion. Then,

T(𝑺)=𝑼T(Λ)𝑼T=𝑼𝑰d𝑼T=𝑰d.\displaystyle T({\bm{S}})={\bm{U}}T(\Lambda){\bm{U}}^{T}={\bm{U}}{\bm{I}}_{d}{\bm{U}}^{T}={\bm{I}}_{d}.

Hence, (T(𝑺))=𝑰d\left(T({\bm{S}})\right)^{{\dagger}}={\bm{I}}_{d}. And the decoded vector for client ii becomes

𝐱^i=β¯(T(𝑮T𝑮))𝑮iT𝑮i𝐱i=β¯𝑮iT𝑮i𝐱i=β¯1d𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i,\displaystyle\widehat{{\mathbf{x}}}_{i}=\bar{\beta}\Big{(}T({\bm{G}}^{T}{\bm{G}})\Big{)}^{{\dagger}}{\bm{G}}_{i}^{T}{\bm{G}}_{i}{\mathbf{x}}_{i}=\bar{\beta}{\bm{G}}_{i}^{T}{\bm{G}}_{i}{\mathbf{x}}_{i}=\bar{\beta}\frac{1}{d}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}, (24)
𝐱^=1ni=1n𝐱^i=1nβ¯i=1n1d𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i\displaystyle\widehat{{\mathbf{x}}}=\frac{1}{n}\sum_{i=1}^{n}\widehat{{\mathbf{x}}}_{i}=\frac{1}{n}\bar{\beta}\sum_{i=1}^{n}\frac{1}{d}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}

𝑫i{\bm{D}}_{i} is a diagonal matrix. Also, 𝑬iT𝑬id×d{\bm{E}}_{i}^{T}{\bm{E}}_{i}\in\mathbb{R}^{d\times d} is a diagonal matrix, where the ii-th entry is 0 or 1.

Computing β¯\bar{\beta}. To ensure that 𝐱^\widehat{{\mathbf{x}}} is an unbiased estimator, from Eq. 24

𝐱i\displaystyle{\mathbf{x}}_{i} =β¯𝔼[𝑮iT𝑮i]𝐱i\displaystyle=\bar{\beta}\mathbb{E}[{\bm{G}}_{i}^{T}{\bm{G}}_{i}]{\mathbf{x}}_{i}
=β¯d𝔼[𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i]𝐱i\displaystyle=\frac{\bar{\beta}}{d}\mathbb{E}[{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}]{\mathbf{x}}_{i}
=β¯d𝔼𝑫i[𝑫i𝑯T𝔼[𝑬iT𝑬i]=(k/d)𝑰d𝑯𝑫i]𝐱i\displaystyle=\frac{\bar{\beta}}{d}\mathbb{E}_{{\bm{D}}_{i}}\Big{[}{\bm{D}}_{i}{\bm{H}}^{T}\underbrace{\mathbb{E}[{\bm{E}}_{i}^{T}{\bm{E}}_{i}]}_{=(k/d){\bm{I}}_{d}}{\bm{H}}{\bm{D}}_{i}\Big{]}{\mathbf{x}}_{i} (𝑬i\because{\bm{E}}_{i} is independent of 𝑫i{\bm{D}}_{i})
=β¯dk𝔼𝑫i[𝑫i2]𝐱i\displaystyle=\frac{\bar{\beta}}{d}k\mathbb{E}_{{\bm{D}}_{i}}\left[{\bm{D}}_{i}^{2}\right]{\mathbf{x}}_{i} (\because 𝑯T𝑯=d𝑰d{\bm{H}}^{T}{\bm{H}}=d{\bm{I}}_{d})
=β¯kd𝐱i\displaystyle=\frac{\bar{\beta}k}{d}{\mathbf{x}}_{i} (𝑫i2=𝑰\because{\bm{D}}_{i}^{2}={\bm{I}} is now deterministic.)
β¯\displaystyle\Rightarrow\bar{\beta} =dk.\displaystyle=\frac{d}{k}. (25)

Computing the MSE.

MSE=𝔼𝐱^𝐱¯22\displaystyle MSE=\mathbb{E}\Big{\|}\widehat{{\mathbf{x}}}-\bar{{\mathbf{x}}}\Big{\|}_{2}^{2}
=𝔼1nβ¯i=1n1d𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i1ni=1n𝐱i22\displaystyle=\mathbb{E}\Big{\|}\frac{1}{n}\bar{\beta}\sum_{i=1}^{n}\frac{1}{d}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}-\frac{1}{n}\sum_{i=1}^{n}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}
=1n2{𝔼β¯i=1n1d𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i22+i=1n𝐱i22\displaystyle=\frac{1}{n^{2}}\left\{\mathbb{E}\Big{\|}\bar{\beta}\sum_{i=1}^{n}\frac{1}{d}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}+\Big{\|}\sum_{i=1}^{n}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}\right.
2β¯𝔼[i=1n1d𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i],i=1n𝐱i}\displaystyle\qquad\qquad\qquad\left.-2\Big{\langle}\bar{\beta}\mathbb{E}[\sum_{i=1}^{n}\frac{1}{d}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}],\sum_{i=1}^{n}{\mathbf{x}}_{i}\Big{\rangle}\right\}
=1n2{β¯2𝔼i=1n1d𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i22i=1n𝐱i22}\displaystyle=\frac{1}{n^{2}}\Bigg{\{}\bar{\beta}^{2}\mathbb{E}\Big{\|}\sum_{i=1}^{n}\frac{1}{d}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}-\Big{\|}\sum_{i=1}^{n}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}\Bigg{\}} (𝔼[𝐱^]=𝐱¯\because\mathbb{E}[\widehat{{\mathbf{x}}}]=\bar{{\mathbf{x}}})
=1n2{i=1nβ¯2d2𝔼𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i22i=1n𝐱i22\displaystyle=\frac{1}{n^{2}}\left\{\sum_{i=1}^{n}\frac{\bar{\beta}^{2}}{d^{2}}\mathbb{E}\Big{\|}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}-\sum_{i=1}^{n}\Big{\|}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}\right. (26)
+2i=1nl=i+1nβ¯2d2𝔼[𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i],𝔼[𝑫l𝑯T𝑬lT𝑬l𝑯𝑫l𝐱l]2i=1nl=i+1n𝐱i,𝐱l}.\displaystyle\quad\left.+2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\frac{\bar{\beta}^{2}}{d^{2}}\Big{\langle}\mathbb{E}[{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}],\mathbb{E}[{\bm{D}}_{l}{\bm{H}}^{T}{\bm{E}}_{l}^{T}{\bm{E}}_{l}{\bm{H}}{\bm{D}}_{l}{\mathbf{x}}_{l}]\Big{\rangle}-2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\Big{\langle}{\mathbf{x}}_{i},{\mathbf{x}}_{l}\Big{\rangle}\right\}.

Note that in Eq. 26

𝔼𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i22=𝔼[𝐱iT𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i]\displaystyle\mathbb{E}\Big{\|}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}=\mathbb{E}[{\mathbf{x}}_{i}^{T}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}]
=d𝔼[𝐱iT𝑫i𝑯T(𝑬iT𝑬i)2𝑯𝑫i𝐱i]\displaystyle=d\mathbb{E}[{\mathbf{x}}_{i}^{T}{\bm{D}}_{i}{\bm{H}}^{T}({\bm{E}}_{i}^{T}{\bm{E}}_{i})^{2}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}] (𝑫i2=𝑰d;𝑯T𝑯=𝑯𝑯T=d𝑰d\because{\bm{D}}_{i}^{2}={\bm{I}}_{d};{\bm{H}}^{T}{\bm{H}}={\bm{H}}{\bm{H}}^{T}=d{\bm{I}}_{d})
=d𝐱iT𝔼𝑫i[𝑫i𝑯T𝔼[𝑬iT𝑬i]𝑯𝑫i]𝐱i\displaystyle=d{\mathbf{x}}_{i}^{T}\mathbb{E}_{{\bm{D}}_{i}}\left[{\bm{D}}_{i}{\bm{H}}^{T}\mathbb{E}[{\bm{E}}_{i}^{T}{\bm{E}}_{i}]{\bm{H}}{\bm{D}}_{i}\right]{\mathbf{x}}_{i} (𝑬i,𝑫i{\bm{E}}_{i},{\bm{D}}_{i} are independent; (𝑬iT𝑬i)2=𝑬iT𝑬i({\bm{E}}_{i}^{T}{\bm{E}}_{i})^{2}={\bm{E}}_{i}^{T}{\bm{E}}_{i})
=kd𝐱i22,\displaystyle=kd\|{\mathbf{x}}_{i}\|_{2}^{2}, (27)

since 𝔼[𝑬iT𝑬i]=(k/d)𝑰d\mathbb{E}[{\bm{E}}_{i}^{T}{\bm{E}}_{i}]=(k/d){\bm{I}}_{d}, 𝑯T𝑯=d𝑰d{\bm{H}}^{T}{\bm{H}}=d{\bm{I}}_{d} and for ili\neq l

𝔼[𝑫i𝑯T𝑬iT𝑬i𝑯𝑫i𝐱i],𝔼[𝑫l𝑯T𝑬lT𝑬l𝑯𝑫l𝐱l]=k𝐱i,k𝐱l=k2𝐱i,𝐱l.\displaystyle\Big{\langle}\mathbb{E}[{\bm{D}}_{i}{\bm{H}}^{T}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\bm{H}}{\bm{D}}_{i}{\mathbf{x}}_{i}],\mathbb{E}[{\bm{D}}_{l}{\bm{H}}^{T}{\bm{E}}_{l}^{T}{\bm{E}}_{l}{\bm{H}}{\bm{D}}_{l}{\mathbf{x}}_{l}]\Big{\rangle}=\Big{\langle}k{\mathbf{x}}_{i},k{\mathbf{x}}_{l}\Big{\rangle}=k^{2}\Big{\langle}{\mathbf{x}}_{i},{\mathbf{x}}_{l}\Big{\rangle}. (28)

Substituting Eq. 27, 28 in Eq. 26, we get

MSE\displaystyle MSE =1n2{(β¯2d2i=1nkd𝐱i22+2i=1nl=i+1nβ¯2k2d2𝐱i,𝐱l)i=1n𝐱i222i=1nl=i+1n𝐱i,𝐱l}\displaystyle=\frac{1}{n^{2}}\Bigg{\{}\Big{(}\frac{\bar{\beta}^{2}}{d^{2}}\sum_{i=1}^{n}kd\|{\mathbf{x}}_{i}\|_{2}^{2}+2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\frac{\bar{\beta}^{2}k^{2}}{d^{2}}\Big{\langle}{\mathbf{x}}_{i},{\mathbf{x}}_{l}\Big{\rangle}\Big{)}-\sum_{i=1}^{n}\Big{\|}{\mathbf{x}}_{i}\Big{\|}_{2}^{2}-2\sum_{i=1}^{n}\sum_{l=i+1}^{n}\Big{\langle}{\mathbf{x}}_{i},{\mathbf{x}}_{l}\Big{\rangle}\Bigg{\}}
=1n2(dk1)i=1n𝐱i22,\displaystyle=\frac{1}{n^{2}}\Big{(}\frac{d}{k}-1\Big{)}\sum_{i=1}^{n}\|{\mathbf{x}}_{i}\|_{2}^{2},

which is exactly the same as the MSE of rand kk. ∎

C.5 Rand-Proj-Spatial recovers Rand-kk-Spatial (Proof of Lemma 4.1)

Lemma 4.1 (Recovering Rand-kk-Spatial).

Suppose client ii generates a subsampling matrix 𝐄i=[𝐞i1,,𝐞ik]{\bm{E}}_{i}=\begin{bmatrix}\mathbf{e}_{i_{1}},&\dots,&\mathbf{e}_{i_{k}}\end{bmatrix}^{\top}, where {𝐞j}j=1d\{\mathbf{e}_{j}\}_{j=1}^{d} are the canonical basis vectors, and {i1,,ik}\{i_{1},\dots,i_{k}\} are sampled from {1,,d}\{1,\dots,d\} without replacement. The encoded vectors are given as 𝐱^i=𝐄i𝐱i\widehat{{\mathbf{x}}}_{i}={\bm{E}}_{i}{\mathbf{x}}_{i}. Given a function TT, 𝐱^\widehat{{\mathbf{x}}} computed as in Eq. 5 recovers the Rand-kk-Spatial estimator.

Proof.

If client ii applies 𝑬ik×d{\bm{E}}_{i}\in\mathbb{R}^{k\times d} as the random matrix to encode 𝐱i{\mathbf{x}}_{i} in Rand-Proj-Spatial, by Eq. 5, client ii’s encoded vector is now

𝐱^i(Rand-Proj-Spatial)=β¯(T(i=1n𝑬iT𝑬i))𝑬iT𝑬i𝐱i\displaystyle\hat{{\mathbf{x}}}_{i}^{(\text{Rand-Proj-Spatial})}=\bar{\beta}\Big{(}T(\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i})\Big{)}^{{\dagger}}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i} (29)

Notice 𝑬iT𝑬i{\bm{E}}_{i}^{T}{\bm{E}}_{i} is a diagonal matrix, where the jj-th diagonal entry is 11 if coordinate jj of 𝐱i{\mathbf{x}}_{i} is chosen. Hence, 𝑬iT𝑬i𝐱i{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i} can be viewed as choosing kk coordinates of 𝐱i{\mathbf{x}}_{i} without replacement, which is exactly the same as Rand-kk-Spatial’s (and Rand-kk’s) encoding procedure.

Notice i=1n𝑬iT𝑬i\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i} is also a diagonal matrix, where the jj-th diagonal entry is exactly MjM_{j}, i.e. the number of clients who selects the jj-th coordinate as in Rand-kk-Spatial [12]. Furthermore, notice (T(i=1n𝑬iT𝑬i))\Big{(}T(\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i})\Big{)}^{{\dagger}} is also a diagonal matrix, where the jj-th diagonal entry is 1T(Mj)\frac{1}{T(M_{j})}, which recovers the scaling factor used in Rand-kk-Spatial’s decoding procedure.

Rand-Proj-Spatial computes β¯\bar{\beta} as β¯𝔼[(T(i=1n𝑬iT𝑬i))𝑬iT𝑬i𝐱i]=𝐱i\bar{\beta}\mathbb{E}\Big{[}\Big{(}T(\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i})\Big{)}^{{\dagger}}{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i}\Big{]}={\mathbf{x}}_{i}. Since (T(i=1n𝑬iT𝑬i))\Big{(}T(\sum_{i=1}^{n}{\bm{E}}_{i}^{T}{\bm{E}}_{i})\Big{)}^{{\dagger}} and 𝑬iT𝑬i𝐱i{\bm{E}}_{i}^{T}{\bm{E}}_{i}{\mathbf{x}}_{i} recover the scaling factor and the encoding procedure of Rand-kk-Spatial, and β¯\bar{\beta} is computed in exactly the same way as Rand-kk-Spatial does, β¯\bar{\beta} will be exactly the same as in Rand-kk-Spatial.

Therefore, 𝐱^i(Rand-Proj-Spatial)\hat{{\mathbf{x}}}_{i}^{(\text{Rand-Proj-Spatial})} in Eq. 29 with 𝑬i{\bm{E}}_{i} as the random matrix at client ii recovers 𝐱^i(Rand-k-Spatial)\hat{{\mathbf{x}}}_{i}^{(\text{Rand-$k$-Spatial})}. This implies Rand-Proj-Spatial recovers Rand-kk-Spatial in this case. ∎

Appendix D Additional Experiment Details and Results

Implementation. All experiments are conducted in a cluster of 2020 machines, each of which has 40 cores. The implementation is in Python, mainly based on numpy and scipy. All code used for the experiments can be found at https://github.com/11hifish/Rand-Proj-Spatial.

Data Split. For the non-IID dataset split across the clients, we follow [62] to split Fashion-MNIST, which is used in distributed power iteration and distributed kk-means. Specifically, the data is first sorted by labels and then divided into 2nn shards with each shard corresponding to the data of a particular label. Each client is then assigned 2 shards (i.e., data from 22 classes). However, this approach only works for datasets with discrete labels (i.e. datasets used in classification tasks). For the other dataset UJIndoor, which is used in distributed linear regression, we first sort the dataset by the ground truth prediction and then divides the sorted dataset across the clients.

D.1 Additional experimental results

For each one of the three tasks, distributed power iteration, distributed kk-means, and distributed linear regression, we provide additional results when the data split is IID across the clients for smaller n,kn,k values in Section D.1.1, and when the data split is Non-IID across the clients in Section D.1.2. For the Non-IID case, we use the same settings (i.e. n,k,dn,k,d values) as in the IID case.

Discussion. For smaller n,kn,k values compared to the data dimension dd, there is less information or less correlation from the client vectors. Hence, both Rand-kk-Spatial and Rand-Proj-Spatial perform better as nknk increases. When n,kn,k is small, one might notice Rand-Proj-Spatial performs worse than Rand-kk-Wangni in some settings. However, Rand-kk-Wangni is an adaptive estimator, which optimizes the sampling weights for choosing the client vector coordinates through an iterative process. That means Rand-kk-Wangni requires more computation from the clients, while in practice, the clients often have limited computational power. In contrast, our Rand-Proj-Spatial estimator is non-adaptive and the server does more computation instead of the clients. This is more practical since the central server usually has more computational power than the clients in applications like FL. See the introduction for more discussion.

In most settings, we observe the proposed Rand-Proj-Spatial has a better performance compared to Rand-kk-Spatial. Furthermore, as one would expect, both Rand-kk-Spatial and Rand-Proj-Spatial perform better when the data split is IID across the clients since there is more correlation among the client vectors in the IID case than in the Non-IID case.

D.1.1 More results in the IID case

Distributed Power Iteration and Distribued KK-Means. We use the Fashion-MNIST dataset for both distributed power iteration and distributed kk-means, which has a dimension of d=1024d=1024. We consider more settings for distributed power iteration and distributed kk-means here: n=10,k{5,25,51}n=10,k\in\{5,25,51\}, and n=50,k{5,10}n=50,k\in\{5,10\}.

Refer to caption
Refer to caption
Refer to caption
Figure 8: More results of distributed power iteration on Fashion-MNIST (IID data split) with d=1024d=1024 when n=10n=10, k{5,25,51}k\in\{5,25,51\} and when n=50n=50, k{5,10}k\in\{5,10\}.
Refer to caption
Refer to caption
Refer to caption
Figure 9: More results on distributed kk-means on Fashion-MNIST (IID data split) with d=1024d=1024 when n=10,k{5,25,51}n=10,k\in\{5,25,51\} and when n=50,k{10,51}n=50,k\in\{10,51\}.

Distributed Linear Regression. We use the UJIndoor dataset distributed linear regression, which has a dimension of d=512d=512. We consider more settings here: n=10,k{5,25}n=10,k\in\{5,25\} and n=50,k{1,5}n=50,k\in\{1,5\}.

Refer to caption
Refer to caption
Figure 10: More results of distributed linear regression on UJIndoor (IID data split) with d=512d=512, when n=10,k{5,25}n=10,k\in\{5,25\} and when n=50,k{1,5}n=50,k\in\{1,5\}. Note when k=1k=1, the Induced estimator is the same as Rand-kk.

D.1.2 Additional results in the Non-IID case

In this section, we report results when the dataset split across the clients are Non-IID, using the same datasets as in the IID case. We choose exactly the same set of n,kn,k values as in the IID case.

Distributed Power Iteration and Distributed KK-Means. Again, both distributed power iteration and distributed kk-means use the Fashion-MNIST dataset, with a dimension d=1024d=1024. We consider the following settings for both tasks: n=10,k{5,25,51,102}n=10,k\in\{5,25,51,102\} and n=50,k{5,10,20}n=50,k\in\{5,10,20\}.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Results of distributed power iteration when the data split is Non-IID. n=10,k{5,25,51,102}n=10,k\in\{5,25,51,102\} and n=50,k{5,10,20}n=50,k\in\{5,10,20\}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Results of distributed kk-means when the data split is Non-IID. n=10,k{5,25,51,102}n=10,k\in\{5,25,51,102\} and n=50,k{5,10,20}n=50,k\in\{5,10,20\}.

Distributed Linear Regression. Again, we use the UJIndoor dataset for distributed linear regression, which has a dimension d=512d=512. We consider the following settings: n=10,k{5,25,50}n=10,k\in\{5,25,50\} and n=50,k{1,5,50}n=50,k\in\{1,5,50\}.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Results of distributed linear regression when the data split is Non-IID. n=10,k{5,25,50}n=10,k\in\{5,25,50\} and n=50,k{1,5,50}n=50,k\in\{1,5,50\}.