Quantized Adaptive Subgradient Algorithms and Their Applications

Ke Xu, Jianqiao Wangni, Yifan Zhang, Deheng Ye, Jiaxiang Wu and Peilin Zhao

(2022)

Abstract.

Data explosion and an increase in model size drive the remarkable advances in large-scale machine learning, but also make model training time-consuming and model storage difficult. To address the above issues in the distributed model training setting which has high computation efficiency and less device limitation, there are still two main difficulties. On one hand, the communication costs for exchanging information, e.g., stochastic gradients among different workers, is a key bottleneck for distributed training efficiency. On the other hand, less parameter model is easy for storage and communication, but the risk of damaging the model performance. To balance the communication costs, model capacity and model performance simultaneously, we propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training. To be specific, we explore the combination of gradient quantization and sparse model to reduce the communication cost per iteration in distributed training. A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity. Moreover, we theoretically find that a large quantization error brings in extra noise, which influences the convergence and sparsity of the model. Therefore, a threshold quantization strategy with a relatively small error is adopted in QCMD adagrad and QRDA adagrad to improve the signal-to-noise ratio and preserve the sparsity of the model. Both theoretical analyses and empirical results demonstrate the efficacy and efficiency of the proposed algorithms.

Distributed training, adaptive subgradient, gradient quantization, sparse model.

^†^†copyright: acmcopyright^†^†journalyear: 2022^†^†conference: 2022; July; 2022 ^†^†price: 15.00^†^†ccs: Computing methodologies Distributed computing methodologies

1. Introduction

The large model has recently been extremely successful in machine learning and data mining with the growth of data volume. A great number of complex deep neural networks (zhao2018adaptive, ; zhang2018online, ; krizhevsky2012imagenet, ; zhang2019whole, ; cao2019multi, ) have been devised to solve real-world applications problem. However, many practical applications can only provide limited computing resources, i.e. limited storage devices and unaccelerated hardware such as CPUs. These constrain the model complexity and make model training extremely time-consuming under the huge amount of training datasets. To solve these issues, this paper explores to accelerate model training and reduce model storage costs simultaneously for large-scale machine learning.

Distributed training offers a potential solution to solve the issue of long training time (balcan2016communication, ; chilimbi2014project, ; xing2015petuum, ; hsieh2017communication, ; zhang2017online, ). The data parallelism (li2017scaling, ; gupta2016model, ) is one of the most popular framework in distributed training. As shown in Fig. 1, data parallelism framework has one or more computer workers connected via some communication networks, while multiple model replicas are trained in parallel on each worker. A global parameter server ensures the consistency among replicas of each worker by collecting all gradients computed from different workers and then averaging them to update parameters. The goal is to optimize a global objective function formed by the average of a series of local loss functions derived from local computation on each worker.

Refer to caption — Figure 1. Data parallelism in a typical distributed training scheme, where $g_{t}^{(m)}$ denotes the local gradient computed by the $m^{th}$ worker, and $M$ denotes the total number of workers. The parameter server collects all local gradients from the workers. In addition, $\bar{g}_{t}$ is the averaged gradients, which is synchronized by the parameter server and pulled by each worker to update the model.

In the above distributed framework, time consumption is mainly caused by computation and communication. While increasing the number of workers helps to reduce computation time, the communication overhead for exchanging training information, e.g., stochastic gradients among different workers, is a key bottleneck for training efficiency, especially for high delay communication network. Even worse, some slow workers can adversely affect the progress of fast workers, leading to a drastic slowdown of the overall convergence process (eshraghi2020distributed, ). Asynchronous communication (huo2018asynchronous, ), alleviate the negative effect of slow machines and decentralized algorithms (lian2017can, ; xin2020decentralized, ) get rid of the dependency on high communication costs on the central node. But, the parameter variance among workers may deteriorate model accuracy(ko2021aladdin, ). In this paper, we concentrate on the data parallelism framework which belongs to synchronous centralized algorithms. There are many types of research focusing on how to save communication costs. Some studies (bekkerman2011scaling, ; de2016efficient, ) propose to reduce the number of training rounds. For example, one may use SVRG (johnson2013accelerating, ; shah2016trading, ) to periodically calculate an accurate gradient estimate to reduce the variance introduced by stochastic sampling, which, however, is an operation with high calculation costs. Other methods focus on reducing the precision of gradients. DoReFa-Net (zhou2016dorefa, ) and QSGD (alistarh2017qsgd, ) quantize gradients into fixed-point numbers, so that much fewer bits are needed to be transmitted. More aggressive quantization methods, such as 1-bit-sgd (seide20141, ) and ternary gradient (wen2017terngrad, ), sacrifice a certain degree of expression power to achieve the purpose of reducing communication costs. (tang2019doublesqueeze, ) studies the double squeeze of the gradient that not only the local gradient is compressed but also the synchronous gradient.

The size of the model is not only a determinant of memory usage but also an important factor of save the communication cost in distributed training. Although in data parallelism, we transmit the gradient instead of the model parameters. But for a sparse model , we can avoid transmitting unnecessary gradients corresponding to the parameters that always be 0. The combination of the above bandwidth reduction methods and parallel stochastic gradient descent (PSGD) has been demonstrated to be effective for model training without model sparsity constraints. Therefore, inspired by some online algorithms such as adaptive composite mirror descent and adaptive regularized dual averaging (duchi2011adaptive, ), which perform well in generating sparse models, we design two corresponding communication efficient sparse model training algorithms for distributed framework, named quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual averaging adaptive subgradient (QRDA adagrad).

To be specific, we define the distributed objective not only under a nonsmooth constraint to keep the model sparse, but also construct a proximal function base on a quantized gradient to achieve a balance between communication costs, accuracy, and model sparsity. The quantization is conducted not only for local gradients computed by all local workers but also for the aggregated gradient like the double squeeze (tang2019doublesqueeze, ). We proof the convergence rate for distributed QCMD adagrad and QRDA adagrad is $O(\frac{1}{\sqrt{T}})$ . Besides, our theoretical analysis shows that quantization introduces additional noise which affects the model convergence and sparsity. Hence, we apply a threshold quantization method with small error to the gradient to reduce the influence of noise on model training.

Our major contributions are summarized as follows:

•

We propose two communication efficient sparse model training algorithms for the distributed framework, namely QCMD adagrad and QRDA adagrad.
•

We theoretically find that quantization noise affects the model convergence and sparsity, and thus we adopt a small error quantization method to alleviate noise influence. We provide theoretical results on the regret for the QCMD adagrad and QRDA adagrad. The convergence rate is $O(\frac{1}{\sqrt{T}})$ .
•

We apply QCMD adagrad and QRDA adagrad to both linear models and convolutional neural networks. Experimental results demonstrate the effectiveness and efficiency of the proposed methods in convex and non-convex problems.

2. Related Work

We detail related studies in three aspects as follows.

Gradients sparsification. Gradient sparsification (seide20141, ) imposes sparsity onto gradients, where only a small fraction of elements of gradients are exchanged across workers based on their importance. Lin et al. (lin2017deep, ) find most of the gradient exchange in distributed SGD are redundant, and propose a deep gradient compression method to cut the gradient size of ResNet-50 from 97MB to 0.35MB, and that of DeepSpeach from 488MB to 0.74MB. Aji et al. (aji2017sparse, ) sparsify gradients by removing the $R\%$ smallest gradients regarding absolute value. Simlarly, (stich2018sparsified, ) propose sparsified SGD with top k-sparsification. Wangni et al. (wangni2017gradient, ) analyse how to achieve the optimal sparseness under certain variance constraints and the unbiasedness of the gradient vector is kept after sparsification.

Gradients quantization. Gradient quantization replaces the original gradient with a small number of fixed values. Quantized SGD (QSGD) (alistarh2017qsgd, ) adjusts the number of bits of the exchanging gradients to balance the bandwidth and accuracy. More aggressive quantization methods, such as the binary representation (seide20141, ), cut each component of the gradient to its sign. TernGrad (wen2017terngrad, ) uses three numerical levels {-1, 0, 1} and a scaler, e.g. maximum norm or $l_{2}$ norm of the gradient, to replace the full precision gradient. This aggressive method can be regarded as a variant of QSGD. To reduce the influence of noise introduced by the aggressive quantization, Wu et al. (wu2018error, ) utilize the accumulated quantization error to compensate the quantized gradient. Several applications, such as federated machine learning (ML) at the wireless edge (amiri2020machine, ), benefit from the error compensation. In addition, (magnusson2020maintaining, ) introduce a family of adaptive gradient quantization schemes which can enable linear convergence in any norm for gradient-descent-type algorithms. (alimisis2021communication, ) propose a quantized Newton’s method which is suitable for ill-conditioned but low-dimensional problems, as it reduces the communication complexity by a trade-off between the dependency on the dimension of the input features and its condition number. In (alimisis2021communication, ), lattice quantization (davies2021new, ) that reduces the variance is adapted to quantize the covariance matrix.

Stochastic optimization. In the modern implementation of large-scale machine learning algorithms, stochastic gradient descent (SGD) is commonly used as the optimization method in distributed training frameworks because of its universality and high computational efficiency for each iteration. SGD intrinsically implies gradient noise, which helps to escape saddle points for non-convex problems, like neural networks (jin2017escape, ; kleinberg2018alternative, ). However, when producing a sparse model, simply adding a subgradient of the $l_{1}$ penalty to the gradient of the loss does not essentially produce parameters that are exactly zero. More sophisticated approaches such as composite mirror descent (singer2009efficient, ; duchi2010composite, ) and regularized dual averaging (xiao2010dual, ) do succeed in introducing sparsity, but the sparsity of model is limited. Their adaptive subgradient extensions (Adagrad) with $l_{1}$ regularization (duchi2011adaptive, ) produce even better accuracy vs. sparsity tradeoffs. Compared with SGD which is very sensitive to the learning rate, Adagrad (duchi2011adaptive, ) dynamically incorporates knowledge of the geometry of the data curvature of the loss function to adjust the learning rate of gradients. As a result, it requires no manual tuning of learning rate and appears robust to noisy gradient information and large-scale high-dimensional machine learning.

3. Problem Definition and our approach

Notations: Before the problem definition, we define the following notations. Lower case bold letters, such as ${\bf v}$ , are used to denote vectors. Matrices are capital case bold letters, like ${\bf A}$ . Scalars are lower case italic letters, such as $s$ . We use [d] to denote the set $\{1,2,...,d\}$ . For a vector ${\bf v}\in\mathbb{R}^{d}$ , the $l_{q}$ norm defined as $||{\bf v}||_{q}=(\sum_{i\in[d]}|{\bf v}_{i}|^{q})^{\frac{1}{q}}$ , and $||{\bf v}||_{\infty}=\max_{i\in[d]}|{\bf v}_{i}|$ . For a symmetric matrix ${\bf A}\in\mathbb{R}^{d\times d}$ and any vector ${\bf x}\in\mathbb{R}^{d}$ , ${\bf y}\in\mathbb{R}^{d}$ , the Bregman divergence associated with a strongly convex and differentiable function $\psi({\bf x})=\frac{1}{2}{\bf x}^{\top}{\bf A}{\bf x}$ is defined as $B_{\psi}({\bf x},{\bf y})=\frac{1}{2}({\bf x}-{\bf y})^{\top}{\bf A}({\bf x}-{\bf y})$ . The Mahalanobis norm $||{\bf x}||_{\psi}=({\bf x}^{\top}{\bf A}{\bf x})^{\frac{1}{2}}$ and it’s associated dual norm $||{\bf x}||_{\psi^{*}}=({\bf x}^{\top}{\bf A}^{-1}{\bf x})^{\frac{1}{2}}$ . $<{\bf x},{\bf y}>$ is used to denote the inner product between ${\bf x}$ and ${\bf y}$ . We also make frequent use of the following matrix, where ${\bf v}_{1:t}=[{\bf v}_{1}\cdot\cdot\cdot{\bf v}_{t}]$ denote the matrix obtained by concatenating the vector sequence. We denote the $i^{th}$ row of ${\bf v}_{1:t}$ by ${\bf v}_{1:t,i}$ , which amounts to the concatenation of the $i_{th}$ component of each vector.

3.1. Problem Definition

Stochastic Optimization (Lee:2018:DQA:3219819.3220075, ) is a popular approach in training large-scale machine learning models. We denote the objective function as $f({\bf x})$ , depending on the model parameter ${\bf x}\in\chi$ . The dataset could be split into $M$ workers, where each has $N$ data samples. We denote $f_{m,n}$ as the loss function associated with the $n^{th}$ sample on the $m^{th}$ worker, and $f({\bf x})$ is an average over all loss functions on all workers.

(1)

f({\bf x})=\frac{1}{M}\frac{1}{N}\sum_{m,n}f_{m,n}({\bf x})

When training on a single machine, the stochastic optimization, like SGD, randomly chooses a subset of samples ${\bf{Z}}_{t}$ and calculates the gradient $f^{\prime}_{t}({\bf x}_{t},{\bf z}_{t})$ , where ${\bf z}_{t}\in{\bf{Z}}_{t}$ , to approximates the true gradient $f^{\prime}_{t}({\bf x}_{t})$ at $t^{th}$ iteration by

(2)

f^{\prime}_{t}({\bf x}_{t})=\frac{1}{|{\bf{Z}}_{t}|}\sum_{{\bf z}_{t}\in{\bf{Z}}_{t}}(f^{\prime}_{t}({\bf x}_{t},{\bf z}_{t}))

Similarly, in a distributed setting, each worker calculates the stochastic gradient $f^{\prime}_{t}({\bf x}_{t})^{(m)}$ based on its local data ${\bf{Z}}_{t}^{(m)}$ . The parameter server averages the local gradients to get the synchronized gradient,

(3)

SynG_{t}({\bf x}_{t})=\frac{1}{M}\sum_{m=1}^{M}f^{\prime}_{t}({\bf x}_{t})^{(m)}

As previous work on gradient sparsification and quantization suggested, we could represent $f^{\prime}_{t}({\bf x}_{t})^{(m)}$ in a low-precision way. Since extreme gradient sparsification can also be seen as gradient quantization, this paper focuses on reducing communication costs through gradient quantization. We denote $Q(\cdot)$ as the quantization function and then the synchronized quantized gradient can be obtained by

(4)

SynQ_{t}=\frac{1}{M}\sum_{m=1}^{M}Q(f_{t}^{\prime}({\bf x}_{t})^{(m)})

The update formula for the synchronous distributed stochastic optimization with function $Q(\cdot)$ is

(5)

{\bf x}_{t+1}={\bf x}_{t}-\eta\frac{1}{M}\sum_{m=1}^{M}Q(f_{t}^{\prime}({\bf x}_{t})^{(m)})

where $\eta$ is the learning rate.

3.2. Quantized CMD Adagrad and Quantized RDA Adagrad

General Framework. In the data parallelism distributed training scheme, the number of the exchanged gradients is equal to the number of model learnable parameters. But for a sparse model, we can avoid transmitting unnecessary gradient information, i.e. gradient corresponding to parameters that are always 0 before and after model update. To this end, each worker utilizes their local full precision gradients to obtain the model parameters after the local update. With the assumption that the model parameters change smoothly and be sparsity, we can transmit a small amount of the necessary gradient to save the communication cost. Furthermore, this small amount of the local gradients is quantized and encoded. After the global parameter server receives and decodes the quantized gradients, synchronous gradients are calculated by averaging the local sparse quantized gradients. The synchronous sparse gradients are then quantized and sent to the workers for model parameter update. The general framework of QCMD adagrad and QRDA adagrad is shown in Fig. 2.

Generate Sparse Model. Vanilla SGD is not particularly effective at producing a sparse model, even adding a subgradient of $l_{1}$ penalty to the gradient of the loss. Some other approaches such as proximal gradient descent (also named composite mirror descent) (singer2009efficient, ; duchi2010composite, ) and regularized dual averaging (xiao2010dual, ) introduce limited sparsity with $l_{1}$ regularization. Their adaptive subgradient extensions, CMD adagrad and RDA adagrad, (duchi2011adaptive, ) produce better accuracy vs. sparsity tradeoffs.

The original proximal gradient method (duchi2010composite, ) employs an immediate trade-off among the current gradient $f_{t}^{\prime}({\bf x}_{t})$ , the regularizer $\phi$ and proximal function $\psi$ . The proximal function $\psi$ aims to keep ${\bf x}$ staying close to ${\bf x}_{t}$ and sometimes be simply set to $\psi({\bf x})=\frac{1}{2}||{\bf x}-{\bf x}_{t}||^{2}_{2}$ ). It makes the model parameters meet the assumption of steady change. To achieve better bound of regret, the CMD adagrad adopts $\psi_{t}$ which varies with $t$ . The update for CMD adagrad amounts to solving

(6)

{\bf x}_{t+1}=\arg\min_{{\bf x}\in\chi}{\eta f^{\prime}_{t}({\bf x}_{t})^{\top}}{\bf x}+\eta\phi({\bf x})+B_{\psi_{t}}({\bf x},{\bf x}_{t})

Similarly, the RDA adagrad encompasses a trade off among a gradient-dependent linear term, the regularizer $\phi$ and a strongly convex term $\psi_{t}$ for well conditioned predictions. The update for RDA adagrad is shown as:

(7)

\begin{split}{\bf x}_{t+1}=\arg\min_{{\bf x}\in\chi}\{{\eta}<\frac{1}{t}\sum_{\tau=1}^{t}f^{\prime}_{\tau}({\bf x}_{\tau}),{\bf x}>+\eta\phi({\bf x})+\frac{1}{t}\psi_{t}({\bf x})\}\end{split}

For both CMD adagrad and RDA adagrad, $\phi({\bf x})=\lambda||{\bf x}||_{1}$ encourages the sparsity of the model parameter. The Bregman divergence associated with $\psi_{t}({\bf x})=\frac{1}{2}{\bf x}^{\top}{\bf A}_{t}{\bf x}$ is defined as

(8)

\displaystyle B_{\psi_{t}}({\bf x},{\bf x}_{t})=\frac{1}{2}({\bf x}-{\bf x}_{t})^{\top}{\bf A}_{t}({\bf x}-{\bf x}_{t})

${\bf A}_{t}$ is a diagonal matrix also named the adaptive learning rate matrix. Concretely, for some small fixed $\delta\geq 0$ , the adaptive learning rate matrix is:

(9)

\displaystyle{\bf A}_{t}=\delta{\bf I}+diag({\bf a}_{t})

where ${\bf a}_{t,d}=||f^{\prime}({\bf x})_{1:t,d}||_{2}$ , $f^{\prime}({\bf x})_{1:t}=[f^{\prime}({\bf x})_{1:t-1},f^{\prime}_{t}({\bf x}_{t})]$ denote the matrix obtained by concatenating the gradient sequence. $f^{\prime}({\bf x})_{1:t,d}$ is the $d^{th}$ row of this matrix and $diag(\cdot)$ converts a vector into a diagonal matrix. ${\bf I}$ is identity matrix.

To apply them to distributed learning, each worker needs to maintain a adaptive learning rate matrix locally. On the $m_{th}$ worker, the elements of the gradient $f^{\prime}({\bf x}_{t})^{(m)}$ are selected by a selected function:

(10)

\displaystyle S(f^{\prime}_{t}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)})=\{f^{\prime}_{t,d}({\bf x}_{t})^{(m)}|\mathbb{I}_{d}^{(m)}=1\}

where $\mathbb{I}^{(m)}$ is an indicator function that

(11)

\displaystyle\mathbb{I}_{d}=\left\{\begin{aligned} 1,&\text{ }{\bf x}_{t,d}\neq 0\text{ or }\hat{\bf x}_{t+1,d}\neq 0\\ 0,&\text{ otherwise}\end{aligned}\right.

$\hat{\bf x}_{t+1}^{(m)}$ is the local parameter that is updated based on the local full precision gradient $f^{\prime}_{t}({\bf x}_{t})^{(m)}$ and local adaptive learning rate matrix ${\bf A}_{t}^{(m)}$ . Both the indicator and selected elements of the local gradients are sent to the global parameter server for decoding. The synchronous indicator is calculated by ”bitwise or” between the $M$ indicators.

Gradient quantization. Besides reducing the transmission amount of the corresponding gradient by generating a sparse model, we also hope to further reduce the communication cost by quantifying the selected elements of the gradient. The quantization function is define as $Q(\cdot)$ . Each worker computes $Q(S(f^{\prime}_{t}({\bf x}_{t})^{(m)},\mathbb{I}^{m}))$ . The global parameter server first decodes the quantized gradient through $S^{-1}$ and computes the synchronized quantized gradient based on Eq.(4). We furtherly quantize the synchronized gradient $SynQ_{t}$ to save more communication costs by

(12)

{\bf q}_{t}=Q(SynQ_{t})

The server pushes the double quantized gradient ${\bf q}_{t}$ to every worker. Specially, the adaptive learning rate is calculated by the worker according to ${\bf q}_{t}$ sequence. So we construct a quantized gradient based adaptive learning rate matrix,

(13)

\displaystyle{\bf H}_{t}=\delta{\bf I}+diag({\bf c}_{t})

where ${\bf c}_{t,d}=||{\bf q}_{1:t,d}||_{2}$ , ${\bf q}_{1:t}=[{\bf q}_{1:t-1},{\bf q}_{t}]$ . ${\bf H}_{t}$ dynamically incorporates knowledge of the geometry of the data and the curvature of the loss function based on quantized gradient, which makes QCMD adagrad and QRDA adagrad achieve a good balance between model sparsity, accuracy and communication cost. Therefore, the objective function for QCMD adagrad becomes

(14)

{\bf x}_{t+1}=\arg\min_{x\in\chi}{\eta{\bf q}_{t}^{\top}}{\bf x}+\eta\phi({\bf x})+B_{\psi_{t}}({\bf x},{\bf x}_{t})

The objective function for QRDA adagrad becomes

(15)

\begin{split}{\bf x}_{t+1}=\arg\min_{{\bf x}\in\chi}\{{\eta}<\frac{1}{t}\sum_{\tau=1}^{t}{\bf q}_{\tau},{\bf x}>+\eta\phi({\bf x})+\frac{1}{t}\psi_{t}({\bf x})\}\end{split}

Solving Eq. (14), we have the update rule for QCMD adagrad:

(16)

{\bf x}_{t+1,i}=sign({\bf x}_{t,i}-\eta{\bf H}_{t,ii}^{-1}{\bf q}_{t,i})[|{\bf x}_{t,i}-\eta{\bf H}_{t,ii}^{-1}{\bf q}_{t,i}|-\lambda\eta{\bf H}_{t,i}^{-1}]_{+}

The subscript $(\cdot)_{+}$ represents that we retain the value greater than 0. Solving Eq. (15), the update rule for QRDA adagrad becomes:

(17)

{\bf x}_{t+1,i}=sign(-\sum_{\tau=1}^{t}{\bf q}_{\tau,i})t\eta{\bf H}_{t,ii}^{-1}[|\frac{1}{t}\sum_{\tau=1}^{t}{\bf q}_{\tau,i}|-\lambda]_{+}

The overall QCMD adagrad and QRDA adagrad schemes are presented in Alg. 1.

for

t=1

T

for M worker:

m=1,...,M

do in parallel

Randomly chooses

{\bf{Z}}_{t}^{(m)}

from the local data.

Compute the gradients

f_{t}^{\prime}({\bf x}_{t})^{(m)}

under

{\bf{Z}}_{t}^{(m)}

Compute the

\hat{{\bf x}}_{t+1}^{(m)}

based on

f^{\prime}_{t}({\bf x})^{(m)}

and

\hat{{\bf H}}_{t}=\delta{\bf I}+diag(\hat{{\bf c}}_{t})

, where

\hat{{\bf c}}_{t,d}=||[{\bf q}_{1:t-1},f_{t}^{\prime}({\bf x}_{t})^{(m)}]||_{2}

Compute the indicator function

\mathbb{I}^{m}

Quantize the selected elements of the gradients with

Q(S(f_{t}^{\prime}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)}))

Push

Q(S(f_{t}^{\prime}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)}))

and

\mathbb{I}^{m}

to the server.

Server:

Decode

Q(S(f_{t}^{\prime}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)}))

to get

Q(f_{t}^{\prime}({\bf x}_{t})^{(m)})

Compute the synchronous indicator

\mathbb{I}^{Syn}

Average the quantized gradients

SynQ_{t}=\frac{1}{M}\sum_{m=1}^{M}Q(f_{t}^{\prime}({\bf x}_{t})^{(m)})

Quantize the aggregated gradient

SynQ_{t}

{\bf q}_{t}=Q(SynQ_{t})

Push

S({\bf q}_{t},\mathbb{I}^{Syn})

and

\mathbb{I}^{Syn}

to every worker.

for M worker:

j=1,...,M

do in parallel

Pull

S({\bf q}_{t},\mathbb{I}^{Syn})

and

\mathbb{I}^{Syn}

from the server.

Decode

S({\bf q}_{t},\mathbb{I}^{Syn})

to get

{\bf q}_{t}

Calculate the adaptive learning rate

{\bf H}_{t}=\delta{\bf I}+diag({\bf c}_{t})

where

c_{t,d}=||{\bf q}_{1:t,d}||_{2}

{\bf q}_{1:t}=[{\bf q}_{1:t-1},{\bf q}_{t}]

QCMD adagrad: update

{\bf x}_{t+1}

based on Eq. (16).

QRDA adagrad: update

{\bf x}_{t+1}

based on Eq. (17).

Algorithm 1 QCMD adagrad and QRDA adagrad.

3.3. Quantization error

In this section, we theoretically analyse that the quantization error introduced by the gradient quantization affects the convergence rate of regret for QCMD adagrad and QRDA adagrad.

Proposition 3.1.

Let the sequence ${{\bf x}_{t}}$ be defined by the update (16), $SynQ_{t}$ defined by Eq. (4), ${\bf q}_{t}$ defined by Eq. (12), $\mathbb{E}[{\bf q}_{t}]=f_{t}({\bf x}_{t})$ . The Mahalanobis norm $||\cdot||_{\psi_{t}}=\sqrt{<\cdot,{\bf H}_{t}\cdot>}$ and $||\cdot||_{\psi^{*}_{t}}=\sqrt{<\cdot,\frac{1}{{\bf H}_{t}}\cdot>}$ be the associated dual norm. ${\bf x}^{*}$ is the optimal solution to $f({\bf x})$ , for any ${\bf x}^{*}\in\chi$ ,

\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}B_{\psi_{1}}({\bf x}^{*},{\bf x}_{1})\\ &\quad+\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[B_{\psi_{t+1}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})]\\ \end{split}

Proof.

See Appendix for the proof. ∎

Proposition 3.2.

Let the sequence ${{\bf x}_{t}}$ be defined by the update (17), $SynQ_{t}$ defined by Eq. (4), ${\bf q}_{t}$ defined by Eq. (12), $\mathbb{E}[{\bf q}_{t}]=f_{t}({\bf x}_{t})$ . For $\forall{\bf g}\in\mathbb{R}^{d}$ , let $\psi_{t}^{*}({\bf g})$ be the conjugate dual of $t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x})$ , $\phi({\bf x})=\lambda||{\bf x}||_{1}$ , $||\cdot||_{\psi_{t}^{*}}=\sqrt{<\cdot,\frac{\eta}{2{\bf H}_{t}}\cdot>}$ . $x^{*}$ is the optimal solution to $f({\bf x})$ , for any ${\bf x}^{*}\in\chi$ , we have

\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{\eta}\mathbb{E}_{\bf q}[\psi_{T}({\bf x}^{*})]+\frac{\eta}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\\ \end{split}

Proof.

See Appendix for the proof. ∎

Remark: Since $\mathbb{E}_{\bf q}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}$ is the quantization variance scaled by the adaptive learning rate and

\begin{split}\mathbb{E}_{\bf q}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}&\leq\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}\\ &\leq\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\end{split}

We can simply regard $\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}$ as the error introduced by quantization for QCMD adagrad and regard $\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}$ as the error introduced by quantization for QRDA adagrad. Therefore, Proposition 1. and Proposition 2. show that gradient quantization introduces additional noise which affects the model convergence and sparsity.

3.4. Threshold Quantization

Although gradient quantization can reduce the cost of gradient communication in distributed training, it also introduces additional errors, which affect the convergence of the model and the sparsity of parameters in QCMD adagrad and QRDA adagrad. As an unbiased gradient quantization method, TernGrad (wen2017terngrad, ) has already made a good balance between the encoding cost and accuracy of the general model. However, when it comes to the sparse model, a large quantization error still leads to slower convergence of the $l_{1}$ norm as a part of the objective function, which affects the sparsity of the model. In order to mitigate this problem, we apply the threshold quantization method to the QCMD adagrad and QRDA adagrad.

Threshold quantization is an existing quantization method used for model quantization in (TWN, ). We apply it to gradient quantization since it produces less error than Terngrad(wen2017terngrad, ). In this section, we use ${\bf v}^{t}$ to represent the (stochastic) gradient in the $t_{th}$ iteration. Fig. 3 gives a brief explanation of threshold quantization, and more analysis is provided below.

$Q_{\triangle}(\cdot)$ is the threshold quantization function defined as below

(18)

Q_{\triangle}({\bf v}^{t})=s{\bf v},

where ${\bf v}$ is a ternary vector, $s$ is a non-negative scaling factor and $\triangle$ denotes the threshold. For the $i^{th}$ component of ${\bf v}$ ,

(19)

{\bf v}_{i}=\left\{\begin{array}[]{rcl}+1,&&{\text{if }{\bf v}_{i}^{t}>\Delta};\\ 0,&&{\text{if }|{\bf v}_{i}^{t}|\leq\Delta};\\ -1,&&{\text{if }{\bf v}_{i}^{t}<-\Delta}.\\ \end{array}\right.

The error ${\epsilon}_{t}$ is defined as the difference between the full precision vector ${\bf v}^{t}$ and $Q_{\triangle}({\bf v}^{t})$ :

(20)

\begin{split}{\epsilon}_{t}={\bf v}^{t}-Q_{\triangle}({\bf v}^{t}).\end{split}

In order to keep as more information of ${\bf v}^{t}$ as possible, the quantization method is required to minimize the Euclidean distance between ${\bf v}^{t}$ and $Q_{\triangle}({\bf v}^{t})$ , i.e.,

(21)

\left\{\begin{array}[]{lr}s^{*},{\bf v}^{*}=\arg min_{s,{\bf v}}||{\epsilon_{t}}||_{2}^{2}\\ s.t.\quad s\geq 0,{\bf v}_{i}\in\{-1,0,1\},i=1,2,...,n.\end{array}\right.

Obviously, the above problem can be transformed to the following formulation:

(22)

s^{*},\Delta^{*}=\arg min_{s\geq 0,\Delta>0}|I_{\Delta}|s^{2}-2s\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|+\sum_{i=1}^{n}({\bf v}^{t}_{i})^{2},

where $I_{\Delta}=\{i||{\bf v}^{t}_{i}|>\Delta\}$ and $|I_{\Delta}|$ denotes the number of elements in $I_{\Delta}$ . Thus, for any given $\Delta$ , the optimal $s$ can be computed as follows,

(23)

s^{*}_{\Delta}=\frac{1}{|I_{\Delta}|}\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|.

The optimal $\Delta$ can be computed as follows,

(24)

\Delta^{*}=\arg max_{\Delta>0}\frac{1}{|I_{\Delta}|}(\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|)^{2}.

To solve the above optimal threshold, we can rank the components of the gradient and treat them as potential thresholds. For each potential threshold, we calculate the corresponding objective value $\frac{1}{|I_{\Delta}|}(\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|)^{2}$ and take the potential threshold that maximizes the objective value as the optimal threshold. The computational complexity of this process is $O(d\log d)$ for $d$ -dimensional gradients, where the major cost is the sorting over all elements of gradients.

3.5. Threshold Approximation

We try to improve the computational efficiency of the threshold quantization procedure without harming the optimality of the coding described in Eq. (24). This improvement begins with the assumption that the gradient follows the Gaussian distribution. In case of ${\bf v}^{t}_{i}$ follows $N(0,\sigma^{2})$ , Li and Liu (TWN, ) have given an approximate solution for the optimal threshold $\triangle^{*}$ by $0.6\sigma$ which equals $0.75\cdot\mathbb{E}(|{\bf v}^{t}_{i}|)\approx\frac{0.75}{d}\sum_{i=1}^{d}|{\bf v}_{i}^{t}|$ . We also find in our experiment that most of the gradients satisfy the previous assumption. Fig. 4 shows the experimental result for training AlexNet (krizhevsky2012imagenet, ) on two workers. The left column visualizes the first convolutional layer and the right one visualizes the first fully-connected layer. The distribution of the original floating gradients is close to the Gaussian distribution for both convolutional and fully-connected layers. Based on this observation, we simply use $\frac{0.75}{d}\sum_{i=1}^{d}|{\bf v}_{i}^{t}|$ to approximate the optimal threshold to avoid the expensive cost of solving the optimal threshold $\Delta^{*}$ every iteration.

Encode. The gradient vectors need to be encoded after the threshold quantization. Specifically, we use 2 bits to encode ${\bf v}_{i}$ , and one floating-point number to represent the scaling factor s. For the dense model, the overall communication cost is $(32+2d)$ bits, where $d$ denotes the parameter dimension. The communication cost of threshold quantization is the same as Terngrad. But for the sparse model, assume the number of non-zero parameters is $k$ $(k\ll d)$ , the communication cost of QCMD adagrad and QRDA adagrad is $(32+d+2k)$ bits. Since we need at least $d$ bits to indicate which components of the parameter are non-zero.

4. Convergence Analysis

Two aspects are taken into account to evaluate the distributed optimization algorithm, the number of bits sent and received by the workers (communication complexity) and the number of parallel iterations required for convergence (round complexity). In this section, we theoretically analyze the proposed QCMD adagrad and QRDA adagrad in terms of the convergence rate of regret.

To obtain the regret bound for QCMD adagrad and QRDA adagrad, we provide the follow lemma, which has been proved in Lemma 4 of (duchi2011adaptive, ).

Lemma 4.1.

For any $\delta\geq 0$ , the Mahalanobis norm $||\cdot||_{\psi_{t}}=\sqrt{<\cdot,{\bf H}_{t}\cdot>}$ and $||\cdot||_{\psi^{*}_{t}}=\sqrt{<\cdot,\frac{1}{{\bf H}_{t}}\cdot>}$ be the associated dual norm, we have

\begin{split}\frac{1}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}\leq\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

Lemma 4.2.

For any $\delta\geq\max_{t}||{\bf q}_{t}||_{\infty}$ , $||\cdot||_{\psi^{*}_{t}}=\sqrt{<\cdot,\frac{\eta}{2{\bf H}_{t}}\cdot>}$ , we have

\begin{split}\frac{1}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\leq\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

Combining the above arguments with Proposition.1 and Proposition.2, we have the following theorem.

Theorem 4.3.

For QCMD adagrad, let $D_{\infty}=\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||_{\infty}$ , $G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}$ , the regularizer $\phi({\bf x})=\lambda||{\bf x}||_{1}$ , where $\lambda\geq 0$ . Assume $Q(\cdot)$ is an unbiased quantization function, $f_{t}({\bf x})$ is L-smooth function, learning rate $\eta=\frac{2}{1+2L}$ , $\psi_{t}(x)$ is 1-strongly convex function, the regret is as below

\begin{split}&\quad\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{dG_{\infty}}{\sqrt{T}}+\frac{dG_{\infty}D_{\infty}}{2\eta\sqrt{T}}\end{split}

Proof.

See Appendix for the proof. ∎

Theorem 4.4.

For QRDA adagrad, let $D_{\infty}=\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||_{\infty}$ , $G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}$ , $Q(\cdot)$ be an unbiased quantization function, the regularizer $\phi({\bf x})=\lambda||{\bf x}||_{1}$ , where $\lambda\geq 0$ . Then the regret is as below

\begin{split}&\quad\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{\delta||{\bf x}^{*}||_{2}^{2}}{\eta T}+\frac{(\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}+\frac{\eta^{2}}{2})dG_{\infty}}{\sqrt{T}}\end{split}

Proof.

See Appendix for the proof. ∎

Remark: The convergence rate of QCMD adagrad and QRDA adagrad is comparable with CMD adagrad and RDA adagrad respectively, in terms of $O(\frac{1}{\sqrt{T}})$ . The quantization error affects the convergence. Only the quantization error is sufficiently small, the proposed methods behave similarly with CMD adagrad and RDA adagrad. The convergence of $f({\bf x})$ determines the accuracy of models and $\phi({\bf x})$ determines the sparsity of models. In QCMD adagrad, the learning rate $\eta$ and the coefficient of regularization $\lambda$ control the sparsity of models, while QRDA adagrad controls the sparsity of models only through $\lambda$ .

5. Experiments

5.1. Experimental Settings

Table 1. Data statistics

Data Set	Number of	Number of	Dimension of
Data Set	Training Data	Test Data	Input Feature
news20	15,994	4,000	1,355,191
rcv1	20,242	677,399	47,236
MNIST	60,000	10,000	12828
CIFAR-10	50,000	10,000	33232

In this section, we first conduct experiments on linear models to validate the effectiveness and efficiency of the proposed QCMD adagrad and QRDA adagrad for binary classification problems. After that, we use our proposed method to train convolutional neural networks, in order to validate the performance of them on non-convex problems.

Baselines. We compare QCMD adagrad and QRDA adagrad with several sparse model distributed optimization methods, including 32 bits Prox-gd(duchi2010composite, ), 32 bits CMD adagrad(duchi2011adaptive, ), 32 bits RDA adagrad(duchi2011adaptive, ) and their corresponding ternary variants(wen2017terngrad, ). ${\dagger}$ is to mark the methods that only the local gradients are quantized but the synchronous gradient.

Implementation Details. All the experiments are carried out in a distributed framework that the network bandwidth is 100MBps. For the linear models training, each worker only utilizes the CPUs. For the training of convolutional neural networks, each worker is allocated with 1 NVIDIA Tesla P40 GPU. The methods are evaluated on four publicly available datasets. news20 and rcv1 are text datasets with a high dimension of input features from LIBSVM (chang2001libsvm, ). MNIST (lecun1998gradient, ) is for handwritten digits classification problem and CIFAR10 (krizhevsky2009learning, ) is for image classification problem. Table 1 shows the details of datasets. For news20 and rcv1, the $\ell_{1}$ norm regularized logistic regression model are trained. As for multi-classification problems, we train LeNet for MNIST(lecun1998gradient, ) and AlexNet (krizhevsky2012imagenet, ) for CIFAR10. To generate a network with large sparsity, the batch normalization layer is added before each convolution layer and fully connected layer. The code is implemented via tensorflow. Experimental results are averaged over 5 runs with a random initialization seed.

Table 2. Settings of hyperparameters

news20
Method	Learning	Coefficient of	Batch size	Number of
Method	rate	regularization	per worker	worker
Prox-gd	0.1	0.001	20	2
QCMD adagrad	0.02	0.00001	20	2
QRDA adagrad	0.02	0.1	20	2
rcv1
Method	Learning	Coefficient of	Batch size	Number of
Method	rate	regularization	per worker	worker
Prox-gd	1.0	0.0001	20	2
QCMD adagrad	1.0	0.000005	20	2
QRDA adagrad	0.1	0.5	20	2
MNIST
Method	Learning	Coefficient of	Batch size	Number of
Method	rate	regularization	per worker	worker
Prox-gd	0.001	0.005	16	4
QCMD adagrad	0.004	0.0001	16	4
QRDA adagrad	0.01	0.001	16	4
CIFAR-10
Method	Learning	Coefficient of	Batch size	Number of
Method	rate	regularization	per worker	worker
Prox-gd	0.01	0.001	64	2
QCMD adagrad	0.01	0.0002	64	2
QRDA adagrad	0.01	0.0004	64	2

Table 3. Comparisons on four datasets in terms of three metrics.

{\dagger}

is to mark the methods that only quantize the local gradient but the synchronous gradient.

news20
Method	Accuracy(%)	Sparsity(%)	Error
32-bits Prox-gd	73.92	56.63	-
32-bits CMD adagrad	97.50	85.76	-
Ternary CMD adagrad^†	96.50	82.64	3.13e-09
Threshold CMD adagrad^†	97.46	84.59	1.49e-09
Ternary CMD adagrad	96.00	80.30	5.24e-09
Threshold CMD adagrad	97.19	83.83	2.39e-09
32-bits RDA adagrad	97.21	98.21	-
Ternary RDA adagrad^†	95.88	97.52	1.63e-08
Threshold RDA adagrad^†	97.16	98.26	2.92e-09
Ternary RDA adagrad	95.50	97.05	2.65e-08
Threshold RDA adagrad	97.09	98.18	2.93e-09
rcv1
Method	Accuracy(%)	Sparsity(%)	Error
32-bits Prox-gd	91.96	33.87	-
32-bits CMD adagrad	95.15	72.66	-
Ternary CMD adagrad^†	94.97	71.70	5.07e-08
Threshold CMD adagrad^†	95.16	73.76	1.40e-08
Ternary CMD adagrad	94.87	70.12	7.77e-08
Threshold CMD adagrad	94.99	73.18	1.45e-08
32-bits RDA adagrad	93.41	97.56	-
Ternary RDA adagrad^†	91.14	96.86	1.73e-07
Threshold RDA adagrad^†	93.21	97.54	2.32e-08
Ternary RDA adagrad	90.90	96.43	3.03e-07
Threshold RDA adagrad	93.01	97.39	2.62e-08
MNIST
Method	Accuracy(%)	Sparsity(%)	Error
32-bits Prox-gd	94.17	7.86	-
32-bits CMD adagrad	98.28	17.19	-
Ternary CMD adagrad^†	98.22	9.61	3.52e-3
Threshold CMD adagrad^†	98.34	22.08	2.77e-3
Ternary CMD adagrad	97.45	10.19	4.82e-3
Threshold CMD adagrad	98.46	21.78	2.95e-3
32-bits RDA adagrad	97.85	98.27	-
Ternary RDA adagrad^†	97.57	95.27	1.30e-1
Threshold RDA adagrad^†	97.84	98.59	1.98e-2
Ternary RDA adagrad	97.32	90.39	1.51e-1
Threshold RDA adagrad	97.70	98.69	2.32e-2
CIFAR-10
Method	Accuracy(%)	Sparsity(%)	Error
32-bits Prox-gd	72.73	76.51	-
32-bits CMD adagrad	86.63	80.50	-
Ternary CMD adagrad^†	85.27	0.42	1.18e-1
Threshold CMD adagrad^†	86.23	62.51	2.45e-2
Ternary CMD adagrad	84.37	0.20	2.58e-1
Threshold CMD adagrad	84.74	39.95	2.60e-2
32-bits RDA adagrad	86.49	87.11	-
Ternary RDA adagrad^†	84.09	80.05	8.70e-1
Threshold RDA adagrad^†	85.57	87.65	2.17e-2
Ternary RDA adagrad	82.84	75.46	9.90e-1
Threshold RDA adagrad	84.22	81.86	2.77e-2

Hyperparameter Discussion. Table 2 lists the related hyperparameters of the sparse model training in distributed settings, including the learning rate $\eta$ , coefficient of regularization $\lambda$ , batch size per worker and the number of the worker. The optimal choice of $\eta$ and $\lambda$ can vary somewhat from dataset to dataset. The structure and random initial of the network also affect them. In order to find the appropriate hyperparameters, we first select the best learning rate based on cross-validations without the regularization and then find the coefficient of regularization that maximizes the sparsity of the model without reducing the accuracy as much as possible. For ease of comparison, we set the same hyper parameters for corresponding ternary quantization methods. Fig. 5 shows the accuracy and model sparsity as the $\eta$ and $\lambda$ change on MINIST. The sparsity increase with the $\lambda$ . A strong coefficient of regularization $\lambda$ enforces a highly sparse model which may deteriorate the accuracy. When the learning rate is high or $\lambda$ is large, it is easy to cause oscillation for QCMD adagrad. For QCMD adagrad, the model sparsity increases not only with the coefficient of regularization $\lambda$ but also the learning rate $\eta$ . For QRDA adagrad, model sparsity is mainly affected by $\lambda$ , while the learning rate has less effect.

5.2. Results and Discussions

We summarize three metrics including the accuracy, model sparsity and error of the quantized gradient in Table 3, and report the time consumption in Fig. 6. From these results, we draw several conclusions.

Firstly, the sparsity of model on the four datasets is 98.18%, 97.39%, 98.69% and 81.86% for QRDA adagrad, respectively. QRDA adagrad is easier to generate extremely sparse models than QCMD adagrad, since it uses the same evaluation criteria $\lambda$ for each dimension of the model parameters to determine whether to retain the parameter values. In the case of highly sparse model for QRDA adagrad, the accuracy decreases slightly compared with QCMD adagrad whose sparseness of model is relatively low. For example, on MNIST, the accuracy for threshold QRDA adagrad achieves 97.70%, which is $0.76\%$ lower than that of threshold QCMD adagrad.

Secondly, although the sparsity of the model generated by QCMD adagrad is limited. However, it still performs better than 32 bits Prox-gd in accuracy and sparsity, since the proximal function $\psi$ use the Bregman divergence to keep the parameter ${\bf x}_{t+1}$ close to ${\bf x}_{t}$ and it adaptively retains the parameter values for each dimension.

Thirdly, the noise introduced by the quantization method affects the convergence of accuracy and sparsity of the model. The one-way quantization (only the local gradients are quantized instead of both local gradients and the synchronous gradient) introduces less noise than the double quantization. The error of threshold quantization is smaller than Terngrad’s and it preserves sparsity and accuracy of the model both for QCMD adagrad and QRDA adagrad. Specifically, for threshold quantization, the accuracy and sparsity of the model for double quantized scheme are relatively close to 32 bits gradient scheme’s, while ternary quantization reduces the sparsity and accuracy of the model on these 4 datasets.

Lastly, double quantization reduces most of the communication cost. Fig. 6 decomposes the time consumption. For example, double quantized threshold QRDA can save $96.06\%$ of the total training time on CIFAR10. Threshold QRDA (only the local gradients are quantized) saves $48.03\%$ of the total training time on CIFAR10. The QRDA adagrad consumes the least communication cost since it generates a highly sparse model.

6. Conclusion

In this paper, we present distributed QCMD adagrad and QRDA adagrad to accelerate large-scale machine learning. QCMD adagrad and QRDA adagrad combine the gradient quantization and sparse model to reduce the communication cost per iteration in distributed training. We Theoretical analysis that the convergence rate of QCMD adagrad and QRDA adagrad is comparable with CMD adagrad and RDA adagrad respectively. Considering that a large error of quantization gradients affects the convergence of QCMD adagrad and QRDA adagrad, we adopt the threshold quantization with less quantized error to preserve the model performance and sparsity. The encouraging empirical results on linear models and convolutional neural networks demonstrate the efficacy and efficiency of the proposed algorithms in balancing communication cost, accuracy, and model sparsity.

References

(1) F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711, 2016. [Online]. Available: http://arxiv.org/abs/1605.04711
(2) Y. Zhang, P. Zhao, J. Cao, W. Ma, J. Huang, Q. Wu, and M. Tan, “Online adaptive asymmetric active learning for budgeted imbalanced data,” in SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 2768–2777.
(3) P. Zhao, Y. Zhang, M. Wu, S. C. Hoi, M. Tan, and J. Huang, “Adaptive cost-sensitive online classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 2, pp. 214–228, 2018.
(4) Y. Zhang, H. Chen, Y. Wei, P. Zhao, J. Cao, X. Fan, X. Lou, H. Liu, J. Hou, X. Han et al., “From whole slide imaging to microscopy: Deep microscopy adaptation network for histopathology cancer image classification,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2019.
(5) D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017.
(6) J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
(7) J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” arXiv preprint arXiv:1710.09854, 2017.
(8) J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan, “Multi-marginal wasserstein gan,” in Advances in Neural Information Processing Systems, 2019.
(9) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 1509–1519.
(10) J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized sgd and its applications to large-scale distributed optimization,” in International Conference on Machine Learning. PMLR, 2018, pp. 5325–5333.
(11) F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in Annual Conference of the International Speech Communication Association, 2014.
(12) A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” arXiv preprint arXiv:1704.05021, 2017.
(13) A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
(14) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
(15) R. Bekkerman, M. Bilenko, and J. Langford, Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011.
(16) S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
(17) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
(18) C.-J. Hsieh, S. Si, and I. S. Dhillon, “Communication-efficient distributed block minimization for nonlinear kernel machines,” in SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2017, pp. 245–254.
(19) M. F. Balcan, Y. Liang, L. Song, D. Woodruff, and B. Xie, “Communication efficient distributed kernel principal component analysis,” in SIGKDD International Conference on Knowledge Discovery & Data Mining, 2016.
(20) C.-p. Lee, C. H. Lim, and S. J. Wright, “A distributed quasi-newton algorithm for empirical risk minimization with nonsmooth regularization,” in SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 1646–1655.
(21) R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in Neural Information Processing Systems, 2013, pp. 315–323.
(22) T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: building an efficient and scalable deep learning training system.” in USENIX Symposium on Operating Systems Design and Implementation, vol. 14, 2014, pp. 571–582.
(23) E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu, “Petuum: A new platform for distributed machine learning on big data,” IEEE Transactions on Big Data, vol. 1, no. 2, pp. 49–67, 2015.
(24) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
(25) J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari, “Composite objective mirror descent,” in Annual Conference on Learning Theory, 2010.
(26) C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” in International Conference on Machine Learning, 2017, pp. 1724–1732.
(27) R. Kleinberg, Y. Li, and Y. Yuan, “An alternative view: When does sgd escape local minima?” arXiv preprint arXiv:1802.06175, 2018.
(28) V. Shah, M. Asteris, A. Kyrillidis, and S. Sanghavi, “Trading-off variance and complexity in stochastic gradient descent,” arXiv preprint arXiv:1603.06861, 2016.
(29) Y. Singer and J. C. Duchi, “Efficient learning using forward-backward splitting,” in Advances in Neural Information Processing Systems, 2009, pp. 495–503.
(30) L. Xiao, “Dual averaging methods for regularized stochastic learning and online optimization,” Journal of Machine Learning Research, vol. 11, no. Oct, pp. 2543–2596, 2010.
(31) C.-C. Chang and C.-J. Lin, “ $\{$ LIBSVM $\}$ : a library for support vector machines (version 2.3),” 2001.
(32) M. Li, “Scaling distributed machine learning with system and algorithm co-design,” Ph.D. dissertation, PhD thesis, Intel, 2017.
(33) X. Zhang, L. Zhao, A. P. Boedihardjo, and C.-T. Lu, “Online and distributed robust regressions under adversarial data corruption,” in IEEE International Conference on Data Mining, 2017, pp. 625–634.
(34) Z. Huo, X. Jiang, and H. Huang, “Asynchronous dual free stochastic dual coordinate ascent for distributed data mining,” in IEEE International Conference on Data Mining, 2018.
(35) S. Gupta, W. Zhang, and F. Wang, “Model accuracy and runtime tradeoff in distributed deep learning: A systematic study,” in IEEE International Conference on Data Mining, 2016, pp. 171–180.
(36) S. De and T. Goldstein, “Efficient distributed sgd with variance reduction,” in IEEE International Conference on Data Mining, 2016.
(37) H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu, “Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression,” in International Conference on Machine Learning. PMLR, 2019, pp. 6155–6165.
(38) M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Transactions on Signal Processing, vol. 68, pp. 2155–2169, 2020.
(39) S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” Advances in Neural Information Processing Systems, vol. 31, 2018.
(40) Y. Ko, K. Choi, H. Jei, D. Lee, and S.-W. Kim, “Aladdin: Asymmetric centralized training for distributed deep learning,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 863–872.
(41) N. Eshraghi and B. Liang, “Distributed online optimization over a heterogeneous network with any-batch mirror descent,” in International Conference on Machine Learning. PMLR, 2020, pp. 2933–2942.
(42) X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” Advances in Neural Information Processing Systems, vol. 30, 2017.
(43) R. Xin, S. Kar, and U. A. Khan, “Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 102–113, 2020.
(44) F. Alimisis, P. Davies, and D. Alistarh, “Communication-efficient distributed optimization with quantized preconditioners,” in International Conference on Machine Learning. PMLR, 2021, pp. 196–206.
(45) S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” IEEE Transactions on Signal Processing, vol. 68, pp. 6101–6116, 2020.
(46) P. Davies, V. Gurunathan, N. Moshrefi, S. Ashkboos, and D. Alistarh, “New bounds for distributed mean estimation and variance reduction,” International Conference on Learning Representations, 2021.

Appendix A Appendeix

A.1. Proof of Proposition 1

Proof.

Assume $f_{t}({\bf x})$ is L-smoothness, one obtains

\begin{split}f_{t}({\bf x}_{t+1})-f_{t}({\bf x}_{t})\leq f^{\prime}_{t}({\bf x}_{t})^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{L}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}\\ \end{split}

Let $\psi_{t}({\bf x})=\frac{1}{2}{\bf x}^{\top}{\bf H}_{t}{\bf x}$ , the Mahalanobis norm $||\cdot||_{\psi_{t}}=\sqrt{<\cdot,{\bf H}_{t}\cdot>}$ and $||\cdot||_{\psi_{t}^{*}}$ be the associated dual norm. Introduce the quantized gradient, we have

\begin{split}&\quad f_{t}({\bf x}_{t+1})-f_{t}({\bf x}_{t})\\ &\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{L}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}+(f^{\prime}_{t}({\bf x}_{t})-{\bf q}_{t})^{\top}({\bf x}_{t+1}-{\bf x}_{t})\\ &\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{L}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}+\frac{1}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}_{\psi_{t}}\\ &\quad+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ \end{split}

The second inequality follows from Cauchy-Schwarz (with $2ab\leq xa^{2}+\frac{b^{2}}{x}$ for any $x>0$ ). $B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})=\frac{1}{2}({\bf x}_{t+1}-{\bf x}_{t})^{\top}{\bf H}_{t}({\bf x}_{t+1}-{\bf x}_{t})=\frac{1}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}_{\psi_{t}}$ is Bregman divergence. Let $\psi_{t}$ be a mirror map 1-strongly convex on $\chi$ , then

0\leq\frac{1}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}\leq B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})

So,

\begin{split}\quad f_{t}({\bf x}_{t+1})&-f_{t}({\bf x}_{t})\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})\\ &+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}+(\frac{1}{2}+L)B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})\end{split}

Since $\psi_{t}^{\prime}({\bf x}_{t})-\psi_{t}^{\prime}({\bf x}_{t+1})={\bf H}_{t}({\bf x}_{t}-{\bf x}_{t+1})=\eta{\bf q}_{t}+\eta\phi^{\prime}({\bf x}_{t+1})$ ,

\begin{split}&\quad\eta[\phi^{\prime}({\bf x}_{t+1})+{\bf q}_{t}]^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &=\left(\psi_{t}^{\prime}({\bf x}_{t})-\psi_{t}^{\prime}({\bf x}_{t+1})\right)^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &=B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})\end{split}

If we set the learning rate $\eta=\frac{2}{1+2L}$ , we have

\begin{split}&\quad f_{t}({\bf x}_{t+1})-f_{t}({\bf x}_{t})\\ &\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ &\quad+\frac{1}{\eta}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &\quad-{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}^{*})-\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &={\bf q}_{t}^{\top}({\bf x}^{*}-{\bf x}_{t})+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ &\quad+\frac{1}{\eta}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &\quad-\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\end{split}

From $f_{t}({\bf x}^{*})+\phi({\bf x}^{*})-[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})+(f^{\prime}_{t}({\bf x}_{t})+\phi^{\prime}({\bf x}_{t}))^{\top}({\bf x}^{*}-{\bf x}_{t})]\geq 0$ ,

\begin{split}&\quad f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})\\ &\leq{\bf q}_{t}^{\top}({\bf x}^{*}-{\bf x}_{t})-f^{\prime}_{t}({\bf x}_{t})^{\top}({\bf x}^{*}-{\bf x}_{t})+\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &\quad+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ &\quad+\frac{1}{\eta}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))-\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ \end{split}

If $\mathbb{E}[{\bf q}_{t}]=f^{\prime}_{t}({\bf x}_{t})$ , we take expectation on both sides of the inequality, we obtains

\begin{split}&\quad\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq 0+\frac{1}{2}\mathbb{E}_{\bf q}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &\leq\frac{1}{2}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ \end{split}

Sum the inequality,

\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\sum_{t=1}^{T}\mathbb{E}_{\bf q}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &=\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}B_{\psi_{1}}({\bf x}^{*},{\bf x}_{1})\\ &\quad+\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[B_{\psi_{t+1}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})]\\ \end{split}

∎

A.2. Proof of Theorem 1

Proof.

The third item in the conclusion of Proposition 1 can be transformed as below:

\begin{split}&\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[B_{\psi_{t+1}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})]\\ =&\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[\frac{1}{2}({\bf x}^{*}-{\bf x}_{t+1})^{\top}({\bf H}_{t+1}-{\bf H}_{t})({\bf x}^{*}-{\bf x}_{t+1})]\\ \leq&\frac{1}{2\eta}\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||f^{\prime}({\bf x})_{1:T,i}||_{2}\\ &\quad-\frac{1}{2\eta}||{\bf x}^{*}-{\bf x}_{1}||_{\infty}^{2}\mathbb{E}_{\bf q}<{\bf h}_{1},\bf{1}>\\ \end{split}

$\bf{1}$ is a vector whose elements are all 1. Since $\mathbb{E}_{\bf q}B_{\psi_{1}}({\bf x}^{*},{\bf x}_{1})\leq\frac{1}{2}||{\bf x}^{*}-{\bf x}_{1}||^{2}_{\infty}\mathbb{E}_{\bf q}<{\bf h}_{1},\bf{1}>$ ,

\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{2\eta}\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}\\ \end{split}

$\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}\leq\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}$ has been proved in Lemma 4 of (duchi2011adaptive, ), then

\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}+\frac{1}{2\eta}\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}\\ \end{split}

We define that $D_{\infty}=\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||_{\infty}$ , $G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}$ .

\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq d\sqrt{T}G_{\infty}+\frac{1}{2\eta}D_{\infty}d\sqrt{T}G_{\infty}\\ \end{split}

Jensen’s inequality:

\begin{split}&\quad\mathbb{E}_{\bf q}[f(\bar{{\bf x}})+\phi(\bar{{\bf x}})]-f({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{dG_{\infty}}{\sqrt{T}}+\frac{dG_{\infty}D_{\infty}}{2\eta\sqrt{T}}\end{split}

∎

A.3. Proof of Proposition 2

Proof.

Define $\psi_{t}^{*}({\bf g})$ to be the conjugate dual of $t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x})$ :

(25)

\begin{split}\psi^{*}_{t}({\bf g})=\sup_{{\bf x}\in\chi}\{<{\bf g},{\bf x}>-t\phi({\bf x})-\frac{1}{\eta}\psi_{t}({\bf x})\}\end{split}

If set $\phi({\bf x})=\lambda||{\bf x}||_{1}$ , then

\begin{split}\psi^{*}_{ti}({\bf g})=\left\{\begin{array}[]{rcl}({\bf g}_{i}-t\lambda)^{2}\frac{\eta}{4{\bf H}_{t,ii}},&&{g_{i}-t\lambda>0}\\ ({\bf g}_{i}+t\lambda)^{2}\frac{\eta}{4{\bf H}_{t,ii}},&&{g_{i}+t\lambda<0}\\ 0,&&{otherwise}\\ \end{array}\right.\end{split}

So, $||\cdot||_{\psi_{t}^{*}}=\sqrt{<\cdot,\frac{\eta}{2{\bf H}_{t}}\cdot>}$ .

Since $\frac{\psi_{t}}{\eta}$ is $\frac{1}{\eta}$ -strongly convex with respect to the norm $||\cdot||_{\psi_{t}}$ , the function $\psi_{t}^{*}$ is $\eta$ -smooth with respect to $||\cdot||_{\psi^{*}_{t}}$ .

\psi_{t}^{*}({\bf y})\leq\psi_{t}^{*}({\bf x})+\triangledown\psi^{*}_{t}({\bf x})({\bf y}-{\bf x})+\frac{\eta}{2}||{\bf y}-{\bf x}||^{2}_{\psi^{*}_{t}}

Further, a simple argument with the fundamental theorem of the conjugate dual is

(26)

\triangledown\psi^{*}_{t}({\bf g})=\arg\min_{{\bf x}\in\chi}\{-<{\bf g},{\bf x}>+t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x})\}

If ${\bf z}_{t}=\sum_{\tau=1}^{t}{\bf q}_{\tau}$ , we have

(27)

\begin{split}\psi^{\prime*}_{t}(-{\bf z}_{t})&=\arg\min_{{\bf x}\in\chi}\{<\sum_{\tau=1}^{t}{\bf q}_{\tau},{\bf x}>+t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x})\}\\ &={\bf x}_{t+1}\end{split}

To obtain the regret for quantized rda adagrad, we first calculate

\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}({\bf x}_{t}-{\bf x}^{*})+\phi({\bf x}_{t+1})-\phi({\bf x}^{*})]\\ &\leq\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\sup_{{\bf x}\in\chi}\left\{-\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}]-T\phi({\bf x})-\frac{1}{\eta}\psi_{T}({\bf x})\right\}\\ &=\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ \end{split}

Since Equation (25) and Equation (27), it is clear that

\begin{split}\psi_{T}^{*}(-{\bf z}_{T})&=-\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{T+1}]-T\phi({\bf x}_{T+1})-\frac{1}{\eta}\psi_{T}({\bf x}_{T+1})\\ &\leq-\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{T+1}]-(T-1)\phi({\bf x}_{T+1})-\phi({\bf x}_{T+1})-\frac{1}{\eta}\psi_{T-1}({\bf x}_{T+1})\\ &\leq\sup_{{\bf x}\in\chi}\left\{-{\bf z}_{T}^{\top}{\bf x}-(T-1)\phi({\bf x})-\frac{1}{\eta}\psi_{T-1}({\bf x})\right\}-\phi({\bf x}_{T+1})\\ &=\psi^{*}_{T-1}(-{\bf z}_{T})-\phi({\bf x}_{T+1})\end{split}

The first inequality above follows from $\psi_{t+1}\geq\psi_{t}$ . Further, the identity (27) and the fact that ${\bf z}_{T}-{\bf z}_{T-1}=-{\bf q}_{T}$ give

\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ &\leq\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T-1}(-{\bf z}_{T})-\phi({\bf x}_{T+1})\\ &\leq\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})-\phi({\bf x}_{T+1})\\ &\quad+{\psi^{*}}_{T-1}(-{\bf z}_{T-1})-{\psi^{\prime}_{T-1}}^{*}{(-{{\bf z}_{T-1})}^{\top}}{\bf q}_{T}+\frac{\eta}{2}||{\bf q}_{T}||^{2}_{\psi^{*}_{T-1}}\\ &=\sum_{t=1}^{T-1}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T-1}(-{\bf z}_{T-1})+\frac{\eta}{2}||{\bf q}_{T}||^{2}_{\psi^{*}_{T-1}}\\ \end{split}

We can repeat the same sequence of steps to see that

\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ &\leq\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi_{0}^{*}(-{\bf z}_{0})+\frac{\eta}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\\ &=\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\frac{\eta}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\\ \end{split}

Take expectation on both sides of the inequality, we get the results in proposition 2. ∎

A.4. Proof of Theorem 2

Proof.

We set $\delta\geq\max_{t}||{\bf q}_{t}||^{2}_{\infty}$ , ${\bf H}_{t}=\delta{\bf I}+diag({\bf c}_{t})$ , in which case

\begin{split}||{\bf q}_{t}||^{2}_{\psi_{t-1}^{*}}\leq\frac{\eta}{2}{\bf q}_{t}^{\top}diag({\bf c}_{t})^{-1}{\bf q}_{t}\end{split}

\begin{split}\sum_{t=1}^{T}{\bf q}_{t}^{\top}diag({\bf c}_{t})^{-1}{\bf q}_{t}&\leq 2\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\\ \end{split}

\begin{split}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}&\leq\eta\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

So,

\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ &\leq\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\frac{\eta^{2}}{2}\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

We also have

\begin{split}\psi_{T}({\bf x}^{*})&=\delta||{\bf x}^{*}||_{2}^{2}+<{\bf x}^{*},diag({\bf c}_{T}){\bf x}^{*}>\\ &\leq\delta||{\bf x}^{*}||_{2}^{2}+||{\bf x}^{*}||^{2}_{\infty}\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

If $\mathbb{E}[{\bf q}_{t}]=f^{\prime}_{t}({\bf x}_{t})$ , take expectation for the quantization operation and set $G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}$

\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\sum_{t=1}^{T}\mathbb{E}_{\bf q}[(f^{\prime}_{t}({\bf x}_{t})-{\bf q}_{t}^{\top}({\bf x}_{t}-{\bf x}^{*})]\\ &\quad+\mathbb{E}_{\bf q}\left\{\sum_{t=1}^{T}[{\bf q}_{t}^{\top}({\bf x}_{t}-{\bf x}^{*})+\phi({\bf x}_{t+1})-\phi({\bf x}^{*})]\right\}\\ &\leq\mathbb{E}_{\bf q}\left\{\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\right\}\\ &\leq\frac{\delta}{\eta}||{\bf x}^{*}||_{2}^{2}+\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}+\frac{\eta^{2}}{2}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}\\ &=\frac{\delta}{\eta}||{\bf x}^{*}||_{2}^{2}+(\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}+\frac{\eta^{2}}{2})dG_{\infty}\sqrt{T}\end{split}

Finally,

\begin{split}&\quad\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{\delta||{\bf x}^{*}||_{2}^{2}}{\eta T}+\frac{(\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}+\frac{\eta^{2}}{2})dG_{\infty}}{\sqrt{T}}\end{split}

∎