This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quantized Adaptive Subgradient Algorithms and Their Applications

Ke Xu, Jianqiao Wangni, Yifan Zhang, Deheng Ye, Jiaxiang Wu and Peilin Zhao
(2022)
Abstract.

Data explosion and an increase in model size drive the remarkable advances in large-scale machine learning, but also make model training time-consuming and model storage difficult. To address the above issues in the distributed model training setting which has high computation efficiency and less device limitation, there are still two main difficulties. On one hand, the communication costs for exchanging information, e.g., stochastic gradients among different workers, is a key bottleneck for distributed training efficiency. On the other hand, less parameter model is easy for storage and communication, but the risk of damaging the model performance. To balance the communication costs, model capacity and model performance simultaneously, we propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training. To be specific, we explore the combination of gradient quantization and sparse model to reduce the communication cost per iteration in distributed training. A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity. Moreover, we theoretically find that a large quantization error brings in extra noise, which influences the convergence and sparsity of the model. Therefore, a threshold quantization strategy with a relatively small error is adopted in QCMD adagrad and QRDA adagrad to improve the signal-to-noise ratio and preserve the sparsity of the model. Both theoretical analyses and empirical results demonstrate the efficacy and efficiency of the proposed algorithms.

Distributed training, adaptive subgradient, gradient quantization, sparse model.
copyright: acmcopyrightjournalyear: 2022conference: 2022; July; 2022 price: 15.00ccs: Computing methodologies Distributed computing methodologies

1. Introduction

The large model has recently been extremely successful in machine learning and data mining with the growth of data volume. A great number of complex deep neural networks (zhao2018adaptive, ; zhang2018online, ; krizhevsky2012imagenet, ; zhang2019whole, ; cao2019multi, ) have been devised to solve real-world applications problem. However, many practical applications can only provide limited computing resources, i.e. limited storage devices and unaccelerated hardware such as CPUs. These constrain the model complexity and make model training extremely time-consuming under the huge amount of training datasets. To solve these issues, this paper explores to accelerate model training and reduce model storage costs simultaneously for large-scale machine learning.

Distributed training offers a potential solution to solve the issue of long training time (balcan2016communication, ; chilimbi2014project, ; xing2015petuum, ; hsieh2017communication, ; zhang2017online, ). The data parallelism (li2017scaling, ; gupta2016model, ) is one of the most popular framework in distributed training. As shown in Fig. 1, data parallelism framework has one or more computer workers connected via some communication networks, while multiple model replicas are trained in parallel on each worker. A global parameter server ensures the consistency among replicas of each worker by collecting all gradients computed from different workers and then averaging them to update parameters. The goal is to optimize a global objective function formed by the average of a series of local loss functions derived from local computation on each worker.

Refer to caption
Figure 1. Data parallelism in a typical distributed training scheme, where gt(m)g_{t}^{(m)} denotes the local gradient computed by the mthm^{th} worker, and MM denotes the total number of workers. The parameter server collects all local gradients from the workers. In addition, g¯t\bar{g}_{t} is the averaged gradients, which is synchronized by the parameter server and pulled by each worker to update the model.

In the above distributed framework, time consumption is mainly caused by computation and communication. While increasing the number of workers helps to reduce computation time, the communication overhead for exchanging training information, e.g., stochastic gradients among different workers, is a key bottleneck for training efficiency, especially for high delay communication network. Even worse, some slow workers can adversely affect the progress of fast workers, leading to a drastic slowdown of the overall convergence process (eshraghi2020distributed, ). Asynchronous communication (huo2018asynchronous, ), alleviate the negative effect of slow machines and decentralized algorithms (lian2017can, ; xin2020decentralized, ) get rid of the dependency on high communication costs on the central node. But, the parameter variance among workers may deteriorate model accuracy(ko2021aladdin, ). In this paper, we concentrate on the data parallelism framework which belongs to synchronous centralized algorithms. There are many types of research focusing on how to save communication costs. Some studies (bekkerman2011scaling, ; de2016efficient, ) propose to reduce the number of training rounds. For example, one may use SVRG (johnson2013accelerating, ; shah2016trading, ) to periodically calculate an accurate gradient estimate to reduce the variance introduced by stochastic sampling, which, however, is an operation with high calculation costs. Other methods focus on reducing the precision of gradients. DoReFa-Net (zhou2016dorefa, ) and QSGD (alistarh2017qsgd, ) quantize gradients into fixed-point numbers, so that much fewer bits are needed to be transmitted. More aggressive quantization methods, such as 1-bit-sgd (seide20141, ) and ternary gradient (wen2017terngrad, ), sacrifice a certain degree of expression power to achieve the purpose of reducing communication costs. (tang2019doublesqueeze, ) studies the double squeeze of the gradient that not only the local gradient is compressed but also the synchronous gradient.

The size of the model is not only a determinant of memory usage but also an important factor of save the communication cost in distributed training. Although in data parallelism, we transmit the gradient instead of the model parameters. But for a sparse model , we can avoid transmitting unnecessary gradients corresponding to the parameters that always be 0. The combination of the above bandwidth reduction methods and parallel stochastic gradient descent (PSGD) has been demonstrated to be effective for model training without model sparsity constraints. Therefore, inspired by some online algorithms such as adaptive composite mirror descent and adaptive regularized dual averaging (duchi2011adaptive, ), which perform well in generating sparse models, we design two corresponding communication efficient sparse model training algorithms for distributed framework, named quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual averaging adaptive subgradient (QRDA adagrad).

To be specific, we define the distributed objective not only under a nonsmooth constraint to keep the model sparse, but also construct a proximal function base on a quantized gradient to achieve a balance between communication costs, accuracy, and model sparsity. The quantization is conducted not only for local gradients computed by all local workers but also for the aggregated gradient like the double squeeze (tang2019doublesqueeze, ). We proof the convergence rate for distributed QCMD adagrad and QRDA adagrad is O(1T)O(\frac{1}{\sqrt{T}}). Besides, our theoretical analysis shows that quantization introduces additional noise which affects the model convergence and sparsity. Hence, we apply a threshold quantization method with small error to the gradient to reduce the influence of noise on model training.

Our major contributions are summarized as follows:

  • We propose two communication efficient sparse model training algorithms for the distributed framework, namely QCMD adagrad and QRDA adagrad.

  • We theoretically find that quantization noise affects the model convergence and sparsity, and thus we adopt a small error quantization method to alleviate noise influence. We provide theoretical results on the regret for the QCMD adagrad and QRDA adagrad. The convergence rate is O(1T)O(\frac{1}{\sqrt{T}}).

  • We apply QCMD adagrad and QRDA adagrad to both linear models and convolutional neural networks. Experimental results demonstrate the effectiveness and efficiency of the proposed methods in convex and non-convex problems.

2. Related Work

We detail related studies in three aspects as follows.

Gradients sparsification. Gradient sparsification (seide20141, ) imposes sparsity onto gradients, where only a small fraction of elements of gradients are exchanged across workers based on their importance. Lin et al. (lin2017deep, ) find most of the gradient exchange in distributed SGD are redundant, and propose a deep gradient compression method to cut the gradient size of ResNet-50 from 97MB to 0.35MB, and that of DeepSpeach from 488MB to 0.74MB. Aji et al. (aji2017sparse, ) sparsify gradients by removing the R%R\% smallest gradients regarding absolute value. Simlarly, (stich2018sparsified, ) propose sparsified SGD with top k-sparsification. Wangni et al. (wangni2017gradient, ) analyse how to achieve the optimal sparseness under certain variance constraints and the unbiasedness of the gradient vector is kept after sparsification.

Gradients quantization. Gradient quantization replaces the original gradient with a small number of fixed values. Quantized SGD (QSGD) (alistarh2017qsgd, ) adjusts the number of bits of the exchanging gradients to balance the bandwidth and accuracy. More aggressive quantization methods, such as the binary representation (seide20141, ), cut each component of the gradient to its sign. TernGrad (wen2017terngrad, ) uses three numerical levels {-1, 0, 1} and a scaler, e.g. maximum norm or l2l_{2} norm of the gradient, to replace the full precision gradient. This aggressive method can be regarded as a variant of QSGD. To reduce the influence of noise introduced by the aggressive quantization, Wu et al. (wu2018error, ) utilize the accumulated quantization error to compensate the quantized gradient. Several applications, such as federated machine learning (ML) at the wireless edge (amiri2020machine, ), benefit from the error compensation. In addition, (magnusson2020maintaining, ) introduce a family of adaptive gradient quantization schemes which can enable linear convergence in any norm for gradient-descent-type algorithms. (alimisis2021communication, ) propose a quantized Newton’s method which is suitable for ill-conditioned but low-dimensional problems, as it reduces the communication complexity by a trade-off between the dependency on the dimension of the input features and its condition number. In (alimisis2021communication, ), lattice quantization (davies2021new, ) that reduces the variance is adapted to quantize the covariance matrix.

Stochastic optimization. In the modern implementation of large-scale machine learning algorithms, stochastic gradient descent (SGD) is commonly used as the optimization method in distributed training frameworks because of its universality and high computational efficiency for each iteration. SGD intrinsically implies gradient noise, which helps to escape saddle points for non-convex problems, like neural networks (jin2017escape, ; kleinberg2018alternative, ). However, when producing a sparse model, simply adding a subgradient of the l1l_{1} penalty to the gradient of the loss does not essentially produce parameters that are exactly zero. More sophisticated approaches such as composite mirror descent (singer2009efficient, ; duchi2010composite, ) and regularized dual averaging (xiao2010dual, ) do succeed in introducing sparsity, but the sparsity of model is limited. Their adaptive subgradient extensions (Adagrad) with l1l_{1} regularization (duchi2011adaptive, ) produce even better accuracy vs. sparsity tradeoffs. Compared with SGD which is very sensitive to the learning rate, Adagrad (duchi2011adaptive, ) dynamically incorporates knowledge of the geometry of the data curvature of the loss function to adjust the learning rate of gradients. As a result, it requires no manual tuning of learning rate and appears robust to noisy gradient information and large-scale high-dimensional machine learning.

Refer to caption
Figure 2. General framework of QCMD adagrad and QRDA adagrad. 𝕀\mathbb{I} is the indicator function defined in Eq.(11). 𝐱^t+1(m)\hat{\bf x}_{t+1}^{(m)} is the local parameter that is updated based on the ft(𝐱t)(m)f^{\prime}_{t}({\bf x}_{t})^{(m)}. S()S(\cdot) is the selection function defined in Eq.(10). Nodes with colors denote the non-zero elements. Quantization function Q()Q(\cdot) quantize the input. Both the indicators and the quantized selected gradients are needed to be sent to the server for the help of decoding. Finally, the parameter server synchronizes the indicators and quantized selected gradients from all MM workers and sent them to the local workers for the next step update.

3. Problem Definition and our approach

Notations: Before the problem definition, we define the following notations. Lower case bold letters, such as 𝐯{\bf v}, are used to denote vectors. Matrices are capital case bold letters, like 𝐀{\bf A}. Scalars are lower case italic letters, such as ss. We use [d] to denote the set {1,2,,d}\{1,2,...,d\}. For a vector 𝐯d{\bf v}\in\mathbb{R}^{d}, the lql_{q} norm defined as 𝐯q=(i[d]|𝐯i|q)1q||{\bf v}||_{q}=(\sum_{i\in[d]}|{\bf v}_{i}|^{q})^{\frac{1}{q}}, and 𝐯=maxi[d]|𝐯i|||{\bf v}||_{\infty}=\max_{i\in[d]}|{\bf v}_{i}|. For a symmetric matrix 𝐀d×d{\bf A}\in\mathbb{R}^{d\times d} and any vector 𝐱d{\bf x}\in\mathbb{R}^{d}, 𝐲d{\bf y}\in\mathbb{R}^{d}, the Bregman divergence associated with a strongly convex and differentiable function ψ(𝐱)=12𝐱𝐀𝐱\psi({\bf x})=\frac{1}{2}{\bf x}^{\top}{\bf A}{\bf x} is defined as Bψ(𝐱,𝐲)=12(𝐱𝐲)𝐀(𝐱𝐲)B_{\psi}({\bf x},{\bf y})=\frac{1}{2}({\bf x}-{\bf y})^{\top}{\bf A}({\bf x}-{\bf y}). The Mahalanobis norm 𝐱ψ=(𝐱𝐀𝐱)12||{\bf x}||_{\psi}=({\bf x}^{\top}{\bf A}{\bf x})^{\frac{1}{2}} and it’s associated dual norm 𝐱ψ=(𝐱𝐀1𝐱)12||{\bf x}||_{\psi^{*}}=({\bf x}^{\top}{\bf A}^{-1}{\bf x})^{\frac{1}{2}}. <𝐱,𝐲><{\bf x},{\bf y}> is used to denote the inner product between 𝐱{\bf x} and 𝐲{\bf y}. We also make frequent use of the following matrix, where 𝐯1:t=[𝐯1𝐯t]{\bf v}_{1:t}=[{\bf v}_{1}\cdot\cdot\cdot{\bf v}_{t}] denote the matrix obtained by concatenating the vector sequence. We denote the ithi^{th} row of 𝐯1:t{\bf v}_{1:t} by 𝐯1:t,i{\bf v}_{1:t,i}, which amounts to the concatenation of the ithi_{th} component of each vector.

3.1. Problem Definition

Stochastic Optimization (Lee:2018:DQA:3219819.3220075, ) is a popular approach in training large-scale machine learning models. We denote the objective function as f(𝐱)f({\bf x}), depending on the model parameter 𝐱χ{\bf x}\in\chi. The dataset could be split into MM workers, where each has NN data samples. We denote fm,nf_{m,n} as the loss function associated with the nthn^{th} sample on the mthm^{th} worker, and f(𝐱)f({\bf x}) is an average over all loss functions on all workers.

(1) f(𝐱)=1M1Nm,nfm,n(𝐱)f({\bf x})=\frac{1}{M}\frac{1}{N}\sum_{m,n}f_{m,n}({\bf x})

When training on a single machine, the stochastic optimization, like SGD, randomly chooses a subset of samples 𝐙t{\bf{Z}}_{t} and calculates the gradient ft(𝐱t,𝐳t)f^{\prime}_{t}({\bf x}_{t},{\bf z}_{t}), where 𝐳t𝐙t{\bf z}_{t}\in{\bf{Z}}_{t}, to approximates the true gradient ft(𝐱t)f^{\prime}_{t}({\bf x}_{t}) at ttht^{th} iteration by

(2) ft(𝐱t)=1|𝐙t|𝐳t𝐙t(ft(𝐱t,𝐳t))f^{\prime}_{t}({\bf x}_{t})=\frac{1}{|{\bf{Z}}_{t}|}\sum_{{\bf z}_{t}\in{\bf{Z}}_{t}}(f^{\prime}_{t}({\bf x}_{t},{\bf z}_{t}))

Similarly, in a distributed setting, each worker calculates the stochastic gradient ft(𝐱t)(m)f^{\prime}_{t}({\bf x}_{t})^{(m)} based on its local data 𝐙t(m){\bf{Z}}_{t}^{(m)}. The parameter server averages the local gradients to get the synchronized gradient,

(3) SynGt(𝐱t)=1Mm=1Mft(𝐱t)(m)SynG_{t}({\bf x}_{t})=\frac{1}{M}\sum_{m=1}^{M}f^{\prime}_{t}({\bf x}_{t})^{(m)}

As previous work on gradient sparsification and quantization suggested, we could represent ft(𝐱t)(m)f^{\prime}_{t}({\bf x}_{t})^{(m)} in a low-precision way. Since extreme gradient sparsification can also be seen as gradient quantization, this paper focuses on reducing communication costs through gradient quantization. We denote Q()Q(\cdot) as the quantization function and then the synchronized quantized gradient can be obtained by

(4) SynQt=1Mm=1MQ(ft(𝐱t)(m))SynQ_{t}=\frac{1}{M}\sum_{m=1}^{M}Q(f_{t}^{\prime}({\bf x}_{t})^{(m)})

The update formula for the synchronous distributed stochastic optimization with function Q()Q(\cdot) is

(5) 𝐱t+1=𝐱tη1Mm=1MQ(ft(𝐱t)(m)){\bf x}_{t+1}={\bf x}_{t}-\eta\frac{1}{M}\sum_{m=1}^{M}Q(f_{t}^{\prime}({\bf x}_{t})^{(m)})

where η\eta is the learning rate.

3.2. Quantized CMD Adagrad and Quantized RDA Adagrad

General Framework. In the data parallelism distributed training scheme, the number of the exchanged gradients is equal to the number of model learnable parameters. But for a sparse model, we can avoid transmitting unnecessary gradient information, i.e. gradient corresponding to parameters that are always 0 before and after model update. To this end, each worker utilizes their local full precision gradients to obtain the model parameters after the local update. With the assumption that the model parameters change smoothly and be sparsity, we can transmit a small amount of the necessary gradient to save the communication cost. Furthermore, this small amount of the local gradients is quantized and encoded. After the global parameter server receives and decodes the quantized gradients, synchronous gradients are calculated by averaging the local sparse quantized gradients. The synchronous sparse gradients are then quantized and sent to the workers for model parameter update. The general framework of QCMD adagrad and QRDA adagrad is shown in Fig. 2.

Generate Sparse Model. Vanilla SGD is not particularly effective at producing a sparse model, even adding a subgradient of l1l_{1} penalty to the gradient of the loss. Some other approaches such as proximal gradient descent (also named composite mirror descent) (singer2009efficient, ; duchi2010composite, ) and regularized dual averaging (xiao2010dual, ) introduce limited sparsity with l1l_{1} regularization. Their adaptive subgradient extensions, CMD adagrad and RDA adagrad, (duchi2011adaptive, ) produce better accuracy vs. sparsity tradeoffs.

The original proximal gradient method (duchi2010composite, ) employs an immediate trade-off among the current gradient ft(𝐱t)f_{t}^{\prime}({\bf x}_{t}), the regularizer ϕ\phi and proximal function ψ\psi. The proximal function ψ\psi aims to keep 𝐱{\bf x} staying close to 𝐱t{\bf x}_{t} and sometimes be simply set to ψ(𝐱)=12𝐱𝐱t22\psi({\bf x})=\frac{1}{2}||{\bf x}-{\bf x}_{t}||^{2}_{2}). It makes the model parameters meet the assumption of steady change. To achieve better bound of regret, the CMD adagrad adopts ψt\psi_{t} which varies with tt. The update for CMD adagrad amounts to solving

(6) 𝐱t+1=argmin𝐱χηft(𝐱t)𝐱+ηϕ(𝐱)+Bψt(𝐱,𝐱t){\bf x}_{t+1}=\arg\min_{{\bf x}\in\chi}{\eta f^{\prime}_{t}({\bf x}_{t})^{\top}}{\bf x}+\eta\phi({\bf x})+B_{\psi_{t}}({\bf x},{\bf x}_{t})

Similarly, the RDA adagrad encompasses a trade off among a gradient-dependent linear term, the regularizer ϕ\phi and a strongly convex term ψt\psi_{t} for well conditioned predictions. The update for RDA adagrad is shown as:

(7) 𝐱t+1=argmin𝐱χ{η<1tτ=1tfτ(𝐱τ),𝐱>+ηϕ(𝐱)+1tψt(𝐱)}\begin{split}{\bf x}_{t+1}=\arg\min_{{\bf x}\in\chi}\{{\eta}<\frac{1}{t}\sum_{\tau=1}^{t}f^{\prime}_{\tau}({\bf x}_{\tau}),{\bf x}>+\eta\phi({\bf x})+\frac{1}{t}\psi_{t}({\bf x})\}\end{split}

For both CMD adagrad and RDA adagrad, ϕ(𝐱)=λ𝐱1\phi({\bf x})=\lambda||{\bf x}||_{1} encourages the sparsity of the model parameter. The Bregman divergence associated with ψt(𝐱)=12𝐱𝐀t𝐱\psi_{t}({\bf x})=\frac{1}{2}{\bf x}^{\top}{\bf A}_{t}{\bf x} is defined as

(8) Bψt(𝐱,𝐱t)=12(𝐱𝐱t)𝐀t(𝐱𝐱t)\displaystyle B_{\psi_{t}}({\bf x},{\bf x}_{t})=\frac{1}{2}({\bf x}-{\bf x}_{t})^{\top}{\bf A}_{t}({\bf x}-{\bf x}_{t})

𝐀t{\bf A}_{t} is a diagonal matrix also named the adaptive learning rate matrix. Concretely, for some small fixed δ0\delta\geq 0, the adaptive learning rate matrix is:

(9) 𝐀t=δ𝐈+diag(𝐚t)\displaystyle{\bf A}_{t}=\delta{\bf I}+diag({\bf a}_{t})

where 𝐚t,d=f(𝐱)1:t,d2{\bf a}_{t,d}=||f^{\prime}({\bf x})_{1:t,d}||_{2}, f(𝐱)1:t=[f(𝐱)1:t1,ft(𝐱t)]f^{\prime}({\bf x})_{1:t}=[f^{\prime}({\bf x})_{1:t-1},f^{\prime}_{t}({\bf x}_{t})] denote the matrix obtained by concatenating the gradient sequence. f(𝐱)1:t,df^{\prime}({\bf x})_{1:t,d} is the dthd^{th} row of this matrix and diag()diag(\cdot) converts a vector into a diagonal matrix. 𝐈{\bf I} is identity matrix.

To apply them to distributed learning, each worker needs to maintain a adaptive learning rate matrix locally. On the mthm_{th} worker, the elements of the gradient f(𝐱t)(m)f^{\prime}({\bf x}_{t})^{(m)} are selected by a selected function:

(10) S(ft(𝐱t)(m),𝕀(m))={ft,d(𝐱t)(m)|𝕀d(m)=1}\displaystyle S(f^{\prime}_{t}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)})=\{f^{\prime}_{t,d}({\bf x}_{t})^{(m)}|\mathbb{I}_{d}^{(m)}=1\}

where 𝕀(m)\mathbb{I}^{(m)} is an indicator function that

(11) 𝕀d={1, 𝐱t,d0 or 𝐱^t+1,d00, otherwise\displaystyle\mathbb{I}_{d}=\left\{\begin{aligned} 1,&\text{ }{\bf x}_{t,d}\neq 0\text{ or }\hat{\bf x}_{t+1,d}\neq 0\\ 0,&\text{ otherwise}\end{aligned}\right.

𝐱^t+1(m)\hat{\bf x}_{t+1}^{(m)} is the local parameter that is updated based on the local full precision gradient ft(𝐱t)(m)f^{\prime}_{t}({\bf x}_{t})^{(m)} and local adaptive learning rate matrix 𝐀t(m){\bf A}_{t}^{(m)}. Both the indicator and selected elements of the local gradients are sent to the global parameter server for decoding. The synchronous indicator is calculated by ”bitwise or” between the MM indicators.

Gradient quantization. Besides reducing the transmission amount of the corresponding gradient by generating a sparse model, we also hope to further reduce the communication cost by quantifying the selected elements of the gradient. The quantization function is define as Q()Q(\cdot). Each worker computes Q(S(ft(𝐱t)(m),𝕀m))Q(S(f^{\prime}_{t}({\bf x}_{t})^{(m)},\mathbb{I}^{m})). The global parameter server first decodes the quantized gradient through S1S^{-1} and computes the synchronized quantized gradient based on Eq.(4). We furtherly quantize the synchronized gradient SynQtSynQ_{t} to save more communication costs by

(12) 𝐪t=Q(SynQt){\bf q}_{t}=Q(SynQ_{t})

The server pushes the double quantized gradient 𝐪t{\bf q}_{t} to every worker. Specially, the adaptive learning rate is calculated by the worker according to 𝐪t{\bf q}_{t} sequence. So we construct a quantized gradient based adaptive learning rate matrix,

(13) 𝐇t=δ𝐈+diag(𝐜t)\displaystyle{\bf H}_{t}=\delta{\bf I}+diag({\bf c}_{t})

where 𝐜t,d=𝐪1:t,d2{\bf c}_{t,d}=||{\bf q}_{1:t,d}||_{2}, 𝐪1:t=[𝐪1:t1,𝐪t]{\bf q}_{1:t}=[{\bf q}_{1:t-1},{\bf q}_{t}]. 𝐇t{\bf H}_{t} dynamically incorporates knowledge of the geometry of the data and the curvature of the loss function based on quantized gradient, which makes QCMD adagrad and QRDA adagrad achieve a good balance between model sparsity, accuracy and communication cost. Therefore, the objective function for QCMD adagrad becomes

(14) 𝐱t+1=argminxχη𝐪t𝐱+ηϕ(𝐱)+Bψt(𝐱,𝐱t){\bf x}_{t+1}=\arg\min_{x\in\chi}{\eta{\bf q}_{t}^{\top}}{\bf x}+\eta\phi({\bf x})+B_{\psi_{t}}({\bf x},{\bf x}_{t})

The objective function for QRDA adagrad becomes

(15) 𝐱t+1=argmin𝐱χ{η<1tτ=1t𝐪τ,𝐱>+ηϕ(𝐱)+1tψt(𝐱)}\begin{split}{\bf x}_{t+1}=\arg\min_{{\bf x}\in\chi}\{{\eta}<\frac{1}{t}\sum_{\tau=1}^{t}{\bf q}_{\tau},{\bf x}>+\eta\phi({\bf x})+\frac{1}{t}\psi_{t}({\bf x})\}\end{split}

Solving Eq. (14), we have the update rule for QCMD adagrad:

(16) 𝐱t+1,i=sign(𝐱t,iη𝐇t,ii1𝐪t,i)[|𝐱t,iη𝐇t,ii1𝐪t,i|λη𝐇t,i1]+{\bf x}_{t+1,i}=sign({\bf x}_{t,i}-\eta{\bf H}_{t,ii}^{-1}{\bf q}_{t,i})[|{\bf x}_{t,i}-\eta{\bf H}_{t,ii}^{-1}{\bf q}_{t,i}|-\lambda\eta{\bf H}_{t,i}^{-1}]_{+}

The subscript ()+(\cdot)_{+} represents that we retain the value greater than 0. Solving Eq. (15), the update rule for QRDA adagrad becomes:

(17) 𝐱t+1,i=sign(τ=1t𝐪τ,i)tη𝐇t,ii1[|1tτ=1t𝐪τ,i|λ]+{\bf x}_{t+1,i}=sign(-\sum_{\tau=1}^{t}{\bf q}_{\tau,i})t\eta{\bf H}_{t,ii}^{-1}[|\frac{1}{t}\sum_{\tau=1}^{t}{\bf q}_{\tau,i}|-\lambda]_{+}

The overall QCMD adagrad and QRDA adagrad schemes are presented in Alg. 1.

for t=1t=1 to TT
for M worker: m=1,,Mm=1,...,M do in parallel
  Randomly chooses 𝐙t(m){\bf{Z}}_{t}^{(m)} from the local data.
  Compute the gradients ft(𝐱t)(m)f_{t}^{\prime}({\bf x}_{t})^{(m)} under 𝐙t(m){\bf{Z}}_{t}^{(m)}.
  Compute the 𝐱^t+1(m)\hat{{\bf x}}_{t+1}^{(m)} based on ft(𝐱)(m)f^{\prime}_{t}({\bf x})^{(m)}
  and 𝐇^t=δ𝐈+diag(𝐜^t)\hat{{\bf H}}_{t}=\delta{\bf I}+diag(\hat{{\bf c}}_{t}), where
   𝐜^t,d=[𝐪1:t1,ft(𝐱t)(m)]2\hat{{\bf c}}_{t,d}=||[{\bf q}_{1:t-1},f_{t}^{\prime}({\bf x}_{t})^{(m)}]||_{2}.
  Compute the indicator function 𝕀m\mathbb{I}^{m}.
  Quantize the selected elements of the gradients with
   Q(S(ft(𝐱t)(m),𝕀(m)))Q(S(f_{t}^{\prime}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)})).
  Push Q(S(ft(𝐱t)(m),𝕀(m)))Q(S(f_{t}^{\prime}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)})) and 𝕀m\mathbb{I}^{m} to the server.
Server:
  Decode Q(S(ft(𝐱t)(m),𝕀(m)))Q(S(f_{t}^{\prime}({\bf x}_{t})^{(m)},\mathbb{I}^{(m)})) to get Q(ft(𝐱t)(m))Q(f_{t}^{\prime}({\bf x}_{t})^{(m)}).
  Compute the synchronous indicator 𝕀Syn\mathbb{I}^{Syn}.
  Average the quantized gradients
   SynQt=1Mm=1MQ(ft(𝐱t)(m))SynQ_{t}=\frac{1}{M}\sum_{m=1}^{M}Q(f_{t}^{\prime}({\bf x}_{t})^{(m)}).
  Quantize the aggregated gradient SynQtSynQ_{t}
   𝐪t=Q(SynQt){\bf q}_{t}=Q(SynQ_{t}).
  Push S(𝐪t,𝕀Syn)S({\bf q}_{t},\mathbb{I}^{Syn}) and 𝕀Syn\mathbb{I}^{Syn} to every worker.
for M worker: j=1,,Mj=1,...,M do in parallel
  Pull S(𝐪t,𝕀Syn)S({\bf q}_{t},\mathbb{I}^{Syn}) and 𝕀Syn\mathbb{I}^{Syn} from the server.
  Decode S(𝐪t,𝕀Syn)S({\bf q}_{t},\mathbb{I}^{Syn}) to get 𝐪t{\bf q}_{t}.
  Calculate the adaptive learning rate
   𝐇t=δ𝐈+diag(𝐜t){\bf H}_{t}=\delta{\bf I}+diag({\bf c}_{t}),
  where ct,d=𝐪1:t,d2c_{t,d}=||{\bf q}_{1:t,d}||_{2}, 𝐪1:t=[𝐪1:t1,𝐪t]{\bf q}_{1:t}=[{\bf q}_{1:t-1},{\bf q}_{t}].
  QCMD adagrad: update 𝐱t+1{\bf x}_{t+1} based on Eq. (16).
  QRDA adagrad: update 𝐱t+1{\bf x}_{t+1} based on Eq. (17).
Algorithm 1 QCMD adagrad and QRDA adagrad.

3.3. Quantization error

In this section, we theoretically analyse that the quantization error introduced by the gradient quantization affects the convergence rate of regret for QCMD adagrad and QRDA adagrad.

Proposition 3.1.

Let the sequence 𝐱t{{\bf x}_{t}} be defined by the update (16), SynQtSynQ_{t} defined by Eq. (4), 𝐪t{\bf q}_{t} defined by Eq. (12), 𝔼[𝐪t]=ft(𝐱t)\mathbb{E}[{\bf q}_{t}]=f_{t}({\bf x}_{t}). The Mahalanobis norm ||||ψt=<,𝐇t>||\cdot||_{\psi_{t}}=\sqrt{<\cdot,{\bf H}_{t}\cdot>} and ||||ψt=<,1𝐇t>||\cdot||_{\psi^{*}_{t}}=\sqrt{<\cdot,\frac{1}{{\bf H}_{t}}\cdot>} be the associated dual norm. 𝐱{\bf x}^{*} is the optimal solution to f(𝐱)f({\bf x}), for any 𝐱χ{\bf x}^{*}\in\chi,

t=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]12t=1T𝔼𝐪𝐪tψt2+1η𝔼𝐪Bψ1(𝐱,𝐱1)+1ηt=1T1𝔼𝐪[Bψt+1(𝐱,𝐱t+1)Bψt(𝐱,𝐱t+1)]\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}B_{\psi_{1}}({\bf x}^{*},{\bf x}_{1})\\ &\quad+\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[B_{\psi_{t+1}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})]\\ \end{split}
Proof.

See Appendix for the proof. ∎

Proposition 3.2.

Let the sequence 𝐱t{{\bf x}_{t}} be defined by the update (17), SynQtSynQ_{t} defined by Eq. (4), 𝐪t{\bf q}_{t} defined by Eq. (12), 𝔼[𝐪t]=ft(𝐱t)\mathbb{E}[{\bf q}_{t}]=f_{t}({\bf x}_{t}). For 𝐠d\forall{\bf g}\in\mathbb{R}^{d}, let ψt(𝐠)\psi_{t}^{*}({\bf g}) be the conjugate dual of tϕ(𝐱)+1ηψt(𝐱)t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x}), ϕ(𝐱)=λ𝐱1\phi({\bf x})=\lambda||{\bf x}||_{1}, ||||ψt=<,η2𝐇t>||\cdot||_{\psi_{t}^{*}}=\sqrt{<\cdot,\frac{\eta}{2{\bf H}_{t}}\cdot>}. xx^{*} is the optimal solution to f(𝐱)f({\bf x}), for any 𝐱χ{\bf x}^{*}\in\chi, we have

t=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]1η𝔼𝐪[ψT(𝐱)]+η2t=1T𝔼𝐪𝐪tψt12\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{\eta}\mathbb{E}_{\bf q}[\psi_{T}({\bf x}^{*})]+\frac{\eta}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\\ \end{split}
Proof.

See Appendix for the proof. ∎

Remark: Since 𝔼𝐪𝐪tft(𝐱t)ψt2\mathbb{E}_{\bf q}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}} is the quantization variance scaled by the adaptive learning rate and

𝔼𝐪𝐪tft(𝐱t)ψt2𝔼𝐪𝐪tψt2𝔼𝐪𝐪tψt12\begin{split}\mathbb{E}_{\bf q}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}&\leq\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}\\ &\leq\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\end{split}

We can simply regard 𝔼𝐪𝐪tψt2\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}} as the error introduced by quantization for QCMD adagrad and regard 𝔼𝐪𝐪tψt12\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}} as the error introduced by quantization for QRDA adagrad. Therefore, Proposition 1. and Proposition 2. show that gradient quantization introduces additional noise which affects the model convergence and sparsity.

3.4. Threshold Quantization

Although gradient quantization can reduce the cost of gradient communication in distributed training, it also introduces additional errors, which affect the convergence of the model and the sparsity of parameters in QCMD adagrad and QRDA adagrad. As an unbiased gradient quantization method, TernGrad (wen2017terngrad, ) has already made a good balance between the encoding cost and accuracy of the general model. However, when it comes to the sparse model, a large quantization error still leads to slower convergence of the l1l_{1} norm as a part of the objective function, which affects the sparsity of the model. In order to mitigate this problem, we apply the threshold quantization method to the QCMD adagrad and QRDA adagrad.

Threshold quantization is an existing quantization method used for model quantization in (TWN, ). We apply it to gradient quantization since it produces less error than Terngrad(wen2017terngrad, ). In this section, we use 𝐯t{\bf v}^{t} to represent the (stochastic) gradient in the ttht_{th} iteration. Fig. 3 gives a brief explanation of threshold quantization, and more analysis is provided below.

Refer to caption
Figure 3. An illustration of threshold quantization. Suppose vmintvitvmaxtv^{t}_{min}\leq v^{t}_{i}\leq v^{t}_{max}. \triangle^{*} denotes the optimal threshold. For vitv^{t}_{i} within the orange line, vitv^{t}_{i} is quantized to 0. For vitv^{t}_{i} within the blue line, vitv^{t}_{i} is quantized to s-s_{\triangle}^{*}. For vitv^{t}_{i} within the green line, vitv^{t}_{i} is quantized to ss_{\triangle}^{*}.

Q()Q_{\triangle}(\cdot) is the threshold quantization function defined as below

(18) Q(𝐯t)=s𝐯,Q_{\triangle}({\bf v}^{t})=s{\bf v},

where 𝐯{\bf v} is a ternary vector, ss is a non-negative scaling factor and \triangle denotes the threshold. For the ithi^{th} component of 𝐯{\bf v},

(19) 𝐯i={+1,if 𝐯it>Δ;0,if |𝐯it|Δ;1,if 𝐯it<Δ.{\bf v}_{i}=\left\{\begin{array}[]{rcl}+1,&&{\text{if }{\bf v}_{i}^{t}>\Delta};\\ 0,&&{\text{if }|{\bf v}_{i}^{t}|\leq\Delta};\\ -1,&&{\text{if }{\bf v}_{i}^{t}<-\Delta}.\\ \end{array}\right.

The error ϵt{\epsilon}_{t} is defined as the difference between the full precision vector 𝐯t{\bf v}^{t} and Q(𝐯t)Q_{\triangle}({\bf v}^{t}):

(20) ϵt=𝐯tQ(𝐯t).\begin{split}{\epsilon}_{t}={\bf v}^{t}-Q_{\triangle}({\bf v}^{t}).\end{split}

In order to keep as more information of 𝐯t{\bf v}^{t} as possible, the quantization method is required to minimize the Euclidean distance between 𝐯t{\bf v}^{t} and Q(𝐯t)Q_{\triangle}({\bf v}^{t}), i.e.,

(21) {s,𝐯=argmins,𝐯ϵt22s.t.s0,𝐯i{1,0,1},i=1,2,,n.\left\{\begin{array}[]{lr}s^{*},{\bf v}^{*}=\arg min_{s,{\bf v}}||{\epsilon_{t}}||_{2}^{2}\\ s.t.\quad s\geq 0,{\bf v}_{i}\in\{-1,0,1\},i=1,2,...,n.\end{array}\right.

Obviously, the above problem can be transformed to the following formulation:

(22) s,Δ=argmins0,Δ>0|IΔ|s22siIΔ|𝐯it|+i=1n(𝐯it)2,s^{*},\Delta^{*}=\arg min_{s\geq 0,\Delta>0}|I_{\Delta}|s^{2}-2s\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|+\sum_{i=1}^{n}({\bf v}^{t}_{i})^{2},

where IΔ={i||𝐯it|>Δ}I_{\Delta}=\{i||{\bf v}^{t}_{i}|>\Delta\} and |IΔ||I_{\Delta}| denotes the number of elements in IΔI_{\Delta}. Thus, for any given Δ\Delta, the optimal ss can be computed as follows,

(23) sΔ=1|IΔ|iIΔ|𝐯it|.s^{*}_{\Delta}=\frac{1}{|I_{\Delta}|}\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|.

The optimal Δ\Delta can be computed as follows,

(24) Δ=argmaxΔ>01|IΔ|(iIΔ|𝐯it|)2.\Delta^{*}=\arg max_{\Delta>0}\frac{1}{|I_{\Delta}|}(\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|)^{2}.

To solve the above optimal threshold, we can rank the components of the gradient and treat them as potential thresholds. For each potential threshold, we calculate the corresponding objective value 1|IΔ|(iIΔ|𝐯it|)2\frac{1}{|I_{\Delta}|}(\sum_{i\in I_{\Delta}}|{\bf v}^{t}_{i}|)^{2} and take the potential threshold that maximizes the objective value as the optimal threshold. The computational complexity of this process is O(dlogd)O(d\log d) for dd-dimensional gradients, where the major cost is the sorting over all elements of gradients.

3.5. Threshold Approximation

We try to improve the computational efficiency of the threshold quantization procedure without harming the optimality of the coding described in Eq. (24). This improvement begins with the assumption that the gradient follows the Gaussian distribution. In case of 𝐯it{\bf v}^{t}_{i} follows N(0,σ2)N(0,\sigma^{2}), Li and Liu (TWN, ) have given an approximate solution for the optimal threshold \triangle^{*} by 0.6σ0.6\sigma which equals 0.75𝔼(|𝐯it|)0.75di=1d|𝐯it|0.75\cdot\mathbb{E}(|{\bf v}^{t}_{i}|)\approx\frac{0.75}{d}\sum_{i=1}^{d}|{\bf v}_{i}^{t}|. We also find in our experiment that most of the gradients satisfy the previous assumption. Fig. 4 shows the experimental result for training AlexNet (krizhevsky2012imagenet, ) on two workers. The left column visualizes the first convolutional layer and the right one visualizes the first fully-connected layer. The distribution of the original floating gradients is close to the Gaussian distribution for both convolutional and fully-connected layers. Based on this observation, we simply use 0.75di=1d|𝐯it|\frac{0.75}{d}\sum_{i=1}^{d}|{\bf v}_{i}^{t}| to approximate the optimal threshold to avoid the expensive cost of solving the optimal threshold Δ\Delta^{*} every iteration.

Encode. The gradient vectors need to be encoded after the threshold quantization. Specifically, we use 2 bits to encode 𝐯i{\bf v}_{i}, and one floating-point number to represent the scaling factor s. For the dense model, the overall communication cost is (32+2d)(32+2d) bits, where dd denotes the parameter dimension. The communication cost of threshold quantization is the same as Terngrad. But for the sparse model, assume the number of non-zero parameters is kk (kd)(k\ll d), the communication cost of QCMD adagrad and QRDA adagrad is (32+d+2k)(32+d+2k) bits. Since we need at least dd bits to indicate which components of the parameter are non-zero.

Refer to caption
(a) First convolutional layer
Refer to caption
(b) First fully connected layer
Figure 4. Histograms of original floating gradients. These histograms are obtained from distributed training AlexNet on two workers, where the vertical axis is the training iteration. The left column visualizes the first convolutional layer and the right one visualizes the first fully-connected layer.
Refer to caption
Figure 5. The effect of coefficient of regularzation λ\lambda and learning rate η\eta on model sparsity respectively on MNIST for QCMD adagrad and QRDA adagrad.

4. Convergence Analysis

Two aspects are taken into account to evaluate the distributed optimization algorithm, the number of bits sent and received by the workers (communication complexity) and the number of parallel iterations required for convergence (round complexity). In this section, we theoretically analyze the proposed QCMD adagrad and QRDA adagrad in terms of the convergence rate of regret.

To obtain the regret bound for QCMD adagrad and QRDA adagrad, we provide the follow lemma, which has been proved in Lemma 4 of (duchi2011adaptive, ).

Lemma 4.1.

For any δ0\delta\geq 0, the Mahalanobis norm ||||ψt=<,𝐇t>||\cdot||_{\psi_{t}}=\sqrt{<\cdot,{\bf H}_{t}\cdot>} and ||||ψt=<,1𝐇t>||\cdot||_{\psi^{*}_{t}}=\sqrt{<\cdot,\frac{1}{{\bf H}_{t}}\cdot>} be the associated dual norm, we have

12t=1T𝐪tψt2i=1d𝐪1:T,i2\begin{split}\frac{1}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}\leq\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}
Lemma 4.2.

For any δmaxt𝐪t\delta\geq\max_{t}||{\bf q}_{t}||_{\infty}, ||||ψt=<,η2𝐇t>||\cdot||_{\psi^{*}_{t}}=\sqrt{<\cdot,\frac{\eta}{2{\bf H}_{t}}\cdot>}, we have

12t=1T𝐪tψt12i=1d𝐪1:T,i2\begin{split}\frac{1}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\leq\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

Combining the above arguments with Proposition.1 and Proposition.2, we have the following theorem.

Theorem 4.3.

For QCMD adagrad, let D=maxtT𝐱𝐱tD_{\infty}=\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||_{\infty}, G=maxtT,id𝐪1:T,i2G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}, the regularizer ϕ(𝐱)=λ𝐱1\phi({\bf x})=\lambda||{\bf x}||_{1}, where λ0\lambda\geq 0. Assume Q()Q(\cdot) is an unbiased quantization function, ft(𝐱)f_{t}({\bf x}) is L-smooth function, learning rate η=21+2L\eta=\frac{2}{1+2L}, ψt(x)\psi_{t}(x) is 1-strongly convex function, the regret is as below

1Tt=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]dGT+dGD2ηT\begin{split}&\quad\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{dG_{\infty}}{\sqrt{T}}+\frac{dG_{\infty}D_{\infty}}{2\eta\sqrt{T}}\end{split}
Proof.

See Appendix for the proof. ∎

Theorem 4.4.

For QRDA adagrad, let D=maxtT𝐱𝐱tD_{\infty}=\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||_{\infty}, G=maxtT,id𝐪1:T,i2G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}, Q()Q(\cdot) be an unbiased quantization function, the regularizer ϕ(𝐱)=λ𝐱1\phi({\bf x})=\lambda||{\bf x}||_{1}, where λ0\lambda\geq 0. Then the regret is as below

1Tt=1T𝔼𝐪[ft(𝐱t)+ϕ(𝐱t)ft(𝐱)ϕ(𝐱)]δ𝐱22ηT+(1η𝐱2+η22)dGT\begin{split}&\quad\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{\delta||{\bf x}^{*}||_{2}^{2}}{\eta T}+\frac{(\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}+\frac{\eta^{2}}{2})dG_{\infty}}{\sqrt{T}}\end{split}
Proof.

See Appendix for the proof. ∎

Remark: The convergence rate of QCMD adagrad and QRDA adagrad is comparable with CMD adagrad and RDA adagrad respectively, in terms of O(1T)O(\frac{1}{\sqrt{T}}). The quantization error affects the convergence. Only the quantization error is sufficiently small, the proposed methods behave similarly with CMD adagrad and RDA adagrad. The convergence of f(𝐱)f({\bf x}) determines the accuracy of models and ϕ(𝐱)\phi({\bf x}) determines the sparsity of models. In QCMD adagrad, the learning rate η\eta and the coefficient of regularization λ\lambda control the sparsity of models, while QRDA adagrad controls the sparsity of models only through λ\lambda.

5. Experiments

5.1. Experimental Settings

Table 1. Data statistics
Data Set Number of Number of Dimension of
Training Data Test Data Input Feature
news20 15,994 4,000 1,355,191
rcv1 20,242 677,399 47,236
MNIST 60,000 10,000 1*28*28
CIFAR-10 50,000 10,000 3*32*32

In this section, we first conduct experiments on linear models to validate the effectiveness and efficiency of the proposed QCMD adagrad and QRDA adagrad for binary classification problems. After that, we use our proposed method to train convolutional neural networks, in order to validate the performance of them on non-convex problems.

Baselines. We compare QCMD adagrad and QRDA adagrad with several sparse model distributed optimization methods, including 32 bits Prox-gd(duchi2010composite, ), 32 bits CMD adagrad(duchi2011adaptive, ), 32 bits RDA adagrad(duchi2011adaptive, ) and their corresponding ternary variants(wen2017terngrad, ). {\dagger} is to mark the methods that only the local gradients are quantized but the synchronous gradient.

Implementation Details. All the experiments are carried out in a distributed framework that the network bandwidth is 100MBps. For the linear models training, each worker only utilizes the CPUs. For the training of convolutional neural networks, each worker is allocated with 1 NVIDIA Tesla P40 GPU. The methods are evaluated on four publicly available datasets. news20 and rcv1 are text datasets with a high dimension of input features from LIBSVM (chang2001libsvm, ). MNIST (lecun1998gradient, ) is for handwritten digits classification problem and CIFAR10 (krizhevsky2009learning, ) is for image classification problem. Table 1 shows the details of datasets. For news20 and rcv1, the 1\ell_{1} norm regularized logistic regression model are trained. As for multi-classification problems, we train LeNet for MNIST(lecun1998gradient, ) and AlexNet (krizhevsky2012imagenet, ) for CIFAR10. To generate a network with large sparsity, the batch normalization layer is added before each convolution layer and fully connected layer. The code is implemented via tensorflow. Experimental results are averaged over 5 runs with a random initialization seed.

Table 2. Settings of hyperparameters
news20
Method Learning Coefficient of Batch size Number of
rate regularization per worker worker
Prox-gd 0.1 0.001 20 2
QCMD adagrad 0.02 0.00001 20 2
QRDA adagrad 0.02 0.1 20 2
rcv1
Method Learning Coefficient of Batch size Number of
rate regularization per worker worker
Prox-gd 1.0 0.0001 20 2
QCMD adagrad 1.0 0.000005 20 2
QRDA adagrad 0.1 0.5 20 2
MNIST
Method Learning Coefficient of Batch size Number of
rate regularization per worker worker
Prox-gd 0.001 0.005 16 4
QCMD adagrad 0.004 0.0001 16 4
QRDA adagrad 0.01 0.001 16 4
CIFAR-10
Method Learning Coefficient of Batch size Number of
rate regularization per worker worker
Prox-gd 0.01 0.001 64 2
QCMD adagrad 0.01 0.0002 64 2
QRDA adagrad 0.01 0.0004 64 2
Table 3. Comparisons on four datasets in terms of three metrics. {\dagger} is to mark the methods that only quantize the local gradient but the synchronous gradient.
news20
Method Accuracy(%) Sparsity(%) Error
32-bits Prox-gd 73.92 56.63 -
32-bits CMD adagrad 97.50 85.76 -
Ternary CMD adagrad 96.50 82.64 3.13e-09
Threshold CMD adagrad 97.46 84.59 1.49e-09
Ternary CMD adagrad 96.00 80.30 5.24e-09
Threshold CMD adagrad 97.19 83.83 2.39e-09
32-bits RDA adagrad 97.21 98.21 -
Ternary RDA adagrad 95.88 97.52 1.63e-08
Threshold RDA adagrad 97.16 98.26 2.92e-09
Ternary RDA adagrad 95.50 97.05 2.65e-08
Threshold RDA adagrad 97.09 98.18 2.93e-09
rcv1
Method Accuracy(%) Sparsity(%) Error
32-bits Prox-gd 91.96 33.87 -
32-bits CMD adagrad 95.15 72.66 -
Ternary CMD adagrad 94.97 71.70 5.07e-08
Threshold CMD adagrad 95.16 73.76 1.40e-08
Ternary CMD adagrad 94.87 70.12 7.77e-08
Threshold CMD adagrad 94.99 73.18 1.45e-08
32-bits RDA adagrad 93.41 97.56 -
Ternary RDA adagrad 91.14 96.86 1.73e-07
Threshold RDA adagrad 93.21 97.54 2.32e-08
Ternary RDA adagrad 90.90 96.43 3.03e-07
Threshold RDA adagrad 93.01 97.39 2.62e-08
MNIST
Method Accuracy(%) Sparsity(%) Error
32-bits Prox-gd 94.17 7.86 -
32-bits CMD adagrad 98.28 17.19 -
Ternary CMD adagrad 98.22 9.61 3.52e-3
Threshold CMD adagrad 98.34 22.08 2.77e-3
Ternary CMD adagrad 97.45 10.19 4.82e-3
Threshold CMD adagrad 98.46 21.78 2.95e-3
32-bits RDA adagrad 97.85 98.27 -
Ternary RDA adagrad 97.57 95.27 1.30e-1
Threshold RDA adagrad 97.84 98.59 1.98e-2
Ternary RDA adagrad 97.32 90.39 1.51e-1
Threshold RDA adagrad 97.70 98.69 2.32e-2
CIFAR-10
Method Accuracy(%) Sparsity(%) Error
32-bits Prox-gd 72.73 76.51 -
32-bits CMD adagrad 86.63 80.50 -
Ternary CMD adagrad 85.27 0.42 1.18e-1
Threshold CMD adagrad 86.23 62.51 2.45e-2
Ternary CMD adagrad 84.37 0.20 2.58e-1
Threshold CMD adagrad 84.74 39.95 2.60e-2
32-bits RDA adagrad 86.49 87.11 -
Ternary RDA adagrad 84.09 80.05 8.70e-1
Threshold RDA adagrad 85.57 87.65 2.17e-2
Ternary RDA adagrad 82.84 75.46 9.90e-1
Threshold RDA adagrad 84.22 81.86 2.77e-2
Refer to caption
Figure 6. Comparison on the decomposed time consumption for training logistic regression on news20, rcv1 and training Lenet on MNIST, AlexNet on CIFAR10. Each histogram shows one-step decomposed time consumption averaged over the entire training procedure. {\dagger} is to mark the methods that only quantize the local gradient but the synchronous gradient. The lower rectangles are the transmission time consumption and the upper rectangles are the computation time consumption. The network bandwidth between the parameter server and workers is 100MBps. Since LeNet and AlexNet are trained on GPUs, the computation time consumption is insignificant for MNIST and CIFAR10.

Hyperparameter Discussion. Table 2 lists the related hyperparameters of the sparse model training in distributed settings, including the learning rate η\eta, coefficient of regularization λ\lambda, batch size per worker and the number of the worker. The optimal choice of η\eta and λ\lambda can vary somewhat from dataset to dataset. The structure and random initial of the network also affect them. In order to find the appropriate hyperparameters, we first select the best learning rate based on cross-validations without the regularization and then find the coefficient of regularization that maximizes the sparsity of the model without reducing the accuracy as much as possible. For ease of comparison, we set the same hyper parameters for corresponding ternary quantization methods. Fig. 5 shows the accuracy and model sparsity as the η\eta and λ\lambda change on MINIST. The sparsity increase with the λ\lambda. A strong coefficient of regularization λ\lambda enforces a highly sparse model which may deteriorate the accuracy. When the learning rate is high or λ\lambda is large, it is easy to cause oscillation for QCMD adagrad. For QCMD adagrad, the model sparsity increases not only with the coefficient of regularization λ\lambda but also the learning rate η\eta. For QRDA adagrad, model sparsity is mainly affected by λ\lambda, while the learning rate has less effect.

5.2. Results and Discussions

We summarize three metrics including the accuracy, model sparsity and error of the quantized gradient in Table 3, and report the time consumption in Fig. 6. From these results, we draw several conclusions.

Firstly, the sparsity of model on the four datasets is 98.18%, 97.39%, 98.69% and 81.86% for QRDA adagrad, respectively. QRDA adagrad is easier to generate extremely sparse models than QCMD adagrad, since it uses the same evaluation criteria λ\lambda for each dimension of the model parameters to determine whether to retain the parameter values. In the case of highly sparse model for QRDA adagrad, the accuracy decreases slightly compared with QCMD adagrad whose sparseness of model is relatively low. For example, on MNIST, the accuracy for threshold QRDA adagrad achieves 97.70%, which is 0.76%0.76\% lower than that of threshold QCMD adagrad.

Secondly, although the sparsity of the model generated by QCMD adagrad is limited. However, it still performs better than 32 bits Prox-gd in accuracy and sparsity, since the proximal function ψ\psi use the Bregman divergence to keep the parameter 𝐱t+1{\bf x}_{t+1} close to 𝐱t{\bf x}_{t} and it adaptively retains the parameter values for each dimension.

Thirdly, the noise introduced by the quantization method affects the convergence of accuracy and sparsity of the model. The one-way quantization (only the local gradients are quantized instead of both local gradients and the synchronous gradient) introduces less noise than the double quantization. The error of threshold quantization is smaller than Terngrad’s and it preserves sparsity and accuracy of the model both for QCMD adagrad and QRDA adagrad. Specifically, for threshold quantization, the accuracy and sparsity of the model for double quantized scheme are relatively close to 32 bits gradient scheme’s, while ternary quantization reduces the sparsity and accuracy of the model on these 4 datasets.

Lastly, double quantization reduces most of the communication cost. Fig. 6 decomposes the time consumption. For example, double quantized threshold QRDA can save 96.06%96.06\% of the total training time on CIFAR10. Threshold QRDA (only the local gradients are quantized) saves 48.03%48.03\% of the total training time on CIFAR10. The QRDA adagrad consumes the least communication cost since it generates a highly sparse model.

6. Conclusion

In this paper, we present distributed QCMD adagrad and QRDA adagrad to accelerate large-scale machine learning. QCMD adagrad and QRDA adagrad combine the gradient quantization and sparse model to reduce the communication cost per iteration in distributed training. We Theoretical analysis that the convergence rate of QCMD adagrad and QRDA adagrad is comparable with CMD adagrad and RDA adagrad respectively. Considering that a large error of quantization gradients affects the convergence of QCMD adagrad and QRDA adagrad, we adopt the threshold quantization with less quantized error to preserve the model performance and sparsity. The encouraging empirical results on linear models and convolutional neural networks demonstrate the efficacy and efficiency of the proposed algorithms in balancing communication cost, accuracy, and model sparsity.

References

  • (1) F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711, 2016. [Online]. Available: http://arxiv.org/abs/1605.04711
  • (2) Y. Zhang, P. Zhao, J. Cao, W. Ma, J. Huang, Q. Wu, and M. Tan, “Online adaptive asymmetric active learning for budgeted imbalanced data,” in SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 2768–2777.
  • (3) P. Zhao, Y. Zhang, M. Wu, S. C. Hoi, M. Tan, and J. Huang, “Adaptive cost-sensitive online classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 2, pp. 214–228, 2018.
  • (4) Y. Zhang, H. Chen, Y. Wei, P. Zhao, J. Cao, X. Fan, X. Lou, H. Liu, J. Hou, X. Han et al., “From whole slide imaging to microscopy: Deep microscopy adaptation network for histopathology cancer image classification,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2019.
  • (5) D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017.
  • (6) J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
  • (7) J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” arXiv preprint arXiv:1710.09854, 2017.
  • (8) J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan, “Multi-marginal wasserstein gan,” in Advances in Neural Information Processing Systems, 2019.
  • (9) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 1509–1519.
  • (10) J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized sgd and its applications to large-scale distributed optimization,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5325–5333.
  • (11) F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in Annual Conference of the International Speech Communication Association, 2014.
  • (12) A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” arXiv preprint arXiv:1704.05021, 2017.
  • (13) A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
  • (14) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
  • (15) R. Bekkerman, M. Bilenko, and J. Langford, Scaling up machine learning: Parallel and distributed approaches.   Cambridge University Press, 2011.
  • (16) S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
  • (17) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
  • (18) C.-J. Hsieh, S. Si, and I. S. Dhillon, “Communication-efficient distributed block minimization for nonlinear kernel machines,” in SIGKDD International Conference on Knowledge Discovery & Data Mining.   ACM, 2017, pp. 245–254.
  • (19) M. F. Balcan, Y. Liang, L. Song, D. Woodruff, and B. Xie, “Communication efficient distributed kernel principal component analysis,” in SIGKDD International Conference on Knowledge Discovery & Data Mining, 2016.
  • (20) C.-p. Lee, C. H. Lim, and S. J. Wright, “A distributed quasi-newton algorithm for empirical risk minimization with nonsmooth regularization,” in SIGKDD International Conference on Knowledge Discovery & Data Mining.   ACM, 2018, pp. 1646–1655.
  • (21) R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in Neural Information Processing Systems, 2013, pp. 315–323.
  • (22) T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: building an efficient and scalable deep learning training system.” in USENIX Symposium on Operating Systems Design and Implementation, vol. 14, 2014, pp. 571–582.
  • (23) E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu, “Petuum: A new platform for distributed machine learning on big data,” IEEE Transactions on Big Data, vol. 1, no. 2, pp. 49–67, 2015.
  • (24) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • (25) J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari, “Composite objective mirror descent,” in Annual Conference on Learning Theory, 2010.
  • (26) C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” in International Conference on Machine Learning, 2017, pp. 1724–1732.
  • (27) R. Kleinberg, Y. Li, and Y. Yuan, “An alternative view: When does sgd escape local minima?” arXiv preprint arXiv:1802.06175, 2018.
  • (28) V. Shah, M. Asteris, A. Kyrillidis, and S. Sanghavi, “Trading-off variance and complexity in stochastic gradient descent,” arXiv preprint arXiv:1603.06861, 2016.
  • (29) Y. Singer and J. C. Duchi, “Efficient learning using forward-backward splitting,” in Advances in Neural Information Processing Systems, 2009, pp. 495–503.
  • (30) L. Xiao, “Dual averaging methods for regularized stochastic learning and online optimization,” Journal of Machine Learning Research, vol. 11, no. Oct, pp. 2543–2596, 2010.
  • (31) C.-C. Chang and C.-J. Lin, “{\{LIBSVM}\}: a library for support vector machines (version 2.3),” 2001.
  • (32) M. Li, “Scaling distributed machine learning with system and algorithm co-design,” Ph.D. dissertation, PhD thesis, Intel, 2017.
  • (33) X. Zhang, L. Zhao, A. P. Boedihardjo, and C.-T. Lu, “Online and distributed robust regressions under adversarial data corruption,” in IEEE International Conference on Data Mining, 2017, pp. 625–634.
  • (34) Z. Huo, X. Jiang, and H. Huang, “Asynchronous dual free stochastic dual coordinate ascent for distributed data mining,” in IEEE International Conference on Data Mining, 2018.
  • (35) S. Gupta, W. Zhang, and F. Wang, “Model accuracy and runtime tradeoff in distributed deep learning: A systematic study,” in IEEE International Conference on Data Mining, 2016, pp. 171–180.
  • (36) S. De and T. Goldstein, “Efficient distributed sgd with variance reduction,” in IEEE International Conference on Data Mining, 2016.
  • (37) H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu, “Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression,” in International Conference on Machine Learning.   PMLR, 2019, pp. 6155–6165.
  • (38) M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Transactions on Signal Processing, vol. 68, pp. 2155–2169, 2020.
  • (39) S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • (40) Y. Ko, K. Choi, H. Jei, D. Lee, and S.-W. Kim, “Aladdin: Asymmetric centralized training for distributed deep learning,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 863–872.
  • (41) N. Eshraghi and B. Liang, “Distributed online optimization over a heterogeneous network with any-batch mirror descent,” in International Conference on Machine Learning.   PMLR, 2020, pp. 2933–2942.
  • (42) X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • (43) R. Xin, S. Kar, and U. A. Khan, “Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 102–113, 2020.
  • (44) F. Alimisis, P. Davies, and D. Alistarh, “Communication-efficient distributed optimization with quantized preconditioners,” in International Conference on Machine Learning.   PMLR, 2021, pp. 196–206.
  • (45) S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” IEEE Transactions on Signal Processing, vol. 68, pp. 6101–6116, 2020.
  • (46) P. Davies, V. Gurunathan, N. Moshrefi, S. Ashkboos, and D. Alistarh, “New bounds for distributed mean estimation and variance reduction,” International Conference on Learning Representations, 2021.

Appendix A Appendeix

A.1. Proof of Proposition 1

Proof.

Assume ft(𝐱)f_{t}({\bf x}) is L-smoothness, one obtains

ft(𝐱t+1)ft(𝐱t)ft(𝐱t)(𝐱t+1𝐱t)+L2𝐱t+1𝐱t2\begin{split}f_{t}({\bf x}_{t+1})-f_{t}({\bf x}_{t})\leq f^{\prime}_{t}({\bf x}_{t})^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{L}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}\\ \end{split}

Let ψt(𝐱)=12𝐱𝐇t𝐱\psi_{t}({\bf x})=\frac{1}{2}{\bf x}^{\top}{\bf H}_{t}{\bf x}, the Mahalanobis norm ||||ψt=<,𝐇t>||\cdot||_{\psi_{t}}=\sqrt{<\cdot,{\bf H}_{t}\cdot>} and ||||ψt||\cdot||_{\psi_{t}^{*}} be the associated dual norm. Introduce the quantized gradient, we have

ft(𝐱t+1)ft(𝐱t)𝐪t(𝐱t+1𝐱t)+L2𝐱t+1𝐱t2+(ft(𝐱t)𝐪t)(𝐱t+1𝐱t)𝐪t(𝐱t+1𝐱t)+L2𝐱t+1𝐱t2+12𝐱t+1𝐱tψt2+12𝐪tft(𝐱t)ψt2\begin{split}&\quad f_{t}({\bf x}_{t+1})-f_{t}({\bf x}_{t})\\ &\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{L}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}+(f^{\prime}_{t}({\bf x}_{t})-{\bf q}_{t})^{\top}({\bf x}_{t+1}-{\bf x}_{t})\\ &\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{L}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}+\frac{1}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}_{\psi_{t}}\\ &\quad+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ \end{split}

The second inequality follows from Cauchy-Schwarz (with 2abxa2+b2x2ab\leq xa^{2}+\frac{b^{2}}{x} for any x>0x>0). Bψt(𝐱t+1,𝐱t)=12(𝐱t+1𝐱t)𝐇t(𝐱t+1𝐱t)=12𝐱t+1𝐱tψt2B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})=\frac{1}{2}({\bf x}_{t+1}-{\bf x}_{t})^{\top}{\bf H}_{t}({\bf x}_{t+1}-{\bf x}_{t})=\frac{1}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}_{\psi_{t}} is Bregman divergence. Let ψt\psi_{t} be a mirror map 1-strongly convex on χ\chi, then

012𝐱t+1𝐱t2Bψt(𝐱t+1,𝐱t)0\leq\frac{1}{2}||{\bf x}_{t+1}-{\bf x}_{t}||^{2}\leq B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})

So,

ft(𝐱t+1)ft(𝐱t)𝐪t(𝐱t+1𝐱t)+12𝐪tft(𝐱t)ψt2+(12+L)Bψt(𝐱t+1,𝐱t)\begin{split}\quad f_{t}({\bf x}_{t+1})&-f_{t}({\bf x}_{t})\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})\\ &+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}+(\frac{1}{2}+L)B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})\end{split}

Since ψt(𝐱t)ψt(𝐱t+1)=𝐇t(𝐱t𝐱t+1)=η𝐪t+ηϕ(𝐱t+1)\psi_{t}^{\prime}({\bf x}_{t})-\psi_{t}^{\prime}({\bf x}_{t+1})={\bf H}_{t}({\bf x}_{t}-{\bf x}_{t+1})=\eta{\bf q}_{t}+\eta\phi^{\prime}({\bf x}_{t+1}),

η[ϕ(𝐱t+1)+𝐪t](𝐱t+1𝐱)=(ψt(𝐱t)ψt(𝐱t+1))(𝐱t+1𝐱)=Bψt(𝐱,𝐱t)Bψt(𝐱,𝐱t+1)Bψt(𝐱t+1,𝐱t)\begin{split}&\quad\eta[\phi^{\prime}({\bf x}_{t+1})+{\bf q}_{t}]^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &=\left(\psi_{t}^{\prime}({\bf x}_{t})-\psi_{t}^{\prime}({\bf x}_{t+1})\right)^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &=B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}_{t+1},{\bf x}_{t})\end{split}

If we set the learning rate η=21+2L\eta=\frac{2}{1+2L}, we have

ft(𝐱t+1)ft(𝐱t)𝐪t(𝐱t+1𝐱t)+12𝐪tft(𝐱t)ψt2+1η(Bψt(𝐱,𝐱t)Bψt(𝐱,𝐱t+1))𝐪t(𝐱t+1𝐱)ϕ(𝐱t+1)(𝐱t+1𝐱)=𝐪t(𝐱𝐱t)+12𝐪tft(𝐱t)ψt2+1η(Bψt(𝐱,𝐱t)Bψt(𝐱,𝐱t+1))ϕ(𝐱t+1)(𝐱t+1𝐱)\begin{split}&\quad f_{t}({\bf x}_{t+1})-f_{t}({\bf x}_{t})\\ &\leq{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}_{t})+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ &\quad+\frac{1}{\eta}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &\quad-{\bf q}_{t}^{\top}({\bf x}_{t+1}-{\bf x}^{*})-\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &={\bf q}_{t}^{\top}({\bf x}^{*}-{\bf x}_{t})+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ &\quad+\frac{1}{\eta}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &\quad-\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\end{split}

From ft(𝐱)+ϕ(𝐱)[ft(𝐱t)+ϕ(𝐱t)+(ft(𝐱t)+ϕ(𝐱t))(𝐱𝐱t)]0f_{t}({\bf x}^{*})+\phi({\bf x}^{*})-[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})+(f^{\prime}_{t}({\bf x}_{t})+\phi^{\prime}({\bf x}_{t}))^{\top}({\bf x}^{*}-{\bf x}_{t})]\geq 0,

ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)𝐪t(𝐱𝐱t)ft(𝐱t)(𝐱𝐱t)+ϕ(𝐱t+1)(𝐱t+1𝐱)+12𝐪tft(𝐱t)ψt2+1η(Bψt(𝐱,𝐱t)Bψt(𝐱,𝐱t+1))ϕ(𝐱t+1)(𝐱t+1𝐱)\begin{split}&\quad f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})\\ &\leq{\bf q}_{t}^{\top}({\bf x}^{*}-{\bf x}_{t})-f^{\prime}_{t}({\bf x}_{t})^{\top}({\bf x}^{*}-{\bf x}_{t})+\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ &\quad+\frac{1}{2}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}\\ &\quad+\frac{1}{\eta}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))-\phi^{\prime}({\bf x}_{t+1})^{\top}({\bf x}_{t+1}-{\bf x}^{*})\\ \end{split}

If 𝔼[𝐪t]=ft(𝐱t)\mathbb{E}[{\bf q}_{t}]=f^{\prime}_{t}({\bf x}_{t}), we take expectation on both sides of the inequality, we obtains

𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]0+12𝔼𝐪𝐪tft(𝐱t)ψt2+1η𝔼𝐪(Bψt(𝐱,𝐱t)Bψt(𝐱,𝐱t+1))12𝔼𝐪𝐪tψt2+1η𝔼𝐪(Bψt(𝐱,𝐱t)Bψt(𝐱,𝐱t+1))\begin{split}&\quad\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq 0+\frac{1}{2}\mathbb{E}_{\bf q}||{\bf q}_{t}-f^{\prime}_{t}({\bf x}_{t})||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &\leq\frac{1}{2}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ \end{split}

Sum the inequality,

t=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]12t=1T𝔼𝐪𝐪tψt2+1ηt=1T𝔼𝐪(Bψt(𝐱,𝐱t)Bψt(𝐱,𝐱t+1))=12t=1T𝔼𝐪𝐪tψt2+1η𝔼𝐪Bψ1(𝐱,𝐱1)+1ηt=1T1𝔼𝐪[Bψt+1(𝐱,𝐱t+1)Bψt(𝐱,𝐱t+1)]\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\sum_{t=1}^{T}\mathbb{E}_{\bf q}(B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1}))\\ &=\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{\eta}\mathbb{E}_{\bf q}B_{\psi_{1}}({\bf x}^{*},{\bf x}_{1})\\ &\quad+\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[B_{\psi_{t+1}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})]\\ \end{split}

A.2. Proof of Theorem 1

Proof.

The third item in the conclusion of Proposition 1 can be transformed as below:

1ηt=1T1𝔼𝐪[Bψt+1(𝐱,𝐱t+1)Bψt(𝐱,𝐱t+1)]=1ηt=1T1𝔼𝐪[12(𝐱𝐱t+1)(𝐇t+1𝐇t)(𝐱𝐱t+1)]12ηmaxtT𝐱𝐱t2i=1d𝔼𝐪f(𝐱)1:T,i212η||𝐱𝐱1||2𝔼𝐪<𝐡1,𝟏>\begin{split}&\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[B_{\psi_{t+1}}({\bf x}^{*},{\bf x}_{t+1})-B_{\psi_{t}}({\bf x}^{*},{\bf x}_{t+1})]\\ =&\frac{1}{\eta}\sum_{t=1}^{T-1}\mathbb{E}_{\bf q}[\frac{1}{2}({\bf x}^{*}-{\bf x}_{t+1})^{\top}({\bf H}_{t+1}-{\bf H}_{t})({\bf x}^{*}-{\bf x}_{t+1})]\\ \leq&\frac{1}{2\eta}\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||f^{\prime}({\bf x})_{1:T,i}||_{2}\\ &\quad-\frac{1}{2\eta}||{\bf x}^{*}-{\bf x}_{1}||_{\infty}^{2}\mathbb{E}_{\bf q}<{\bf h}_{1},\bf{1}>\\ \end{split}

𝟏\bf{1} is a vector whose elements are all 1. Since 𝔼𝐪Bψ1(𝐱,𝐱1)12𝐱𝐱12𝔼𝐪<𝐡1,𝟏>\mathbb{E}_{\bf q}B_{\psi_{1}}({\bf x}^{*},{\bf x}_{1})\leq\frac{1}{2}||{\bf x}^{*}-{\bf x}_{1}||^{2}_{\infty}\mathbb{E}_{\bf q}<{\bf h}_{1},\bf{1}>,

t=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]12t=1T𝔼𝐪𝐪tψt2+12ηmaxtT𝐱𝐱t2i=1d𝔼𝐪𝐪1:T,i2\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}+\frac{1}{2\eta}\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}\\ \end{split}

12t=1T𝔼𝐪𝐪tψt2i=1d𝔼𝐪𝐪1:T,i2\frac{1}{2}\sum_{t=1}^{T}\mathbb{E}_{\bf q}||{\bf q}_{t}||^{2}_{\psi^{*}_{t}}\leq\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2} has been proved in Lemma 4 of (duchi2011adaptive, ), then

t=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]i=1d𝔼𝐪𝐪1:T,i2+12ηmaxtT𝐱𝐱t2i=1d𝔼𝐪𝐪1:T,i2\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}+\frac{1}{2\eta}\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}\\ \end{split}

We define that D=maxtT𝐱𝐱tD_{\infty}=\max_{t\leq T}||{\bf x}^{*}-{\bf x}_{t}||_{\infty}, G=maxtT,id𝐪1:T,i2G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}.

t=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]dTG+12ηDdTG\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq d\sqrt{T}G_{\infty}+\frac{1}{2\eta}D_{\infty}d\sqrt{T}G_{\infty}\\ \end{split}

Jensen’s inequality:

𝔼𝐪[f(𝐱¯)+ϕ(𝐱¯)]f(𝐱)ϕ(𝐱)]1Tt=1T𝔼𝐪[ft(𝐱t+1)+ϕ(𝐱t+1)ft(𝐱)ϕ(𝐱)]dGT+dGD2ηT\begin{split}&\quad\mathbb{E}_{\bf q}[f(\bar{{\bf x}})+\phi(\bar{{\bf x}})]-f({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t+1})+\phi({\bf x}_{t+1})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{dG_{\infty}}{\sqrt{T}}+\frac{dG_{\infty}D_{\infty}}{2\eta\sqrt{T}}\end{split}

A.3. Proof of Proposition 2

Proof.

Define ψt(𝐠)\psi_{t}^{*}({\bf g}) to be the conjugate dual of tϕ(𝐱)+1ηψt(𝐱)t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x}):

(25) ψt(𝐠)=sup𝐱χ{<𝐠,𝐱>tϕ(𝐱)1ηψt(𝐱)}\begin{split}\psi^{*}_{t}({\bf g})=\sup_{{\bf x}\in\chi}\{<{\bf g},{\bf x}>-t\phi({\bf x})-\frac{1}{\eta}\psi_{t}({\bf x})\}\end{split}

If set ϕ(𝐱)=λ𝐱1\phi({\bf x})=\lambda||{\bf x}||_{1}, then

ψti(𝐠)={(𝐠itλ)2η4𝐇t,ii,gitλ>0(𝐠i+tλ)2η4𝐇t,ii,gi+tλ<00,otherwise\begin{split}\psi^{*}_{ti}({\bf g})=\left\{\begin{array}[]{rcl}({\bf g}_{i}-t\lambda)^{2}\frac{\eta}{4{\bf H}_{t,ii}},&&{g_{i}-t\lambda>0}\\ ({\bf g}_{i}+t\lambda)^{2}\frac{\eta}{4{\bf H}_{t,ii}},&&{g_{i}+t\lambda<0}\\ 0,&&{otherwise}\\ \end{array}\right.\end{split}

So, ||||ψt=<,η2𝐇t>||\cdot||_{\psi_{t}^{*}}=\sqrt{<\cdot,\frac{\eta}{2{\bf H}_{t}}\cdot>}.

Since ψtη\frac{\psi_{t}}{\eta} is 1η\frac{1}{\eta}-strongly convex with respect to the norm ||||ψt||\cdot||_{\psi_{t}}, the function ψt\psi_{t}^{*} is η\eta-smooth with respect to ||||ψt||\cdot||_{\psi^{*}_{t}}.

ψt(𝐲)ψt(𝐱)+ψt(𝐱)(𝐲𝐱)+η2𝐲𝐱ψt2\psi_{t}^{*}({\bf y})\leq\psi_{t}^{*}({\bf x})+\triangledown\psi^{*}_{t}({\bf x})({\bf y}-{\bf x})+\frac{\eta}{2}||{\bf y}-{\bf x}||^{2}_{\psi^{*}_{t}}

Further, a simple argument with the fundamental theorem of the conjugate dual is

(26) ψt(𝐠)=argmin𝐱χ{<𝐠,𝐱>+tϕ(𝐱)+1ηψt(𝐱)}\triangledown\psi^{*}_{t}({\bf g})=\arg\min_{{\bf x}\in\chi}\{-<{\bf g},{\bf x}>+t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x})\}

If 𝐳t=τ=1t𝐪τ{\bf z}_{t}=\sum_{\tau=1}^{t}{\bf q}_{\tau}, we have

(27) ψt(𝐳t)=argmin𝐱χ{<τ=1t𝐪τ,𝐱>+tϕ(𝐱)+1ηψt(𝐱)}=𝐱t+1\begin{split}\psi^{\prime*}_{t}(-{\bf z}_{t})&=\arg\min_{{\bf x}\in\chi}\{<\sum_{\tau=1}^{t}{\bf q}_{\tau},{\bf x}>+t\phi({\bf x})+\frac{1}{\eta}\psi_{t}({\bf x})\}\\ &={\bf x}_{t+1}\end{split}

To obtain the regret for quantized rda adagrad, we first calculate

t=1T[𝐪t(𝐱t𝐱)+ϕ(𝐱t+1)ϕ(𝐱)]t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+sup𝐱χ{t=1T[𝐪t𝐱]Tϕ(𝐱)1ηψT(𝐱)}=t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+ψT(𝐳T)\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}({\bf x}_{t}-{\bf x}^{*})+\phi({\bf x}_{t+1})-\phi({\bf x}^{*})]\\ &\leq\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\sup_{{\bf x}\in\chi}\left\{-\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}]-T\phi({\bf x})-\frac{1}{\eta}\psi_{T}({\bf x})\right\}\\ &=\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ \end{split}

Since Equation (25) and Equation (27), it is clear that

ψT(𝐳T)=t=1T[𝐪t𝐱T+1]Tϕ(𝐱T+1)1ηψT(𝐱T+1)t=1T[𝐪t𝐱T+1](T1)ϕ(𝐱T+1)ϕ(𝐱T+1)1ηψT1(𝐱T+1)sup𝐱χ{𝐳T𝐱(T1)ϕ(𝐱)1ηψT1(𝐱)}ϕ(𝐱T+1)=ψT1(𝐳T)ϕ(𝐱T+1)\begin{split}\psi_{T}^{*}(-{\bf z}_{T})&=-\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{T+1}]-T\phi({\bf x}_{T+1})-\frac{1}{\eta}\psi_{T}({\bf x}_{T+1})\\ &\leq-\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{T+1}]-(T-1)\phi({\bf x}_{T+1})-\phi({\bf x}_{T+1})-\frac{1}{\eta}\psi_{T-1}({\bf x}_{T+1})\\ &\leq\sup_{{\bf x}\in\chi}\left\{-{\bf z}_{T}^{\top}{\bf x}-(T-1)\phi({\bf x})-\frac{1}{\eta}\psi_{T-1}({\bf x})\right\}-\phi({\bf x}_{T+1})\\ &=\psi^{*}_{T-1}(-{\bf z}_{T})-\phi({\bf x}_{T+1})\end{split}

The first inequality above follows from ψt+1ψt\psi_{t+1}\geq\psi_{t}. Further, the identity (27) and the fact that 𝐳T𝐳T1=𝐪T{\bf z}_{T}-{\bf z}_{T-1}=-{\bf q}_{T} give

t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+ψT(𝐳T)t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+ψT1(𝐳T)ϕ(𝐱T+1)t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)ϕ(𝐱T+1)+ψT1(𝐳T1)ψT1(𝐳T1)𝐪T+η2𝐪TψT12=t=1T1[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+ψT1(𝐳T1)+η2𝐪TψT12\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ &\leq\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T-1}(-{\bf z}_{T})-\phi({\bf x}_{T+1})\\ &\leq\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})-\phi({\bf x}_{T+1})\\ &\quad+{\psi^{*}}_{T-1}(-{\bf z}_{T-1})-{\psi^{\prime}_{T-1}}^{*}{(-{{\bf z}_{T-1})}^{\top}}{\bf q}_{T}+\frac{\eta}{2}||{\bf q}_{T}||^{2}_{\psi^{*}_{T-1}}\\ &=\sum_{t=1}^{T-1}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T-1}(-{\bf z}_{T-1})+\frac{\eta}{2}||{\bf q}_{T}||^{2}_{\psi^{*}_{T-1}}\\ \end{split}

We can repeat the same sequence of steps to see that

t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+ψT(𝐳T)1ηψT(𝐱)+ψ0(𝐳0)+η2t=1T𝐪tψt12=1ηψT(𝐱)+η2t=1T𝐪tψt12\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ &\leq\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi_{0}^{*}(-{\bf z}_{0})+\frac{\eta}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\\ &=\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\frac{\eta}{2}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}\\ \end{split}

Take expectation on both sides of the inequality, we get the results in proposition 2. ∎

A.4. Proof of Theorem 2

Proof.

We set δmaxt𝐪t2\delta\geq\max_{t}||{\bf q}_{t}||^{2}_{\infty}, 𝐇t=δ𝐈+diag(𝐜t){\bf H}_{t}=\delta{\bf I}+diag({\bf c}_{t}), in which case

𝐪tψt12η2𝐪tdiag(𝐜t)1𝐪t\begin{split}||{\bf q}_{t}||^{2}_{\psi_{t-1}^{*}}\leq\frac{\eta}{2}{\bf q}_{t}^{\top}diag({\bf c}_{t})^{-1}{\bf q}_{t}\end{split}
t=1T𝐪tdiag(𝐜t)1𝐪t2i=1d𝐪1:T,i2\begin{split}\sum_{t=1}^{T}{\bf q}_{t}^{\top}diag({\bf c}_{t})^{-1}{\bf q}_{t}&\leq 2\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\\ \end{split}
t=1T𝐪tψt12ηi=1d𝐪1:T,i2\begin{split}\sum_{t=1}^{T}||{\bf q}_{t}||^{2}_{\psi^{*}_{t-1}}&\leq\eta\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

So,

t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+ψT(𝐳T)1ηψT(𝐱)+η22i=1d𝐪1:T,i2\begin{split}&\quad\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\\ &\leq\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\frac{\eta^{2}}{2}\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

We also have

ψT(𝐱)=δ𝐱22+<𝐱,diag(𝐜T)𝐱>δ𝐱22+𝐱2i=1d𝐪1:T,i2\begin{split}\psi_{T}({\bf x}^{*})&=\delta||{\bf x}^{*}||_{2}^{2}+<{\bf x}^{*},diag({\bf c}_{T}){\bf x}^{*}>\\ &\leq\delta||{\bf x}^{*}||_{2}^{2}+||{\bf x}^{*}||^{2}_{\infty}\sum_{i=1}^{d}||{\bf q}_{1:T,i}||_{2}\end{split}

If 𝔼[𝐪t]=ft(𝐱t)\mathbb{E}[{\bf q}_{t}]=f^{\prime}_{t}({\bf x}_{t}), take expectation for the quantization operation and set G=maxtT,id𝐪1:T,i2G_{\infty}=\max_{t\leq T,i\leq d}||{\bf q}_{1:T,i}||_{2}

t=1T𝔼𝐪[ft(𝐱t)+ϕ(𝐱t)ft(𝐱)ϕ(𝐱)]t=1T𝔼𝐪[(ft(𝐱t)𝐪t(𝐱t𝐱)]+𝔼𝐪{t=1T[𝐪t(𝐱t𝐱)+ϕ(𝐱t+1)ϕ(𝐱)]}𝔼𝐪{t=1T[𝐪t𝐱t+ϕ(𝐱t+1)]+1ηψT(𝐱)+ψT(𝐳T)}δη𝐱22+1η𝐱2i=1d𝔼𝐪𝐪1:T,i2+η22i=1d𝔼𝐪𝐪1:T,i2=δη𝐱22+(1η𝐱2+η22)dGT\begin{split}&\quad\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\sum_{t=1}^{T}\mathbb{E}_{\bf q}[(f^{\prime}_{t}({\bf x}_{t})-{\bf q}_{t}^{\top}({\bf x}_{t}-{\bf x}^{*})]\\ &\quad+\mathbb{E}_{\bf q}\left\{\sum_{t=1}^{T}[{\bf q}_{t}^{\top}({\bf x}_{t}-{\bf x}^{*})+\phi({\bf x}_{t+1})-\phi({\bf x}^{*})]\right\}\\ &\leq\mathbb{E}_{\bf q}\left\{\sum_{t=1}^{T}[{\bf q}_{t}^{\top}{\bf x}_{t}+\phi({\bf x}_{t+1})]+\frac{1}{\eta}\psi_{T}({\bf x}^{*})+\psi^{*}_{T}(-{\bf z}_{T})\right\}\\ &\leq\frac{\delta}{\eta}||{\bf x}^{*}||_{2}^{2}+\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}+\frac{\eta^{2}}{2}\sum_{i=1}^{d}\mathbb{E}_{\bf q}||{\bf q}_{1:T,i}||_{2}\\ &=\frac{\delta}{\eta}||{\bf x}^{*}||_{2}^{2}+(\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}+\frac{\eta^{2}}{2})dG_{\infty}\sqrt{T}\end{split}

Finally,

1Tt=1T𝔼𝐪[ft(𝐱t)+ϕ(𝐱t)ft(𝐱)ϕ(𝐱)]δ𝐱22ηT+(1η𝐱2+η22)dGT\begin{split}&\quad\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\bf q}[f_{t}({\bf x}_{t})+\phi({\bf x}_{t})-f_{t}({\bf x}^{*})-\phi({\bf x}^{*})]\\ &\leq\frac{\delta||{\bf x}^{*}||_{2}^{2}}{\eta T}+\frac{(\frac{1}{\eta}||{\bf x}^{*}||^{2}_{\infty}+\frac{\eta^{2}}{2})dG_{\infty}}{\sqrt{T}}\end{split}