This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Second Order Online Learning by Sketching

Haipeng Luo
Princeton University, Princeton, NJ USA
haipengl@cs.princeton.edu
&Alekh Agarwal
Microsoft Research, New York, NY USA
alekha@microsoft.com
&Nicolò Cesa-Bianchi
Università degli Studi di Milano, Italy
nicolo.cesa-bianchi@unimi.it
&John Langford
Microsoft Research, New York, NY USA
jcl@microsoft.com
Abstract

We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data. SON is an enhanced version of the Online Newton Step, which, via sketching techniques enjoys a running time linear in the dimension and sketch size. We further develop sparse forms of the sketching methods (such as Oja’s rule), making the computation linear in the sparsity of features. Together, the algorithm eliminates all computational obstacles in previous second order online learning approaches.

1 Introduction

Online learning methods are highly successful at rapidly reducing the test error on large, high-dimensional datasets. First order methods are particularly attractive in such problems as they typically enjoy computational complexity linear in the input size. However, the convergence of these methods crucially depends on the geometry of the data; for instance, running the same algorithm on a rotated set of examples can return vastly inferior results. See Fig. 1 for an illustration.

Second order algorithms such as Online Newton Step (Hazan et al., 2007) have the attractive property of being invariant to linear transformations of the data, but typically require space and update time quadratic in the number of dimensions. Furthermore, the dependence on dimension is not improved even if the examples are sparse. These issues lead to the key question in our work: Can we develop (approximately) second order online learning algorithms with efficient updates? We show that the answer is “yes” by developing efficient sketched second order methods with regret guarantees. Specifically, the three main contributions of this work are:

Refer to caption
Figure 1: Error rate of SON using Oja’s sketch, and AdaGrad on a synthetic ill-conditioned problem. mm is the sketch size (m=0m=0 is Online Gradient, m=dm=d resembles Online Newton). SON is nearly invariant to condition number for m=10m=10.

1. Invariant learning setting and optimal algorithms (Section 2).

The typical online regret minimization setting evaluates against a benchmark that is bounded in some fixed norm (such as the 2\ell_{2}-norm), implicitly putting the problem in a nice geometry. However, if all the features are scaled down, it is desirable to compare with accordingly larger weights, which is precluded by an apriori fixed norm bound. We study an invariant learning setting similar to the paper (Ross et al., 2013) which compares the learner to a benchmark only constrained to generate bounded predictions on the sequence of examples. We show that a variant of the Online Newton Step (Hazan et al., 2007), while quadratic in computation, stays regret-optimal with a nearly matching lower bound in this more general setting.

2. Improved efficiency via sketching (Section 3).

To overcome the quadratic running time, we next develop sketched variants of the Newton update, approximating the second order information using a small number of carefully chosen directions, called a sketch. While the idea of data sketching is widely studied (Woodruff, 2014), as far as we know our work is the first one to apply it to a general adversarial online learning setting and provide rigorous regret guarantees. Two different sketching methods are considered: Frequent Directions (Ghashami et al., 2015; Liberty, 2013) and Oja’s algorithm (Oja, 1982; Oja and Karhunen, 1985), both of which allow linear running time per round. For the first method, we prove regret bounds similar to the full second order update whenever the sketch-size is large enough. Our analysis makes it easy to plug in other sketching and online PCA methods (e.g. (Garber et al., 2015)).

3. Sparse updates (Section 4).

For practical implementation, we further develop sparse versions of these updates with a running time linear in the sparsity of the examples. The main challenge here is that even if examples are sparse, the sketch matrix still quickly becomes dense. These are the first known sparse implementations of the Frequent Directions111Recent work by (Ghashami et al., 2016) also studies sparse updates for a more complicated variant of Frequent Directions which is randomized and incurs extra approximation error. and Oja’s algorithm, and require new sparse eigen computation routines that may be of independent interest.

Empirically, we evaluate our algorithm using the sparse Oja sketch (called Oja-SON) against first order methods such as diagonalized AdaGrad (Duchi et al., 2011; McMahan and Streeter, 2010) on both ill-conditioned synthetic and a suite of real-world datasets. As Fig. 1 shows for a synthetic problem, we observe substantial performance gains as data conditioning worsens. On the real-world datasets, we find improvements in some instances, while observing no substantial second-order signal in the others.

Related work

Our online learning setting is closest to the one proposed in (Ross et al., 2013), which studies scale-invariant algorithms, a special case of the invariance property considered here (see also (Orabona et al., 2015, Section 5)). Computational efficiency, a main concern in this work, is not a problem there since each coordinate is scaled independently. Orabona and Pál (2015) study unrelated notions of invariance. Gao et al. (2013) study a specific randomized sketching method for a special online learning setting.

The L-BFGS algorithm (Liu and Nocedal, 1989) has recently been studied in the stochastic setting222Stochastic setting assumes that the examples are drawn i.i.d. from a distribution. (Byrd et al., 2016; Mokhtari and Ribeiro, 2015; Moritz et al., 2016; Schraudolph et al., 2007; Sohl-Dickstein et al., 2014), but has strong assumptions with pessimistic rates in theory and reliance on the use of large mini-batches empirically. Recent works (Erdogdu and Montanari, 2015; Gonen et al., 2016; Gonen and Shalev-Shwartz, 2015; Pilanci and Wainwright, 2015) employ sketching in stochastic optimization, but do not provide sparse implementations or extend in an obvious manner to the online setting. The Frank-Wolfe algorithm (Frank and Wolfe, 1956; Jaggi, 2013) is also invariant to linear transformations, but with worse regret bounds (Hazan and Kale, 2012) without further assumptions and modifications (Garber and Hazan, 2016).

Notation

Vectors are represented by bold letters (e.g., 𝒙\boldsymbol{x}, 𝒘\boldsymbol{w}, …) and matrices by capital letters (e.g., MM, AA, …). Mi,jM_{i,j} denotes the (i,j)(i,j) entry of matrix MM. 𝑰d\boldsymbol{I}_{d} represents the d×dd\times d identity matrix, 𝟎m×d\boldsymbol{0}_{m\times d} represents the m×dm\times d matrix of zeroes, and diag{𝒙}\mathrm{diag}\!\left\{{\boldsymbol{x}}\right\} represents a diagonal matrix with 𝒙\boldsymbol{x} on the diagonal. λi(A)\lambda_{i}(A) denotes the ii-th largest eigenvalue of AA, 𝒘A\left\|{\boldsymbol{w}}\right\|_{A} denotes 𝒘A𝒘\sqrt{\boldsymbol{w}^{\top}A\boldsymbol{w}}, |A||A| is the determinant of AA, tr(A)\textsc{tr}({A}) is the trace of AA, A,B\left\langle{A,B}\right\rangle denotes i,jAi,jBi,j\sum_{i,j}A_{i,j}B_{i,j}, and ABA\preceq B means that BAB-A is positive semidefinite. The sign function sgn(a)\mbox{\sc sgn}(a) is 11 if a0a\geq 0 and 1-1 otherwise.

2 Setup and an Optimal Algorithm

We consider the following setting. On each round t=1,2,Tt=1,2\ldots,T: (1) the adversary first presents an example 𝒙td\boldsymbol{x}_{t}\in\mathbb{R}^{d}, (2) the learner chooses 𝒘td\boldsymbol{w}_{t}\in\mathbb{R}^{d} and predicts 𝒘t𝒙t\boldsymbol{w}_{t}^{\top}\boldsymbol{x}_{t}, (3) the adversary reveals a loss function ft(𝒘)=t(𝒘𝒙t)f_{t}(\boldsymbol{w})=\ell_{t}(\boldsymbol{w}^{\top}\boldsymbol{x}_{t}) for some convex, differentiable t:+\ell_{t}:\mathbb{R}\rightarrow\mathbb{R}_{+}, and (4) the learner suffers loss ft(𝒘t)f_{t}(\boldsymbol{w}_{t}) for this round.

The learner’s regret to a comparator 𝒘\boldsymbol{w} is defined as RT(𝒘)=t=1Tft(𝒘t)t=1Tft(𝒘)R_{T}(\boldsymbol{w})=\sum_{t=1}^{T}f_{t}(\boldsymbol{w}_{t})-\sum_{t=1}^{T}f_{t}(\boldsymbol{w}). Typical results study RT(𝒘)R_{T}(\boldsymbol{w}) against all 𝒘\boldsymbol{w} with a bounded norm in some geometry. For an invariant update, we relax this requirement and only put bounds on the predictions 𝒘𝒙t\boldsymbol{w}^{\top}\boldsymbol{x}_{t}. Specifically, for some pre-chosen constant CC we define 𝒦t=def{𝒘:|𝒘𝒙t|C}.\mathcal{K}_{t}\stackrel{{\scriptstyle\rm def}}{{=}}\left\{{\boldsymbol{w}}\,:\,{|\boldsymbol{w}^{\top}\boldsymbol{x}_{t}|\leq C}\right\}. We seek to minimize regret to all comparators that generate bounded predictions on every data point, that is:

RT=sup𝒘𝒦RT(𝒘) where𝒦=deft=1T𝒦t={𝒘:t=1,2,T,|𝒘𝒙t|C}.R_{T}=\sup_{\boldsymbol{w}\in\mathcal{K}}R_{T}(\boldsymbol{w})~~\mbox{ where}~~\mathcal{K}\stackrel{{\scriptstyle\rm def}}{{=}}\bigcap_{t=1}^{T}\mathcal{K}_{t}=\left\{{\boldsymbol{w}}\,:\,{\forall t=1,2,\ldots T,~~|\boldsymbol{w}^{\top}\boldsymbol{x}_{t}|\leq C}\right\}~.

Under this setup, if the data are transformed to M𝒙tM\boldsymbol{x}_{t} for all tt and some invertible matrix Md×dM\in\mathbb{R}^{d\times d}, the optimal 𝒘\boldsymbol{w}^{*} simply moves to (M1)𝒘(M^{-1})^{\top}\boldsymbol{w}^{*}, which still has bounded predictions but might have significantly larger norm. This relaxation is similar to the comparator set considered in (Ross et al., 2013).

We make two structural assumptions on the loss functions.

Assumption 1.

(Scalar Lipschitz) The loss function t\ell_{t} satisfies |t(z)|L|\ell_{t}^{{}^{\prime}}(z)|\leq L whenever |z|C|z|\leq C.

Assumption 2.

(Curvature) There exists σt0\sigma_{t}\geq 0 such that for all 𝐮,𝐰𝒦\boldsymbol{u},\boldsymbol{w}\in\mathcal{K}, ft(𝐰)f_{t}(\boldsymbol{w}) is lower bounded by ft(𝐮)+ft(𝐮)(𝐰𝐮)+σt2(ft(𝐮)(𝐮𝐰))2.f_{t}(\boldsymbol{u})+\nabla f_{t}(\boldsymbol{u})^{\top}(\boldsymbol{w}-\boldsymbol{u})+\frac{\sigma_{t}}{2}\left(\nabla f_{t}(\boldsymbol{u})^{\top}(\boldsymbol{u}-\boldsymbol{w})\right)^{2}.

Note that when σt=0\sigma_{t}=0, Assumption 2 merely imposes convexity. More generally, it is satisfied by squared loss ft(𝒘)=(𝒘𝒙tyt)2f_{t}(\boldsymbol{w})=(\boldsymbol{w}^{\top}\boldsymbol{x}_{t}-y_{t})^{2} with σt=18C2\sigma_{t}=\frac{1}{8C^{2}} whenever |𝒘𝒙t||\boldsymbol{w}^{\top}\boldsymbol{x}_{t}| and |yt||y_{t}| are bounded by CC, as well as for all exp-concave functions (see (Hazan et al., 2007, Lemma 3)).

Enlarging the comparator set might result in worse regret. We next show matching upper and lower bounds qualitatively similar to the standard setting, but with an extra unavoidable d\sqrt{d} factor. 333In the standard setting where 𝒘t\boldsymbol{w}_{t} and 𝒙t\boldsymbol{x}_{t} are restricted such that 𝒘tD\left\|{\boldsymbol{w}_{t}}\right\|\leq D and 𝒙tX\left\|{\boldsymbol{x}_{t}}\right\|\leq X, the minimax regret is 𝒪(DXLT)\mathcal{O}(DXL\sqrt{T}). This is clearly a special case of our setting with C=DXC=DX.

Theorem 1.

For any online algorithm generating 𝐰td\boldsymbol{w}_{t}\in\mathbb{R}^{d} and all TdT\geq d, there exists a sequence of TT examples 𝐱td\boldsymbol{x}_{t}\in\mathbb{R}^{d} and loss functions t\ell_{t} satisfying Assumptions 1 and 2 (with σt=0\sigma_{t}=0) such that the regret RTR_{T} is at least CLdT/2CL\sqrt{dT/2}.

We now give an algorithm that matches the lower bound up to logarithmic constants in the worst case but enjoys much smaller regret when σt0\sigma_{t}\neq 0. At round t+1t+1 with some invertible matrix AtA_{t} specified later and gradient 𝒈t=ft(𝒘t)\boldsymbol{g}_{t}=\nabla f_{t}(\boldsymbol{w}_{t}), the algorithm performs the following update before making the prediction on the example 𝒙t+1\boldsymbol{x}_{t+1}:

𝒖t+1=𝒘tAt1𝒈t,and𝒘t+1=argmin𝒘𝒦t+1𝒘𝒖t+1At.\boldsymbol{u}_{t+1}=\boldsymbol{w}_{t}-A_{t}^{-1}\boldsymbol{g}_{t},\quad\mbox{and}\quad\boldsymbol{w}_{t+1}=\operatorname*{argmin}_{\boldsymbol{w}\in\mathcal{K}_{t+1}}\left\|{\boldsymbol{w}-\boldsymbol{u}_{t+1}}\right\|_{A_{t}}. (1)

The projection onto the set 𝒦t+1\mathcal{K}_{t+1} differs from typical norm-based projections as it only enforces boundedness on 𝒙t+1\boldsymbol{x}_{t+1} at round t+1t+1. Moreover, this projection step can be performed in closed form.

Lemma 1.

For any 𝐱𝟎,𝐮d\boldsymbol{x}\neq\boldsymbol{0},\boldsymbol{u}\in\mathbb{R}^{d} and positive definite matrix Ad×dA\in\mathbb{R}^{d\times d}, we have

argmin𝒘:|𝒘𝒙|C𝒘𝒖A=𝒖τC(𝒖𝒙)𝒙A1𝒙A1𝒙,where τC(y)=sgn(y)max{|y|C,0}.\operatorname*{argmin}_{\boldsymbol{w}\,:\,|\boldsymbol{w}^{\top}\boldsymbol{x}|\leq C}\left\|{\boldsymbol{w}-\boldsymbol{u}}\right\|_{A}=\boldsymbol{u}-\frac{\tau_{C}(\boldsymbol{u}^{\top}\boldsymbol{x})}{\boldsymbol{x}^{\top}A^{-1}\boldsymbol{x}}A^{-1}\boldsymbol{x},~~\mbox{where $\tau_{C}(y)=\mbox{\sc sgn}(y)\max\{|y|-C,0\}$.}

If AtA_{t} is a diagonal matrix, updates similar to those of Ross et al. (2013) are recovered. We study a choice of AtA_{t} that is similar to the Online Newton Step (ONS) (Hazan et al., 2007) (though with different projections):

At=α𝑰d+s=1t(σs+ηs)𝒈s𝒈sA_{t}=\alpha\boldsymbol{I}_{d}+\sum_{s=1}^{t}(\sigma_{s}+\eta_{s})\boldsymbol{g}_{s}\boldsymbol{g}_{s}^{\top} (2)

for some parameters α>0\alpha>0 and ηt0\eta_{t}\geq 0. The regret guarantee of this algorithm is shown below:

Theorem 2.

Under Assumptions 1 and 2, suppose that σtσ0\sigma_{t}\geq\sigma\geq 0 for all tt, and ηt\eta_{t} is non-increasing. Then using the matrices (2) in the updates (1) yields for all 𝐰𝒦\boldsymbol{w}\in\mathcal{K},

RT(𝒘)α2𝒘22+2(CL)2t=1Tηt+d2(σ+ηT)ln(1+(σ+ηT)t=1T𝒈t22dα).\displaystyle R_{T}(\boldsymbol{w})\leq\frac{\alpha}{2}\left\|{\boldsymbol{w}}\right\|_{2}^{2}+2(CL)^{2}\sum_{t=1}^{T}\eta_{t}+\frac{d}{2(\sigma+\eta_{T})}\ln\left(1+\frac{(\sigma+\eta_{T})\sum_{t=1}^{T}\left\|{\boldsymbol{g}_{t}}\right\|_{2}^{2}}{d\alpha}\right)~.

The dependence on 𝒘22\left\|{\boldsymbol{w}}\right\|_{2}^{2} implies that the method is not completely invariant to transformations of the data. This is due to the part α𝑰d\alpha\boldsymbol{I}_{d} in AtA_{t}. However, this is not critical since α\alpha is fixed and small while the other part of the bound grows to eventually become the dominating term. Moreover, we can even set α=0\alpha=0 and replace the inverse with the Moore-Penrose pseudoinverse to obtain a truly invariant algorithm, as discussed in Appendix D. We use α>0\alpha>0 in the remainder for simplicity.

The implication of this regret bound is the following: in the worst case where σ=0\sigma=0, we set ηt=d/C2L2t\eta_{t}=\sqrt{d/C^{2}L^{2}t} and the bound simplifies to

RT(𝒘)α2𝒘22+CL2Tdln(1+t=1T𝒈t22αCLTd)+4CLTd,\displaystyle R_{T}(\boldsymbol{w})\leq\frac{\alpha}{2}\left\|{\boldsymbol{w}}\right\|_{2}^{2}+\frac{CL}{2}\sqrt{Td}\ln\left(1+\frac{\sum_{t=1}^{T}\left\|{\boldsymbol{g}_{t}}\right\|_{2}^{2}}{\alpha CL\sqrt{Td}}\right)+4CL\sqrt{Td}~,

essentially only losing a logarithmic factor compared to the lower bound in Theorem 1. On the other hand, if σtσ>0\sigma_{t}\geq\sigma>0 for all tt, then we set ηt=0\eta_{t}=0 and the regret simplifies to

RT(𝒘)α2𝒘22+d2σln(1+σt=1T𝒈t22dα),R_{T}(\boldsymbol{w})\leq\frac{\alpha}{2}\left\|{\boldsymbol{w}}\right\|_{2}^{2}+\frac{d}{2\sigma}\ln\left(1+\frac{\sigma\sum_{t=1}^{T}\left\|{\boldsymbol{g}_{t}}\right\|_{2}^{2}}{d\alpha}\right)~, (3)

extending the 𝒪(dlnT)\mathcal{O}(d\ln T) results in (Hazan et al., 2007) to the weaker Assumption 2 and a larger comparator set 𝒦\mathcal{K}.

3 Efficiency via Sketching

Our algorithm so far requires Ω(d2)\Omega(d^{2}) time and space just as ONS. In this section we show how to achieve regret guarantees nearly as good as the above bounds, while keeping computation within a constant factor of first order methods.

Let Gtt×dG_{t}\in\mathbb{R}^{t\times d} be a matrix such that the tt-th row is 𝒈^t\widehat{\boldsymbol{g}}_{t}^{\top} where we define 𝒈^t=σt+ηt𝒈t\widehat{\boldsymbol{g}}_{t}=\sqrt{\sigma_{t}+\eta_{t}}\boldsymbol{g}_{t} to be the to-sketch vector. Our previous choice of AtA_{t} (Eq. (2)) can be written as α𝑰d+GtGt\alpha\boldsymbol{I}_{d}+G_{t}^{\top}G_{t}. The idea of sketching is to maintain an approximation of GtG_{t}, denoted by Stm×dS_{t}\in\mathbb{R}^{m\times d} where mdm\ll d is a small constant called the sketch size. If mm is chosen so that StStS_{t}^{\top}S_{t} approximates GtGtG_{t}^{\top}G_{t} well, we can redefine AtA_{t} as α𝑰d+StSt\alpha\boldsymbol{I}_{d}+S_{t}^{\top}S_{t} for the algorithm.

Algorithm 1 Sketched Online Newton (SON)
1:Parameters CC, α\alpha and mm.
2:Initialize 𝒖1=𝟎d×1\boldsymbol{u}_{1}=\boldsymbol{0}_{d\times 1}.
3:Initialize sketch (S,H)SketchInit(α,m)(S,H)\leftarrow\textbf{SketchInit}(\alpha,m).
4:for t=1t=1 to TT do
5:  Receive example 𝒙t\boldsymbol{x}_{t}.
6:  Projection step: compute 𝒙^=S𝒙t,γ=τC(𝒖t𝒙t)𝒙t𝒙t𝒙^H𝒙^\widehat{\boldsymbol{x}}=S\boldsymbol{x}_{t},\;\gamma=\frac{\tau_{C}(\boldsymbol{u}_{t}^{\top}\boldsymbol{x}_{t})}{\boldsymbol{x}_{t}^{\top}\boldsymbol{x}_{t}-{\widehat{\boldsymbol{x}}}^{\top}H\widehat{\boldsymbol{x}}} and set 𝒘t=𝒖tγ(𝒙tSH𝒙^)\boldsymbol{w}_{t}=\boldsymbol{u}_{t}-\gamma(\boldsymbol{x}_{t}-S^{\top}H\widehat{\boldsymbol{x}}).
7:  Predict label yt=𝒘t𝒙ty_{t}=\boldsymbol{w}_{t}^{\top}\boldsymbol{x}_{t} and suffer loss t(yt)\ell_{t}(y_{t}).
8:  Compute gradient 𝒈t=t(yt)𝒙t\boldsymbol{g}_{t}=\ell^{\prime}_{t}(y_{t})\boldsymbol{x}_{t} and the to-sketch vector 𝒈^=σt+ηt𝒈t\widehat{\boldsymbol{g}}=\sqrt{\sigma_{t}+\eta_{t}}\boldsymbol{g}_{t}.
9:  (S,H)(S,H)\leftarrow SketchUpdate(𝒈^\widehat{\boldsymbol{g}}).
10:  Update weight: 𝒖t+1=𝒘t1α(𝒈tSHS𝒈t)\boldsymbol{u}_{t+1}=\boldsymbol{w}_{t}-\frac{1}{\alpha}(\boldsymbol{g}_{t}-S^{\top}HS\boldsymbol{g}_{t}).
11:end for

To see why this admits an efficient algorithm, notice that by the Woodbury formula one has At1=1α(𝑰dSt(α𝑰m+StSt)1St).A_{t}^{-1}=\frac{1}{\alpha}\bigl{(}\boldsymbol{I}_{d}-S_{t}^{\top}(\alpha\boldsymbol{I}_{m}+S_{t}S_{t}^{\top})^{-1}S_{t}\bigr{)}. With the notation Ht=(α𝑰m+StSt)1m×mH_{t}=(\alpha\boldsymbol{I}_{m}+S_{t}S_{t}^{\top})^{-1}\in\mathbb{R}^{m\times m} and γt=τC(𝒖t+1𝒙t+1)/(𝒙t+1𝒙t+1𝒙t+1StHtSt𝒙t+1)\gamma_{t}=\tau_{C}(\boldsymbol{u}_{t+1}^{\top}\boldsymbol{x}_{t+1})/(\boldsymbol{x}_{t+1}^{\top}\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t+1}^{\top}S_{t}^{\top}H_{t}S_{t}\boldsymbol{x}_{t+1}), update (1) becomes:

𝒖t+1=𝒘t1α(𝒈tStHtSt𝒈t),and𝒘t+1\displaystyle\boldsymbol{u}_{t+1}=\boldsymbol{w}_{t}-\tfrac{1}{\alpha}\bigl{(}\boldsymbol{g}_{t}-S_{t}^{\top}H_{t}S_{t}\boldsymbol{g}_{t}\bigr{)},\quad\mbox{and}\quad\boldsymbol{w}_{t+1} =𝒖t+1γt(𝒙t+1StHtSt𝒙t+1).\displaystyle=\boldsymbol{u}_{t+1}-\gamma_{t}\bigl{(}\boldsymbol{x}_{t+1}-S_{t}^{\top}H_{t}S_{t}\boldsymbol{x}_{t+1}\bigr{)}~.

The operations involving St𝒈tS_{t}\boldsymbol{g}_{t} or St𝒙t+1S_{t}\boldsymbol{x}_{t+1} require only 𝒪(md)\mathcal{O}(md) time, while matrix vector products with HtH_{t} require only 𝒪(m2)\mathcal{O}(m^{2}). Altogether, these updates are at most mm times more expensive than first order algorithms as long as StS_{t} and HtH_{t} can be maintained efficiently. We call this algorithm Sketched Online Newton (SON) and summarize it in Algorithm 1.

We now discuss two sketching techniques to maintain the matrices StS_{t} and HtH_{t} efficiently, each requiring 𝒪(md)\mathcal{O}(md) storage and time linear in dd.

Frequent Directions (FD).

Algorithm 2 FD-Sketch for FD-SON 1:SS and HH. 2: 3:Set S=𝟎m×dS=\boldsymbol{0}_{m\times d} and H=1α𝑰mH=\tfrac{1}{\alpha}\boldsymbol{I}_{m}. 4:Return (S,H)(S,H). 1: 2:Insert 𝒈^\widehat{\boldsymbol{g}} into the last row of SS. 3:Compute eigendecomposition: VΣV=SSV^{\top}\Sigma V=S^{\top}S and set S=(ΣΣm,m𝑰m)12VS=(\Sigma-\Sigma_{m,m}\boldsymbol{I}_{m})^{\frac{1}{2}}V. 4:Set H=diag{1α+Σ1,1Σm,m,,1α}H=\mathrm{diag}\!\left\{{\frac{1}{\alpha+\Sigma_{1,1}-\Sigma_{m,m}},\cdots,\frac{1}{\alpha}}\right\}. 5:Return (S,H)(S,H). Algorithm 3 Oja’s Sketch for Oja-SON 1:tt, Λ\Lambda, VV and HH. 2: 3:Set t=0,Λ=𝟎m×m,H=1α𝑰mt=0,\Lambda=\boldsymbol{0}_{m\times m},H=\tfrac{1}{\alpha}\boldsymbol{I}_{m} and VV to any m×dm\times d matrix with orthonormal rows. 4:Return (𝟎m×d\boldsymbol{0}_{m\times d}, HH). 1: 2:Update tt+1t\leftarrow t+1, Λ\Lambda and VV as Eqn. 4. 3:Set S=(tΛ)12VS=(t\Lambda)^{\frac{1}{2}}V. 4:Set H=diag{1α+tΛ1,1,,1α+tΛm,m}H=\mathrm{diag}\!\left\{{\frac{1}{\alpha+t\Lambda_{1,1}},\cdots,\frac{1}{\alpha+t\Lambda_{m,m}}}\right\}. 5:Return (S,H)(S,H).

Frequent Directions sketch (Ghashami et al., 2015; Liberty, 2013) is a deterministic sketching method. It maintains the invariant that the last row of StS_{t} is always 𝟎\boldsymbol{0}. On each round, the vector 𝒈^t\widehat{\boldsymbol{g}}_{t}^{\top} is inserted into the last row of St1S_{t-1}, then the covariance of the resulting matrix is eigendecomposed into VtΣtVtV_{t}^{\top}\Sigma_{t}V_{t} and StS_{t} is set to (Σtρt𝑰m)12Vt(\Sigma_{t}-\rho_{t}\boldsymbol{I}_{m})^{\frac{1}{2}}V_{t} where ρt\rho_{t} is the smallest eigenvalue. Since the rows of StS_{t} are orthogonal to each other, HtH_{t} is a diagonal matrix and can be maintained efficiently (see Algorithm 2). The sketch update works in 𝒪(md)\mathcal{O}(md) time (see (Ghashami et al., 2015) and Appendix F) so the total running time is 𝒪(md)\mathcal{O}(md) per round. We call this combination FD-SON and prove the following regret bound with notation Ωk=i=k+1dλi(GTGT)\Omega_{k}=\sum_{i=k+1}^{d}\lambda_{i}(G_{T}^{\top}G_{T}) for any k=0,,m1k=0,\dots,m-1.

Theorem 3.

Under Assumptions 1 and 2, suppose that σtσ0\sigma_{t}\geq\sigma\geq 0 for all tt and ηt\eta_{t} is non-increasing. FD-SON ensures that for any 𝐰𝒦\boldsymbol{w}\in\mathcal{K} and k=0,,m1k=0,\ldots,m-1, we have

RT(𝒘)α2𝒘22+2(CL)2t=1Tηt+m2(σ+ηT)ln(1+tr(STST)mα)+mΩk2(mk)(σ+ηT)α.\displaystyle R_{T}(\boldsymbol{w})\leq\frac{\alpha}{2}\left\|{\boldsymbol{w}}\right\|_{2}^{2}+2(CL)^{2}\sum_{t=1}^{T}\eta_{t}+\frac{m}{2(\sigma+\eta_{T})}\ln\left(1+\frac{\textsc{tr}({S_{T}^{\top}S_{T}})}{m\alpha}\right)+\frac{m\Omega_{k}}{2(m-k)(\sigma+\eta_{T})\alpha}~.

The bound depends on the spectral decay Ωk\Omega_{k}, which essentially is the only extra term compared to the bound in Theorem 2. Similarly to previous discussion, if σtσ\sigma_{t}\geq\sigma, we get the bound α2w22+m2σln(1+tr(STST)mα)+mΩk2(mk)σα.\frac{\alpha}{2}\left\|{w}\right\|_{2}^{2}+\frac{m}{2\sigma}\ln\left(1+\frac{\textsc{tr}({S_{T}^{\top}S_{T}})}{m\alpha}\right)+\frac{m\Omega_{k}}{2(m-k)\sigma\alpha}~. With α\alpha tuned well, we pay logarithmic regret for the top mm eigenvectors, but a square root regret 𝒪(Ωk)\mathcal{O}(\sqrt{\Omega_{k}}) for remaining directions not controlled by our sketch. This is expected for deterministic sketching which focuses on the dominant part of the spectrum. When α\alpha is not tuned we still get sublinear regret as long as Ωk\Omega_{k} is sublinear.

Oja’s Algorithm.

Oja’s algorithm (Oja, 1982; Oja and Karhunen, 1985) is not usually considered as a sketching algorithm but seems very natural here. This algorithm uses online gradient descent to find eigenvectors and eigenvalues of data in a streaming fashion, with the to-sketch vector 𝒈^t\widehat{\boldsymbol{g}}_{t}’s as the input. Specifically, let Vtm×dV_{t}\in\mathbb{R}^{m\times d} denote the estimated eigenvectors and the diagonal matrix Λtm×m\Lambda_{t}\in\mathbb{R}^{m\times m} contain the estimated eigenvalues at the end of round tt. Oja’s algorithm updates as:

Λt=(𝑰mΓt)Λt1+Γtdiag{Vt1𝒈^t}2,VtorthVt1+ΓtVt1𝒈^t𝒈^t\displaystyle\Lambda_{t}=(\boldsymbol{I}_{m}-\Gamma_{t})\Lambda_{t-1}+\Gamma_{t}\;\mathrm{diag}\!\left\{{V_{t-1}\widehat{\boldsymbol{g}}_{t}}\right\}^{2},\quad\quad V_{t}\xleftarrow{\text{orth}}V_{t-1}+\Gamma_{t}V_{t-1}\widehat{\boldsymbol{g}}_{t}\widehat{\boldsymbol{g}}_{t}^{\top} (4)

where Γtm×m\Gamma_{t}\in\mathbb{R}^{m\times m} is a diagonal matrix with (possibly different) learning rates of order Θ(1/t)\Theta(1/t) on the diagonal, and the “orth\xleftarrow{\text{orth}}” operator represents an orthonormalizing step.444For simplicity, we assume that Vt1+ΓtVt1𝒈^t𝒈^tV_{t-1}+\Gamma_{t}V_{t-1}\widehat{\boldsymbol{g}}_{t}\widehat{\boldsymbol{g}}_{t}^{\top} is always of full rank so that the orthonormalizing step does not reduce the dimension of VtV_{t}. The sketch is then St=(tΛt)12VtS_{t}=(t\Lambda_{t})^{\frac{1}{2}}V_{t}. The rows of StS_{t} are orthogonal and thus HtH_{t} is an efficiently maintainable diagonal matrix (see Algorithm 3). We call this combination Oja-SON.

The time complexity of Oja’s algorithm is 𝒪(m2d)\mathcal{O}(m^{2}d) per round due to the orthonormalizing step. To improve the running time to 𝒪(md)\mathcal{O}(md), one can only update the sketch every mm rounds (similar to the block power method (Hardt and Price, 2014; Li et al., 2015)). The regret guarantee of this algorithm is unclear since existing analysis for Oja’s algorithm is only for the stochastic setting (see e.g. (Balsubramani et al., 2013; Li et al., 2015)). However, Oja-SON provides good performance experimentally.

4 Sparse Implementation

In many applications, examples (and hence gradients) are sparse in the sense that 𝒙t0s\left\|{\boldsymbol{x}_{t}}\right\|_{0}\leq s for all tt and some small constant sds\ll d. Most online first order methods enjoy a per-example running time depending on ss instead of dd in such settings. Achieving the same for second order methods is more difficult since At1𝒈tA_{t}^{-1}\boldsymbol{g}_{t} (or sketched versions) are typically dense even if 𝒈t\boldsymbol{g}_{t} is sparse.

We show how to implement our algorithms in sparsity-dependent time, specifically, in 𝒪(m2+ms)\mathcal{O}(m^{2}+ms) for FD-SON and in 𝒪(m3+ms)\mathcal{O}(m^{3}+ms) for Oja-SON. We emphasize that since the sketch would still quickly become a dense matrix even if the examples are sparse, achieving purely sparsity-dependent time is highly non-trivial and may be of independent interest. Due to space limit, below we only briefly mention how to do it for Oja-SON. Similar discussion for the FD sketch can be found in Appendix F. Note that mathematically these updates are equivalent to the non-sparse counterparts and regret guarantees are thus unchanged.

There are two ingredients to doing this for Oja-SON: (1) The eigenvectors VtV_{t} are represented as Vt=FtZtV_{t}=F_{t}Z_{t}, where Ztm×dZ_{t}\in\mathbb{R}^{m\times d} is a sparsely updatable direction (Step 3 in Algorithm 5) and Ftm×mF_{t}\in\mathbb{R}^{m\times m} is a matrix such that FtZtF_{t}Z_{t} is orthonormal. (2) The weights 𝒘t\boldsymbol{w}_{t} are split as 𝒘¯t+Zt1𝒃t\bar{\boldsymbol{w}}_{t}+Z_{t-1}^{\top}\boldsymbol{b}_{t}, where 𝒃tm\boldsymbol{b}_{t}\in\mathbb{R}^{m} maintains the weights on the subspace captured by Vt1V_{t-1} (same as Zt1Z_{t-1}), and 𝒘¯t\bar{\boldsymbol{w}}_{t} captures the weights on the complementary subspace which are again updated sparsely.

We describe the sparse updates for 𝒘¯t\bar{\boldsymbol{w}}_{t} and 𝒃t\boldsymbol{b}_{t} below with the details for FtF_{t} and ZtZ_{t} deferred to Appendix G. Since St=(tΛt)12Vt=(tΛt)12FtZtS_{t}=(t\Lambda_{t})^{\frac{1}{2}}V_{t}=(t\Lambda_{t})^{\frac{1}{2}}F_{t}Z_{t} and 𝒘t=𝒘¯t+Zt1𝒃t\boldsymbol{w}_{t}=\bar{\boldsymbol{w}}_{t}+Z_{t-1}^{\top}\boldsymbol{b}_{t}, we know 𝒖t+1\boldsymbol{u}_{t+1} is

𝒘t(𝑰dStHtSt)𝒈tα=𝒘¯t𝒈tα(ZtZt1)𝒃t=def𝒖¯t+1+Zt(𝒃t+1αFt(tΛtHt)FtZt𝒈t=def𝒃t+1).\displaystyle\boldsymbol{w}_{t}-\big{(}\boldsymbol{I}_{d}-S_{t}^{\top}H_{t}S_{t}\big{)}\tfrac{\boldsymbol{g}_{t}}{\alpha}=\underbrace{\bar{\boldsymbol{w}}_{t}-\tfrac{\boldsymbol{g}_{t}}{\alpha}-(Z_{t}-Z_{t-1})^{\top}\boldsymbol{b}_{t}}_{\stackrel{{\scriptstyle\rm def}}{{=}}\bar{\boldsymbol{u}}_{t+1}}+Z_{t}^{\top}(\underbrace{\boldsymbol{b}_{t}+\tfrac{1}{\alpha}F_{t}^{\top}(t\Lambda_{t}H_{t})F_{t}Z_{t}\boldsymbol{g}_{t}}_{\stackrel{{\scriptstyle\rm def}}{{=}}\boldsymbol{b}_{t+1}^{\prime}})~. (5)

Since ZtZt1Z_{t}-Z_{t-1} is sparse by construction and the matrix operations defining 𝒃t+1\boldsymbol{b}_{t+1}^{\prime} scale with mm, overall the update can be done in 𝒪(m2+ms)\mathcal{O}(m^{2}+ms). Using the update for 𝒘t+1\boldsymbol{w}_{t+1} in terms of 𝒖t+1\boldsymbol{u}_{t+1}, 𝒘t+1\boldsymbol{w}_{t+1} is equal to

𝒖t+1γt(𝑰dStHtSt)𝒙t+1=𝒖¯t+1γt𝒙t+1=def𝒘¯t+1+Zt(𝒃t+1+γtFt(tΛtHt)FtZt𝒙t+1=def𝒃t+1).\displaystyle\boldsymbol{u}_{t+1}-\gamma_{t}(\boldsymbol{I}_{d}-S_{t}^{\top}H_{t}S_{t})\boldsymbol{x}_{t+1}=\underbrace{\bar{\boldsymbol{u}}_{t+1}-\gamma_{t}\boldsymbol{x}_{t+1}}_{\stackrel{{\scriptstyle\rm def}}{{=}}\bar{\boldsymbol{w}}_{t+1}}+Z_{t}^{\top}(\underbrace{\boldsymbol{b}_{t+1}^{\prime}+\gamma_{t}F_{t}^{\top}(t\Lambda_{t}H_{t})F_{t}Z_{t}\boldsymbol{x}_{t+1}}_{\stackrel{{\scriptstyle\rm def}}{{=}}\boldsymbol{b}_{t+1}})~. (6)

Again, it is clear that all the computations scale with ss and not dd, so both 𝒘¯t+1\bar{\boldsymbol{w}}_{t+1} and 𝒃t+1\boldsymbol{b}_{t+1} require only O(m2+ms)O(m^{2}+ms) time to maintain. Furthermore, the prediction 𝒘t𝒙t=𝒘¯t𝒙t+𝒃tZt1𝒙t\boldsymbol{w}_{t}^{\top}\boldsymbol{x}_{t}=\bar{\boldsymbol{w}}_{t}^{\top}\boldsymbol{x}_{t}+\boldsymbol{b}_{t}^{\top}Z_{t-1}\boldsymbol{x}_{t} can also be computed in 𝒪(ms)\mathcal{O}(ms) time. The 𝒪(m3)\mathcal{O}(m^{3}) in the overall complexity comes from a Gram-Schmidt step in maintaining FtF_{t} (details in Appendix G).

The pseudocode is presented in Algorithms 4 and 5 with some details deferred to Appendix G. This is the first sparse implementation of online eigenvector computation to the best of our knowledge.

Algorithm 4 Sparse Sketched Online Newton with Oja’s Algorithm
1:Parameters CC, α\alpha and mm.
2:Initialize 𝒖¯=𝟎d×1\bar{\boldsymbol{u}}=\boldsymbol{0}_{d\times 1} and 𝒃=𝟎m×1\boldsymbol{b}=\boldsymbol{0}_{m\times 1}.
3:(Λ,F,Z,H)SketchInit(α,m)\Lambda,F,Z,H)\leftarrow\textbf{SketchInit}(\alpha,m)  (Algorithm 5).
4:for t=1t=1 to TT do
5:  Receive example 𝒙t\boldsymbol{x}_{t}.
6:  Projection step: compute 𝒙^=FZ𝒙t\widehat{\boldsymbol{x}}=FZ\boldsymbol{x}_{t} and γ=τC(𝒖¯𝒙t+𝒃Z𝒙t)𝒙t𝒙t(t1)𝒙^ΛH𝒙^\gamma=\frac{\tau_{C}(\bar{\boldsymbol{u}}^{\top}\boldsymbol{x}_{t}+\boldsymbol{b}^{\top}Z\boldsymbol{x}_{t})}{\boldsymbol{x}_{t}^{\top}\boldsymbol{x}_{t}-(t-1){\widehat{\boldsymbol{x}}}^{\top}\Lambda H\widehat{\boldsymbol{x}}}. Obtain 𝒘¯=𝒖¯γ𝒙t\bar{\boldsymbol{w}}=\bar{\boldsymbol{u}}-\gamma\boldsymbol{x}_{t} and 𝒃𝒃+γ(t1)FΛH𝒙^\boldsymbol{b}\leftarrow\boldsymbol{b}+\gamma(t-1)F^{\top}\Lambda H\widehat{\boldsymbol{x}}  (Equation 6).
7:  Predict label yt=𝒘¯𝒙t+𝒃Z𝒙ty_{t}=\bar{\boldsymbol{w}}^{\top}\boldsymbol{x}_{t}+\boldsymbol{b}^{\top}Z\boldsymbol{x}_{t} and suffer loss t(yt)\ell_{t}(y_{t}).
8:  Compute gradient 𝒈t=t(yt)𝒙t\boldsymbol{g}_{t}=\ell^{\prime}_{t}(y_{t})\boldsymbol{x}_{t} and the to-sketch vector 𝒈^=σt+ηt𝒈t\widehat{\boldsymbol{g}}=\sqrt{\sigma_{t}+\eta_{t}}\boldsymbol{g}_{t}.
9:  (Λ\Lambda, FF, ZZ, HH, 𝜹\boldsymbol{\delta}) \leftarrow SketchUpdate(𝒈^\widehat{\boldsymbol{g}})  (Algorithm 5).
10:  Update weight: 𝒖¯=𝒘¯1α𝒈t(𝜹𝒃)𝒈^\bar{\boldsymbol{u}}=\bar{\boldsymbol{w}}-\tfrac{1}{\alpha}\boldsymbol{g}_{t}-(\boldsymbol{\delta}^{\top}\boldsymbol{b})\widehat{\boldsymbol{g}} and 𝒃𝒃+1αtFΛHFZ𝒈t\boldsymbol{b}\leftarrow\boldsymbol{b}+\tfrac{1}{\alpha}tF^{\top}\Lambda HFZ\boldsymbol{g}_{t}  (Equation 5).
11:end for
Algorithm 5 Sparse Oja’s Sketch
1:tt, Λ\Lambda, FF, ZZ, HH and KK.
2:
3:Set t=0,Λ=𝟎m×m,F=K=αH=𝑰mt=0,\Lambda=\boldsymbol{0}_{m\times m},F=K=\alpha H=\boldsymbol{I}_{m} and ZZ to any m×dm\times d matrix with orthonormal rows.
4:Return (Λ\Lambda, FF, ZZ, HH).
1:
2:Update tt+1t\leftarrow t+1. Pick a diagonal stepsize matrix Γt\Gamma_{t} to update Λ(𝑰Γt)Λ+Γtdiag{FZ𝒈^}2\Lambda\leftarrow(\boldsymbol{I}-\Gamma_{t})\Lambda+\Gamma_{t}\;\mathrm{diag}\!\left\{{FZ\widehat{\boldsymbol{g}}}\right\}^{2}.
3:Set 𝜹=A1ΓtFZ𝒈^\boldsymbol{\delta}=A^{-1}\Gamma_{t}FZ\widehat{\boldsymbol{g}} and update KK+𝜹𝒈^Z+Z𝒈^𝜹+(𝒈^𝒈^)𝜹𝜹K\leftarrow K+\boldsymbol{\delta}\widehat{\boldsymbol{g}}^{\top}Z^{\top}+Z\widehat{\boldsymbol{g}}\boldsymbol{\delta}^{\top}+(\widehat{\boldsymbol{g}}^{\top}\widehat{\boldsymbol{g}})\boldsymbol{\delta}\boldsymbol{\delta}^{\top}.
4:Update ZZ+𝜹𝒈^Z\leftarrow Z+\boldsymbol{\delta}\widehat{\boldsymbol{g}}^{\top}.
5:(L,Q)Decompose(F,K)(L,Q)\leftarrow\text{Decompose}(F,K) (Algorithm 11), so that LQZ=FZLQZ=FZ and QZQZ is orthogonal. Set F=QF=Q.
6:Set Hdiag{1α+tΛ1,1,,1α+tΛm,m}H\leftarrow\mathrm{diag}\!\left\{{\frac{1}{\alpha+t\Lambda_{1,1}},\cdots,\frac{1}{\alpha+t\Lambda_{m,m}}}\right\}.
7:Return (Λ\Lambda, FF, ZZ, HH, 𝜹\boldsymbol{\delta}).

5 Experiments

Preliminary experiments revealed that out of our two sketching options, Oja’s sketch generally has better performance (see Appendix H). For more thorough evaluation, we implemented the sparse version of Oja-SON in Vowpal Wabbit.555An open source machine learning toolkit available at http://hunch.net/ṽw We compare it with AdaGrad (Duchi et al., 2011; McMahan and Streeter, 2010) on both synthetic and real-world datasets. Each algorithm takes a stepsize parameter: 1α\tfrac{1}{\alpha} serves as a stepsize for Oja-SON and a scaling constant on the gradient matrix for AdaGrad. We try both methods with the parameter set to 2j2^{j} for j=3,2,,6j=-3,-2,\ldots,6 and report the best results. We keep the stepsize matrix in Oja-SON fixed as Γt=1t𝑰m\Gamma_{t}=\frac{1}{t}\boldsymbol{I}_{m} throughout. All methods make one online pass over data minimizing square loss.

5.1 Synthetic Datasets

Refer to caption
Figure 2: Oja’s algorithm’s eigenvalue recovery error.
Refer to caption
Refer to caption
Figure 3: (a) Comparison of two sketch sizes on real data, and (b) Comparison against AdaGrad on real data.

To investigate Oja-SON’s performance in the setting it is really designed for, we generated a range of synthetic ill-conditioned datasets as follows. We picked a random Gaussian matrix ZT×dZ\sim\mathbb{R}^{T\times d} (T=10,000T=10,\!000 and d=100d=100) and a random orthonormal basis Vd×dV\in\mathbb{R}^{d\times d}. We chose a specific spectrum 𝝀d\boldsymbol{\lambda}\in\mathbb{R}^{d} where the first d10d-10 coordinates are 1 and the rest increase linearly to some fixed condition number parameter κ\kappa. We let X=Zdiag{𝝀}12VX=Z\mathrm{diag}\!\left\{{\boldsymbol{\lambda}}\right\}^{\frac{1}{2}}V^{\top} be our example matrix, and created a binary classification problem with labels y=sign(𝜽𝒙)y=\mbox{sign}(\boldsymbol{\theta}^{\top}\boldsymbol{x}), where 𝜽d\boldsymbol{\theta}\in\mathbb{R}^{d} is a random vector. We generated 20 such datasets with the same Z,VZ,V and labels yy but different values of κ{10,20,,200\kappa\in\{10,20,\ldots,200}. Note that if the algorithm is truly invariant, it would have the same behavior on these 20 datasets.

Fig. 1 (in Section 1) shows the final progressive error (i.e. fraction of misclassified examples after one pass over data) for AdaGrad and Oja-SON (with sketch size m=0,5,10m=0,5,10) as the condition number increases. As expected, the plot confirms the performance of first order methods such as AdaGrad degrades when the data is ill-conditioned. The plot also shows that as the sketch size increases, Oja-SON becomes more accurate: when m=0m=0 (no sketch at all), Oja-SON is vanilla gradient descent and is worse than AdaGrad as expected; when m=5m=5, the accuracy greatly improves; and finally when m=10m=10, the accuracy of Oja-SON is substantially better and hardly worsens with κ\kappa.

To further explain the effectiveness of Oja’s algorithm in identifying top eigenvalues and eigenvectors, the plot in Fig. 2 shows the largest relative difference between the true and estimated top 10 eigenvalues as Oja’s algorithm sees more data. This gap drops quickly after seeing just 500 examples.

5.2 Real-world Datasets

Next we evaluated Oja-SON on 23 benchmark datasets from the UCI and LIBSVM repository (see Appendix H for description of these datasets). Note that some datasets are very high dimensional but very sparse (e.g. for 20news, d102,000d\approx 102,000 and s94s\approx 94), and consequently methods with running time quadratic (such as ONS) or even linear in dimension rather than sparsity are prohibitive.

In Fig. 3, we show the effect of using sketched second order information, by comparing sketch size m=0m=0 and m=10m=10 for Oja-SON (concrete error rates in Appendix H). We observe significant improvements in 5 datasets (acoustic, census, heart, ionosphere, letter), demonstrating the advantage of using second order information. However, we found that Oja-SON was outperformed by AdaGrad on most datasets, mostly because the diagonal adaptation of AdaGrad greatly reduces the condition number on these datasets. Moreover, one disadvantage of SON is that for the directions not in the sketch, it is essentially doing vanilla gradient descent. We expect better results using diagonal adaptation as in AdaGrad in off-sketch directions.

To incorporate this high level idea, we performed a simple modification to Oja-SON: upon seeing example 𝒙t\boldsymbol{x}_{t}, we feed Dt12𝒙tD_{t}^{-\frac{1}{2}}\boldsymbol{x}_{t} to our algorithm instead of 𝒙t\boldsymbol{x}_{t}, where Dtd×dD_{t}\in\mathbb{R}^{d\times d} is the diagonal part of the matrix τ=1t1𝒈τ𝒈τ\sum_{\tau=1}^{t-1}\boldsymbol{g}_{\tau}\boldsymbol{g}_{\tau}^{\top}.666D1D_{1} is defined as 0.1×𝑰d0.1\times\boldsymbol{I}_{d} to avoid division by zero. The intuition is that this diagonal rescaling first homogenizes the scales of all dimensions. Any remaining ill-conditioning is further addressed by the sketching to some degree, while the complementary subspace is no worse-off than with AdaGrad. We believe this flexibility in picking the right vectors to sketch is an attractive aspect of our sketching-based approach.

With this modification, Oja-SON outperforms AdaGrad on most of the datasets even for m=0m=0, as shown in Fig. 3 (concrete error rates in Appendix H). The improvement on AdaGrad at m=0m=0 is surprising but not impossible as the updates are not identical–our update is scale invariant like Ross et al. (2013). However, the diagonal adaptation already greatly reduces the condition number on all datasets except splice (see Fig. 4 in Appendix H for detailed results on this dataset), so little improvement is seen for sketch size m=10m=10 over m=0m=0. For several datasets, we verified the accuracy of Oja’s method in computing the top-few eigenvalues (Appendix H), so the lack of difference between sketch sizes is due to the lack of second order information after the diagonal correction.

The average running time of our algorithm when m=10m=10 is about 11 times slower than AdaGrad, matching expectations. Overall, SON can significantly outperform baselines on ill-conditioned data, while maintaining a practical computational complexity.

Acknowledgements

This work was done when Haipeng Luo and Nicolò Cesa-Bianchi were at Microsoft Research, New York. We thank Lijun Zhang for pointing out our mistake in the regret proof of another sketching method that appeared in an earlier version.

References

  • Balsubramani et al. [2013] A. Balsubramani, S. Dasgupta, and Y. Freund. The fast convergence of incremental pca. In NIPS, 2013.
  • Byrd et al. [2016] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26:1008–1031, 2016.
  • Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • Cesa-Bianchi et al. [2005] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm. SIAM Journal on Computing, 34(3):640–668, 2005.
  • Duchi et al. [2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159, 2011.
  • Erdogdu and Montanari [2015] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled newton methods. In NIPS, 2015.
  • Frank and Wolfe [1956] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
  • Gao et al. [2013] W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass auc optimization. In ICML, 2013.
  • Garber and Hazan [2016] D. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. SIAM Journal on Optimization, 26:1493–1528, 2016.
  • Garber et al. [2015] D. Garber, E. Hazan, and T. Ma. Online learning of eigenvectors. In ICML, 2015.
  • Ghashami et al. [2015] M. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45:1762–1792, 2015.
  • Ghashami et al. [2016] M. Ghashami, E. Liberty, and J. M. Phillips. Efficient frequent directions algorithm for sparse matrices. In KDD, 2016.
  • Gonen and Shalev-Shwartz [2015] A. Gonen and S. Shalev-Shwartz. Faster sgd using sketched conditioning. arXiv:1506.02649, 2015.
  • Gonen et al. [2016] A. Gonen, F. Orabona, and S. Shalev-Shwartz. Solving ridge regression using sketched preconditioned svrg. In ICML, 2016.
  • Hardt and Price [2014] M. Hardt and E. Price. The noisy power method: A meta algorithm with applications. In NIPS, 2014.
  • Hazan and Kale [2012] E. Hazan and S. Kale. Projection-free online learning. In ICML, 2012.
  • Hazan et al. [2007] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
  • Jaggi [2013] M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML, 2013.
  • Li et al. [2015] C.-L. Li, H.-T. Lin, and C.-J. Lu. Rivalry of two families of algorithms for memory-restricted streaming pca. arXiv:1506.01490, 2015.
  • Liberty [2013] E. Liberty. Simple and deterministic matrix sketching. In KDD, 2013.
  • Liu and Nocedal [1989] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
  • McMahan and Streeter [2010] H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010.
  • Mokhtari and Ribeiro [2015] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory bfgs. JMLR, 16:3151–3181, 2015.
  • Moritz et al. [2016] P. Moritz, R. Nishihara, and M. I. Jordan. A linearly-convergent stochastic l-bfgs algorithm. In AISTATS, 2016.
  • Oja [1982] E. Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273, 1982.
  • Oja and Karhunen [1985] E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of mathematical analysis and applications, 106(1):69–84, 1985.
  • Orabona and Pál [2015] F. Orabona and D. Pál. Scale-free algorithms for online linear optimization. In ALT, 2015.
  • Orabona et al. [2015] F. Orabona, K. Crammer, and N. Cesa-Bianchi. A generalized online mirror descent with applications to classification and regression. Machine Learning, 99(3):411–435, 2015.
  • Pilanci and Wainwright [2015] M. Pilanci and M. J. Wainwright. Newton sketch: A linear-time optimization algorithm with linear-quadratic convergence. arXiv:1505.02250, 2015.
  • Ross et al. [2013] S. Ross, P. Mineiro, and J. Langford. Normalized online learning. In UAI, 2013.
  • Schraudolph et al. [2007] N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-newton method for online convex optimization. In AISTATS, 2007.
  • Sohl-Dickstein et al. [2014] J. Sohl-Dickstein, B. Poole, and S. Ganguli. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In ICML, 2014.
  • Woodruff [2014] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Machine Learning, 10(1-2):1–157, 2014.

Supplementary material for
“Efficient Second Order Online Learning by Sketching”

Appendix A Proof of Theorem 1

Proof.

Assuming TT is a multiple of dd without loss of generality, we pick 𝒙t\boldsymbol{x}_{t} from the basis vectors {𝒆1,,𝒆d}\{\boldsymbol{e}_{1},\ldots,\boldsymbol{e}_{d}\} so that each 𝒆i\boldsymbol{e}_{i} appears T/dT/d times (in an arbitrary order). Note that now 𝒦\mathcal{K} is just a hypercube:

𝒦={𝒘:|𝒘𝒙t|C,t}={𝒘:𝒘C}.\mathcal{K}=\left\{{\boldsymbol{w}}\,:\,{|\boldsymbol{w}^{\top}\boldsymbol{x}_{t}|\leq C,\;\;\forall t}\right\}=\left\{{\boldsymbol{w}}\,:\,{\left\|{\boldsymbol{w}}\right\|_{\infty}\leq C}\right\}.

Let ξ1,,ξT\xi_{1},\dots,\xi_{T} be independent Rademacher random variables such that Pr(ξt=+1)=Pr(ξt=1)=12\Pr(\xi_{t}=+1)=\Pr(\xi_{t}=-1)=\tfrac{1}{2}. For a scalar θ\theta, we define loss function777By adding a suitable constant, these losses can always be made nonnegative while leaving the regret unchanged. t(θ)=(ξtL)θ\ell_{t}(\theta)=(\xi_{t}L)\theta, so that Assumptions 1 and 2 are clearly satisfied with σt=0\sigma_{t}=0. We show that, for any online algorithm,

𝔼[RT]=𝔼[t=1Tt(𝒘t𝒙t)inf𝒘𝒦t=1Tt(𝒘𝒙t)]CLdT2\mathbb{E}[R_{T}]=\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}\bigl{(}\boldsymbol{w}_{t}^{\top}\boldsymbol{x}_{t}\bigr{)}-\inf_{\boldsymbol{w}\in\mathcal{K}}\sum_{t=1}^{T}\ell_{t}\bigl{(}\boldsymbol{w}^{\top}\boldsymbol{x}_{t}\bigr{)}\right]\geq CL\sqrt{\frac{dT}{2}}

which implies the statement of the theorem.

First of all, note that 𝔼[t(𝒘t𝒙t)|ξ1,,ξt1]=0\mathbb{E}\Bigl{[}\ell_{t}\bigl{(}\boldsymbol{w}_{t}^{\top}\boldsymbol{x}_{t}\bigr{)}\,\Big{|}\,\xi_{1},\dots,\xi_{t-1}\Bigr{]}=0 for any 𝒘t\boldsymbol{w}_{t}. Hence we have

𝔼[t=1Tt(𝒘t𝒙t)inf𝒘𝒦t=1Tt(𝒘𝒙t)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}\bigl{(}\boldsymbol{w}_{t}^{\top}\boldsymbol{x}_{t}\bigr{)}-\inf_{\boldsymbol{w}\in\mathcal{K}}\sum_{t=1}^{T}\ell_{t}\bigl{(}\boldsymbol{w}^{\top}\boldsymbol{x}_{t}\bigr{)}\right] =𝔼[sup𝒘𝒦t=1Tt(𝒘𝒙t)]=L𝔼[sup𝒘𝒦𝒘t=1Tξt𝒙t],\displaystyle=\mathbb{E}\left[\sup_{\boldsymbol{w}\in\mathcal{K}}\sum_{t=1}^{T}-\ell_{t}\bigl{(}\boldsymbol{w}^{\top}\boldsymbol{x}_{t}\bigr{)}\right]=L\,\mathbb{E}\left[\sup_{\boldsymbol{w}\in\mathcal{K}}\boldsymbol{w}^{\top}\sum_{t=1}^{T}\xi_{t}\boldsymbol{x}_{t}\right],

which, by the construction of 𝒙t\boldsymbol{x}_{t}, is

CL𝔼[t=1Tξt𝒙t1]=CLd𝔼[|t=1T/dξt|]CLdT2d=CLdT2,CL\,\mathbb{E}\left[\left\|{\sum_{t=1}^{T}\xi_{t}\boldsymbol{x}_{t}}\right\|_{1}\right]=CLd\,\mathbb{E}\left[\left|\sum_{t=1}^{T/d}\xi_{t}\right|\right]\geq CLd\sqrt{\frac{T}{2d}}=CL\sqrt{\frac{dT}{2}},

where the final bound is due to the Khintchine inequality (see e.g. Lemma 8.2 in Cesa-Bianchi and Lugosi [2006]). This concludes the proof. ∎

Appendix B Projection

We prove a more general version of Lemma 1 which does not require invertibility of the matrix AA here.

Lemma 2.

For any 𝐱𝟎,𝐮d×1\boldsymbol{x}\neq\boldsymbol{0},\boldsymbol{u}\in\mathbb{R}^{d\times 1} and positive semidefinite matrix Ad×dA\in\mathbb{R}^{d\times d}, we have

𝒘=argmin𝒘:|𝒘𝒙|C𝒘𝒖A={𝒖τC(𝒖𝒙)𝒙A𝒙A𝒙if 𝒙range(A)𝒖τC(𝒖𝒙)𝒙(𝑰AA)𝒙(𝑰AA)𝒙if 𝒙range(A)\boldsymbol{w}^{*}=\operatorname*{argmin}_{\boldsymbol{w}:|\boldsymbol{w}^{\top}\boldsymbol{x}|\leq C}\left\|{\boldsymbol{w}-\boldsymbol{u}}\right\|_{A}=\left\{\begin{array}[]{cl}\boldsymbol{u}-\frac{\tau_{C}(\boldsymbol{u}^{\top}\boldsymbol{x})}{\boldsymbol{x}^{\top}A^{\dagger}\boldsymbol{x}}A^{\dagger}\boldsymbol{x}&\text{if $\boldsymbol{x}\in\operatorname*{range}(A)$}\\ \\ \boldsymbol{u}-\frac{\tau_{C}(\boldsymbol{u}^{\top}\boldsymbol{x})}{\boldsymbol{x}^{\top}(\boldsymbol{I}-A^{\dagger}A)\boldsymbol{x}}(\boldsymbol{I}-A^{\dagger}A)\boldsymbol{x}&\text{if $\boldsymbol{x}\notin\operatorname*{range}(A)$}\end{array}\right.

where τC(y)=sgn(y)max{|y|C,0}\tau_{C}(y)=\mbox{\sc sgn}(y)\max\{|y|-C,0\} and AA^{\dagger} is the Moore-Penrose pseudoinverse of AA. (Note that when AA is rank deficient, this is one of the many possible solutions.)

Proof.

First consider the case when 𝒙range(A)\boldsymbol{x}\in\operatorname*{range}(A). If |𝒖𝒙|C|\boldsymbol{u}^{\top}\boldsymbol{x}|\leq C, then it is trivial that 𝒘=𝒖\boldsymbol{w}^{*}=\boldsymbol{u}. We thus assume 𝒖𝒙C\boldsymbol{u}^{\top}\boldsymbol{x}\geq C below (the last case 𝒖𝒙C\boldsymbol{u}^{\top}\boldsymbol{x}\leq-C is similar). The Lagrangian of the problem is

L(𝒘,λ1,λ2)=12(𝒘𝒖)A(𝒘𝒖)+λ1(𝒘𝒙C)+λ2(𝒘𝒙+C)L(\boldsymbol{w},\lambda_{1},\lambda_{2})=\frac{1}{2}(\boldsymbol{w}-\boldsymbol{u})^{\top}A(\boldsymbol{w}-\boldsymbol{u})+\lambda_{1}(\boldsymbol{w}^{\top}\boldsymbol{x}-C)+\lambda_{2}(\boldsymbol{w}^{\top}\boldsymbol{x}+C)

where λ10\lambda_{1}\geq 0 and λ20\lambda_{2}\leq 0 are Lagrangian multipliers. Since 𝒘𝒙\boldsymbol{w}^{\top}\boldsymbol{x} cannot be CC and C-C at the same time, The complementary slackness condition implies that either λ1=0\lambda_{1}=0 or λ2=0\lambda_{2}=0. Suppose the latter case is true, then setting the derivative with respect to 𝒘\boldsymbol{w} to 0, we get 𝒘=𝒖λ1A𝒙+(𝑰AA)𝒛\boldsymbol{w}^{*}=\boldsymbol{u}-\lambda_{1}A^{\dagger}\boldsymbol{x}+(\boldsymbol{I}-A^{\dagger}A)\boldsymbol{z} where 𝒛Rd×1\boldsymbol{z}\in R^{d\times 1} can be arbitrary. However, since A(𝑰AA)=0A(\boldsymbol{I}-A^{\dagger}A)=0, this part does not affect the objective value at all and we can simply pick z=0z=0 so that ww^{*} has a consistent form regardless of whether AA is full rank or not. Now plugging 𝒘\boldsymbol{w}^{*} back, we have

L(𝒘,λ1,0)=λ122𝒙A𝒙+λ1(𝒖𝒙C)L(\boldsymbol{w}^{*},\lambda_{1},0)=-\frac{{\lambda_{1}}^{2}}{2}\boldsymbol{x}^{\top}A^{\dagger}\boldsymbol{x}+\lambda_{1}(\boldsymbol{u}^{\top}\boldsymbol{x}-C)

which is maximized when λ1=𝒖𝒙C𝒙A𝒙0\lambda_{1}=\frac{\boldsymbol{u}^{\top}\boldsymbol{x}-C}{\boldsymbol{x}^{\top}A^{\dagger}\boldsymbol{x}}\geq 0. Plugging this optimal λ1\lambda_{1} into 𝒘\boldsymbol{w}^{*} gives the stated solution. On the other hand, if λ1=0\lambda_{1}=0 instead, we can proceed similarly and verify that it gives a smaller dual value (0 in fact), proving the previous solution is indeed optimal.

We now move on to the case when 𝒙range(A)\boldsymbol{x}\notin\operatorname*{range}(A). First of all the stated solution is well defined since 𝒙(𝑰AA)𝒙\boldsymbol{x}^{\top}(\boldsymbol{I}-A^{\dagger}A)\boldsymbol{x} is nonzero in this case. Moreover, direct calculation shows that 𝒘\boldsymbol{w}^{*} is in the valid space: |𝒘𝒙|=|𝒖𝒙τC(𝒖𝒙)|C|{\boldsymbol{w}^{*}}^{\top}\boldsymbol{x}|=|\boldsymbol{u}^{\top}\boldsymbol{x}-\tau_{C}(\boldsymbol{u}^{\top}\boldsymbol{x})|\leq C, and also it gives the minimal possible distance value 𝒘𝒖A=0\left\|{\boldsymbol{w}^{*}-\boldsymbol{u}}\right\|_{A}=0, proving the lemma. ∎

Appendix C Proof of Theorem 2

We first prove a general regret bound that holds for any choice of AtA_{t} in update 1:

𝒖t+1=𝒘tAt1𝒈t𝒘t+1=argmin𝒘𝒦t+1𝒘𝒖t+1At.\begin{split}\boldsymbol{u}_{t+1}&=\boldsymbol{w}_{t}-A_{t}^{-1}\boldsymbol{g}_{t}\\ \boldsymbol{w}_{t+1}&=\operatorname*{argmin}_{\boldsymbol{w}\in\mathcal{K}_{t+1}}\left\|{\boldsymbol{w}-\boldsymbol{u}_{t+1}}\right\|_{A_{t}}~.\end{split}

This bound will also be useful in proving regret guarantees for the sketched versions.

Proposition 1.

For any sequence of positive definite matrices AtA_{t} and sequence of losses satisfying Assumptions 1 and 2, the regret of updates (1) against any comparator 𝐰𝒦\boldsymbol{w}\in\mathcal{K} satisfies

2RT(𝒘)𝒘A02+t=1T𝒈tTAt1𝒈t“Gradient Bound” RG+t=1T(𝒘t𝒘)(AtAt1σt𝒈t𝒈t)(𝒘t𝒘)“Diameter Bound” RD2R_{T}(\boldsymbol{w})\leq\|\boldsymbol{w}\|_{A_{0}}^{2}+\underbrace{\sum_{t=1}^{T}\boldsymbol{g}_{t}^{T}A_{t}^{-1}\boldsymbol{g}_{t}}_{\text{``Gradient Bound'' $R_{G}$}}+\underbrace{\sum_{t=1}^{T}(\boldsymbol{w}_{t}-\boldsymbol{w})^{\top}(A_{t}-A_{t-1}-\sigma_{t}\boldsymbol{g}_{t}\boldsymbol{g}_{t}^{\top})(\boldsymbol{w}_{t}-\boldsymbol{w})}_{\text{``Diameter Bound'' $R_{D}$}}
Proof.

Since 𝒘t+1\boldsymbol{w}_{t+1} is the projection of 𝒖t+1\boldsymbol{u}_{t+1} onto 𝒦t+1\mathcal{K}_{t+1}, by the property of projections (see for example [Hazan and Kale, 2012, Lemma 8]), the algorithm ensures

𝒘t+1𝒘At2𝒖t+1𝒘At2=𝒘t𝒘At2+𝒈tAt1𝒈t2𝒈t(𝒘t𝒘)\left\|{\boldsymbol{w}_{t+1}-\boldsymbol{w}}\right\|_{A_{t}}^{2}\leq\left\|{\boldsymbol{u}_{t+1}-\boldsymbol{w}}\right\|_{A_{t}}^{2}=\left\|{\boldsymbol{w}_{t}-\boldsymbol{w}}\right\|_{A_{t}}^{2}+\boldsymbol{g}_{t}^{\top}A_{t}^{-1}\boldsymbol{g}_{t}-2\boldsymbol{g}_{t}^{\top}(\boldsymbol{w}_{t}-\boldsymbol{w})

for all 𝒘𝒦𝒦t+1\boldsymbol{w}\in\mathcal{K}\subseteq\mathcal{K}_{t+1}. By the curvature property in Assumption 2, we then have that

2RT(𝒘)\displaystyle 2R_{T}(\boldsymbol{w})\; t=1T2𝒈t(𝒘t𝒘)σt(𝒈t(𝒘t𝒘))2\displaystyle\leq\;\sum_{t=1}^{T}2\boldsymbol{g}_{t}^{\top}(\boldsymbol{w}_{t}-\boldsymbol{w})-\sigma_{t}\bigl{(}\boldsymbol{g}_{t}^{\top}(\boldsymbol{w}_{t}-\boldsymbol{w})\bigr{)}^{2}
t=1T𝒈tAt1𝒈t+𝒘t𝒘At2𝒘t+1𝒘At2σt(𝒈t(𝒘t𝒘))2\displaystyle\leq\;\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}A_{t}^{-1}\boldsymbol{g}_{t}+\left\|{\boldsymbol{w}_{t}-\boldsymbol{w}}\right\|_{A_{t}}^{2}-\left\|{\boldsymbol{w}_{t+1}-\boldsymbol{w}}\right\|_{A_{t}}^{2}-\sigma_{t}\bigl{(}\boldsymbol{g}_{t}^{\top}(\boldsymbol{w}_{t}-\boldsymbol{w})\bigr{)}^{2}
𝒘A02+t=1T𝒈tAt1𝒈t+(𝒘t𝒘)(AtAt1σt𝒈t𝒈t)(𝒘t𝒘),\displaystyle\leq\;\left\|{\boldsymbol{w}}\right\|_{A_{0}}^{2}+\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}A_{t}^{-1}\boldsymbol{g}_{t}+(\boldsymbol{w}_{t}-\boldsymbol{w})^{\top}(A_{t}-A_{t-1}-\sigma_{t}\boldsymbol{g}_{t}\boldsymbol{g}_{t}^{\top})(\boldsymbol{w}_{t}-\boldsymbol{w}),

which completes the proof. ∎

Proof of Theorem 2.

We apply Proposition 1 with the choice: A0=α𝑰dA_{0}=\alpha\boldsymbol{I}_{d} and At=At1+(σt+ηt)𝒈t𝒈tTA_{t}=A_{t-1}+(\sigma_{t}+\eta_{t})\boldsymbol{g}_{t}\boldsymbol{g}_{t}^{T}, which gives 𝒘A02=α𝒘22\left\|{\boldsymbol{w}}\right\|_{A_{0}}^{2}=\alpha\left\|{\boldsymbol{w}}\right\|_{2}^{2} and

RD=t=1Tηt(𝒘t𝒘)𝒈t𝒈t(𝒘t𝒘)4(CL)2t=1Tηt,R_{D}=\sum_{t=1}^{T}\eta_{t}(\boldsymbol{w}_{t}-\boldsymbol{w})^{\top}\boldsymbol{g}_{t}\boldsymbol{g}_{t}^{\top}(\boldsymbol{w}_{t}-\boldsymbol{w})\leq 4(CL)^{2}\sum_{t=1}^{T}\eta_{t}~,

where the last equality uses the Lipschitz property in Assumption 1 and the boundedness of 𝒘t𝒙t\boldsymbol{w}_{t}^{\top}\boldsymbol{x}_{t} and 𝒘𝒙t\boldsymbol{w}^{\top}\boldsymbol{x}_{t}.

For the term RGR_{G}, define A^t=ασ+ηT𝑰d+s=1t𝒈s𝒈s\widehat{A}_{t}=\frac{\alpha}{\sigma+\eta_{T}}\boldsymbol{I}_{d}+\sum_{s=1}^{t}\boldsymbol{g}_{s}\boldsymbol{g}_{s}^{\top}. Since σtσ\sigma_{t}\geq\sigma and ηt\eta_{t} is non-increasing, we have A^t1σ+ηTAt\widehat{A}_{t}\preceq\frac{1}{\sigma+\eta_{T}}A_{t}, and therefore:

RG\displaystyle R_{G} 1σ+ηTt=1T𝒈tA^t1𝒈t=1σ+ηTt=1TA^tA^t1,A^t1\displaystyle\leq\frac{1}{\sigma+\eta_{T}}\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}\widehat{A}_{t}^{-1}\boldsymbol{g}_{t}=\frac{1}{\sigma+\eta_{T}}\sum_{t=1}^{T}\left\langle{\widehat{A}_{t}-\widehat{A}_{t-1},\;\widehat{A}_{t}^{-1}}\right\rangle
1σ+ηTt=1Tln|A^t||A^t1|=1σ+ηTln|A^T||A^0|\displaystyle\leq\frac{1}{\sigma+\eta_{T}}\sum_{t=1}^{T}\ln\frac{|\widehat{A}_{t}|}{|\widehat{A}_{t-1}|}=\frac{1}{\sigma+\eta_{T}}\ln\frac{|\widehat{A}_{T}|}{|\widehat{A}_{0}|}
=1σ+ηTi=1dln(1+(σ+ηT)λi(t=1T𝒈t𝒈t)α)\displaystyle=\frac{1}{\sigma+\eta_{T}}\sum_{i=1}^{d}\ln\left(1+\frac{(\sigma+\eta_{T})\lambda_{i}\Bigl{(}\sum_{t=1}^{T}\boldsymbol{g}_{t}\boldsymbol{g}_{t}^{\top}\Bigr{)}}{\alpha}\right)
dσ+ηTln(1+(σ+ηT)i=1dλi(t=1T𝒈t𝒈t)dα)\displaystyle\leq\frac{d}{\sigma+\eta_{T}}\ln\left(1+\frac{(\sigma+\eta_{T})\sum_{i=1}^{d}\lambda_{i}\Bigl{(}\sum_{t=1}^{T}\boldsymbol{g}_{t}\boldsymbol{g}_{t}^{\top}\Bigr{)}}{d\alpha}\right)
=dσ+ηTln(1+(σ+ηT)t=1T𝒈t22dα)\displaystyle=\frac{d}{\sigma+\eta_{T}}\ln\left(1+\frac{(\sigma+\eta_{T})\sum_{t=1}^{T}\left\|{\boldsymbol{g}_{t}}\right\|_{2}^{2}}{d\alpha}\right)

where the second inequality is by the concavity of the function ln|X|\ln|X| (see [Hazan et al., 2007, Lemma 12] for an alternative proof), and the last one is by Jensen’s inequality. This concludes the proof. ∎

Appendix D A Truly Invariant Algorithm

In this section we discuss how to make our adaptive online Newton algorithm truly invariant to invertible linear transformations. To achieve this, we set α=0\alpha=0 and replace At1A_{t}^{-1} with the Moore-Penrose pseudoinverse AtA_{t}^{\dagger}: 888See Appendix B for the closed form of the projection step.

𝒖t+1=𝒘tAt𝒈t,𝒘t+1=argmin𝒘𝒦t+1𝒘𝒖t+1At.\begin{split}\boldsymbol{u}_{t+1}&=\boldsymbol{w}_{t}-A_{t}^{\dagger}\boldsymbol{g}_{t},\\ \boldsymbol{w}_{t+1}&=\operatorname*{argmin}_{\boldsymbol{w}\in\mathcal{K}_{t+1}}\left\|{\boldsymbol{w}-\boldsymbol{u}_{t+1}}\right\|_{A_{t}}~.\end{split} (7)

When written in this form, it is not immediately clear that the algorithm has the invariant property. However, one can rewrite the algorithm in a mirror descent form:

𝒘t+1\displaystyle\boldsymbol{w}_{t+1} =argmin𝒘𝒦t+1𝒘𝒘t+At𝒈tAt2\displaystyle=\operatorname*{argmin}_{\boldsymbol{w}\in\mathcal{K}_{t+1}}\left\|{\boldsymbol{w}-\boldsymbol{w}_{t}+A_{t}^{\dagger}\boldsymbol{g}_{t}}\right\|_{A_{t}}^{2}
=argmin𝒘𝒦t+1𝒘𝒘tAt2+2(𝒘𝒘t)AtAt𝒈t\displaystyle=\operatorname*{argmin}_{\boldsymbol{w}\in\mathcal{K}_{t+1}}\left\|{\boldsymbol{w}-\boldsymbol{w}_{t}}\right\|_{A_{t}}^{2}+2(\boldsymbol{w}-\boldsymbol{w}_{t})^{\top}A_{t}A_{t}^{\dagger}\boldsymbol{g}_{t}
=argmin𝒘𝒦t+1𝒘𝒘tAt2+2𝒘𝒈t\displaystyle=\operatorname*{argmin}_{\boldsymbol{w}\in\mathcal{K}_{t+1}}\left\|{\boldsymbol{w}-\boldsymbol{w}_{t}}\right\|_{A_{t}}^{2}+2\boldsymbol{w}^{\top}\boldsymbol{g}_{t}

where we use the fact that 𝒈t\boldsymbol{g}_{t} is in the range of AtA_{t} in the last step. Now suppose all the data 𝒙t\boldsymbol{x}_{t} are transformed to M𝒙tM\boldsymbol{x}_{t} for some unknown and invertible matrix MM, then one can verify that all the weights will be transformed to MT𝒘tM^{-T}\boldsymbol{w}_{t} accordingly, ensuring the prediction to remain the same.

Moreover, the regret bound of this algorithm can be bounded as below. First notice that even when AtA_{t} is rank deficient, the projection step still ensures the following: 𝒘t+1𝒘At2𝒖t+1𝒘At2\left\|{\boldsymbol{w}_{t+1}-\boldsymbol{w}}\right\|_{A_{t}}^{2}\leq\left\|{\boldsymbol{u}_{t+1}-\boldsymbol{w}}\right\|_{A_{t}}^{2}, which is proven in [Hazan et al., 2007, Lemma 8]. Therefore, the entire proof of Theorem 2 still holds after replacing At1A_{t}^{-1} with AtA_{t}^{\dagger}, giving the regret bound:

12t=1T𝒈tAt𝒈t+2(CL)2ηt.\frac{1}{2}\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}A_{t}^{\dagger}\ \boldsymbol{g}_{t}+2(CL)^{2}\eta_{t}~. (8)

The key now is to bound the term t=1T𝒈tA^t𝒈t\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}\widehat{A}_{t}^{\dagger}\ \boldsymbol{g}_{t} where we define A^t=s=1t𝒈s𝒈s\widehat{A}_{t}=\sum_{s=1}^{t}\boldsymbol{g}_{s}\boldsymbol{g}_{s}^{\top}. In order to do this, we proceed similarly to the proof of [Cesa-Bianchi et al., 2005, Theorem 4.2] to show that this term is of order 𝒪(d2lnT)\mathcal{O}(d^{2}\ln T) in the worst case.

Theorem 4.

Let λ\lambda^{*} be the minimum among the smallest nonzero eigenvalues of A^t(t=1,,T)\widehat{A}_{t}\;(t=1,\ldots,T) and rr be the rank of A^T\widehat{A}_{T}. We have

t=1T𝒈tA^t𝒈tr+(1+r)r2ln(1+2t=1T𝒈t22(1+r)rλ).\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}\widehat{A}_{t}^{\dagger}\ \boldsymbol{g}_{t}\leq r+\frac{(1+r)r}{2}\ln\left(1+\frac{2\sum_{t=1}^{T}\left\|{\boldsymbol{g}_{t}}\right\|^{2}_{2}}{(1+r)r\lambda^{*}}\right)~.
Proof.

First by Cesa-Bianchi et al. [2005, Lemma D.1], we have

𝒈tA^t𝒈t={1if 𝒈trange(A^t1)1det+(A^t1)det+(A^t)<1if 𝒈trange(A^t1)\boldsymbol{g}_{t}^{\top}\widehat{A}_{t}^{\dagger}\ \boldsymbol{g}_{t}=\left\{\begin{array}[]{cl}1&\text{if $\boldsymbol{g}_{t}\notin\operatorname*{range}(\widehat{A}_{t-1})$}\\ 1-\frac{\operatorname*{det_{+}}(\widehat{A}_{t-1})}{\operatorname*{det_{+}}(\widehat{A}_{t})}<1&\text{if $\boldsymbol{g}_{t}\in\operatorname*{range}(\widehat{A}_{t-1})$}\end{array}\right.

where det+(M)\operatorname*{det_{+}}(M) denotes the product of the nonzero eigenvalues of matrix MM. We thus separate the steps tt such that 𝒈trange(A^t1)\boldsymbol{g}_{t}\in\operatorname*{range}(\widehat{A}_{t-1}) from those where 𝒈trange(A^t1)\boldsymbol{g}_{t}\notin\operatorname*{range}(\widehat{A}_{t-1}). For each k=1,,rk=1,\dots,r let TkT_{k} be the first time step tt in which the rank of AtA_{t} is kk (so that T1=1T_{1}=1). Also let Tr+1=T+1T_{r+1}=T+1 for convenience. With this notation, we have

t=1T𝒈tA^t𝒈t\displaystyle\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}\widehat{A}_{t}^{\dagger}\ \boldsymbol{g}_{t} =k=1r(𝒈TkA^Tk𝒈Tk+t=Tk+1Tk+11𝒈tA^t𝒈t)\displaystyle=\sum_{k=1}^{r}\left(\boldsymbol{g}_{T_{k}}^{\top}\widehat{A}_{T_{k}}^{\dagger}\boldsymbol{g}_{T_{k}}+\sum_{t=T_{k}+1}^{T_{k+1}-1}\boldsymbol{g}_{t}^{\top}\widehat{A}_{t}^{\dagger}\ \boldsymbol{g}_{t}\right)
=k=1r(1+t=Tk+1Tk+11(1det+(A^t1)det+(A^t)))\displaystyle=\sum_{k=1}^{r}\left(1+\sum_{t=T_{k}+1}^{T_{k+1}-1}\left(1-\frac{\operatorname*{det_{+}}(\widehat{A}_{t-1})}{\operatorname*{det_{+}}(\widehat{A}_{t})}\right)\right)
=r+k=1rt=Tk+1Tk+11(1det+(A^t1)det+(A^t))\displaystyle=r+\sum_{k=1}^{r}\sum_{t=T_{k}+1}^{T_{k+1}-1}\left(1-\frac{\operatorname*{det_{+}}(\widehat{A}_{t-1})}{\operatorname*{det_{+}}(\widehat{A}_{t})}\right)
r+k=1rt=Tk+1Tk+11lndet+(A^t)det+(A^t1)\displaystyle\leq r+\sum_{k=1}^{r}\sum_{t=T_{k}+1}^{T_{k+1}-1}\ln\frac{\operatorname*{det_{+}}(\widehat{A}_{t})}{\operatorname*{det_{+}}(\widehat{A}_{t-1})}
=r+k=1rlndet+(A^Tk+11)det+(A^Tk).\displaystyle=r+\sum_{k=1}^{r}\ln\frac{\operatorname*{det_{+}}(\widehat{A}_{T_{k+1}-1})}{\operatorname*{det_{+}}(\widehat{A}_{T_{k}})}~.

Fix any kk and let λk,1,,λk,k\lambda_{k,1},\dots,\lambda_{k,k} be the nonzero eigenvalues of A^Tk\widehat{A}_{T_{k}} and λk,1+μk,1,,λk,k+μk,k\lambda_{k,1}+\mu_{k,1},\dots,\lambda_{k,k}+\mu_{k,k} be the nonzero eigenvalues of A^Tk+11\widehat{A}_{T_{k+1}-1}. Then

lndet+(A^Tk+11)det+(A^Tk)=lni=1kλk,i+μk,iλk,i=i=1kln(1+μk,iλk,i).\displaystyle\ln\frac{\operatorname*{det_{+}}(\widehat{A}_{T_{k+1}-1})}{\operatorname*{det_{+}}(\widehat{A}_{T_{k}})}=\ln\prod_{i=1}^{k}\frac{\lambda_{k,i}+\mu_{k,i}}{\lambda_{k,i}}=\sum_{i=1}^{k}\ln\left(1+\frac{\mu_{k,i}}{\lambda_{k,i}}\right)~.

Hence, we arrive at

t=1T𝒈tA^t+𝒈tr+k=1ri=1kln(1+μk,iλk,i).\displaystyle\sum_{t=1}^{T}\boldsymbol{g}_{t}^{\top}\widehat{A}_{t}^{+}\boldsymbol{g}_{t}\leq r+\sum_{k=1}^{r}\sum_{i=1}^{k}\ln\left(1+\frac{\mu_{k,i}}{\lambda_{k,i}}\right)~.

To further bound the latter quantity, we use λλk,i\lambda^{*}\leq\lambda_{k,i} and Jensen’s inequality :

k=1ri=1kln(1+μk,iλk,i)\displaystyle\sum_{k=1}^{r}\sum_{i=1}^{k}\ln\left(1+\frac{\mu_{k,i}}{\lambda_{k,i}}\right) k=1ri=1kln(1+μk,iλ)\displaystyle\leq\sum_{k=1}^{r}\sum_{i=1}^{k}\ln\left(1+\frac{\mu_{k,i}}{\lambda^{*}}\right)
(1+r)r2ln(1+2k=1ri=1kμk,i(1+r)rλ).\displaystyle\leq\frac{(1+r)r}{2}\ln\left(1+\frac{2\sum_{k=1}^{r}\sum_{i=1}^{k}\mu_{k,i}}{(1+r)r\lambda^{*}}\right)~.

Finally noticing that

i=1kμk,i=tr(A^Tk+11)tr(A^Tk)=t=Tk+1Tk+11tr(𝒈t𝒈t)=t=Tk+1Tk+11𝒈t22\sum_{i=1}^{k}\mu_{k,i}=\textsc{tr}({\widehat{A}_{T_{k+1}-1}})-\textsc{tr}({\widehat{A}_{T_{k}}})=\sum_{t=T_{k}+1}^{T_{k+1}-1}\textsc{tr}({\boldsymbol{g}_{t}\boldsymbol{g}_{t}^{\top}})=\sum_{t=T_{k}+1}^{T_{k+1}-1}\left\|{\boldsymbol{g}_{t}}\right\|^{2}_{2}

completes the proof. ∎

Taken together, Eq. (8) and Theorem 4 lead to the following regret bounds (recall the definitions of λ\lambda^{*} and rr from Theorem 4).

Corollary 1.

If σt=0\sigma_{t}=0 for all tt and ηt\eta_{t} is set to be 1CLdt\frac{1}{CL}\sqrt{\frac{d}{t}}, then the regret of the algorithm defined by Eq. (7) is at most

CL2Td(r+(1+r)r2ln(1+2t=1T𝒈t22(1+r)rλ))+4CLTd.\frac{CL}{2}\sqrt{\frac{T}{d}}\left(r+\frac{(1+r)r}{2}\ln\left(1+\frac{2\sum_{t=1}^{T}\left\|{\boldsymbol{g}_{t}}\right\|^{2}_{2}}{(1+r)r\lambda^{*}}\right)\right)+4CL\sqrt{Td}.

On the other hand, if σtσ>0\sigma_{t}\geq\sigma>0 for all tt and ηt\eta_{t} is set to be 0, then the regret is at most

12σ(r+(1+r)r2ln(1+2t=1T𝒈t22(1+r)rλ)).\frac{1}{2\sigma}\left(r+\frac{(1+r)r}{2}\ln\left(1+\frac{2\sum_{t=1}^{T}\left\|{\boldsymbol{g}_{t}}\right\|^{2}_{2}}{(1+r)r\lambda^{*}}\right)\right)~.

Appendix E Proof of Theorem 3

Proof.

We again first apply Proposition 1 (recall the notation RGR_{G} and RDR_{D} stated in the proposition). By the construction of the sketch, we have

AtAt1=StStSt1St1=𝒈^t𝒈^tρtVtVt𝒈^t𝒈^t.A_{t}-A_{t-1}=S_{t}^{\top}S_{t}-S_{t-1}^{\top}S_{t-1}=\widehat{\boldsymbol{g}}_{t}\widehat{\boldsymbol{g}}_{t}^{\top}-\rho_{t}V_{t}^{\top}V_{t}\preceq\widehat{\boldsymbol{g}}_{t}\widehat{\boldsymbol{g}}_{t}^{\top}.

It follows immediately that RDR_{D} is again at most 4(CL)2t=1Tηt4(CL)^{2}\sum_{t=1}^{T}\eta_{t}. For the term RGR_{G}, we will apply the following guarantee of Frequent Directions (see the proof of Theorem 1.1 of [Ghashami et al., 2015]): t=1TρtΩkmk.\sum_{t=1}^{T}\rho_{t}\leq\frac{\Omega_{k}}{m-k}. Specifically, since tr(VtAt1Vt)1αtr(VtVt)=mα\textsc{tr}({V_{t}A_{t}^{-1}V_{t}^{\top}})\leq\frac{1}{\alpha}\textsc{tr}({V_{t}V_{t}^{\top}})=\frac{m}{\alpha} we have

RG\displaystyle R_{G} =t=1T1σt+ηtAt1,AtAt1+ρtVtVt\displaystyle=\sum_{t=1}^{T}\frac{1}{\sigma_{t}+\eta_{t}}\left\langle{A_{t}^{-1},A_{t}-A_{t-1}+\rho_{t}V_{t}^{\top}V_{t}}\right\rangle
1σ+ηTt=1T(At1,AtAt1+ρtVtVt)\displaystyle\leq\frac{1}{\sigma+\eta_{T}}\sum_{t=1}^{T}\left(\left\langle{A_{t}^{-1},A_{t}-A_{t-1}+\rho_{t}V_{t}^{\top}V_{t}}\right\rangle\right)
=1σ+ηTt=1T(At1,AtAt1+ρttr(VtAt1Vt))\displaystyle=\frac{1}{\sigma+\eta_{T}}\sum_{t=1}^{T}\left(\left\langle{A_{t}^{-1},A_{t}-A_{t-1}}\right\rangle+\rho_{t}\textsc{tr}({V_{t}A_{t}^{-1}V_{t}^{\top}})\right)
1(σ+ηT)t=1TAt1,AtAt1+mΩk(mk)(σ+ηT)α.\displaystyle\leq\frac{1}{(\sigma+\eta_{T})}\sum_{t=1}^{T}\left\langle{A_{t}^{-1},A_{t}-A_{t-1}}\right\rangle+\frac{m\Omega_{k}}{(m-k)(\sigma+\eta_{T})\alpha}~.

Finally for the term t=1TAt1,AtAt1\sum_{t=1}^{T}\left\langle{A_{t}^{-1},A_{t}-A_{t-1}}\right\rangle, we proceed similarly to the proof of Theorem 2:

t=1TAt1,AtAt1\displaystyle\sum_{t=1}^{T}\left\langle{A_{t}^{-1},A_{t}-A_{t-1}}\right\rangle t=1Tln|At||At1|=ln|AT||A0|=i=1dln(1+λi(STST)α)\displaystyle\leq\sum_{t=1}^{T}\ln\frac{|A_{t}|}{|A_{t-1}|}=\ln\frac{|A_{T}|}{|A_{0}|}=\sum_{i=1}^{d}\ln\left(1+\frac{\lambda_{i}(S_{T}^{\top}S_{T})}{\alpha}\right)
=i=1mln(1+λi(STST)α)mln(1+tr(STST)mα)\displaystyle=\sum_{i=1}^{m}\ln\left(1+\frac{\lambda_{i}(S_{T}^{\top}S_{T})}{\alpha}\right)\leq m\ln\left(1+\frac{\textsc{tr}({S_{T}^{\top}S_{T}})}{m\alpha}\right)

where the first inequality is by the concavity of the function ln|X|\ln|X|, the second one is by Jensen’s inequality, and the last equality is by the fact that STSTS_{T}^{\top}S_{T} is of rank mm and thus λi(STST)=0\lambda_{i}(S_{T}^{\top}S_{T})=0 for any i>mi>m. This concludes the proof. ∎

Appendix F Sparse updates for FD sketch

The sparse version of our algorithm with the Frequent Directions option is much more involved. We begin by taking a detour and introducing a fast and epoch-based variant of the Frequent Directions algorithm proposed in [Ghashami et al., 2015]. The idea is the following: instead of doing an eigendecomposition immediately after inserting a new 𝒈^\widehat{\boldsymbol{g}} every round, we double the size of the sketch (to 2m2m), keep up to mm recent 𝒈^\widehat{\boldsymbol{g}}’s, do the decomposition only at the end of every mm rounds and finally keep the top mm eigenvectors with shrunk eigenvalues. The advantage of this variant is that it can be implemented straightforwardly in 𝒪(md)\mathcal{O}(md) time on average without doing a complicated rank-one SVD update, while still ensuring the exact same guarantee with the only price of doubling the sketch size.

Algorithm 6 shows the details of this variant and how we maintain HH. The sketch SS is always represented by two parts: the top part (DVDV) comes from the last eigendecomposition, and the bottom part (GG) collects the recent to-sketch vector 𝒈^\widehat{\boldsymbol{g}}’s. Note that within each epoch, the update of H1H^{-1} is a rank-two update and thus HH can be updated efficiently using Woodbury formula (Lines 4 and 5 of Algorithm 6).

Algorithm 6 Frequent Direction Sketch (epoch version)
1:τ,D,V,G\tau,D,V,G and HH.
2:
3:Set τ=1,D=𝟎m×m,G=𝟎m×d,H=1α𝑰2m\tau=1,D=\boldsymbol{0}_{m\times m},G=\boldsymbol{0}_{m\times d},H=\tfrac{1}{\alpha}\boldsymbol{I}_{2m} and let VV be any m×dm\times d matrix whose rows are orthonormal.
4:Return (𝟎2m×d,H)(\boldsymbol{0}_{2m\times d},H).
1:
2:Insert 𝒈^\widehat{\boldsymbol{g}} into the τ\tau-th row of GG.
3:if τ<m\tau<m then
4:  Let 𝒆\boldsymbol{e} be the 2m×12m\times 1 basis vector whose (m+τ)(m+\tau)-th entry is 1 and 𝒒=S𝒈^𝒈^𝒈^2𝒆\boldsymbol{q}=S\widehat{\boldsymbol{g}}-\tfrac{\widehat{\boldsymbol{g}}^{\top}\widehat{\boldsymbol{g}}}{2}\boldsymbol{e}.
5:  Update HHH𝒒𝒆H1+𝒆H𝒒H\leftarrow H-\frac{H\boldsymbol{q}\boldsymbol{e}^{\top}H}{1+\boldsymbol{e}^{\top}H\boldsymbol{q}} and HHH𝒆𝒒H1+𝒒H𝒆H\leftarrow H-\frac{H\boldsymbol{e}\boldsymbol{q}^{\top}H}{1+\boldsymbol{q}^{\top}H\boldsymbol{e}}.
6:  Update ττ+1\tau\leftarrow\tau+1.
7:else
8:  (V,Σ)ComputeEigenSystem((DVG))(V,\Sigma)\leftarrow\textbf{ComputeEigenSystem}\left(\left(\begin{array}[]{c}DV\\ G\end{array}\right)\right) (Algorithm 7).
9:  Set DD to be a diagonal matrix with Di,i=Σi,iΣm,m,i[m]D_{i,i}=\sqrt{\Sigma_{i,i}-\Sigma_{m,m}},\;\forall i\in[m].
10:  Set Hdiag{1α+D1,12,,1α+Dm,m2,1α,,1α}H\leftarrow\mathrm{diag}\!\left\{{\frac{1}{\alpha+D_{1,1}^{2}},\cdots,\frac{1}{\alpha+D_{m,m}^{2}},\frac{1}{\alpha},\ldots,\frac{1}{\alpha}}\right\}.
11:  Set G=𝟎m×dG=\boldsymbol{0}_{m\times d}.
12:  Set τ=1\tau=1.
13:end if
14:Return ((DVG),H)\left(\left(\begin{array}[]{c}DV\\ G\end{array}\right),H\right) .

Although we can use any available algorithm that runs in 𝒪(m2d)\mathcal{O}(m^{2}d) time to do the eigendecomposition (Line 8 in Algorithm 6), we explicitly write down the procedure of reducing this problem to eigendecomposing a small square matrix in Algorithm 7, which will be important for deriving the sparse version of the algorithm. Lemma 3 proves that Algorithm 7 works correctly for finding the top mm eigenvector and eigenvalues.

Algorithm 7 ComputeEigenSystem(S)(S)
1:S=(DVG)S=\left(\begin{array}[]{c}DV\\ G\end{array}\right).
2:Vm×dV^{\prime}\in\mathbb{R}^{m\times d} and diagonal matrix Σm×m\Sigma\in\mathbb{R}^{m\times m} such that the ii-th row of VV^{\prime} and the ii-th entry of the diagonal of Σ\Sigma are the ii-th eigenvector and eigenvalue of SSS^{\top}S respectively.
3:Compute M=GVM=GV^{\top}.
4:Decompose GMVG-MV into the form LQLQ where Lm×rL\in\mathbb{R}^{m\times r}, QQ is a r×dr\times d matrix whose rows are orthonormal and rr is the rank of GMVG-MV (e.g. by a Gram-Schmidt process).
5:Compute the top mm eigenvectors (Um×(m+r)U\in\mathbb{R}^{m\times(m+r)}) and eigenvalues (Σm×m\Sigma\in\mathbb{R}^{m\times m}) of the matrix (D2𝟎m×r𝟎r×m𝟎r×r)+(ML)(ML)\left(\begin{array}[]{cc}D^{2}&\boldsymbol{0}_{m\times r}\\ \boldsymbol{0}_{r\times m}&\boldsymbol{0}_{r\times r}\end{array}\right)+\left(\begin{array}[]{c}M^{\top}\\ L^{\top}\end{array}\right)\left(\begin{array}[]{cc}M&L\end{array}\right).
6:Return (V,Σ)(V^{\prime},\Sigma) where V=U(VQ)V^{\prime}=U\left(\begin{array}[]{c}V\\ Q\end{array}\right).
Lemma 3.

The outputs of Algorithm 7 are such that the ii-th row of VV^{\prime} and the ii-th entry of the diagonal of Σ\Sigma are the ii-th eigenvector and eigenvalue of SSS^{\top}S respectively.

Proof.

Let Wd×(dmr)W^{\top}\in\mathbb{R}^{d\times(d-m-r)} be an orthonormal basis of the null space of (VQ)\left(\begin{array}[]{c}V\\ Q\end{array}\right). By Line 4, we know that GW=𝟎GW^{\top}=\boldsymbol{0} and E=(VQW)E=(V^{\top}\;Q^{\top}\;W^{\top}) forms an orthonormal basis of d\mathbb{R}^{d}. Therefore, we have

SS\displaystyle S^{\top}S =VD2V+GG\displaystyle=V^{\top}D^{2}V+G^{\top}G
=E(D2𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎)E+EEGGEE\displaystyle=E\left(\begin{array}[]{ccc}D^{2}&\boldsymbol{0}&\boldsymbol{0}\\ \boldsymbol{0}&\boldsymbol{0}&\boldsymbol{0}\\ \boldsymbol{0}&\boldsymbol{0}&\boldsymbol{0}\end{array}\right)E^{\top}+EE^{\top}G^{\top}GEE^{\top}
=E((D2𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎)+(VGQGWG)(GVGQGW))E\displaystyle=E\left(\left(\begin{array}[]{ccc}D^{2}&\boldsymbol{0}&\boldsymbol{0}\\ \boldsymbol{0}&\boldsymbol{0}&\boldsymbol{0}\\ \boldsymbol{0}&\boldsymbol{0}&\boldsymbol{0}\end{array}\right)+\left(\begin{array}[]{c}VG^{\top}\\ QG^{\top}\\ WG^{\top}\end{array}\right)(GV^{\top}\;GQ^{\top}\;GW^{\top})\right)E^{\top}
=(VQ)((D2𝟎𝟎𝟎)+(ML)(ML))=C(VQ)\displaystyle=(V^{\top}\;Q^{\top})\underbrace{\left(\left(\begin{array}[]{cc}D^{2}&\boldsymbol{0}\\ \boldsymbol{0}&\boldsymbol{0}\end{array}\right)+\left(\begin{array}[]{c}M^{\top}\\ L^{\top}\end{array}\right)\left(\begin{array}[]{cc}M&L\end{array}\right)\right)}_{=C}\left(\begin{array}[]{c}V\\ Q\end{array}\right)

where in the last step we use the fact GQ=(MV+LQ)Q=LGQ^{\top}=(MV+LQ)Q^{\top}=L. Now it is clear that the eigenvalue of CC will be the eigenvalue of SSS^{\top}S and the eigenvector of CC will be the eigenvector of SSS^{\top}S after left multiplied by matrix (VQ)(V^{\top}\;Q^{\top}), completing the proof. ∎

We are now ready to present the sparse version of SON with Frequent Direction sketch (Algorithm 8). The key point is that we represent VtV_{t} as FtZtF_{t}Z_{t} for some Ftm×mF_{t}\in\mathbb{R}^{m\times m} and Ztm×dZ_{t}\in\mathbb{R}^{m\times d}, and the weight vector 𝒘t\boldsymbol{w}_{t} as 𝒘¯t+Zt1𝒃t\bar{\boldsymbol{w}}_{t}+Z_{t-1}^{\top}\boldsymbol{b}_{t} and ensure that the update of ZtZ_{t} and 𝒘¯t\bar{\boldsymbol{w}}_{t} will always be sparse. To see this, denote the sketch StS_{t} by (DtFtZtGt)\left(\begin{array}[]{c}D_{t}F_{t}Z_{t}\\ G_{t}\end{array}\right) and let Ht,1H_{t,1} and Ht,2H_{t,2} be the top and bottom half of HtH_{t}. Now the update rule of 𝒖t+1\boldsymbol{u}_{t+1} can be rewritten as

𝒖t+1\displaystyle\boldsymbol{u}_{t+1} =𝒘t(𝑰dStHtSt)𝒈tα\displaystyle=\boldsymbol{w}_{t}-\big{(}\boldsymbol{I}_{d}-S_{t}^{\top}H_{t}S_{t}\big{)}\tfrac{\boldsymbol{g}_{t}}{\alpha}
=𝒘¯t+Zt1𝒃t1α𝒈t+1α(ZtFtDt,Gt)(Ht,1St𝒈tHt,2St𝒈t)\displaystyle=\bar{\boldsymbol{w}}_{t}+Z_{t-1}^{\top}\boldsymbol{b}_{t}-\frac{1}{\alpha}\boldsymbol{g}_{t}+\frac{1}{\alpha}(Z_{t}^{\top}F_{t}^{\top}D_{t},G_{t}^{\top})\left(\begin{array}[]{c}H_{t,1}S_{t}\boldsymbol{g}_{t}\\ H_{t,2}S_{t}\boldsymbol{g}_{t}\end{array}\right)
=𝒘¯t+1α(GtHt,2St𝒈t𝒈t)(ZtZt1)𝒃t𝒖¯t+1+Zt(𝒃t+1αFtDtHt,1St𝒈t)𝒃t+1\displaystyle=\underbrace{\bar{\boldsymbol{w}}_{t}+\frac{1}{\alpha}(G_{t}^{\top}H_{t,2}S_{t}\boldsymbol{g}_{t}-\boldsymbol{g}_{t})-(Z_{t}-Z_{t-1})^{\top}\boldsymbol{b}_{t}}_{\bar{\boldsymbol{u}}_{t+1}}+Z_{t}^{\top}\underbrace{(\boldsymbol{b}_{t}+\frac{1}{\alpha}F_{t}^{\top}D_{t}H_{t,1}S_{t}\boldsymbol{g}_{t})}_{\boldsymbol{b}^{\prime}_{t+1}}

We will show that ZtZt1=ΔtGtZ_{t}-Z_{t-1}=\Delta_{t}G_{t} for some Δtm×m\Delta_{t}\in\mathbb{R}^{m\times m} shortly, and thus the above update is efficient due to the fact that the rows of GtG_{t} are collections of previous sparse vectors 𝒈^\widehat{\boldsymbol{g}}.

Similarly, the update of 𝒘t+1\boldsymbol{w}_{t+1} can be written as

𝒘t+1\displaystyle\boldsymbol{w}_{t+1} =𝒖t+1γt(𝒙t+1StHtSt𝒙t+1)\displaystyle=\boldsymbol{u}_{t+1}-\gamma_{t}(\boldsymbol{x}_{t+1}-S_{t}^{\top}H_{t}S_{t}\boldsymbol{x}_{t+1})
=𝒖¯t+1+Zt𝒃t+1γt𝒙t+1+γt(ZtFtDt,Gt)(Ht,1St𝒙t+1Ht,2St𝒙t+1)\displaystyle=\bar{\boldsymbol{u}}_{t+1}+Z_{t}^{\top}\boldsymbol{b}^{\prime}_{t+1}-\gamma_{t}\boldsymbol{x}_{t+1}+\gamma_{t}(Z_{t}^{\top}F_{t}^{\top}D_{t},G_{t}^{\top})\left(\begin{array}[]{c}H_{t,1}S_{t}\boldsymbol{x}_{t+1}\\ H_{t,2}S_{t}\boldsymbol{x}_{t+1}\end{array}\right)
=𝒖¯t+1+γt(GtHt,2St𝒙t+1𝒙t+1)𝒘¯t+1+Zt(𝒃t+1+γtFtDtHt,1St𝒙t+1)𝒃t+1.\displaystyle=\underbrace{\bar{\boldsymbol{u}}_{t+1}+\gamma_{t}(G_{t}^{\top}H_{t,2}S_{t}\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t+1})}_{\bar{\boldsymbol{w}}_{t+1}}+Z_{t}^{\top}\underbrace{(\boldsymbol{b}^{\prime}_{t+1}+\gamma_{t}F_{t}^{\top}D_{t}H_{t,1}S_{t}\boldsymbol{x}_{t+1})}_{\boldsymbol{b}_{t+1}}.

It is clear that γt\gamma_{t} can be computed efficiently, and thus the update of 𝒘t+1\boldsymbol{w}_{t+1} is also efficient. These updates correspond to Line 7 and 11 of Algorithm 8.

It remains to perform the sketch update efficiently. Algorithm 9 is the sparse version of Algorithm 6. The challenging part is to compute eigenvectors and eigenvalues efficiently. Fortunately, in light of Algorithm 7, using the new representation V=FZV=FZ one can directly translate the process to Algorithm 10 and find that the eigenvectors can be expressed in the form N1Z+N2GN_{1}Z+N_{2}G. To see this, first note that Line 1 of both algorithms compute the same matrix M=GV=GZFM=GV^{\top}=GZ^{\top}F^{\top}. Then Line 4 decomposes the matrix

GMV=GMFZ=(MF𝑰m)(ZG)=defPRG-MV=G-MFZ=\left(\begin{array}[]{cc}-MF&\boldsymbol{I}_{m}\end{array}\right)\left(\begin{array}[]{c}Z\\ G\end{array}\right)\stackrel{{\scriptstyle\rm def}}{{=}}PR

using Gram-Schmidt into the form LQRLQR such that the rows of QRQR are orthonormal (that is, QRQR corresponds to QQ in Algorithm 7). While directly applying Gram-Schmidt to PRPR would take 𝒪(m2d)\mathcal{O}(m^{2}d) time, this step can in fact be efficiently implemented by performing Gram-Schmidt to PP (instead of PRPR) in a Banach space where inner product is defined as 𝒂,𝒃=𝒂K𝒃\langle\boldsymbol{a},\boldsymbol{b}\rangle=\boldsymbol{a}^{\top}K\boldsymbol{b} with

K=RR=(ZZZGGZGG)K=RR^{\top}=\left(\begin{array}[]{cc}ZZ^{\top}&ZG^{\top}\\ GZ^{\top}&GG^{\top}\end{array}\right)

being the Gram matrix of RR. Since we can efficiently maintain the Gram matrix of ZZ (see Line 11 of Algorithm 9) and GZGZ^{\top} and GGGG^{\top} can be computed sparsely, this decomposing step can be done efficiently too. This modified Gram-Schmidt algorithm is presented in Algorithm 11 (which will also be used in sparse Oja’s sketch), where Line 6 is the key difference compared to standard Gram-Schmidt (see Lemma 4 below for a formal proof of correctness).

Line 3 of Algorithms 7 and 10 are exactly the same. Finally the eigenvectors U(VQ)U\left(\begin{array}[]{c}V\\ Q\end{array}\right) in Algorithm 7 now becomes (with U1,U2,Q1,Q2,N1,N2U_{1},U_{2},Q_{1},Q_{2},N_{1},N_{2} defined in Line 4 of Algorithm 10)

U(FZQR)\displaystyle U\left(\begin{array}[]{c}FZ\\ QR\end{array}\right) =(U1,U2)(FZQR)=U1FZ+U2(Q1,Q2)(ZG)\displaystyle=(U_{1},U_{2})\left(\begin{array}[]{c}FZ\\ QR\end{array}\right)=U_{1}FZ+U_{2}(Q_{1},Q_{2})\left(\begin{array}[]{c}Z\\ G\end{array}\right)
=(U1FZ+U2Q1)Z+U2Q2G=N1Z+N2G.\displaystyle=(U_{1}FZ+U_{2}Q_{1})Z+U_{2}Q_{2}G=N_{1}Z+N_{2}G.

Therefore, having the eigenvectors in the form N1Z+N2GN_{1}Z+N_{2}G, we can simply update FF as N1N_{1} and ZZ as Z+N11N2GZ+N_{1}^{-1}N_{2}G so that the invariant V=FZV=FZ still holds (see Line 12 of Algorithm 9). The update of ZZ is sparse since GG is sparse.

We finally summarize the results of this section in the following theorem.

Theorem 5.

The average running time of Algorithm 8 is 𝒪(m2+ms)\mathcal{O}\bigl{(}m^{2}+ms\bigr{)} per round, and the regret bound is exactly the same as the one stated in Theorem 3.

Algorithm 8 Sparse Sketched Online Newton with Frequent Directions
1:Parameters CC, α\alpha and mm.
2:Initialize 𝒖¯=𝟎d×1\bar{\boldsymbol{u}}=\boldsymbol{0}_{d\times 1}, 𝒃=𝟎m×1\boldsymbol{b}=\boldsymbol{0}_{m\times 1} and (D,F,Z,G,H)SketchInit(α,m)(D,F,Z,G,H)\leftarrow\textbf{SketchInit}(\alpha,m) (Algorithm 9).
3:Let SS denote the matrix (DFZG)\left(\begin{array}[]{c}DFZ\\ G\end{array}\right) throughout the algorithm (without actually computing it).
4:Let H1H_{1} and H2H_{2} denote the upper and lower half of HH, i.e. H=(H1H2)H=\left(\begin{array}[]{c}H_{1}\\ H_{2}\end{array}\right).
5:for t=1t=1 to TT do
6:  Receive example 𝒙t\boldsymbol{x}_{t}.
7:  Projection step: compute 𝒙^=S𝒙t\widehat{\boldsymbol{x}}=S\boldsymbol{x}_{t} and γ=τC(𝒖¯𝒙t+𝒃Z𝒙t)𝒙t𝒙t𝒙^H𝒙^\gamma=\frac{\tau_{C}(\bar{\boldsymbol{u}}^{\top}\boldsymbol{x}_{t}+\boldsymbol{b}^{\top}Z\boldsymbol{x}_{t})}{\boldsymbol{x}_{t}^{\top}\boldsymbol{x}_{t}-\widehat{\boldsymbol{x}}^{\top}H\widehat{\boldsymbol{x}}}. Obtain 𝒘¯=𝒖¯+γ(GH2𝒙^𝒙t)\bar{\boldsymbol{w}}=\bar{\boldsymbol{u}}+\gamma(G^{\top}H_{2}\widehat{\boldsymbol{x}}-\boldsymbol{x}_{t}) and 𝒃𝒃+γFDH1𝒙^\boldsymbol{b}\leftarrow\boldsymbol{b}+\gamma F^{\top}DH_{1}\widehat{\boldsymbol{x}}.
8:  Predict label yt=𝒘¯𝒙t+𝒃Z𝒙ty_{t}=\bar{\boldsymbol{w}}^{\top}\boldsymbol{x}_{t}+\boldsymbol{b}^{\top}Z\boldsymbol{x}_{t} and suffer loss t(yt)\ell_{t}(y_{t}).
9:  Compute gradient 𝒈t=t(yt)𝒙t\boldsymbol{g}_{t}=\ell_{t}^{\prime}(y_{t})\boldsymbol{x}_{t} and the to-sketch vector 𝒈^=σt+ηt𝒈t\widehat{\boldsymbol{g}}=\sqrt{\sigma_{t}+\eta_{t}}\boldsymbol{g}_{t}.
10:  (D,F,Z,G,H,Δ)SketchUpdate(𝒈^)(D,F,Z,G,H,\Delta)\leftarrow\textbf{SketchUpdate}(\widehat{\boldsymbol{g}}) (Algorithm 9).
11:  Update 𝒖¯=𝒘¯+1α(GH2S𝒈𝒈)GΔ𝒃\bar{\boldsymbol{u}}=\bar{\boldsymbol{w}}+\frac{1}{\alpha}(G^{\top}H_{2}S\boldsymbol{g}-\boldsymbol{g})-G^{\top}\Delta^{\top}\boldsymbol{b} and 𝒃𝒃+1αFDH1S𝒈\boldsymbol{b}\leftarrow\boldsymbol{b}+\frac{1}{\alpha}F^{\top}DH_{1}S\boldsymbol{g}.
12:end for
Algorithm 9 Sparse Frequent Direction Sketch
1:τ,D,F,Z,G,H\tau,D,F,Z,G,H and KK.
2:
3:Set τ=1,D=𝟎m×m,F=K=𝑰m,H=1α𝑰2m,G=𝟎m×d\tau=1,D=\boldsymbol{0}_{m\times m},F=K=\boldsymbol{I}_{m},H=\tfrac{1}{\alpha}\boldsymbol{I}_{2m},G=\boldsymbol{0}_{m\times d}, and let ZZ be any m×dm\times d matrix whose rows are orthonormal.
4:Return (D,F,Z,G,H)(D,F,Z,G,H).
1:
2:Insert 𝒈^\widehat{\boldsymbol{g}} into the τ\tau-th row of GG.
3:if τ<m\tau<m then
4:  Let 𝒆\boldsymbol{e} be the 2m×12m\times 1 basic vector whose (m+τ)(m+\tau)-th entry is 1 and compute 𝒒=S𝒈^𝒈^𝒈^2𝒆\boldsymbol{q}=S\widehat{\boldsymbol{g}}-\tfrac{\widehat{\boldsymbol{g}}^{\top}\widehat{\boldsymbol{g}}}{2}\boldsymbol{e}.
5:  Update HHH𝒒𝒆H1+𝒆H𝒒H\leftarrow H-\frac{H\boldsymbol{q}\boldsymbol{e}^{\top}H}{1+\boldsymbol{e}^{\top}H\boldsymbol{q}} and HHH𝒆𝒒H1+𝒒H𝒆H\leftarrow H-\frac{H\boldsymbol{e}\boldsymbol{q}^{\top}H}{1+\boldsymbol{q}^{\top}H\boldsymbol{e}}.
6:  Set Δ=𝟎m×m\Delta=\boldsymbol{0}_{m\times m}.
7:  Set ττ+1\tau\leftarrow\tau+1.
8:else
9:  (N1,N2,Σ)ComputeSparseEigenSystem((DFZG),K)(N_{1},N_{2},\Sigma)\leftarrow\textbf{ComputeSparseEigenSystem}\left(\left(\begin{array}[]{c}DFZ\\ G\end{array}\right),K\right) (Algorithm 10).
10:  Compute Δ=N11N2\Delta=N_{1}^{-1}N_{2}.
11:  Update Gram matrix KK+ΔGZ+ZGΔ+ΔGGΔK\leftarrow K+\Delta GZ^{\top}+ZG^{\top}\Delta^{\top}+\Delta GG^{\top}\Delta^{\top}.
12:  Update F=N1,ZZ+ΔGF=N_{1},Z\leftarrow Z+\Delta G, and let DD be such that Di,i=Σi,iΣm,m,i[m]D_{i,i}=\sqrt{\Sigma_{i,i}-\Sigma_{m,m}},\;\forall i\in[m].
13:  Set Hdiag{1α+D1,12,,1α+Dm,m2,1α,,1α}H\leftarrow\mathrm{diag}\!\left\{{\frac{1}{\alpha+D_{1,1}^{2}},\cdots,\frac{1}{\alpha+D_{m,m}^{2}},\frac{1}{\alpha},\ldots,\frac{1}{\alpha}}\right\}.
14:  Set G=𝟎m×dG=\boldsymbol{0}_{m\times d}.
15:  Set τ=1\tau=1.
16:end if
17:Return (D,F,Z,G,H,Δ)(D,F,Z,G,H,\Delta).
Algorithm 10 ComputeSparseEigenSystem(S,K)(S,K)
1:S=(DFZG)S=\left(\begin{array}[]{c}DFZ\\ G\end{array}\right) and Gram matrix K=ZZK=ZZ^{\top}.
2:N1,N2m×mN_{1},N_{2}\in\mathbb{R}^{m\times m} and diagonal matrix Σm×m\Sigma\in\mathbb{R}^{m\times m} such that the ii-th row of N1Z+N2GN_{1}Z+N_{2}G and the ii-th entry of the diagonal of Σ\Sigma are the ii-th eigenvector and eigenvalue of the matrix SSS^{\top}S.
3:Compute M=GZFM=GZ^{\top}F^{\top}.
4:(L,Q)Decompose((MF𝑰m),(KZGGZGG))(L,Q)\leftarrow\text{Decompose}\left(\left(\begin{array}[]{cc}-MF&\boldsymbol{I}_{m}\end{array}\right),\left(\begin{array}[]{cc}K&ZG^{\top}\\ GZ^{\top}&GG^{\top}\end{array}\right)\right) (Algorithm 11).
5:Let rr be the number of columns of LL. Compute the top mm eigenvectors (Um×(m+r)U\in\mathbb{R}^{m\times(m+r)}) and eigenvalues (Σm×m\Sigma\in\mathbb{R}^{m\times m}) of the matrix (D2𝟎m×r𝟎r×m𝟎r×r)+(ML)(ML)\left(\begin{array}[]{cc}D^{2}&\boldsymbol{0}_{m\times r}\\ \boldsymbol{0}_{r\times m}&\boldsymbol{0}_{r\times r}\end{array}\right)+\left(\begin{array}[]{c}M^{\top}\\ L^{\top}\end{array}\right)\left(\begin{array}[]{cc}M&L\end{array}\right).
6:Set N1=U1F+U2Q1N_{1}=U_{1}F+U_{2}Q_{1} and N2=U2Q2N_{2}=U_{2}Q_{2} where U1U_{1} and U2U_{2} are the first mm and last rr columns of UU respectively, and Q1Q_{1} and Q2Q_{2} are the left and right half of QQ respectively.
7:Return (N1,N2,Σ)(N_{1},N_{2},\Sigma).
Lemma 4.

The output of Algorithm 11 ensures that LQR=PRLQR=PR and the rows of QRQR are orthonormal.

Proof.

It suffices to prove that Algorithm 11 is exactly the same as using the standard Gram-Schmidt to decompose the matrix PRPR into LL and an orthonormal matrix which can be written as QRQR. First note that when K=𝑰nK=\boldsymbol{I}_{n}, Algorithm 11 is simply the standard Gram-Schmidt algorithm applied to PP. We will thus go through Line 1-10 of Algorithm 11 with PP replaced by PRPR and KK by 𝑰n\boldsymbol{I}_{n} and show that it leads to the exact same calculations as running Algorithm 11 directly. For clarity, we add “~\;\tilde{}\;” to symbols to distinguish the two cases (so P~=PR\tilde{P}=PR and K~=𝑰n\tilde{K}=\boldsymbol{I}_{n}). We will inductively prove the invariance Q~=QR\tilde{Q}=QR and L~=L\tilde{L}=L. The base case Q~=QR=𝟎\tilde{Q}=QR=\boldsymbol{0} and L~=L=𝟎\tilde{L}=L=\boldsymbol{0} is trivial. Now assume it holds for iteration i1i-1 and consider iteration ii. We have

𝜶~=Q~K~𝒑~=QRR𝒑=QK𝒑=𝜶,\tilde{\boldsymbol{\alpha}}=\tilde{Q}\tilde{K}\tilde{\boldsymbol{p}}=QRR^{\top}\boldsymbol{p}=QK\boldsymbol{p}=\boldsymbol{\alpha},
𝜷~=𝒑~Q~𝜶~=R𝒑(QR)𝜶=R(𝒑Q𝜶)=R𝜷,\tilde{\boldsymbol{\beta}}=\tilde{\boldsymbol{p}}-\tilde{Q}^{\top}\tilde{\boldsymbol{\alpha}}=R^{\top}\boldsymbol{p}-(QR)^{\top}\boldsymbol{\alpha}=R^{\top}(\boldsymbol{p}-Q^{\top}\boldsymbol{\alpha})=R^{\top}\boldsymbol{\beta},
c~=𝜷~K~𝜷~=(R𝜷)(R𝜷)=𝜷K𝜷=c,\tilde{c}=\sqrt{\tilde{\boldsymbol{\beta}}^{\top}\tilde{K}\tilde{\boldsymbol{\beta}}}=\sqrt{(R^{\top}\boldsymbol{\beta})^{\top}(R^{\top}\boldsymbol{\beta})}=\sqrt{\boldsymbol{\beta}^{\top}K\boldsymbol{\beta}}=c,

which clearly implies that after execution of Line 5-9, we again have Q~=QR\tilde{Q}=QR and L~=L\tilde{L}=L, finishing the induction. ∎

Appendix G Details for sparse Oja’s algorithm

We finally provide the missing details for the sparse version of the Oja’s algorithm. Since we already discussed the updates for 𝒘¯t\bar{\boldsymbol{w}}_{t} and 𝒃t\boldsymbol{b}_{t} in Section 4, we just need to describe how the updates for FtF_{t} and ZtZ_{t} work. Recall that the dense Oja’s updates can be written in terms of FF and ZZ as

Λt=(𝑰mΓt)Λt1+Γtdiag{Ft1Zt1𝒈^t}2FtZtorthFt1Zt1+ΓtFt1Zt1𝒈^t𝒈^t=Ft1(Zt1+Ft11ΓtFt1Zt1𝒈^t𝒈^t).\begin{split}\Lambda_{t}&=(\boldsymbol{I}_{m}-\Gamma_{t})\Lambda_{t-1}+\Gamma_{t}\;\mathrm{diag}\!\left\{{F_{t-1}Z_{t-1}\widehat{\boldsymbol{g}}_{t}}\right\}^{2}\\ F_{t}Z_{t}&\xleftarrow{\text{orth}}F_{t-1}Z_{t-1}+\Gamma_{t}F_{t-1}Z_{t-1}\widehat{\boldsymbol{g}}_{t}\widehat{\boldsymbol{g}}_{t}^{\top}=F_{t-1}(Z_{t-1}+F_{t-1}^{-1}\Gamma_{t}F_{t-1}Z_{t-1}\widehat{\boldsymbol{g}}_{t}\widehat{\boldsymbol{g}}_{t}^{\top})~.\end{split} (9)

Here, the update for the eigenvalues is straightforward. For the update of eigenvectors, first we let Zt=Zt1+𝜹t𝒈^tZ_{t}=Z_{t-1}+\boldsymbol{\delta}_{t}\widehat{\boldsymbol{g}}_{t}^{\top} where 𝜹t=Ft11ΓtFt1Zt1𝒈^t\boldsymbol{\delta}_{t}=F_{t-1}^{-1}\Gamma_{t}F_{t-1}Z_{t-1}\widehat{\boldsymbol{g}}_{t} (note that under the assumption of Footnote 4, FtF_{t} is always invertible). Now it is clear that ZtZt1Z_{t}-Z_{t-1} is a sparse rank-one matrix and the update of 𝒖¯t+1\bar{\boldsymbol{u}}_{t+1} is efficient. Finally it remains to update FtF_{t} so that FtZtF_{t}Z_{t} is the same as orthonormalizing Ft1ZtF_{t-1}Z_{t}, which can in fact be achieved by applying the Gram-Schmidt algorithm to Ft1F_{t-1} in a Banach space where inner product is defined as 𝒂,𝒃=𝒂Kt𝒃\langle\boldsymbol{a},\boldsymbol{b}\rangle=\boldsymbol{a}^{\top}K_{t}\boldsymbol{b} where KtK_{t} is the Gram matrix ZtZtZ_{t}Z_{t}^{\top} (see Algorithm 11). Since we can maintain KtK_{t} efficiently based on the update of ZtZ_{t}:

Kt=Kt1+𝜹t𝒈^tZt1+Zt1𝒈^t𝜹t+(𝒈^t𝒈^t)𝜹t𝜹t,K_{t}=K_{t-1}+\boldsymbol{\delta}_{t}\widehat{\boldsymbol{g}}_{t}^{\top}Z_{t-1}^{\top}+Z_{t-1}\widehat{\boldsymbol{g}}_{t}\boldsymbol{\delta}_{t}^{\top}+(\widehat{\boldsymbol{g}}_{t}^{\top}\widehat{\boldsymbol{g}}_{t})\boldsymbol{\delta}_{t}\boldsymbol{\delta}_{t}^{\top},

the update of FtF_{t} can therefore be implemented in 𝒪(m3)\mathcal{O}(m^{3}) time.

Algorithm 11 Decompose(P, K)
1:Pm×nP\in\mathbb{R}^{m\times n}, Km×mK\in\mathbb{R}^{m\times m} such that KK is the Gram matrix K=RRK=RR^{\top} for some matrix Rn×dR\in\mathbb{R}^{n\times d} where nm,dmn\geq m,d\geq m,
2:Lm×rL\in\mathbb{R}^{m\times r} and Qr×nQ\in\mathbb{R}^{r\times n} such that LQR=PRLQR=PR where rr is the rank of PRPR and the rows of QRQR are orthonormal.
3:Initialize L=𝟎m×mL=\boldsymbol{0}_{m\times m} and Q=𝟎m×nQ=\boldsymbol{0}_{m\times n}.
4:for i=1i=1 to mm do
5:  Let 𝒑\boldsymbol{p}^{\top} be the ii-th row of PP.
6:  Compute 𝜶=QK𝒑,𝜷=𝒑Q𝜶\boldsymbol{\alpha}=QK\boldsymbol{p},\boldsymbol{\beta}=\boldsymbol{p}-Q^{\top}\boldsymbol{\alpha} and c=𝜷K𝜷c=\sqrt{\boldsymbol{\beta}^{\top}K\boldsymbol{\beta}}.
7:  if c0c\neq 0 then
8:   Insert 1c𝜷\frac{1}{c}\boldsymbol{\beta}^{\top} to the ii-th row of QQ.
9:  end if
10:  Set the ii-th entry of 𝜶\boldsymbol{\alpha} to be cc and insert 𝜶\boldsymbol{\alpha} to the ii-th row of LL.
11:end for
12:Delete the all-zero columns of LL and all-zero rows of QQ.
13:Return (L,Q)(L,Q).

Appendix H Experiment Details

This section reports some detailed experimental results omitted from Section 5.2. Table 1 includes the description of benchmark datasets; Table 2 reports error rates on relatively small datasets to show that Oja-SON generally has better performance; Table 3 reports concrete error rates for the experiments described in Section 5.2; finally Table 4 shows that Oja’s algorithm estimates the eigenvalues accurately.

As mentioned in Section 5.2, we see substantial improvement for the splice dataset when using Oja’s sketch even after the diagonal adaptation. We verify that the condition number for this dataset before and after the diagonal adaptation are very close (682 and 668 respectively), explaining why a large improvement is seen using Oja’s sketch. Fig. 4 shows the decrease of error rates as Oja-SON with different sketch sizes sees more examples. One can see that even with m=1m=1 Oja-SON already performs very well. This also matches our expectation since there is a huge gap between the top and second eigenvalues of this dataset (50.750.7 and 0.40.4 respectively).

Table 1: Datasets used in experiments
Dataset #examples avg. sparsity #features
20news 18845 93.89 101631
a9a 48841 13.87 123
acoustic 78823 50.00 50
adult 48842 12.00 105
australian 690 11.19 14
breast-cancer 683 10.00 10
census 299284 32.01 401
cod-rna 271617 8.00 8
covtype 581011 11.88 54
diabetes 768 7.01 8
gisette 1000 4971.00 5000
heart 270 9.76 13
ijcnn1 91701 13.00 22
ionosphere 351 30.06 34
letter 20000 15.58 16
magic04 19020 9.99 10
mnist 11791 142.43 780
mushrooms 8124 21.00 112
rcv1 781265 75.72 43001
real-sim 72309 51.30 20958
splice 1000 60.00 60
w1a 2477 11.47 300
w8a 49749 11.65 300
Table 2: Error rates for Sketched Online Newton with different sketching algorithms
Dataset FD-SON Oja-SON
australian 16.0 15.8
breast-cancer 5.3 3.7
diabetes 35.4 32.8
mushrooms 0.5 0.2
splice 22.6 22.9
Table 3: Error rates for different algorithms (with best results bolded)
Dataset Oja-SON AdaGrad
Without Diagonal Adaptation With Diagonal Adaptation
m=0m=0 m=10m=10 m=0m=0 m=10m=10
20news 0.121338 0.121338 0.049590 0.049590 0.068020
a9a 0.204447 0.195203 0.155953 0.155953 0.156414
acoustic 0.305824 0.260241 0.257894 0.257894 0.259493
adult 0.199763 0.199803 0.150830 0.150830 0.181582
australian 0.366667 0.366667 0.162319 0.157971 0.289855
breast-cancer 0.374817 0.374817 0.036603 0.036603 0.358712
census 0.093610 0.062038 0.051479 0.051439 0.069629
cod-rna 0.175107 0.175107 0.049710 0.049643 0.081066
covtype 0.042304 0.042312 0.050827 0.050818 0.045507
diabetes 0.433594 0.433594 0.329427 0.328125 0.391927
gisette 0.208000 0.208000 0.152000 0.152000 0.154000
heart 0.477778 0.388889 0.244444 0.244444 0.362963
ijcnn1 0.046826 0.046826 0.034536 0.034645 0.036913
ionosphere 0.188034 0.148148 0.182336 0.182336 0.190883
letter 0.306650 0.232300 0.233250 0.230450 0.237350
magic04 0.000263 0.000263 0.000158 0.000158 0.000210
mnist 0.062336 0.062336 0.040031 0.039182 0.046561
mushrooms 0.003323 0.002339 0.002462 0.002462 0.001969
rcv1 0.055976 0.052694 0.052764 0.052766 0.050938
real-sim 0.045140 0.043577 0.029498 0.029498 0.031670
splice 0.343000 0.343000 0.294000 0.229000 0.301000
w1a 0.001615 0.001615 0.004845 0.004845 0.003633
w8a 0.000101 0.000101 0.000422 0.000422 0.000221
Table 4: Largest relative error between true and estimated top 10 eigenvalues using Oja’s rule.
Dataset
Relative eigenvalue
difference
a9a 0.90
australian 0.85
breast-cancer 5.38
diabetes 5.13
heart 4.36
ijcnn1 0.57
magic04 11.48
mushrooms 0.91
splice 8.23
w8a 0.95
Refer to caption
Figure 4: Error rates for Oja-SON with different sketch sizes on splice dataset