This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Algorithms for p\ell_{p}-Regression

Deeksha Adil111Supported by a Post Graduate Schlarship (PGSD) by NSERC (Natural Sciences and Engineering Research Council of Canada) and Sushant Sachdeva’s NSERC Discovery Grant.
University of Toronto
deeksha@cs.toronto.edu
   Rasmus Kyng222The research leading to these results has received funding from the grant “Algorithms and complexity for high-accuracy flows and convex optimization” (no. 200021 204787) of the Swiss National Science Foundation.
ETH Zurich
kyng@inf.ethz.ch
   Richard Peng333This material is based upon work supported by the National Science Foundation under Grant No. 1846218, and by an Natural Sciences and Engineering Research Council of Canada Discovery Grant. Part this work was done while the author was at the Georgia Institute of Technology.
University of Waterloo
y5peng@uwaterloo.ca
   Sushant Sachdeva444Sushant Sachdeva’s research is supported by an NSERC (Natural Sciences and Engineering Research Council of Canada) Discovery Grant.
University of Toronto
sachdeva@cs.toronto.edu
Abstract

The p\ell_{p}-norm regression problem is a classic problem in optimization with wide ranging applications in machine learning and theoretical computer science. The goal is to compute 𝒙=argmin𝑨𝒙=𝒃𝒙pp\boldsymbol{\mathit{x}}^{\star}=\arg\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{x}}\|_{p}^{p}, where 𝒙n,𝑨d×n,𝒃d\boldsymbol{\mathit{x}}^{\star}\in\mathbb{R}^{n},\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{b}}\in\mathbb{R}^{d} and dnd\leq n. Efficient high-accuracy algorithms for the problem have been challenging both in theory and practice and the state of the art algorithms require poly(p)n121ppoly(p)\cdot n^{\frac{1}{2}-\frac{1}{p}} linear system solves for p2p\geq 2. In this paper, we provide new algorithms for p\ell_{p}-regression (and a more general formulation of the problem) that obtain a high-accuracy solution in O(pn(p2)/(3p2))O(pn^{\nicefrac{{(p-2)}}{{(3p-2)}}}) linear system solves. We further propose a new inverse maintenance procedure that speeds-up our algorithm to O~(nω)\widetilde{O}(n^{\omega}) total runtime, where O(nω)O(n^{\omega}) denotes the running time for multiplying n×nn\times n matrices. Additionally, we give the first Iteratively Reweighted Least Squares (IRLS) algorithm that is guaranteed to converge to an optimum in a few iterations. Our IRLS algorithm has shown exceptional practical performance, beating the currently available implementations in MATLAB/CVX by 10-50x.

1 Introduction

Preliminary versions of the results in this paper have appeared as conference publications [Adi+19, APS19, AS20, Adi+21]. This paper unifies and simplifies results from the preliminary versions.

Linear regression in p\ell_{p}-norm seeks to compute a vector 𝒙n\boldsymbol{\mathit{x}}^{\star}\in\mathbb{R}^{n} such that,

𝒙=argmin𝑨𝒙=𝒃𝒙pp,\boldsymbol{\mathit{x}}^{\star}=\arg\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{x}}\|^{p}_{p},

where 𝑨d×n,𝒃d\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{b}}\in\mathbb{R}^{d}, dnd\leq n. This is a classic convex optimization problem that captures several well-studied questions including least squares regression (p=2p=2) which is equivalent to solving a system of linear equations, and linear programming (p=p=\infty). The p\ell_{p}-norm regression problem for p>1p>1 has found use across a wide range of applications in machine learning and theoretical computer science including low rank matrix approximation [Chi+17], sparse recovery [CT05], graph based semi-supervised learning [AL11, Cal19, RCL19, Kyn+15], data clustering and learning problems [ETT15, EDT17, HFE18]. In this paper, we focus on solving the p\ell_{p}-norm regression problem for p2.p\geq 2. The exact solution to the p\ell_{p}-norm regression problem for p1,2,,p\neq 1,2,\infty, may not even be expressible using rationals. Thus, the goal is often relaxed to finding an ε\varepsilon-approximate solution to the problem, i.e., find 𝒙^\hat{\boldsymbol{\mathit{x}}} such that 𝑨𝒙^=𝒃\boldsymbol{\mathit{A}}\hat{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{b}} and,

𝒙^pp(1+ϵ)𝒙pp,\|\hat{\boldsymbol{\mathit{x}}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p},

for some small ε>0.\varepsilon>0. Furthermore, several applications such as graph based semi-supervised learning require that 𝒙^\hat{\boldsymbol{\mathit{x}}} is close to 𝒙\boldsymbol{\mathit{x}}^{\star} coordinate-wise and not just in objective value – necessitating a high-accuracy solution with ϵ1poly(n)\epsilon\approx\frac{1}{\text{poly}(n)}. In order to find such high-accuracy solutions efficiently, we require an algorithm with runtime dependence on ϵ\epsilon being poly(log1ϵ)\text{poly}\mathopen{}\mathclose{{}\left(\log\frac{1}{\epsilon}}\right) rather than poly(1ϵ).\text{poly}\mathopen{}\mathclose{{}\left(\frac{1}{\epsilon}}\right).

Fast, high-accuracy algorithms for p\ell_{p}-regression are challenging both in theory and practice, due to the lack of smoothness and strong convexity of the objective. The Interior Point Method framework by [NN94] can be used to compute a high-accuracy solution for all p[1,]p\in[1,\infty] in O~(n)\widetilde{O}(\sqrt{n})555O~\widetilde{O} hides constants, pp dependencies, log1ϵ,\log\frac{1}{\epsilon}, and logn\log n factors unless explicitly mentioned iterations, with each iteration requiring solving an n×nn\times n system of linear equations. This was the most efficient algorithm for p\ell_{p}-regression until 2018. In 2018, [Bub+18] showed that Ω(n)\Omega(\sqrt{n}) iterations are necessary for the interior point framework and proposed a new homotopy-based approach that could compute a high-accuracy solution in O~(n|121p|)\widetilde{O}(n^{\mathopen{}\mathclose{{}\left|\frac{1}{2}-\frac{1}{p}}\right|}) linear system solves for all p(1,)p\in(1,\infty). Their algorithms improve over the interior point method by nΩ(1)n^{\Omega(1)} factors for values of pp bounded away from 11 and .\infty. However, for pp approaching 11 or \infty, the number of linear system solves required by their algorithm approaches n,\sqrt{n}, the same as required by interior point methods. Finding an algorithm for p\ell_{p}-regression requiring o(n1/2)o(n^{1/2}) linear system solves has been a long standing open problem.

Among practical implementations for the p\ell_{p}-norm regression problem, the Iteratively Reweighted Least Squares (IRLS) methods stand out due to their simplicity, and have been studied since 1961 [Law61]. For some range of values for p,p, IRLS converges rapidly. However, the method is guaranteed to converge only for p(1.5,3)p\in(1.5,3) and diverges even for small values of p,p, e.g. p=3.5p=3.5 [RCL19]. Over the years, several empirical modifications of the algorithm have been used for various applications in practice (refer to [Bur12] for a full survey). However, an IRLS algorithm that is guaranteed to converge to the optimum in a few iterations for all values of p,p, has again been a long standing challenge.

1.1 Our Contributions

In this paper, we present the first algorithm for the p\ell_{p}-regression problem that finds a high-accuracy solution in at most O(pn1/3)=o(n1/2)O(pn^{1/3})=o(n^{1/2}) linear system solves, which has been a long sought-after goal in optimization. Our algorithm builds on a new iterative refinement framework for p\ell_{p}-norm objectives that allows us to find a high-accuracy solution using low-accuracy solutions to a subproblem. The iterative refinement framework allows for the subproblems to be solved to an no(1)n^{o(1)}-approximation and this has been useful in several follow up works on graph optimization (see Section 1.3). We further propose a new inverse maintenance framework and show how to speed up our algorithm to solve the p\ell_{p}-norm problem to a high-accuracy in total time O~(nω)\widetilde{O}(n^{\omega}). Finally, we give the first IRLS algorithm that provably converges to a high-accuracy solution in a few iterations.

Preliminary versions of the results presented in this paper have appeared in previous conference publications by [Adi+19, APS19, AS20, Adi+21]. In this paper, we present our results for a more general formulation of the p\ell_{p}-regression problem,

min𝑨𝒙=𝒃𝒇(𝒙)=𝒅𝒙+𝑴𝒙22+𝑵𝒙pp\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p} (1)

for matrices 𝑨d×n,𝑴m1×n,𝑵m2×n,\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{2}\times n}, m1,m2n,dnm_{1},m_{2}\geq n,d\leq n. Let m=max{m1,m2}m=\max\{m_{1},m_{2}\} and, 𝒅{ker(𝑴)ker(𝑵)ker(𝑨)}\boldsymbol{\mathit{d}}\perp\{ker(\boldsymbol{\mathit{M}})\cap ker(\boldsymbol{\mathit{N}})\cap ker(\boldsymbol{\mathit{A}})\}, 𝒃im(𝑨)\boldsymbol{\mathit{b}}\in im(\boldsymbol{\mathit{A}}) so that the above problem has a bounded solution. Our first result is a fast, high-accuracy algorithm for Problem (1).

Theorem 1.1.

Let ϵ>0\epsilon>0 and p2p\geq 2. There is an algorithm that starting from 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} satisfying 𝐀𝐱(0)=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}, finds an ϵ\epsilon-approximate solution to Problem (1) in O(pmp23p2log𝐟(𝐱(0))𝐟(𝐱)ϵ)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right) calls to a linear system solver.

As a corollary, for the p\ell_{p}-norm regression problem, i.e., 𝒅=𝑴=0\boldsymbol{\mathit{d}}=\boldsymbol{\mathit{M}}=0 and 𝑵=𝑰\boldsymbol{\mathit{N}}=\boldsymbol{\mathit{I}}, our algorithm converges in O(pnp23p2lognϵ)O(pn^{\frac{p-2}{3p-2}}\log\frac{n}{\epsilon}) calls to a linear system solver. This is the first algorithm that converges to a high accuracy solution at an asymptotic rate of convergence O~(n1/3)=o(n1/2)\widetilde{O}(n^{1/3})=o(n^{1/2}) for all p[2,)p\in[2,\infty), and thus faster than all previously known algorithms by at least a factor of nΩ(1)n^{\Omega(1)}. As a result, we answer the long standing problem in optimization of whether such a rate of convergence could be achieved.

Our next result shows how to speed up our algorithms and solve Problem (1) in time O~(mω)\widetilde{O}(m^{\omega}) (or O~(nω)\widetilde{O}(n^{\omega}) for p\ell_{p}-regression), where ω2.37\omega\approx 2.37 and O(nω)O(n^{\omega}) is the current time required for multiplying two n×nn\times n matrices. This is almost as fast as solving a system of linear equations. We achieve this guarantee via a new inverse maintenance procedure for p\ell_{p}-regression and prove the following result.

Theorem 1.2.

If 𝐀,𝐌,𝐍\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}} are explicitly given, matrices with polynomially bounded condition numbers, and p2p\geq 2, there is an algorithm for Problem (1) that can be implemented to run in total time O~(mω)\widetilde{O}(m^{\omega}).

Our inverse maintenance algorithm is presented in Section 5, where we also give a more fine grained dependence on the parameters m1,m2,nm_{1},m_{2},n and pp in the rate of convergence (Theorem 5.1). Our algorithms and techniques for p\ell_{p}-regression have motivated a line of work in graph optimization and the study of accelerated width reduced methods which we describe in detail in Section 1.3.

Our next contribution is towards the IRLS approach. For the p\ell_{p}-regression problem i.e. 𝒅=𝑴=0\boldsymbol{\mathit{d}}=\boldsymbol{\mathit{M}}=0 in (1), we give an IRLS algorithm that globally converges to the optimum in at most O(p3mp22(p1)logmϵ)O\mathopen{}\mathclose{{}\left(p^{3}m^{\frac{p-2}{2(p-1)}}\log\frac{m}{\epsilon}}\right) linear system solves for all p2p\geq 2 (Section 6). This is the first IRLS algorithm that is guaranteed to converge to the optimum for all values of p2p\geq 2, with a quantitative bound on the runtime. Our IRLS algorithm has proven to be very fast and robust in practice and is faster than existing implementations in MATLAB/CVX by 10-50x. These speed-ups are demonstrated in experiments performed in [APS19] and we present these results along with our algorithm in Section 6.

Theorem 1.3.

Let p2p\geq 2. Algorithm 10 returns 𝐱\boldsymbol{\mathit{x}} such that 𝐀𝐱=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}} and 𝐍𝐱pp(1+ϵ)𝐍𝐱pp\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p}, in at most O(p3m(p2)2(p1)log(mϵ))O\mathopen{}\mathclose{{}\left(p^{3}m^{\frac{(p-2)}{2(p-1)}}\log\mathopen{}\mathclose{{}\left(\frac{m}{\epsilon}}\right)}\right) calls to a linear system solver.

The analysis of our IRLS algorithm fits into the overall framework of this paper. Such an algorithm first appeared in the conference paper by [APS19], where they also ran some experiments to demonstrate the performance of their IRLS algorithm in practice. We include some of their experimental results to show that the rate of convergence in practice is even better than the theoretical bounds.

1.2 Technical Overview

Overall log1ϵ\log\frac{1}{\epsilon} Convergence

Our algorithm follows an overall iterative refinement approach for p2p\geq 2, which implies 𝒇(𝒙+δ)𝒇(𝒙)\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}+\delta)-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}) can be upper bounded by the function 𝒓𝒆𝒔p=𝒈δ+𝑹δ22+𝑵δpp\boldsymbol{res}_{p}=\boldsymbol{\mathit{g}}^{\top}\delta+\|\boldsymbol{\mathit{R}}\delta\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\delta\|_{p}^{p}, and lower bounded by a similar function. Here, the vector 𝒈\boldsymbol{\mathit{g}} and matrix 𝑹\boldsymbol{\mathit{R}} depend on 𝒙,\boldsymbol{\mathit{x}}, and the matrix 𝑵\boldsymbol{\mathit{N}} is as defined in Problem (1). We prove that if we can solve min𝑨δ=0𝒓𝒆𝒔p(δ)\min_{\boldsymbol{\mathit{A}}\delta=0}\boldsymbol{res}_{p}(\delta) to a κ\kappa-approximation, O(pκlog(𝒇(𝒙(0))𝒇(𝒙))/ε)O(p\kappa\log\nicefrac{{(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star}))}}{{\varepsilon}}) such solves (iterations) suffice to obtain an ε\varepsilon-approximate solution to Problem (1) (Theorem 2.1). We call this problem the Residual Problem and this process Iterative Refinement for p\ell_{p}-norms.

Solving the Residual Problem

We next perform a binary search on the linear term of the residual problem and reduce it to solving O(logp)O(\log p) problems of the form, min𝑨δ=𝒄𝑹δ22+𝑵δpp\min_{\boldsymbol{\mathit{A}}\delta=\boldsymbol{\mathit{c}}}\|\boldsymbol{\mathit{R}}\delta\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\delta\|_{p}^{p} (Lemma 3.1). In order to solve these new problems, we use a multiplicative weight update routine that returns a constant approximate solution in O(pm(p2)/(3p2))O(pm^{\nicefrac{{(p-2)}}{{(3p-2)}}}) calls to a linear system solver (Theorem 3.2). We can thus find a constant approximate solution to the residual problem in O(pm(p2)/(3p2)logp)O(pm^{\nicefrac{{(p-2)}}{{(3p-2)}}}\log p) calls to a linear system solver (Corollary 3.7). Combined with iterative refinement, we obtain an algorithm that converges in O(p2mp23p2logplog𝒇(𝒙(0))𝒇(𝒙)ϵ)O~(p2m1/3log1ϵ)O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log p\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)\leq\widetilde{O}\mathopen{}\mathclose{{}\left(p^{2}m^{1/3}\log\frac{1}{\epsilon}}\right) linear system solves.

Improving pp Dependence

Furthermore, we prove that for any qpq\neq p, given a pp-norm residual problem, we can construct a corresponding qq-norm residual problem such that β\beta-approximate solution to the qq-norm residual problem roughly gives a O(β2)m|1p1q|O(\beta^{2})m^{\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|} approximate solution to the pp-norm residual problem (Theorem 4.3). As a consequence, if pp is large, i.e. plogmp\geq\log m, a constant approximate solution to the corresponding logm\log m-norm residual problem will give an O(m1logm)O(1)O(m^{\frac{1}{\log m}})\leq O(1)-approximate solution to the pp-norm residual problem in at most O(logmmlogm23logm2)O~(mp23p2)O(\log m\cdot m^{\frac{\log m-2}{3\log m-2}})\leq\widetilde{O}(m^{\frac{p-2}{3p-2}}) calls to a linear system solver. Combining this with the algorithm described in the previous paragraph, we obtain our final guarantees as described in Theorem 1.1.

p\ell_{p}-Regression in Matrix Multiplication Time

We next describe how to obtain the guarantees of Theorem 1.2. While solving the residual problem, the algorithm solves a system of linear equations at every iteration. The key observation for obtaining improved running times is that the weights determining these linear systems change slowly. Thus, we can maintain a spectral approximation to the linear system via a sequence of lazy low-rank updates. The Sherman-Morrison-Woodbury formula then allows us to update the inverse quickly. We can use the spectral approximation as a preconditioner for solving the linear system quickly at each iteration. Thus, we obtain a speed-up since the linear systems do not need to be solved from scratch at each iteration, giving Theorem 1.2.

Good Starting Solution

For p\ell_{p}-norm objectives, i.e., min𝑨𝒙=𝒃𝑵𝒙pp\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}, we further show how to find a starting solution 𝒙(0)\boldsymbol{\mathit{x}}^{(0)} such that 𝑵𝒙(0)ppO(m)𝑵𝒙pp\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{(0)}\|_{p}^{p}\leq O(m)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p} . The key idea is that for any kk, a constant approximate solution to the kk-norm problem is an O(m)O(m)-approximate solution to the 2k2k-norm problem (Lemma 2.9). This inspires a homotopy approach, where we first solve an 2\ell_{2} norm problem followed by 22,23,,2logp\ell_{2^{2}},\ell_{2^{3}},\cdots,\ell_{2^{\mathopen{}\mathclose{{}\left\lceil\log p}\right\rceil}}-norm problems to constant approximations. We can thus obtain the required starting solution in at most O(pmp23p2logmlog2p)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log m\log^{2}p}\right) calls to a linear system solver.

IRLS Algorithm

For the IRLS algorithm, given the residual problem at an iteration, we show how to construct a weighted least squares problem, the solution of which is an O(p2mp22(p1))O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{2(p-1)}}}\right)-approximate solution to the residual problem (Lemma 6.1). This result along with the overall iterative refinement culminates in our IRLS algorithm where we directly solve these weighted least squares problems in every iteration.

1.3 Related Works

p\ell_{p}-Regression

Until 2018, the fastest high-accuracy algorithms for p\ell_{p}-regression, including the [NN94] Interior Point Method framework and [Bub+18] homotopy method, asymptotically required O(n)\approx O(\sqrt{n}) linear system solves. The first algorithm for p\ell_{p}-regression to beat the n\sqrt{n} iteration bound was the algorithm by [Adi+19], which was faster than all known algorithms and asymptotically required at most O(pO(p)n1/3)\approx O(p^{O(p)}n^{1/3}) iterations , for all p>1p>1. Concurrently [Bul18] used tools from convex optimization to give an algorithm for p=4p=4 which matches the rates of [Adi+19] up to logarithmic factors. Subsequent works have improved the pp dependence [AS20, Adi+21] and proposed alternate methods for obtaining matching rates (upto logarithmic and pp factors) [Car+20]. A recent work by [JLS22] shows how to solve p\ell_{p}-regression in n+poly(p)dp23p2\approx n+poly(p)\cdot d^{\frac{p-2}{3p-2}} iterations where dd is the smaller dimension of the constraint matrix 𝑨\boldsymbol{\mathit{A}}.

Width Reduced MWU Algorithms

Width reduction is a technique that has been used repeatedly in multiplicative weight update algorithms to speed up rates of convergence from m1/2m^{1/2} to m1/3m^{1/3}, where mm is the size of the input. This technique was first seen in the work of [Chr+11], in the context of the maximum flow problem where for a graph with nn vertices and mm edges to improve the iteration complexity from O~(m1/2)\widetilde{O}(m^{1/2}) to O~(m1/3)\widetilde{O}(m^{1/3}). A similar improvement was further seen in algorithms for 1,\ell_{1},\ell_{\infty}-regression by [Chi+13, EV19], p\ell_{p}-regression (p2p\geq 2) [Adi+19] and, algorithms for matrix scaling [All+17]. In a recent work [ABS21] extend this technique to improve iteration complexities for all quasi-self-concordant objectives which includes soft-max and logistic regression among others.

Inverse Maintenance

Inverse Maintenance is a technique used to speed up algorithms and was first introduced by [Vai90] in the context of minimum cost and multicommodity flows and has further been used for interior point methods [LS14], [LSW15]. In 2019, [Adi+19] developed a method for p\ell_{p}-regression that utilized the idea of reusing inverses due to controllable rates of change of underlying variables.

IRLS Algorithms

Iteratively Reweighted Least Squares Algorithms are simple to implement and have thus been used in a wide range of applications including sparse signal reconstruction [GR97], compressive sensing [CY08] and Chebyshev approximation in FIR filter design [BB94]. Refer to [Bur12] for a full survey. The works by [Osb85] and [Kar70] show convergence in the limit and with certain assumptions on the starting solution. For 1\ell_{1}-regression, [SV16b, SV16a, SV16] show quantitative convergence bounds. In 2019, [APS19] give the first IRLS algorithm with quantitative bounds that is guaranteed to converge with no conditions on the starting point. Their algorithm also works well in practice as suggested by their experiments.

Follow-up Work in Graph Optimization

The p\ell_{p}-norm flow problem, which asks to minimize the p\ell_{p}-norm of a flow vector while satisfying certain demand constraints, is modeled via the p\ell_{p}-regression problem. The maximum flow problem is the special case of p=p=\infty. For graphs with nn vertices and mm edges, the p\ell_{p}-norm regression algorithm of [Adi+19] when combined with fast laplacian solvers, directly gives an O~(pO(p)m4/3)\approx\widetilde{O}(p^{O(p)}m^{4/3}) time algorithm for the p\ell_{p}-norm flow problem. Building on their work, specifically the iterative refinement framework, which allows to solve these problems to a high-accuracy while only requiring an mo(1)m^{o(1)}-aproximate solution to an p\ell_{p}-norm subproblem, [Kyn+19] give an algorithm for unweighted graphs that runs in time exp(p3/2)m1+7p1+o(1)\exp(p^{3/2})m^{1+\frac{7}{\sqrt{p-1}}+o(1)}. We note that their algorithm runs in time m1+o(1)m^{1+o(1)} for p=logmp=\sqrt{\log m}. Further works including [Adi+21] also utilize the iterative refinement guarantees to give an algorithm with runtime p(m1+o(1)+n4/3+o(1))p(m^{1+o(1)}+n^{4/3+o(1)}) for weighted p\ell_{p}-norm flow problems by designing new sparsification algorithms that preserve p\ell_{p}-norm objectives of the subproblem to an mo(1)m^{o(1)}-approximation. For the maximum flow problem, [AS20] give an m1+o(1)ϵ1m^{1+o(1)}\epsilon^{-1} time algorithm for the approximate maximum flow problem on unweighted graphs. [KLS20] build on these works further and give an algorithm that computes maximum ss-tt flow problem where each edge has integer capacities at most UU, in time m4/3+o(1)U1/3m^{4/3+o(1)}U^{1/3}. In a recent breakthrough result by [Che+22], the authors give an algorithm for the maximum flow problem and the p\ell_{p}-norm flow problem that runs in almost linear time, m1+o(1)m^{1+o(1)}.

1.4 Organization of Paper

Section 2 describes the overall iterative refinement framework, first for p2p\geq 2, and then for p(1,2)p\in(1,2). In the end, we show how to find good starting solutions for pure p\ell_{p}-norm objectives for p2{}p\geq 2. Section 3 describes the width reduced multiplicative weight update routine used to solve the residual problem. In Section 4 we show how to solve pp-norm residual problems using qq-norm residual problems and give our overall algorithm (Algorithm 6). Section 5 contains our new inverse maintenance algorithm that allows us to solve p\ell_{p}-regression almost as fast as linear regression. Finally in Section 6 we give an IRLS algorithm and present some experimental results from [APS19].

2 Iterative Refinement for p\ell_{p}-norms

Recall that we would like to find a high-accuracy solution for the problem,

min𝑨𝒙=𝒃𝒇(𝒙)=𝒅𝒙+𝑴𝒙22+𝑵𝒙pp\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}

for matrices 𝑨d×n,𝑴m1×n,𝑵m2×n,\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{2}\times n}, m1,m2n,dnm_{1},m_{2}\geq n,d\leq n.

A common approach in smooth, convex optimization is upper bounding the function using a first order Taylor expansion plus a quadratic function (smoothness), and minimizing this bound repeatedly to converge to the optimum. Additionally, when the function has a similar quadratic lower bound (strong convexity) it can be shown that minimizing this upper bound O(log1ϵ)O\mathopen{}\mathclose{{}\left(\log\frac{1}{\epsilon}}\right)666hiding problem dependent parameters times is sufficient to converge to an ϵ\epsilon-approximate solution. The p\ell_{p}-norm function satisfies no such quadratic upper bound since it has a very steep growth, or lower bound since it is too flat around 0. In this section we show that we can instead upper and lower bound the p\ell_{p} function for p2p\geq 2 by a second order Taylor expansion plus an pp\ell_{p}^{p} term. We show that it is sufficient to minimize such a bound to a κ\kappa-approximation O(pκlog1ϵ)O\mathopen{}\mathclose{{}\left(p\kappa\log\frac{1}{\epsilon}}\right) times. Such an iterative refinement method was previously only known for p=2p=2, and we thus call this algorithm Iterative Refinement for p\ell_{p}-norms. In further sections, we show different ways to minimize this upper bound approximately to obtain fast algorithms.

For p(1,2)p\in(1,2), we use a smoothed function which is quadratic in a small range around 0 and grows as pp\ell_{p}^{p} otherwise. We use this function to give upper and lower bounds and a similar iterative refinement scheme.

We further show how to obtain a good starting solution for Problem (1) in the special case when the vector 𝒅\boldsymbol{\mathit{d}} and matrix 𝑴\boldsymbol{\mathit{M}} are zero, i.e., the objective function is only the p\ell_{p}-norm function.

These sections are based on the results and proofs from [Adi+19, APS19, AS20, Adi+21].

2.1 Iterative Refinement

We will prove that the following algorithm can be used to obtain a high-accuracy solution, i.e., log1ϵ\log\frac{1}{\epsilon} rate of convergence for p\ell_{p}-regression.

Algorithm 1 Iterative Refinement
1:procedure Main-Solver(𝑨,𝑴,𝑵,𝒅,𝒃,p,ϵ\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},p,\epsilon)
2:     𝒙𝒙(0)\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}^{(0)}
3:     ν\nu\leftarrow Bound on 𝒇(𝒙(0))𝒇(𝒙)\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star}) \triangleright If 𝒇(𝒙)0\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\geq 0, then ν𝒇(𝒙(0))\nu\leftarrow\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})
4:     while ν>ϵ\nu>\epsilon do
5:         Δ~{\widetilde{{\Delta}}}\leftarrow ResidualSolver(𝒙,𝑴,𝑵,𝑨,𝒅,𝒃,ν,p\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p)
6:         if 𝒓𝒆𝒔p(Δ~)ν32pκ\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa} then
7:              𝒙𝒙Δ~p\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}-\frac{{\widetilde{{\Delta}}}}{p}
8:         else
9:              νν2\nu\leftarrow\frac{\nu}{2}               
10:     return 𝒙\boldsymbol{\mathit{x}}

Specifically, we will prove,

Theorem 2.1.

Let p2p\geq 2, and κ1\kappa\geq 1. Let the initial solution 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} satisfy 𝐀𝐱(0)=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}. Algorithm 1 returns an ϵ\epsilon-approximate solution 𝐱\boldsymbol{\mathit{x}} of Problem (1) in at most O(pκlog(𝐟(𝐱(0))𝐟(𝐱)ϵ))O\mathopen{}\mathclose{{}\left(p\kappa\log\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)}\right) calls to a κ\kappa-approximate solver for the residual problem (Definition 2.3).

Before we prove the above result, we will define some of the terms used in the above statement.

2.1.1 Preliminaries

Definition 2.2 (ϵ\epsilon-Approximate Solution).

Let 𝐱\boldsymbol{\mathit{x}}^{\star} denote the optimizer of Problem (1). We say 𝐱~\widetilde{\boldsymbol{\mathit{x}}} is an ϵ\epsilon-approximate solution to (1) if 𝐀𝐱~=𝐛\boldsymbol{\mathit{A}}\widetilde{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{b}} and

𝒇(𝒙~)𝒇(𝒙)+ϵ.\boldsymbol{\mathit{f}}(\widetilde{\boldsymbol{\mathit{x}}})\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})+\epsilon.
Definition 2.3 (Residual Problem).

For any p2p\geq 2, we define the residual problem resp(Δ)res_{p}(\Delta), for (1) at a feasible 𝐱\boldsymbol{\mathit{x}} as,

max𝑨Δ=0𝒓𝒆𝒔p(Δ)=def𝒈ΔΔ𝑹Δ𝑵Δpp, where,\max_{\boldsymbol{\mathit{A}}\Delta=0}\quad\boldsymbol{res}_{p}(\Delta)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\boldsymbol{\mathit{g}}^{\top}\Delta-\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p},\text{ where,}
𝒈=1p𝒅+2p𝑴𝑴𝒙+𝑵Diag(|𝑵𝒙|p2)𝑵𝒙and𝑹=2p2𝑴𝑴+2𝑵Diag(|𝑵𝒙|p2)𝑵.\boldsymbol{\mathit{g}}=\frac{1}{p}\boldsymbol{\mathit{d}}+\frac{2}{p}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{N}}^{\top}Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\quad\text{and}\quad\boldsymbol{\mathit{R}}=\frac{2}{p^{2}}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+2\boldsymbol{\mathit{N}}^{\top}Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}.
Definition 2.4 (Approximation to Residual Problem).

Let p2p\geq 2 and Δ{{{\Delta^{\star}}}} be the optimum of the residual problem. Δ~{\widetilde{{\Delta}}} is a κ\kappa-approximation to the residual problem if 𝐀Δ~=0\boldsymbol{\mathit{A}}{\widetilde{{\Delta}}}=0 and,

resp(Δ~)1κresp(Δ).res_{p}({\widetilde{{\Delta}}})\geq\frac{1}{\kappa}res_{p}({{{\Delta^{\star}}}}).

2.1.2 Bounding Change in Objective

In order to prove our result, we first show that we can upper and lower bound the change in our p\ell_{p}-objective by a linear term plus a quadratic term plus an p\ell_{p}-norm term.

Lemma 2.5.

For any 𝐱,Δ\boldsymbol{\mathit{x}},\Delta and p2p\geq 2, we have for vectors 𝐫,𝐠\boldsymbol{\mathit{r}},\boldsymbol{\mathit{g}} defined coordinate wise as 𝐫=|𝐱|p2\boldsymbol{\mathit{r}}=|\boldsymbol{\mathit{x}}|^{p-2} and 𝐠=p|𝐱|p2𝐱\boldsymbol{\mathit{g}}=p|\boldsymbol{\mathit{x}}|^{p-2}\boldsymbol{\mathit{x}},

p8i𝒓iΔi2+12p+1Δpp𝒙+Δpp𝒙pp𝒈Δ2p2i𝒓iΔi2+ppΔpp.\frac{p}{8}\sum_{i}\boldsymbol{\mathit{r}}_{i}\Delta_{i}^{2}+\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left\lVert\Delta}\right\rVert_{p}^{p}\leq\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}+\Delta}\right\rVert^{p}_{p}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}}\right\rVert_{p}^{p}-\boldsymbol{\mathit{g}}^{\top}\Delta\leq 2p^{2}\sum_{i}\boldsymbol{\mathit{r}}_{i}\Delta_{i}^{2}+p^{p}\mathopen{}\mathclose{{}\left\lVert\Delta}\right\rVert_{p}^{p}.
Proof.

To show this, we show that the above holds for all coordinates. For a single coordinate, the above expression is equivalent to proving,

p8|x|p2Δ2+12p+1|Δ|p|𝒙+Δ|p|𝒙|pp|x|p1sgn(x)Δ2p2|x|p2Δ2+pp|Δ|p.\frac{p}{8}|x|^{p-2}\Delta^{2}+\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left|\Delta}\right|^{p}\leq\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{x}}+\Delta}\right|^{p}-\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{x}}}\right|^{p}-p\mathopen{}\mathclose{{}\left|x}\right|^{p-1}sgn(x)\Delta\leq 2p^{2}|x|^{p-2}\Delta^{2}+p^{p}\mathopen{}\mathclose{{}\left|\Delta}\right|^{p}.

Let Δ=αx\Delta=\alpha x. Since the above clearly holds for x=0x=0, it remains to show for all α\alpha,

p8α2+12p+1|α|p|1+α|p1pα2p2α2+pp|α|p.\frac{p}{8}\alpha^{2}+\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}\leq\mathopen{}\mathclose{{}\left|1+\alpha}\right|^{p}-1-p\alpha\leq 2p^{2}\alpha^{2}+p^{p}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}.
  1. 1.

    α1\alpha\geq 1:
    In this case, 1+α2αpα1+\alpha\leq 2\alpha\leq p\cdot\alpha. So, |1+α|ppp|α|p\mathopen{}\mathclose{{}\left|1+\alpha}\right|^{p}\leq p^{p}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p} and the right inequality directly holds. To show the other side, let

    h(α)=(1+α)p1pαp8α212p+1αp.h(\alpha)=(1+\alpha)^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{1}{2^{p+1}}{\alpha}^{p}.

    We have,

    h(α)=p(1+α)p1pp4αp2p+1αp1h^{\prime}(\alpha)=p(1+\alpha)^{p-1}-p-\frac{p}{4}\alpha-\frac{p}{2^{p+1}}{\alpha}^{p-1}

    and

    h′′(α)=p(p1)(1+α)p2p4p(p1)2p+1αp20.h^{\prime\prime}(\alpha)=p(p-1)(1+\alpha)^{p-2}-\frac{p}{4}-\frac{p(p-1)}{2^{p+1}}{\alpha}^{p-2}\geq 0.

    Since h′′(α)0h^{\prime\prime}(\alpha)\geq 0, h(α)h(1)0h^{\prime}(\alpha)\geq h^{\prime}(1)\geq 0. So hh is an increasing function in α\alpha and h(α)h(1)0h(\alpha)\geq h(1)\geq 0.

  2. 2.

    α1\alpha\leq-1:
    Now, |1+α|1+|α|p|α|\mathopen{}\mathclose{{}\left|1+\alpha}\right|\leq 1+\mathopen{}\mathclose{{}\left|\alpha}\right|\leq p\cdot\mathopen{}\mathclose{{}\left|\alpha}\right|, and 2α2p2|α|p02\alpha^{2}p^{2}-\mathopen{}\mathclose{{}\left|\alpha}\right|p\geq 0. As a result,

    |1+α|p|α|p+2α2p2+pp|α|p\mathopen{}\mathclose{{}\left|1+\alpha}\right|^{p}\leq-\mathopen{}\mathclose{{}\left|\alpha}\right|p+2\alpha^{2}p^{2}+p^{p}\cdot\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}

    which gives the right inequality. Consider,

    h(α)=|1+α|p1pαp8α212p+1|α|p.h(\alpha)=|1+\alpha|^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{1}{2^{p+1}}|\alpha|^{p}.
    h(α)=p|1+α|p1pp4α+p12p+1|α|p1.h^{\prime}(\alpha)=-p|1+\alpha|^{p-1}-p-\frac{p}{4}\alpha+p\frac{1}{2^{p+1}}|\alpha|^{p-1}.

    Let β=α\beta=-\alpha. The above expression now becomes,

    p(β1)p1p+p4β+p12p+1βp1.-p(\beta-1)^{p-1}-p+\frac{p}{4}\beta+p\frac{1}{2^{p+1}}\beta^{p-1}.

    We know that β1\beta\geq 1. When β2\beta\geq 2, β2β1\frac{\beta}{2}\leq\beta-1 and β2(β2)p1\frac{\beta}{2}\leq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}. This gives us,

    p4β+p12p+1βp1p2(β2)p1+p2(β2)p1p(β1)p1\frac{p}{4}\beta+p\frac{1}{2^{p+1}}\beta^{p-1}\leq\frac{p}{2}\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}+\frac{p}{2}\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}\leq p(\beta-1)^{p-1}

    giving us h(α)0h^{\prime}(\alpha)\leq 0 for α2\alpha\leq-2. When β2\beta\leq 2, β2(β2)p1\frac{\beta}{2}\geq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1} and β21\frac{\beta}{2}\leq 1.

    p4β+p12p+1βp1p2β2+p2β2p\frac{p}{4}\beta+p\frac{1}{2^{p+1}}\beta^{p-1}\leq\frac{p}{2}\cdot\frac{\beta}{2}+\frac{p}{2}\cdot\frac{\beta}{2}\leq p

    giving us h(α)0h^{\prime}(\alpha)\leq 0 for 2α1-2\leq\alpha\leq-1. Therefore, h(α)0h^{\prime}(\alpha)\leq 0 giving us, h(α)h(1)0h(\alpha)\geq h(-1)\geq 0, thus giving the left inequality.

  3. 3.

    |α|1\mathopen{}\mathclose{{}\left|\alpha}\right|\leq 1:
    Let s(α)=1+pα+2p2α2+pp|α|p(1+α)p.s(\alpha)=1+p\alpha+2p^{2}\alpha^{2}+p^{p}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}-(1+\alpha)^{p}. Now,

    s(α)=p+4p2α+pp+1|α|p1sgn(α)p(1+α)p1.s^{\prime}(\alpha)=p+4p^{2}\alpha+p^{p+1}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-1}sgn(\alpha)-p(1+\alpha)^{p-1}.

    When α0\alpha\leq 0, we have,

    s(α)=p+4p2αpp+1|α|p1p(1+α)p1.s^{\prime}(\alpha)=p+4p^{2}\alpha-p^{p+1}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-1}-p(1+\alpha)^{p-1}.

    and

    s′′(α)=4p2+pp+1(p1)|α|p2p(p1)(1+α)p12p2+pp+1(p1)|α|p2p(p1)0.s^{\prime\prime}(\alpha)=4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-2}-p(p-1)(1+\alpha)^{p-1}\geq 2p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-2}-p(p-1)\geq 0.

    So ss^{\prime} is an increasing function of α\alpha which gives us, s(α)s(0)=0s^{\prime}(\alpha)\leq s^{\prime}(0)=0. Therefore ss is a decreasing function, and the minimum is at 0 which is 0. This gives us our required inequality for α0\alpha\leq 0. When α1p1\alpha\geq\frac{1}{p-1}, 1+αpα1+\alpha\leq p\cdot\alpha and s(α)0s^{\prime}(\alpha)\geq 0. We are left with the range 0α1p10\leq\alpha\leq\frac{1}{p-1}. Again, we have,

    s′′(α)\displaystyle s^{\prime\prime}(\alpha) =4p2+pp+1(p1)|α|p2p(p1)(1+α)p1\displaystyle=4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-2}-p(p-1)(1+\alpha)^{p-1}
    4p2+pp+1(p1)|α|p2p(p1)(1+1p1)p1\displaystyle\geq 4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-2}-p(p-1)(1+\frac{1}{p-1})^{p-1}
    4p2+pp+1(p1)|α|p2p(p1)e,When p gets large the last term approaches e\displaystyle\geq 4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-2}-p(p-1)e,\text{When $p$ gets large the last term approaches $e$}
    0.\displaystyle\geq 0.

    Therefore, ss^{\prime} is an increasing function, s(α)s(0)=0s^{\prime}(\alpha)\geq s^{\prime}(0)=0. This implies ss is an increasing function, giving, s(α)s(0)=0s(\alpha)\geq s(0)=0 as required.

    To show the other direction,

    h(α)=(1+α)p1pαp8α212p+1|α|p(1+α)p1pαp8α2p8α2=(1+α)p1pαp4α2.h(\alpha)=(1+\alpha)^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}\geq(1+\alpha)^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{p}{8}{\alpha}^{2}=(1+\alpha)^{p}-1-p\alpha-\frac{p}{4}\alpha^{2}.

    Now, since p2p\geq 2,

    ((1+α)p21)sgn(α)0\displaystyle\mathopen{}\mathclose{{}\left((1+\alpha)^{p-2}-1}\right)sgn(\alpha)\geq 0
    \displaystyle\Rightarrow ((1+α)p11α)sgn(α)0\displaystyle\mathopen{}\mathclose{{}\left((1+\alpha)^{p-1}-1-\alpha}\right)sgn(\alpha)\geq 0
    \displaystyle\Rightarrow (p(1+α)p1pp2α)sgn(α)0\displaystyle\mathopen{}\mathclose{{}\left(p(1+\alpha)^{p-1}-p-\frac{p}{2}\alpha}\right)sgn(\alpha)\geq 0

    We thus have, h(α)0h^{\prime}(\alpha)\geq 0 when α\alpha is positive and h(α)0h^{\prime}(\alpha)\leq 0 when α\alpha is negative. The minimum of hh is at 0 which is 0. This concludes the proof of this case.

2.1.3 Proof of Iterative Refinement

In this section we will prove our main result. We start by proving the following lemma which relates the objective of the residual problem defined in the preliminaries to the change in objective value when 𝒙\boldsymbol{\mathit{x}} is updated by Δ/p\Delta/p.

Lemma 2.6.

For any 𝐱,Δ\boldsymbol{\mathit{x}},\Delta and p2p\geq 2 and λ=16p\lambda=16p,

𝒓𝒆𝒔p(Δ)𝒇(𝒙)𝒇(𝒙Δp),\boldsymbol{res}_{p}(\Delta)\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right),

and

𝒇(𝒙)𝒇(𝒙λΔp)λ𝒓𝒆𝒔p(Δ).\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\lambda\frac{\Delta}{p}}\right)\leq\lambda\cdot\boldsymbol{res}_{p}(\Delta).
Proof.

We note,

𝒇(𝒙Δp)=\displaystyle\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)= 𝒅(𝒙Δp)+𝑴(𝒙Δp)22+𝑵(𝒙Δp)pp\displaystyle\boldsymbol{\mathit{d}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{M}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)}\right\rVert_{2}^{2}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)}\right\rVert_{p}^{p}
=\displaystyle= 𝒅𝒙+𝑴𝒙22+𝑵(𝒙Δp)pp1p𝒅Δ2p𝒙𝑴𝑴Δ+1p2𝑴Δ22\displaystyle\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)}\right\rVert_{p}^{p}-\frac{1}{p}\boldsymbol{\mathit{d}}^{\top}\Delta-\frac{2}{p}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\frac{1}{p^{2}}\|\boldsymbol{\mathit{M}}\Delta\|_{2}^{2}
\displaystyle\leq 𝒅𝒙+𝑴𝒙22+𝑵𝒙ppp|𝑵𝒙|p2(𝑵𝒙)𝑵Δp+2p2(𝑵Δ)p(𝑵𝒙)p2𝑵Δp\displaystyle\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-p|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2}(\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}})^{\top}\frac{\boldsymbol{\mathit{N}}\Delta}{p}+2p^{2}\frac{(\boldsymbol{\mathit{N}}\Delta)^{\top}}{p}(\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}})^{p-2}\frac{\boldsymbol{\mathit{N}}\Delta}{p}
+pp𝑵Δppp1p𝒅Δ2p𝒙𝑴𝑴Δ+1p2𝑴Δ22\displaystyle+p^{p}\mathopen{}\mathclose{{}\left\lVert\frac{\boldsymbol{\mathit{N}}\Delta}{p}}\right\rVert_{p}^{p}-\frac{1}{p}\boldsymbol{\mathit{d}}^{\top}\Delta-\frac{2}{p}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\frac{1}{p^{2}}\|\boldsymbol{\mathit{M}}\Delta\|_{2}^{2}
(From right inequality of Lemma 2.5)
=𝒇(𝒙)(1p𝒅+2p𝑴𝑴𝒙+𝑵|𝑵𝒙|p2𝑵𝒙)Δ\displaystyle=\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\mathopen{}\mathclose{{}\left(\frac{1}{p}\boldsymbol{\mathit{d}}+\frac{2}{p}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{N}}^{\top}|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}}\right)^{\top}\Delta
Δ(2p2𝑴𝑴+2𝑵Diag(|𝑵𝒙|p2)𝑵)Δ+𝑵Δpp\displaystyle-\Delta^{\top}\mathopen{}\mathclose{{}\left(\frac{2}{p^{2}}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+2\boldsymbol{\mathit{N}}^{\top}Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}}\right)\Delta+\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p}
=𝒇(𝒙)𝒓𝒆𝒔p(Δ), From Definition 2.3.\displaystyle=\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\boldsymbol{res}_{p}(\Delta),\text{ From Definition \ref{def:residual}.}

Let 𝒈\boldsymbol{\mathit{g}} and 𝑹\boldsymbol{\mathit{R}} be as defined in Definition 2.3. We now use a similar calculation and the left inequality of Lemma 2.5 to get,

𝒇(𝒙λΔp)𝒇(𝒙)λ𝒈Δλ216pΔ𝑹Δλppp2p+1.\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\lambda\frac{\Delta}{p}}\right)\geq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{\lambda^{2}}{16p}\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{\lambda^{p}}{p^{p}2^{p+1}}.

For λ=16p\lambda=16p,

𝒇(𝒙)λ𝒈Δλ216pΔ𝑹Δλppp2p+1\displaystyle\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{\lambda^{2}}{16p}\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{\lambda^{p}}{p^{p}2^{p+1}} 𝒇(𝒙)λ(𝒈Δλ16pΔ𝑹Δλp1pp2p+1)\displaystyle\geq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{\lambda}{16p}\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{\lambda^{p-1}}{p^{p}2^{p+1}}}\right)
𝒇(𝒙)λ𝒓𝒆𝒔p(Δ),\displaystyle\geq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\boldsymbol{res}_{p}(\Delta),

thus concluding the proof of the lemma. ∎

We now track the value of 𝒇(𝒙(t))𝒇(𝒙)\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star}) with a parameter ν\nu. We will first show that, if we have a κ\kappa approximate solver for the residual problem, we can either take a step to obtain 𝒙(t+1)\boldsymbol{\mathit{x}}^{(t+1)} such that

𝒇(𝒙(t+1))𝒇(𝒙)(1132pκ)(𝒇(𝒙(t))𝒇(𝒙)),\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t+1)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\mathopen{}\mathclose{{}\left(1-\frac{1}{32p\kappa}}\right)\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right), (2)

or we need to reduce the value of ν\nu by a factor of 22 since 𝒇(𝒙(t))𝒇(𝒙)\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star}) is less than ν/2\nu/2.

Lemma 2.7.

Consider an iterate tt. Let 𝐫𝐞𝐬p\boldsymbol{res}_{p} denote the residual problem at 𝐱(t)\boldsymbol{\mathit{x}}^{(t)} and ν\nu be as defined in Algorithm 1. Let Δ~{\widetilde{{\Delta}}} denote the solution returned by a κ\kappa-approximate solver to the residual problem. Then,

  1. 1.

    either 𝒇(𝒙(t))𝒇(𝒙)ν\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu and, 𝒙(t+1)=𝒙(t)Δ~p\boldsymbol{\mathit{x}}^{(t+1)}=\boldsymbol{\mathit{x}}^{(t)}-\frac{{\widetilde{{\Delta}}}}{p} satisfies (2),

  2. 2.

    or, 𝒇(𝒙(t))𝒇(𝒙)ν2\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\frac{\nu}{2} and Line 9 in the algorithm is executed.

Proof.

We will first prove that 𝒇(𝒙(t))𝒇(𝒙)ν\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu by induction. For t=0t=0, 𝒇(𝒙(0))𝒇(𝒙)ν\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu by definition. Now, let us assume this is true for iteration tt. Note that, if the algorithm updates 𝒙\boldsymbol{\mathit{x}} in line 7, since 𝒇(𝒙(t+1))𝒇(𝒙(t))\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t+1)})\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)}) (solution of the residual problem is always non-negative), the relation holds for t+1t+1. Otherwise, the algorithm reduces ν\nu to ν/2\nu/2 and 𝒓𝒆𝒔p(Δ~)<ν32pκ\boldsymbol{res}_{p}({\widetilde{{\Delta}}})<\frac{\nu}{32p\kappa}. For Δ¯{\bar{{\Delta}}} such that 𝒙=𝒙(t)λΔ¯p\boldsymbol{\mathit{x}}^{\star}=\boldsymbol{\mathit{x}}^{(t)}-\lambda\frac{{\bar{{\Delta}}}}{p}, and from Lemma 2.6,

𝒇(𝒙(t))𝒇(𝒙)=𝒇(𝒙(t))𝒇(𝒙(t)λΔ¯p)λ𝒓𝒆𝒔p(Δ¯)λ𝒓𝒆𝒔p(Δ).\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})=\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{(t)}-\lambda\frac{{\bar{{\Delta}}}}{p}}\right)\leq\lambda\boldsymbol{res}_{p}({\bar{{\Delta}}})\leq\lambda\boldsymbol{res}_{p}({{{\Delta^{\star}}}}).

Since Δ~{\widetilde{{\Delta}}} is a κ\kappa-approximate solution to the residual problem,

λ𝒓𝒆𝒔p(Δ)λκ𝒓𝒆𝒔p(Δ~)<16pκν32pκν2.\lambda\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\leq\lambda\kappa\boldsymbol{res}_{p}({\widetilde{{\Delta}}})<16p\kappa\frac{\nu}{32p\kappa}\leq\frac{\nu}{2}.

We have thus shown that 𝒇(𝒙(t))𝒇(𝒙)ν\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu for all iterates tt and whenever Line 9 of the algorithm is executed, 2 from the lemma statement holds. It remains to prove that if 𝒓𝒆𝒔p(Δ~)ν32pκ\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa}, then 𝒙(t+1)=𝒙(t)Δ~p\boldsymbol{\mathit{x}}^{(t+1)}=\boldsymbol{\mathit{x}}^{(t)}-\frac{{\widetilde{{\Delta}}}}{p} satisfies (2). Since, 𝒇(𝒙(t))𝒇(𝒙)ν\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu,

𝒓𝒆𝒔p(Δ~)ν32pκ132pκ(𝒇(𝒙(t))𝒇(𝒙)).\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa}\geq\frac{1}{32p\kappa}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right).

Now, from Lemma 2.6,

𝒇(𝒙(t+1))𝒇(𝒙)\displaystyle\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{(t+1)}}\right)-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star}) 𝒇(𝒙(t))𝒓𝒆𝒔p(Δ~)𝒇(𝒙)\displaystyle\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{res}_{p}({\widetilde{{\Delta}}})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})
(𝒇(𝒙(t))𝒇(𝒙))132pκ(𝒇(𝒙(t))𝒇(𝒙))\displaystyle\leq\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right)-\frac{1}{32p\kappa}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right)
=(1132pκ)(𝒇(𝒙(t))𝒇(𝒙)).\displaystyle=\mathopen{}\mathclose{{}\left(1-\frac{1}{32p\kappa}}\right)\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right).

Corollary 2.8.

The vector 𝐱\boldsymbol{\mathit{x}} returned by Algorithm 1 is an ϵ\epsilon-approximate solution to Problem (1).

Proof.

Our starting solution 𝒙(0)\boldsymbol{\mathit{x}}^{(0)} satisfies 𝑨𝒙(0)=𝒃\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}} and the solutions Δ~{\widetilde{{\Delta}}} of the residual problem added in each iteration satisfy 𝑨Δ~=0\boldsymbol{\mathit{A}}{\widetilde{{\Delta}}}=0. Therefore, 𝑨𝒙=𝒃\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}. For the second part, note that we always have 𝒇(𝒙(t))𝒇(𝒙)ν\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu. When we stop, νϵ\nu\leq\epsilon. Thus,

𝒇(𝒙(t))𝒇(𝒙)ϵ.\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\epsilon.

We are now ready to prove our main result. See 2.1

Proof.

From Corollary 2.8, the solution returned by the algorithm is as required. We next need to bound the runtime. From Lemma 2.7, the algorithm, either reduces ν\nu or Equation (2) holds. The number of times we can reduce ν\nu is bounded by log𝒇(𝒙(0))𝒇(𝒙)ϵ\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}. The number of times Equation (2) holds can be bounded as follows,

ϵ2𝒇(𝒙(t+1))𝒇(𝒙)(1132pκ)t(𝒇(𝒙(0))𝒇(𝒙)).\frac{\epsilon}{2}\leq\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{(t+1)}}\right)-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\mathopen{}\mathclose{{}\left(1-\frac{1}{32p\kappa}}\right)^{t}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right).

Therefore, the total number of iterations TT is bounded as T32pκlog(𝒇(𝒙(0))𝒇(𝒙)ϵ)T\leq 32p\kappa\log\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right). ∎

2.2 Starting Solution and Homotopy for pure p\ell_{p} Objectives

In this section, we consider the case where 𝒇(𝒙)=𝑵𝒙pp\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}, i.e., 𝒅=0\boldsymbol{\mathit{d}}=0 and 𝑴=0\boldsymbol{\mathit{M}}=0.

min𝑨𝒙=𝒃𝑵𝒙pp\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p} (3)

For such cases, we show how to find a good starting solution. We note that we can solve the following problem since it is equivalent to solving a system of linear equations,

min𝑨𝒙=𝒃𝑵𝒙22.\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{2}^{2}.

Refer to Appendix A for details on how the above is equivalent to solving a system of linear equations.

We next consider a homotopy on pp. Specifically, we want to find a starting solution for the p\ell_{p}-norm problem by first solving an 2\ell_{2}-norm problem, followed by 22,23,,2logp1\ell_{2^{2}},\ell_{2^{3}},...,\ell_{2^{\lfloor\log p-1\rfloor}}-norm problems to a constant approximation. The following lemma relates these solutions.

Lemma 2.9.

Let 𝐱k\boldsymbol{\mathit{x}}_{k}^{\star} denote the optimum of the kk-norm and 𝐱2k\boldsymbol{\mathit{x}}_{2k}^{\star} the optimum of the 2k2k-norm problem (3). Let 𝐱~\widetilde{\boldsymbol{\mathit{x}}} be an O(1)O(1)-approximate solution to the kk-norm problem. The following relation holds,

𝒙2k2k2k𝒙~2k2kO(m)𝒙2k2k2k.\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}\leq\mathopen{}\mathclose{{}\left\lVert\widetilde{\boldsymbol{\mathit{x}}}}\right\rVert_{2k}^{2k}\leq O(m)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}.

In other words, 𝐱~\widetilde{\boldsymbol{\mathit{x}}} is a O(m)O(m)-approximate solution to the 2k2k-norm problem.

Proof.

The left side follows from optimality of 𝒙2k\boldsymbol{\mathit{x}}^{\star}_{2k}. For the other side, we have the following relation,

𝒙~2k2k𝒙~k2kO(1)𝒙kk2kO(1)𝒙2kk2kO(1)m2k(1k12k)𝒙2k2k2k=O(m)𝒙2k2k2k.\mathopen{}\mathclose{{}\left\lVert\widetilde{\boldsymbol{\mathit{x}}}}\right\rVert_{2k}^{2k}\leq\mathopen{}\mathclose{{}\left\lVert\widetilde{\boldsymbol{\mathit{x}}}}\right\rVert_{k}^{2k}\leq O(1)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{k}}\right\rVert_{k}^{2k}\leq O(1)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{k}^{2k}\leq O(1)m^{2k\mathopen{}\mathclose{{}\left(\frac{1}{k}-\frac{1}{2k}}\right)}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}=O(m)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}.

Consider the following procedure to obtain a starting point 𝒙(0)\boldsymbol{\mathit{x}}^{(0)} for the p\ell_{p}-norm problem.

Algorithm 2 Homotopy on pp for Starting Solution
1:procedure StartSolution(𝑨,𝑵,𝒃,p\boldsymbol{\mathit{A}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{b}},p)
2:     𝒙(0)0,k2\boldsymbol{\mathit{x}}^{(0)}\leftarrow 0,k\leftarrow 2
3:     while k2logp1k\leq 2^{\lfloor\log p-1\rfloor} do
4:         𝒙(0)\boldsymbol{\mathit{x}}^{(0)}\leftarrow Main-Solver (𝑨,0,𝑵,0,𝒃,k,1)(\boldsymbol{\mathit{A}},0,\boldsymbol{\mathit{N}},0,\boldsymbol{\mathit{b}},k,1)\triangleright 2-approximate solution to the kk-norm Problem
5:         k2kk\leftarrow 2k      
6:     return 𝒙(0)\boldsymbol{\mathit{x}}^{(0)}
Lemma 2.10.

Let 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} be as returned by Algorithm 2. Suppose there exists an oracle that solves the residual problem for any norm k\ell_{k}, i.e., 𝐫𝐞𝐬k\boldsymbol{res}_{k} to a κk\kappa_{k}-approximation in time T(k,κk)T(k,\kappa_{k}). We can then compute 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} which is a O(m)O(m)-approximation to the p\ell_{p}-norm problem, in time at most

O(plogm)k=2i,i=2i=logp1κkT(k,κk).O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k}).
Proof.

For any kk, we have an O(1)O(1)-approximation solution to the k/2k/2-norm solution. From Lemma 2.9, this is a O(m)O(m)-approximate solution to the kk-norm problem. We now have from Theorem 2.1, that we require O(kκkT(k,κk)logm)O\mathopen{}\mathclose{{}\left(k\kappa_{k}T(k,\kappa_{k})\log m}\right) time to solve the kk-norm problem to a constant approximation. Summing over all kk, we have total runtime,

T=k=2i,i=2i=logp1O(kκkT(k,κk)logm)O(plogm)k=2i,i=2i=logp1κkT(k,κk).T=\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}O(k\kappa_{k}T(k,\kappa_{k})\log m)\leq O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k}).

In later sections, we will describe an oracle that will have κk=O(1)\kappa_{k}=O(1) for all values of kk and T(k,κk)T(k,\kappa_{k}) depends on kk linearly.

2.3 Iterative Refinement for p(1,2)p\in(1,2)

We will consider the following pure p\ell_{p} problem here,

min𝑨𝒙=𝒃𝑵𝒙p,\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}, (4)

where p(1,2)p\in(1,2). In the previous sections we saw an iterative refinement framework that worked for p2p\geq 2. In this section, we will show a similar iterative refinement for p(1,2)p\in(1,2). In particular, we will prove the following result from [Adi+19].

Theorem 2.11.

Let p(1,2)p\in(1,2), and κ1\kappa\geq 1. Given an initial solution 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} satisfying 𝐀𝐱(0)=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}, we can find 𝐱~\widetilde{\boldsymbol{\mathit{x}}} such that 𝐀𝐱~=𝐛\boldsymbol{\mathit{A}}\widetilde{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{b}} and 𝐍𝐱~pp(1+ϵ)𝐱pp\|\boldsymbol{\mathit{N}}\widetilde{\boldsymbol{\mathit{x}}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p} in O((pp1)1p1κlogmϵ)O\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(\frac{p}{p-1}}\right)^{\frac{1}{p-1}}\kappa\log\frac{m}{\epsilon}}\right) calls to a κ\kappa-approximate solver to the residual problem (Definition 2.13).

The key idea in the algorithm for p2p\geq 2 was an upper and lower bound on the function that was an 22+pp\ell_{2}^{2}+\ell_{p}^{p}-norm term (Lemma 2.5). Such a bound does not hold when p<2p<2, however, we will show that a smoothed p\ell_{p}-norm function can be used for providing such bounds. Specifically, we use the following smoothed p\ell_{p}-norm function defined in [Bub+18].

Definition 2.12.

(Smoothed p\ell_{p} Function.) Let p(1,2)p\in(1,2), and x,t0x\in\mathbb{R},t\geq 0 . We define,

γp(t,x)={p2tp2x2 if |x|t,|x|p(1p2)tp otherwise.\gamma_{p}(t,x)=\begin{cases}\frac{p}{2}t^{p-2}x^{2}&\text{ if }|x|\leq t,\\ |x|^{p}-\mathopen{}\mathclose{{}\left(1-\frac{p}{2}}\right)t^{p}&\text{ otherwise.}\end{cases}

For any vector 𝐱\boldsymbol{\mathit{x}} and 𝐭0\boldsymbol{\mathit{t}}\geq 0, we define γp(𝐭,𝐱)=iγp(𝐭i,𝐱i)\gamma_{p}(\boldsymbol{\mathit{t}},\boldsymbol{\mathit{x}})=\sum_{i}\gamma_{p}(\boldsymbol{\mathit{t}}_{i},\boldsymbol{\mathit{x}}_{i}).

We define the following residual problem for this section.

Definition 2.13.

For p(1,2)p\in(1,2), we define the residual problem at any feasible 𝐱\boldsymbol{\mathit{x}} to be,

max𝑨Δ=0𝒓𝒆𝒔p(Δ)=def𝒈Δ2pγp(|𝑵𝒙|,𝑵Δ),\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{res}_{p}(\Delta)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\boldsymbol{\mathit{g}}^{\top}\Delta-2^{p}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\boldsymbol{\mathit{N}}\Delta),

where 𝐠=p𝐍|𝐍𝐱|p2𝐍𝐱\boldsymbol{\mathit{g}}=p\boldsymbol{\mathit{N}}^{\top}|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}.

We will follow a similar structure as Section 2.1. We begin by proving analogues of Lemma 2.6 and Lemma 2.5.

Lemma 2.14.

Let p(1,2)p\in(1,2). For any xx and Δ\Delta,

|x|p+p|x|p2xΔ+p1p2pγp(|x|,Δ)|x+Δ|p|x|p+p|x|p2xΔ+2pγp(|x|,Δ)|x|^{p}+p|x|^{p-2}x\Delta+\frac{p-1}{p2^{p}}\gamma_{p}(|x|,\Delta)\leq|x+\Delta|^{p}\leq|x|^{p}+p|x|^{p-2}x\Delta+2^{p}\gamma_{p}(|x|,\Delta)
Proof.

We first show the following inequality holds for |α|1|\alpha|\leq 1

1+αp+(p1)4α2(1+α)p1+αp+p2p1α2.1+\alpha p+\frac{(p-1)}{4}\alpha^{2}\leq(1+\alpha)^{p}\leq 1+\alpha p+p2^{p-1}\alpha^{2}. (5)

Let us first show the left inequality, i.e. 1+αp+p14α2(1+α)p1+\alpha p+\frac{p-1}{4}\alpha^{2}\leq(1+\alpha)^{p}. Define the following function,

h(α)=(1+α)p1αpp14α2.h(\alpha)=(1+\alpha)^{p}-1-\alpha p-\frac{p-1}{4}\alpha^{2}.

When α=1,1\alpha=1,-1, h(α)0h(\alpha)\geq 0. The derivative of hh with respect to α\alpha is, h(α)=p(1+α)p1p(p1)2αh^{\prime}(\alpha)=p(1+\alpha)^{p-1}-p-\frac{(p-1)}{2}\alpha. Next let us see what happens when |α|<1\mathopen{}\mathclose{{}\left|\alpha}\right|<1.

h′′(α)=p(p1)(1+α)p2p12=(p1)(p(1+α)2p12)0h^{\prime\prime}(\alpha)=p(p-1)(1+\alpha)^{p-2}-\frac{p-1}{2}=(p-1)\mathopen{}\mathclose{{}\left(\frac{p}{(1+\alpha)^{2-p}}-\frac{1}{2}}\right)\geq 0

This implies that h(α)h^{\prime}(\alpha) is an increasing function of α\alpha and α0\alpha_{0} for which h(α0)=0h^{\prime}(\alpha_{0})=0 is where hh attains its minimum value. The only point where hh^{\prime} is 0 is α0=0\alpha_{0}=0. This implies h(α)h(0)=0h(\alpha)\geq h(0)=0. This concludes the proof of the left inequality. For the right inequality, define:

s(α)=1+αp+p2p1α2(1+α)p.s(\alpha)=1+\alpha p+p2^{p-1}\alpha^{2}-(1+\alpha)^{p}.

Note that s(0)=0s(0)=0 and s(1),s(1)0s(1),s(-1)\geq 0. We have,

s(α)=p+p2pαp(1+α)p1,s^{\prime}(\alpha)=p+p2^{p}\alpha-p(1+\alpha)^{p-1},

and

(1+α)p1sign(α)(1+α)sign(α).(1+\alpha)^{p-1}sign(\alpha)\leq(1+\alpha)sign(\alpha).

Using this, we get, s(α)sign(α)p|α|(2p1)0s^{\prime}(\alpha)sign(\alpha)\geq p|\alpha|(2^{p}-1)\geq 0 which says s(α)s^{\prime}(\alpha) is positive for α\alpha positive and negative for α\alpha negative. Thus the minima of ss is at 0 which is 0. So s(α)0s(\alpha)\geq 0.

Before we prove the lemma, we will prove the following inequality for β1\beta\geq 1,

(β1)p1+112pβp1.(\beta-1)^{p-1}+1\geq\frac{1}{2^{p}}\beta^{p-1}. (6)

(β1)β2(\beta-1)\geq\frac{\beta}{2} for β2\beta\geq 2. So the claim clearly holds for β2\beta\geq 2 since (β1)p1(β2)p1(\beta-1)^{p-1}\geq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}. When 1β21\leq\beta\leq 2, 1β21\geq\frac{\beta}{2}, so the claim holds since, 1(β2)p11\geq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}

We now prove the lemma.

Let Δ=αx\Delta=\alpha x. The term p|x|p1sign(x)αx=αp|x|p1|x|=αp|x|pp|x|^{p-1}sign(x)\cdot\alpha x=\alpha p|x|^{p-1}|x|=\alpha p|x|^{p}. Let us first look at the case when |α|1|\alpha|\leq 1. We want to show,

|x|p+αp|x|p+cp2|x|p2|αx|2|x+αx|p|x|p+αp|x|p+Cp2|x|p2|αx|2\displaystyle|x|^{p}+\alpha p|x|^{p}+c\frac{p}{2}|x|^{p-2}|\alpha x|^{2}\leq|x+\alpha x|^{p}\leq|x|^{p}+\alpha p|x|^{p}+C\frac{p}{2}|x|^{p-2}|\alpha x|^{2}
(1+αp)+cp2α2(1+α)p(1+αp)+Cp2α2.\displaystyle\Leftrightarrow(1+\alpha p)+c\frac{p}{2}\alpha^{2}\leq(1+\alpha)^{p}\leq(1+\alpha p)+C\frac{p}{2}\alpha^{2}.

This follows from Equation (5) and the facts cp2p14\frac{cp}{2}\leq\frac{p-1}{4} and Cp2p2p1\frac{Cp}{2}\geq p2^{p-1} . We next look at the case when |α|1|\alpha|\geq 1. Now, γp(|f|,Δ)=|Δ|p+(p21)|f|p\gamma_{p}(|f|,\Delta)=|\Delta|^{p}+(\frac{p}{2}-1)|f|^{p}. We need to show

|x|p(1+αp)+|x|p(p1)p2p(|α|p+p21)|x|p|1+α|p|x|p(1+αp)+2p|x|p(|α|p+p21).|x|^{p}(1+\alpha p)+\frac{|x|^{p}(p-1)}{p2^{p}}(|\alpha|^{p}+\frac{p}{2}-1)\leq|x|^{p}|1+\alpha|^{p}\leq|x|^{p}(1+\alpha p)+2^{p}|x|^{p}(|\alpha|^{p}+\frac{p}{2}-1).

When |x|=0|x|=0 it is trivially true. When |x|0|x|\neq 0, let

h(α)=|1+α|p(1+αp)(p1)p2p(|α|p+p21).h(\alpha)=|1+\alpha|^{p}-(1+\alpha p)-\frac{(p-1)}{p2^{p}}(|\alpha|^{p}+\frac{p}{2}-1).

Now, taking the derivative with respect to α\alpha we get,

h(α)=p(|1+α|p1sign(α)1(p1)p2p|α|p1sign(α)).h^{\prime}(\alpha)=p\mathopen{}\mathclose{{}\left(|1+\alpha|^{p-1}sign(\alpha)-1-\frac{(p-1)}{p2^{p}}|\alpha|^{p-1}sign(\alpha)}\right).

We use the mean value theorem to get for |α|1|\alpha|\geq 1,

(1+α)p11\displaystyle(1+\alpha)^{p-1}-1 =(p1)α(1+z)p2,z(0,α)\displaystyle=(p-1)\alpha(1+z)^{p-2},z\in(0,\alpha)
(p1)α(2α)p2\displaystyle\geq(p-1)\alpha(2\alpha)^{p-2}
p12αp1\displaystyle\geq\frac{p-1}{2}\alpha^{p-1}

which implies h(α)0h^{\prime}(\alpha)\geq 0 in this range as well. When α1\alpha\leq-1 it follows from Equation (6) that h(α)0h^{\prime}(\alpha)\leq 0. So the function hh is increasing for α1\alpha\geq 1 and decreasing for α1\alpha\leq-1. The minimum value of hh is min{h(1),h(1)}0min\{h(1),h(-1)\}\geq 0. It follows that h(α)0h(\alpha)\geq 0 which gives us the left inequality. The other side requires proving,

|1+α|p1+αp+2p(|α|p+p21).|1+\alpha|^{p}\leq 1+\alpha p+2^{p}(|\alpha|^{p}+\frac{p}{2}-1).

Define:

s(α)=1+αp+2p(|α|p+p21)|1+α|p.s(\alpha)=1+\alpha p+2^{p}(|\alpha|^{p}+\frac{p}{2}-1)-|1+\alpha|^{p}.

The derivative s(α)=p+(p2p|α|p1p|1+α|p1)sign(α)s^{\prime}(\alpha)=p+\mathopen{}\mathclose{{}\left(p2^{p}|\alpha|^{p-1}-p|1+\alpha|^{p-1}}\right)sign(\alpha) is non negative for α1\alpha\geq 1 and non positive for α1\alpha\leq-1. The minimum value taken by ss is min{s(1),s(1)}\min\{s(1),s(-1)\} which is non negative. This gives us the right inequality.

Lemma 2.15.

Let p(1,2)p\in(1,2) and λp1=p4pp1\lambda^{p-1}=\frac{p4^{p}}{p-1}. Then for any Δ\Delta,

𝒓𝒆𝒔p(Δ)𝑵𝒙pp𝑵(𝒙Δ)pp,\boldsymbol{res}_{p}(\Delta)\leq\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\Delta)\|_{p}^{p},

and

𝑵𝒙pp𝑵(𝒙λΔ)ppλ𝒓𝒆𝒔p(Δ).\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\lambda\Delta)\|_{p}^{p}\leq\lambda\boldsymbol{res}_{p}(\Delta).
Proof.

Applying Lemma 2.14 to all coordinates,

𝒈Δ+p1p2pγp(|𝑵𝒙|,𝑵Δ)𝑵(𝒙Δ)pp𝑵𝒙pp𝒈Δ+2pγp(|𝑵𝒙|,𝑵Δ).-\boldsymbol{\mathit{g}}^{\top}\Delta+\frac{p-1}{p2^{p}}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\boldsymbol{\mathit{N}}\Delta)\leq\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\Delta)\|_{p}^{p}-\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq-\boldsymbol{\mathit{g}}^{\top}\Delta+2^{p}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\boldsymbol{\mathit{N}}\Delta).

From the definition of the residual problem and the above equation, the first inequality of our lemma directly follows. To see the other inequality, from the above equation,

𝑵𝒙pp𝑵(𝒙λΔ)pp\displaystyle\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\lambda\Delta)\|_{p}^{p} λ𝒈Δp1p2pγp(|𝑵𝒙|,λ𝑵Δ)\displaystyle\leq\lambda\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{p-1}{p2^{p}}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\lambda\boldsymbol{\mathit{N}}\Delta)
λ(𝒈Δλp1p1p2pγp(|𝑵𝒙|,𝑵Δ))\displaystyle\leq\lambda\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\Delta-\lambda^{p-1}\frac{p-1}{p2^{p}}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\boldsymbol{\mathit{N}}\Delta)}\right)
=λ𝒓𝒆𝒔p(Δ).\displaystyle=\lambda\cdot\boldsymbol{res}_{p}(\Delta).

Here, we are using the following property of γp\gamma_{p},

γp(t,λΔ)min{λ2,λp}γp(t,Δ).\gamma_{p}(t,\lambda\Delta)\geq\min\{\lambda^{2},\lambda^{p}\}\gamma_{p}(t,\Delta).

Lemma 2.15 is similar to Lemma 2.6, and we can follow the proof of Theorem 2.1 to obtain Theorem 2.11.

3 Fast Multiplicative Weight Update Algorithm for p\ell_{p}-norms

In this section, we will show how to solve the residual problem for p2p\geq 2 as defined in the previous section (Definition 2.3), to a constant approximation. The core of our approach is a multiplicative weight update routine with width reduction that is used to speed up the algorithm. For problem instances of size mm, this routine returns a constant approximate solution in at most O(m1/3)O(m^{1/3}) calls to a linear system solver. Such a width reduced multiplicative weight update algorithm was first seen in the context of the maximum flow problem and \ell_{\infty}-regression in works by [Chr+11, Chi+13]

The first instance of such a width reduced multiplicative weight update algorithm for p\ell_{p}-regression appeared in the work of [Adi+19]. In a further work, the authors improved the dependence on pp in the runtime [Adi+21]. The following sections are based on the improved algorithm from [Adi+21].

3.1 Algorithm for p\ell_{p}-norm Regression

Recall that our residual problem for p2p\geq 2 is defined as:

max𝑨Δ=0𝒓𝒆𝒔p(Δ)=def𝒈ΔΔ𝑹Δ𝑵Δpp,\max_{\boldsymbol{\mathit{A}}\Delta=0}\quad\boldsymbol{res}_{p}(\Delta)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\boldsymbol{\mathit{g}}^{\top}\Delta-\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p},

for some vector 𝒈\boldsymbol{\mathit{g}} and matrices 𝑹\boldsymbol{\mathit{R}} and 𝑵\boldsymbol{\mathit{N}}. Also recall that in Algorithm 1, we used a parameter ν\nu, which was used to track the value of 𝒇(𝒙(t))𝒇(𝒙)\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star}) at any iteration tt. We will now use this parameter ν\nu to do a binary search on the linear term in 𝒓𝒆𝒔p\boldsymbol{res}_{p} and reduce the residual problem to,

minΔΔ𝑹Δ+𝑵Δpps.t.𝑨Δ=0𝒈Δ=c,\displaystyle\begin{aligned} \min_{\Delta}&\quad\Delta^{\top}\boldsymbol{\mathit{R}}\Delta+\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p}\\ s.t.&\quad\boldsymbol{\mathit{A}}\Delta=0\\ &\quad\boldsymbol{\mathit{g}}^{\top}\Delta=c,\end{aligned} (7)

for some constant cc. Further, we will use our multiplicative weight update solver to solve problems of this kind to a constant approximation. We start by proving the binary search results.

3.1.1 Binary Search

We first note that, if ν\nu at iteration tt is such that 𝒇(𝒙(t))𝒇(𝒙)(ν/2,ν]\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\in(\nu/2,\nu], then from Lemma 2.6, the residual at 𝒙(t)\boldsymbol{\mathit{x}}^{(t)} has optimum value, 𝒓𝒆𝒔p(Δ)(ν32p,ν]\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\frac{\nu}{32p},\nu]. We now consider a parameter ζ\zeta that has value between ν16p\frac{\nu}{16p} and ν\nu such that 𝒓𝒆𝒔p(Δ)(ζ2,ζ]\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\frac{\zeta}{2},\zeta]. We have the following lemma that relates the optimum of problem of the type (7) with ζ\zeta.

Lemma 3.1.

Let ζ\zeta be such that the residual problem satisfies 𝐫𝐞𝐬p(Δ)(ζ2,ζ]\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\frac{\zeta}{2},\zeta]. The following problem has optimum at most 2ζ2\zeta.

minΔΔ𝑹Δ+𝑵Δpps.t.𝑨Δ=0𝒈Δ=ζ2.\displaystyle\begin{aligned} \min_{\Delta}&\quad\Delta^{\top}\boldsymbol{\mathit{R}}\Delta+\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p}\\ s.t.&\quad\boldsymbol{\mathit{A}}\Delta=0\\ &\quad\boldsymbol{\mathit{g}}^{\top}\Delta=\frac{\zeta}{2}.\end{aligned} (8)

Further, let Δ~{\widetilde{{\Delta}}} be a solution to the above problem such that Δ~𝐑Δ~a2ζ{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}\leq a^{2}\zeta and 𝐍Δ~ppapζ\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq a^{p}\zeta for some a>1a>1. Then Δ~5a2\frac{{\widetilde{{\Delta}}}}{5a^{2}} is a 100a2100a^{2}-approximation to the residual problem.

Proof.

We have assumed that,

𝒓𝒆𝒔(Δ)=𝒈ΔΔ𝑹Δ𝑵Δpp(ζ2,ζ].\boldsymbol{res}({{{\Delta^{\star}}}})=\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\in\mathopen{}\mathclose{{}\left(\frac{\zeta}{2},\zeta}\right].

Since the last 22 terms are strictly non-positive, we must have, 𝒈Δζ2.\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}\geq\frac{\zeta}{2}. Since Δ{{{\Delta^{\star}}}} is the optimum and satisfies 𝑨Δ=0\boldsymbol{\mathit{A}}{{{\Delta^{\star}}}}=0,

ddλ(𝒈λΔλ2Δ𝑹Δλp𝑵Δpp)λ=1=0.\frac{d}{d\lambda}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\lambda{{{\Delta^{\star}}}}-\lambda^{2}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\lambda^{p}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}}\right)_{\lambda=1}=0.

Thus,

𝒈ΔΔ𝑹Δ𝑵Δpp=Δ𝑹Δ+(p1)𝑵Δpp.\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}={{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+(p-1)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}.

Since p2,p\geq 2, we get the following

Δ𝑹Δ+𝑵Δpp𝒈ΔΔ𝑹Δ𝑵Δppζ.{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\leq\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\leq\zeta.

Now, we know that, 𝒈Δζ2\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}\geq\frac{\zeta}{2} and 𝒈ΔΔ𝑹Δ𝑵Δppζ\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\leq\zeta. This gives,

ζ2𝒈ΔΔ𝑹Δ+𝑵Δpp+ζ2ζ.\frac{\zeta}{2}\leq\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}\leq{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}+\zeta\leq 2\zeta.

Now, let Δ~{\widetilde{{\Delta}}} be as described in the lemma. We have,

𝒓𝒆𝒔p(Δ~5a2)\displaystyle\boldsymbol{res}_{p}\mathopen{}\mathclose{{}\left(\frac{{\widetilde{{\Delta}}}}{5a^{2}}}\right) =15a2𝒈Δ~ζ25a2ζ5pap\displaystyle=\frac{1}{5a^{2}}\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-\frac{\zeta}{25a^{2}}-\frac{\zeta}{5^{p}a^{p}}
ζ10a22ζ25a2\displaystyle\geq\frac{\zeta}{10a^{2}}-\frac{2\zeta}{25a^{2}}
ζ50a21100a2𝒓𝒆𝒔p(Δ)\displaystyle\geq\frac{\zeta}{50a^{2}}\geq\frac{1}{100a^{2}}\boldsymbol{res}_{p}({{{\Delta^{\star}}}})

Algorithm 3 Algorithm for Solving the Residual Problem
1:procedure ResidualSolver(𝒙,𝑴,𝑵,𝑨,𝒅,𝒃,ν,p\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p)
2:     ζν\zeta\leftarrow\nu
3:     (𝒈,𝑹,𝑵)𝒓𝒆𝒔p(\boldsymbol{\mathit{g}},\boldsymbol{\mathit{R}},\boldsymbol{\mathit{N}})\leftarrow\boldsymbol{res}_{p}\triangleright Create residual problem at 𝒙\boldsymbol{\mathit{x}}
4:     while ζ>ν32p\zeta>\frac{\nu}{32p} do
5:         Δ~ζ{\widetilde{{\Delta}}}_{\zeta}\leftarrow MWU-Solver([𝑨,𝒈],𝑹1/2,𝑵,[0,ζ2],ζ,p)\mathopen{}\mathclose{{}\left([\boldsymbol{\mathit{A}},\boldsymbol{\mathit{g}}^{\top}],\boldsymbol{\mathit{R}}^{1/2},\boldsymbol{\mathit{N}},[0,\frac{\zeta}{2}]^{\top},\zeta,p}\right)\triangleright Algorithm 5
6:         ζζ2\zeta\leftarrow\frac{\zeta}{2}      
7:     return argminΔ~ζ𝒇(𝒙Δ~ζp)\arg\min_{{\widetilde{{\Delta}}}_{\zeta}}\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{{\widetilde{{\Delta}}}_{\zeta}}{p}}\right)

3.1.2 Width-Reduced Approximate Solver

We are now finally ready to solve problems of the type (7). In this section, we will give an algorithm to solve the following problem,

minΔ\displaystyle\min_{\Delta} Δ𝑴𝑴Δ+𝑵Δpp\displaystyle\quad\Delta^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p} (9)
s.t.𝑨Δ=𝒄.\displaystyle\text{s.t.}\quad\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}.

Here 𝑨d×n,𝑵m1×n,𝑴m2×n\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}, and vector 𝒄d\boldsymbol{\mathit{c}}\in\mathbb{R}^{d}. Our approach involves a multiplicative weight update method with a width reduction step which allows us to solve these problems faster.

3.1.3 Slow Multiplicative Weight Update Solver

We first give an informal analysis of the multiplicative weight update method without width reduction. We will show that this method converges in m1p22(p1)m11/2\approx m_{1}^{\frac{p-2}{2(p-1)}}\leq m_{1}^{1/2} iterations. For simplicity, we let 𝑴=0\boldsymbol{\mathit{M}}=0 in Problem (9) and assume without loss of generality that the optimum Δ{{{\Delta^{\star}}}} satisfies, 𝑵Δp1\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}\leq 1. Consider the following MWU algorithm for parameter α\alpha that we will set later:

  1. 1.

    𝒘(0)=1,𝒙(0)=0,T=α1m1/p\boldsymbol{\mathit{w}}^{(0)}=1,\boldsymbol{\mathit{x}}^{(0)}=0,T=\alpha^{-1}m^{1/p}

  2. 2.

    for t=1,,Tt=1,\cdots,T:
    Δ(t)=argmin𝑨Δ=𝒄i(𝒘i(t1))p2(𝑵Δ)i2,𝒘(t)=𝒘(t1)+α|𝑵Δ(t)|,𝒙(t)=𝒙(t1)+Δ(t)\Delta^{(t)}=\arg\min_{\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}}\sum_{i}(\boldsymbol{\mathit{w}}^{(t-1)}_{i})^{p-2}(\boldsymbol{\mathit{N}}\Delta)_{i}^{2},\quad\boldsymbol{\mathit{w}}^{(t)}=\boldsymbol{\mathit{w}}^{(t-1)}+\alpha|\boldsymbol{\mathit{N}}\Delta^{(t)}|,\quad\boldsymbol{\mathit{x}}^{(t)}=\boldsymbol{\mathit{x}}^{(t-1)}+\Delta^{(t)}

  3. 3.

    Return 𝒙~=𝒙/T\widetilde{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{x}}/T

We claim that the above algorithm returns 𝒙~\widetilde{\boldsymbol{\mathit{x}}} such that 𝑵𝒙~ppOp(1)\|\boldsymbol{\mathit{N}}\widetilde{\boldsymbol{\mathit{x}}}\|_{p}^{p}\leq O_{p}(1), i.e., a constant approximate solution to the residual problem, in m11/2\approx m_{1}^{1/2} iterations. We will bound the value of the returned solution, 𝑵𝒙~pp\|\boldsymbol{\mathit{N}}\widetilde{\boldsymbol{\mathit{x}}}\|_{p}^{p} by looking at how 𝒘(t)pp\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p} grows with tt. From Lemma 2.5,

𝒘(t1)+α𝑵Δ(t)pp𝒘(t1)pp+αpi(𝒘i(t1))p1(𝑵Δ(t))i+2p2α2i(𝒘i(t1))p2(𝑵Δ(t))i2+αppp𝑵Δ(t)pp.\|\boldsymbol{\mathit{w}}^{(t-1)}+\alpha\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}\leq\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p}+\alpha p\sum_{i}(\boldsymbol{\mathit{w}}^{(t-1)}_{i})^{p-1}(\boldsymbol{\mathit{N}}\Delta^{(t)})_{i}\\ +2p^{2}\alpha^{2}\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}\Delta^{(t)})^{2}_{i}+\alpha^{p}p^{p}\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}.

Observe that the third term on the right hand side is exactly the objective of the quadratic problem minimized to obtain Δ(t)\Delta^{(t)}. Using that Δ(t)\Delta^{(t)} must achieve a lower objective than Δ{{{\Delta^{\star}}}}, i.e., i(𝒘i(t1))p2(𝑵Δ(t))i2i(𝒘i(t1))p2(𝑵Δ)i2\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}\Delta^{(t)})^{2}_{i}\leq\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})^{2}_{i} along with Hölder’s inequality and 𝑵Δp1\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}\leq 1, we can bound this term by 𝒘(t1)pp2\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-2}. We can further bound the second term in right hand side of the above inequality by the third term using Hölder’s inequality (refer to Proof of Lemma 3.3 for details). These bounds give,

𝒘(t)pp𝒘(t1)pp+αp𝒘(t1)pp1+2α2p2𝒘(t1)pp2+αppp𝑵Δ(t)pp.\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p}\leq\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p}+\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|^{p-1}_{p}+2\alpha^{2}p^{2}\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-2}+\alpha^{p}p^{p}\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}.

Observe that the growth of 𝒘(t)pp\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p} is controlled by 𝑵Δ(t)pp\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|^{p}_{p}. We next see how large this quantity can be. Assume that, 𝒘(t)p3m11/p\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}\leq 3m_{1}^{1/p} for all tt (one may verify in the end that this holds for all tTt\leq T). Since (𝒘i(t1))p2(𝒘i(0))p2=1(\boldsymbol{\mathit{w}}^{(t-1)}_{i})^{p-2}\geq(\boldsymbol{\mathit{w}}^{(0)}_{i})^{p-2}=1,

𝑵Δ(t)22i(𝒘i(t1))p2(𝑵Δ(t))i2(a)𝒘(t1)pp23p2m1(p2)/p,\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{2}^{2}\leq\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}\Delta^{(t)})^{2}_{i}\begin{subarray}{c}(a)\\ \leq\end{subarray}\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-2}\leq 3^{p-2}m_{1}^{(p-2)/p},

where we used Hölder’s inequality in (a)(a). This implies, 𝑵Δ(t)pp3(p2)p/2m1(p2)/2\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}\leq 3^{(p-2)p/2}m_{1}^{(p-2)/2}. Now, for αm1p24p+22p(p1)\alpha\approx m_{1}^{-\frac{p^{2}-4p+2}{2p(p-1)}}, αppp𝑵Δ(t)ppαpm1p1pαp𝒘(t1)pp1\alpha^{p}p^{p}\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}\leq\alpha pm_{1}^{\frac{p-1}{p}}\leq\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-1} and,

𝒘(t)pp𝒘(t1)pp+αp𝒘(t1)pp1+2α2p2+αp𝒘(t1)pp1(𝒘(t1)p+2α)p.\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p}\leq\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p}+\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|^{p-1}_{p}+2\alpha^{2}p^{2}+\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-1}\leq\mathopen{}\mathclose{{}\left(\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}+2\alpha}\right)^{p}.

We can thus prove that,

𝑵𝒙(T)pp1m1𝒘(T)pp(𝒘(0))p+2αT)p=1m1(m11/p+2m11/p)p=3p,\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{(T)}\|_{p}^{p}\leq\frac{1}{m_{1}}\|\boldsymbol{\mathit{w}}^{(T)}\|_{p}^{p}\leq\mathopen{}\mathclose{{}\left(\|\boldsymbol{\mathit{w}}^{(0)})\|_{p}+2\alpha T}\right)^{p}=\frac{1}{m_{1}}\mathopen{}\mathclose{{}\left(m_{1}^{1/p}+2m_{1}^{1/p}}\right)^{p}=3^{p},

as required. The total number of iterations is T=α1m11/pm1p22(p1)T=\alpha^{-1}m_{1}^{1/p}\approx m_{1}^{\frac{p-2}{2(p-1)}}.

To obtain the improved rates of convergence via width reduction, our algorithm uses a hard threshold on 𝑵Δ(t)pp\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p} and performs a width reduction step whenever 𝑵Δ(t)pp\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p} is larger than the threshold. The analysis now requires to additionally track how 𝒘p\|\boldsymbol{\mathit{w}}\|_{p} changes with a width reduction step. Our analysis also tracks the value of an additional potential Ψ=min𝑨Δ=𝒄i𝒘iΔi2\Psi=\min_{\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}}\sum_{i}\boldsymbol{\mathit{w}}_{i}\Delta_{i}^{2}. The interplay of these two potentials and balancing out their changes with respect to primal updates and width reduction steps give the improved rates of convergence.

3.1.4 Fast, Width-Reduced MWU Solver

In the previous section, we showed that a multiplicative weight update algorithm without width-reduction obtains a rate of convergence m11/2\approx m_{1}^{1/2}. In this section we will show how width-reduction allows for a faster m11/3\approx m_{1}^{1/3} rate of convergence. We now present the faster width-reduced algorithm. We will prove the following result.

Theorem 3.2.

Let p2p\geq 2. Consider an instance of Problem (9) described by matrices 𝐀d×n,𝐍m1×n,𝐌m2×n\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}, and vector 𝐜d\boldsymbol{\mathit{c}}\in\mathbb{R}^{d}. If the optimum of this problem is at most ζ\zeta, Procedure Residual-Solver (Algorithm 5) returns an 𝐱\boldsymbol{\mathit{x}} such that 𝐀𝐱=𝐜,\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{c}}, and 𝐱𝐌𝐌𝐱O(1)ζ\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\leq O(1)\zeta and 𝐍𝐱ppO(3p)ζ\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq O(3^{p})\zeta. The algorithm makes O(pm1p2(3p2)){O}\mathopen{}\mathclose{{}\left(pm_{1}^{\frac{p-2}{(3p-2)}}}\right) calls to a linear system solver.

The algorithm and analyses of this chapter are based on [Adi+19] and [Adi+21].

In every iteration of the algorithm, we solve a weighted linear system. The solution returned is used to update the current iterate if it has a small p\ell_{p} norm. Otherwise, we do not update the solution, but update the weights corresponding to the coordinates with large value by a constant factor. This step is refered to as the “width reduction step”. The analysis is based on a potential function argument for specially defined potentials.

The following is the oracle used in the algorithm, i.e., the linear system we need to solve. We show in the Appendix A how to implement the oracle using a linear system solver.

Algorithm 4 Oracle
1:procedure Oracle(𝑨,𝑴,𝑵,𝒄,𝒘,ζ\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{c}},\boldsymbol{\mathit{w}},\zeta)
2:     𝒓e𝒘ep2\boldsymbol{\mathit{r}}_{e}\leftarrow\boldsymbol{\mathit{w}}_{e}^{p-2}
3:     𝑴~ζp22p𝑴{\widetilde{\boldsymbol{\mathit{M}}}}\leftarrow\zeta^{-\frac{p-2}{2p}}\boldsymbol{\mathit{M}}
4:     Compute,
Δ=argmin𝑨Δ=𝒄m1p2pΔ𝑴~𝑴~Δ+13p2e𝒓e(𝑵Δ)e2\Delta=\arg\min_{\boldsymbol{\mathit{A}}\Delta^{\prime}=\boldsymbol{\mathit{c}}}\quad m_{1}^{\frac{p-2}{p}}{\Delta^{\prime}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}\Delta^{\prime}+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta^{{}^{\prime}}}\right)^{2}_{e}
5:     return Δ\Delta

We now have the following multiplicative weight update algorithm given in Algorithm 5.

Algorithm 5 Width Reduced MWU Algorithm
1:procedure MWU-Solver(𝑨,𝑴,𝑵,𝒄,ζ,p\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{c}},\zeta,p)
2:     𝒘e(0,0)1\boldsymbol{\mathit{w}}^{(0,0)}_{e}\leftarrow 1
3:     𝒙0\boldsymbol{\mathit{x}}\leftarrow 0
4:     ρm1(p24p+2)p(3p2)\rho\leftarrow m_{1}^{\frac{(p^{2}-4p+2)}{p(3p-2)}}\triangleright width parameter
5:     β3p1m1p23p2\beta\leftarrow 3^{p-1}\cdot m_{1}^{\frac{p-2}{3p-2}}\triangleright resistance threshold
6:     α3p1pp1m1p25p+2p(3p2)\alpha\leftarrow 3^{-\frac{p-1}{p}}\cdot p^{-1}m_{1}^{-\frac{p^{2}-5p+2}{p(3p-2)}}\triangleright step size
7:     τ3pm1(p1)(p2)(3p2)\tau\leftarrow 3^{p}\cdot m_{1}^{\frac{(p-1)(p-2)}{(3p-2)}}\triangleright p\ell_{p} threshold
8:     Tα1m11/p=3p1p(pmp23p2)T\leftarrow\alpha^{-1}m_{1}^{1/p}=3^{\frac{p-1}{p}}\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}}\right)
9:     i0,k0i\leftarrow 0,k\leftarrow 0
10:     while i<Ti<T do
11:         ΔOracle(𝑨,𝑴,𝑵,𝒄,𝒘(i,k),ζ)\Delta\leftarrow\textsc{Oracle}(\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{c}},\boldsymbol{\mathit{w}}^{(i,k)},\zeta)
12:         𝒓(𝒘(i,k))p2\boldsymbol{\mathit{r}}\leftarrow\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)^{p-2}
13:         if 𝑵Δppτζ\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\Delta}\right\rVert_{p}^{p}\leq\tau\zeta then\triangleright primal step
14:              𝒘(i+1,k)𝒘(i,k)+α|𝑵Δ|ζ1/p\boldsymbol{\mathit{w}}^{(i+1,k)}\leftarrow\boldsymbol{\mathit{w}}^{(i,k)}+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|}{\zeta^{1/p}}
15:              𝒙𝒙+Δ\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}+\Delta
16:              ii+1i\leftarrow i+1
17:         else
18:              For all coordinates ee with |𝑵Δ|eρζ1p|\boldsymbol{\mathit{N}}\Delta|_{e}\geq\rho\zeta^{\frac{1}{p}} and 𝒓eβ\boldsymbol{\mathit{r}}_{e}\leq\beta\triangleright width reduction step
19:              𝒘e(i,k+1)21p2𝒘e\quad\quad\quad\boldsymbol{\mathit{w}}_{e}^{(i,k+1)}\leftarrow 2^{\frac{1}{p-2}}\boldsymbol{\mathit{w}}_{e}
20:              kk+1\quad\quad\quad k\leftarrow k+1               
21:     return 𝒙T\frac{\boldsymbol{\mathit{x}}}{T}
Notation

We will use Δ{{{\Delta^{\star}}}} to denote the optimum of (9). Since we assume that the optimum value of (9) is at most ζ\zeta,

Δ𝑴𝑴Δζ and 𝑵Δppζ{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{{{\Delta^{\star}}}}\leq\zeta\quad\text{ and }\quad\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\Delta^{*}}\right\rVert_{p}^{p}\leq\zeta (10)

3.1.5 Analysis of Algorithm 5

Our analysis is based on tracking the following two potential functions. We will show how these potentials change with a primal step (Line 13) and a width reduction step (18) in the algorithm. The proofs of these lemmas appear later in the section.

Φ(𝒘(i))=def𝒘pp\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}
Ψ(𝒓)=defminΔ:𝑨Δ=𝒄m1p2pΔ𝑴~𝑴~Δ+13p2e𝒓e(𝑵Δ)e2.\Psi(\boldsymbol{\mathit{r}})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\min_{\Delta:\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}}m_{1}^{\frac{p-2}{p}}{\Delta}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}\Delta+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)^{2}_{e}.

Finally, to prove our runtime bound, we will first show that if the total number of width reduction steps KK is not too large, then Φ\Phi is bounded. We then prove that the number of width reduction steps cannot be too large by using the relation between Φ\Phi and Ψ\Psi and their respective changes throughout the algorithm.

We now begin our analysis. The next two lemmas show how our potentials change with every iteration of the algorithm.

Lemma 3.3.

After ii primal steps, and kk width-reduction steps, provided ppαpτpαm1p1pp^{p}\alpha^{p}\tau\leq p\alpha m_{1}^{\frac{p-1}{p}}, the potential Φ\Phi is bounded as follows:

Φ(𝒘(i,k))(2αi+m11/p)p(1+2pp2ρ2m12/pβ2p2)k.\displaystyle\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)\leq\mathopen{}\mathclose{{}\left(2\alpha i+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}.
Lemma 3.4.

After ii primal steps and kk width reduction steps, if,

  1. 1.

    τ2/pζ2/p43p2Ψ(𝒓)β\tau^{2/p}\zeta^{2/p}\geq 4\cdot 3^{p-2}\frac{\Psi(\boldsymbol{\mathit{r}})}{\beta}, and

  2. 2.

    τζ2/p23p2Ψ(𝒓)ρp2\tau\zeta^{2/p}\geq 2\cdot 3^{p-2}\Psi(\boldsymbol{\mathit{r}})\rho^{p-2},

then,

Ψ(𝒓(i,k+1))Ψ(𝒓(0,0))+k4τ2/pζ2/p.{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)}+\frac{k}{4}\cdot\tau^{2/p}\zeta^{2/p}.

The next lemma gives a lower bound on the energy in the beginning and an upper bound on the energy at each step.

Lemma 3.5.

Let ii denote the number of primal steps and kk the number of width reduction steps. For any i,k0i,k\geq 0, we have,

Ψ(𝒓(i,k))ζ2/p(m1p2p+13p2Φ(i,k)p2p).\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)\leq\zeta^{2/p}\mathopen{}\mathclose{{}\left(m_{1}^{\frac{p-2}{p}}+\frac{1}{3^{p-2}}\Phi(i,k)^{\frac{p-2}{p}}}\right).

3.1.6 Proof of Theorem 3.2

Proof.

Let 𝒙T\frac{\boldsymbol{\mathit{x}}}{T} be the solution returned by Algorithm 5. We first note that this satisfies the linear constraint required. We next bound the objective value at 𝒙T\frac{\boldsymbol{\mathit{x}}}{T}, i.e., 1T2𝒙𝑴𝑴𝒙\frac{1}{T^{2}}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}} and 1Tp𝑵𝒙pp\frac{1}{T^{p}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

Suppose the algorithm terminates in T=α1m11/pT=\alpha^{-1}m_{1}^{1/p} primal steps and K2pp2ρ2m12/pβ2p2K\leq 2^{-\frac{p}{p-2}}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}} width reduction steps. We next note that our parameter values α\alpha and τ\tau are such that ppαpτpαm1p1pp^{p}\alpha^{p}\tau\leq p\alpha m_{1}^{\frac{p-1}{p}}. We can now apply Lemma 3.3 to get,

Φ(𝒘(T,K))3pm1e1=e3pm1\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(T,K)}}\right)\leq 3^{p}m_{1}e^{1}=e\cdot 3^{p}m_{1}

We next observe from the weight and 𝒙\boldsymbol{\mathit{x}} update steps in our algorithm that, ζ1/pm11/p𝒘(T,K)|𝑵𝒙|\zeta^{1/p}m_{1}^{-1/p}\boldsymbol{\mathit{w}}^{(T,K)}\geq|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|. Thus,

1T𝑵𝒙ppζm1𝒘(T,K)pp=ζm1Φ(𝒘(T,K))e3pζ.\displaystyle\frac{1}{T}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq\frac{\zeta}{m_{1}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}^{(T,K)}}\right\rVert_{p}^{p}=\frac{\zeta}{m_{1}}\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(T,K)}}\right)\leq e\cdot 3^{p}\zeta.

We next bound the quadratic term. Let Δ~(t){\widetilde{{\Delta}}}^{(t)} denote the solution returned by the oracle in iteration tt. Since Φe3pm1\Phi\leq e\cdot 3^{p}m_{1} for all iterations, we always have from Lemma 3.5 that, Ψ(𝒓)4m1p2pζ2/p\Psi(\boldsymbol{\mathit{r}})\leq 4m_{1}^{\frac{p-2}{p}}\zeta^{2/p}. We will first bound (Δ~(t))𝑴𝑴Δ~(t)\mathopen{}\mathclose{{}\left({\widetilde{{\Delta}}}^{(t)}}\right)^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}^{(t)} for every tt.

(Δ~(t))𝑴𝑴Δ~(t)=ζp2p(Δ~(t))𝑴~𝑴~Δ~(t)ζp2pm1p2pΨ(𝒓)4ζ.\mathopen{}\mathclose{{}\left({\widetilde{{\Delta}}}^{(t)}}\right)^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}^{(t)}=\zeta^{\frac{p-2}{p}}\mathopen{}\mathclose{{}\left({\widetilde{{\Delta}}}^{(t)}}\right)^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}{\widetilde{{\Delta}}}^{(t)}\leq\zeta^{\frac{p-2}{p}}m_{1}^{-\frac{p-2}{p}}\Psi(\boldsymbol{\mathit{r}})\leq 4\zeta.

Now from convexity of 𝒙22\|\boldsymbol{\mathit{x}}\|_{2}^{2}, we get

𝑴𝒙T221T2Tt𝑴Δ~(t)224ζ.\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{M}}\frac{\boldsymbol{\mathit{x}}}{T}}\right\rVert_{2}^{2}\leq\frac{1}{T^{2}}\cdot T\sum_{t}\|\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}^{(t)}\|_{2}^{2}\leq 4\zeta.

We have shown that if the number of width reduction steps is bounded by KK then our algorithm returns the required solution. We will next prove that we cannot have more than KK width reduction steps.

Suppose to the contrary, the algorithm takes a width reduction step starting from step (i,k)(i,k) where i<Ti<T and k=2pp2ρ2m12/pβ2p2k=2^{-\frac{p}{p-2}}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}. Since the conditions for Lemma 3.3 hold for all preceding steps, we must have Φ(𝒘(i,k))e3pm1\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)\leq e\cdot 3^{p}m_{1} which combined with Lemma 3.5 implies Ψ4m1p2pζ2/p\Psi\leq 4m_{1}^{\frac{p-2}{p}}\zeta^{2/p}. Using this bound on Ψ\Psi, we note that our parameter values satisfy the conditions of Lemma 3.4. From lemma 3.4,

Ψ(𝒓(i,k+1))Ψ(𝒓(0,0))+14τ2/pζ2/pk.{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)}+\frac{1}{4}\tau^{2/p}\zeta^{2/p}k.

Since our parameter choices ensure τ2/pk>14m1\tau^{2/p}k>\frac{1}{4}m_{1},

Ψ(𝒓(i,k+1))Ψ(𝒓(0,0))>m116ζ2/p.{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}-{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)}>\frac{m_{1}}{16}\zeta^{2/p}.

Since Φ(𝒘(i,k))O(3p)m1\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)\leq O(3^{p})m_{1} and Ψ0\Psi\geq 0, from Lemma 3.5,

Ψ(𝒓(i,k+1))Ψ(𝒓(0,0))4m1p2pζ2/p,\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)-\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)\leq 4m_{1}^{\frac{p-2}{p}}\zeta^{2/p},

which is a contradiction. We can thus conclude that we can never have more than K=2pp2ρ2m12/pβ2p2K=2^{\frac{-p}{p-2}}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}} width reduction steps, thus concluding the correctness of the returned solution. We next bound the number of oracle calls required. The total number of iterations is at most,

T+Kα1m11/p+2p/(p2)ρ2m12/pβ2p2O(pm1p23p2).T+K\leq\alpha^{-1}m_{1}^{1/p}+2^{-p/(p-2)}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}\leq O\mathopen{}\mathclose{{}\left(pm_{1}^{\frac{p-2}{3p-2}}}\right).

3.1.7 Proof of Lemma 3.3

We first prove a simple lemma about the solution Δ~{\widetilde{{\Delta}}} returned by the oracle, that we will use in our proof.

Lemma 3.6.

Let p2p\geq 2. For any 𝐰\boldsymbol{\mathit{w}}, let Δ~{\widetilde{{\Delta}}} be the solution returned by Algorithm 4. Then,

e(𝑵Δ~)e2e𝒓e(𝑵Δ~)e2ζ2p𝒘p2\sum_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}\leq\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}\leq\zeta^{\frac{2}{p}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert^{p-2}
Proof.

Since Δ~{\widetilde{{\Delta}}} is the solution returned by Algorithm 4, and Δ{{{\Delta^{\star}}}} satisfies the constraints of the oracle, we have,

e𝒓e(𝑵Δ~)e2e𝒓e(𝑵Δ)e2\displaystyle\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}\leq\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta^{*})_{e}^{2} =e𝒘ep2(𝑵Δ)e2ζ2/p𝒘pp2.\displaystyle=\sum_{e}\boldsymbol{\mathit{w}}_{e}^{p-2}(\boldsymbol{\mathit{N}}\Delta^{*})_{e}^{2}\leq\zeta^{2/p}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert^{p-2}_{p}.

In the last inequality we use,

e𝒘e(𝑵Δ)e2\displaystyle\sum_{e}\boldsymbol{\mathit{w}}_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2} (e(𝑵Δ)e2p2)2/p(e|𝒘e|(p2)pp2)(p2)/p\displaystyle\leq\mathopen{}\mathclose{{}\left(\sum_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2\cdot\frac{p}{2}}}\right)^{2/p}\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{w}}_{e}}\right|^{(p-2)\cdot\frac{p}{p-2}}}\right)^{(p-2)/p}
=𝑵Δp2𝒘p(p2)/p\displaystyle=\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{2}\|\boldsymbol{\mathit{w}}\|_{p}^{(p-2)/p}
ζ2/p𝒘p(p2)/p,since 𝑵Δppζ .\displaystyle\leq\zeta^{2/p}\|\boldsymbol{\mathit{w}}\|_{p}^{(p-2)/p},\text{since $\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\Delta^{*}}\right\rVert^{p}_{p}\leq\zeta$ }.

Finally, using 𝒓e1,\boldsymbol{\mathit{r}}_{e}\geq 1, we have e(𝑵Δ)e2e𝒓e(𝑵Δ)e2,\sum_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}, concluding the proof. ∎

See 3.3

Proof.

We prove this claim by induction. Initially, i=k=0,i=k=0, and Φ(𝒘(0,0))=m1,\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(0,0)}}\right)=m_{1}, and thus, the claim holds trivially. Assume that the claim holds for some i,k0.i,k\geq 0. We will use Φ\Phi as an abbreviated notation for Φ(𝒘(i,k))\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right) below.

Primal Step.

For brevity, we use 𝒘\boldsymbol{\mathit{w}} to denote 𝒘(i,k)\boldsymbol{\mathit{w}}^{(i,k)}. If the next step is a primal step,

Φ(𝒘(i+1,k))=\displaystyle\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i+1,k)}}\right)= 𝒘(i,k)+α|𝑵Δ~|ζ1/ppp\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}^{(i,k)}+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right|}{\zeta^{1/p}}}\right\rVert_{p}^{p}
\displaystyle\leq 𝒘pp+ζ1/pα|(𝑵Δ~)||𝒘pp|+2p2α2ζ2/pe|𝒘e|p2|𝑵Δ~|e2+αpppζ1𝑵Δ~pp\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}+\zeta^{-1/p}\alpha\mathopen{}\mathclose{{}\left|(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})}\right|^{\top}\mathopen{}\mathclose{{}\left|\nabla\|\boldsymbol{\mathit{w}}\|_{p}^{p}}\right|+2p^{2}\alpha^{2}\zeta^{-2/p}\sum_{e}|\boldsymbol{\mathit{w}}_{e}|^{p-2}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right|_{e}^{2}+\alpha^{p}p^{p}\zeta^{-1}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}
by Lemma 2.5

We next bound |(𝑵Δ~)||𝒘pp|\mathopen{}\mathclose{{}\left|(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})}\right|^{\top}\mathopen{}\mathclose{{}\left|\nabla\|\boldsymbol{\mathit{w}}\|_{p}^{p}}\right| by ζ1/pp𝒘pp1.\zeta^{1/p}p\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-1}. Using Cauchy Schwarz’s inequality,

(e|𝑵Δ~|e|e𝒘pp|)2=\displaystyle\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right|_{e}\mathopen{}\mathclose{{}\left|\nabla_{e}\|\boldsymbol{\mathit{w}}\|_{p}^{p}}\right|}\right)^{2}= p2(e|𝑵Δ~|e|𝒘e|p2|𝒘e|)2\displaystyle p^{2}\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right|_{e}|\boldsymbol{\mathit{w}}_{e}|^{p-2}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{w}}_{e}}\right|}\right)^{2}
\displaystyle\leq p2(e|𝒘e|p2𝒘e2)(e|𝒘e|p2(𝑵Δ~)e2)\displaystyle p^{2}\mathopen{}\mathclose{{}\left(\sum_{e}|\boldsymbol{\mathit{w}}_{e}|^{p-2}\boldsymbol{\mathit{w}}_{e}^{2}}\right)\mathopen{}\mathclose{{}\left(\sum_{e}|\boldsymbol{\mathit{w}}_{e}|^{p-2}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}}\right)
=\displaystyle= p2𝒘ppe𝒓e(𝑵Δ~)e2\displaystyle p^{2}\|\boldsymbol{\mathit{w}}\|_{p}^{p}\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}
\displaystyle\leq p2𝒘p2p2ζ2/p, From Lemma 3.6.\displaystyle p^{2}\|\boldsymbol{\mathit{w}}\|_{p}^{2p-2}\zeta^{2/p},\text{ From Lemma \ref{lem:Oracle}.}

We thus have,

e|𝑵Δ~|e|e𝒘pp|\displaystyle\sum_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right|_{e}\mathopen{}\mathclose{{}\left|\nabla_{e}\|\boldsymbol{\mathit{w}}\|_{p}^{p}}\right| p𝒘pp1ζ1/p.\displaystyle\leq p\|\boldsymbol{\mathit{w}}\|_{p}^{p-1}\zeta^{1/p}.

Using the above bound, we now have,

Φ(𝒘(i+1,k))\displaystyle\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i+1,k)}}\right)\leq 𝒘pp+pα𝒘pp1+2p2α2𝒘pp2+ppαp𝑵Δ~pp\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}+p\alpha\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-1}+2p^{2}\alpha^{2}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-2}+p^{p}\alpha^{p}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}
\displaystyle\leq 𝒘pp+pα𝒘pp1+2p2α2𝒘pp2+pαm1p1p,\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}+p\alpha\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-1}+2p^{2}\alpha^{2}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-2}+p\alpha m_{1}^{\frac{p-1}{p}},
(since ppαpτpαm1p1pp^{p}\alpha^{p}\tau\leq p\alpha m_{1}^{\frac{p-1}{p}})

Recall 𝒘pp=Φ(𝒘).\|\boldsymbol{\mathit{w}}\|_{p}^{p}=\Phi(\boldsymbol{\mathit{w}}). Since Φm1\Phi\geq m_{1}, we have,

Φ(𝒘(i+1,k))Φ(𝒘)+pαΦ(𝒘)p1p+2p2α2Φ(𝒘)p2p+pαΦ(𝒘)p1p(Φ(𝒘)1/p+2α)p.\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i+1,k)}}\right)\leq\Phi(\boldsymbol{\mathit{w}})+p\alpha\Phi(\boldsymbol{\mathit{w}})^{\frac{p-1}{p}}+2p^{2}\alpha^{2}\Phi(\boldsymbol{\mathit{w}})^{\frac{p-2}{p}}+p\alpha\Phi(\boldsymbol{\mathit{w}})^{\frac{p-1}{p}}\leq(\Phi(\boldsymbol{\mathit{w}})^{1/p}+2\alpha)^{p}.

From the inductive assumption, we have

Φ(𝒘)\displaystyle\Phi(\boldsymbol{\mathit{w}}) (2αi+m11/p)p(1+2pp2ρ2m12/pβ2p2)k.\displaystyle\leq\mathopen{}\mathclose{{}\left({2\alpha i}+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}.

Thus,

Φ(i+1,k)(Φ(𝒘)1/p+2α)p(2α(i+1)+m11/p)p(1+2pp2ρ2m12/pβ2p2)k\Phi(i+1,k)\leq(\Phi(\boldsymbol{\mathit{w}})^{1/p}+2\alpha)^{p}\leq\mathopen{}\mathclose{{}\left({2\alpha(i+1)}+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}

proving the inductive claim.

Width Reduction Step.

Let Δ~{\widetilde{{\Delta}}} be the solution returned by the oracle and HH denote the set of indices jj such that |𝑵Δ~|jρζ1/p|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}|_{j}\geq\rho\zeta^{1/p} and 𝒓jβ\boldsymbol{\mathit{r}}_{j}\leq\beta, i.e., the set of indices on which the algorithm performs width reduction. We have the following:

jH𝒓jρ2ζ2/pjH𝒓e(𝑵Δ)j2ρ2ζ2/pj𝒓j(𝑵Δ)e2ρ2𝒘pp2ρ2Φp2p,\sum_{j\in H}\boldsymbol{\mathit{r}}_{j}\leq\rho^{-2}\zeta^{-2/p}\sum_{j\in H}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{j}^{2}\leq\rho^{-2}\zeta^{-2/p}\sum_{j}\boldsymbol{\mathit{r}}_{j}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\rho^{-2}\|\boldsymbol{\mathit{w}}\|_{p}^{p-2}\leq\rho^{-2}\Phi^{\frac{p-2}{p}},

where we use Lemma 3.6 for the second last inequality. Also,

Φ(𝒘(i,k+1))\displaystyle\Phi(\boldsymbol{\mathit{w}}^{(i,k+1)}) Φ+jH|𝒘jk+1|pΦ+2pp2jH|𝒘j|pΦ+2pp2j𝒓jpp2\displaystyle\leq\Phi+\sum_{j\in H}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{w}}_{j}^{k+1}}\right|^{p}\leq\Phi+2^{\frac{p}{p-2}}\sum_{j\in H}|\boldsymbol{\mathit{w}}_{j}|^{p}\leq\Phi+2^{\frac{p}{p-2}}\sum_{j}\boldsymbol{\mathit{r}}_{j}^{\frac{p}{p-2}}
Φ+2pp2(jH𝒓j)(maxjH𝒓j)pp21Φ+2pp2ρ2Φp2pβ2p2.\displaystyle\leq\Phi+2^{\frac{p}{p-2}}\mathopen{}\mathclose{{}\left(\sum_{j\in H}\boldsymbol{\mathit{r}}_{j}}\right)\mathopen{}\mathclose{{}\left(\max_{j\in H}\boldsymbol{\mathit{r}}_{j}}\right)^{\frac{p}{p-2}-1}\leq\Phi+2^{\frac{p}{p-2}}\rho^{-2}\Phi^{\frac{p-2}{p}}\beta^{\frac{2}{p-2}}.

Again, since Φ(𝒘)m1\Phi(\boldsymbol{\mathit{w}})\geq m_{1},

Φ(𝒘(i,k+1))Φ(1+2pp2ρ2m12pβ2p2)(2αi+m11/p)p(1+2pp2ρ2m12/pβ2p2)k\Phi(\boldsymbol{\mathit{w}}^{(i,k+1)})\leq\Phi\mathopen{}\mathclose{{}\left(1+2^{\frac{p}{p-2}}\rho^{-2}m_{1}^{-\frac{2}{p}}\beta^{\frac{2}{p-2}}}\right)\leq\mathopen{}\mathclose{{}\left(2\alpha i+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}

proving the inductive claim. ∎

3.1.8 Proof of Lemma 3.4

See 3.4

Proof.

It will be helpful for our analysis to split the index set into three disjoint parts:

  • S={e:|𝑵Δe|ρζ1/p}S=\mathopen{}\mathclose{{}\left\{e:\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|\leq\rho\zeta^{1/p}}\right\}

  • H={e:|𝑵Δe|>ρζ1/p and 𝒓eβ}H=\mathopen{}\mathclose{{}\left\{e:\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|>\rho\zeta^{1/p}\text{ and }\boldsymbol{\mathit{r}}_{e}\leq\beta}\right\}

  • B={e:|𝑵Δe|>ρζ1/p and 𝒓e>β}B=\mathopen{}\mathclose{{}\left\{e:\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|>\rho\zeta^{1/p}\text{ and }\boldsymbol{\mathit{r}}_{e}>\beta}\right\}.

Firstly, we note

eS|𝑵Δ|epρp2ζp2peS|𝑵Δ|e2ρp2ζp2peS𝒓e|𝑵Δ|e2ρp2ζp2p3p2Ψ(𝒓).\displaystyle\sum_{e\in S}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}\leq\rho^{p-2}\zeta^{\frac{p-2}{p}}\sum_{e\in S}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{2}\leq\rho^{p-2}\zeta^{\frac{p-2}{p}}\sum_{e\in S}\boldsymbol{\mathit{r}}_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{2}\leq\rho^{p-2}\zeta^{\frac{p-2}{p}}3^{p-2}\Psi(\boldsymbol{\mathit{r}}).

hence, using Assumption 2

eHB|𝑵Δ|epe|𝑵Δ|epeS|𝑵Δ|epτζρp2ζp2p3p2Ψ(𝒓)12τζ.\displaystyle\sum_{e\in H\cup B}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}\geq\sum_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}-\sum_{e\in S}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}\geq\tau\zeta-\rho^{p-2}\zeta^{\frac{p-2}{p}}3^{p-2}\Psi(\boldsymbol{\mathit{r}})\geq\frac{1}{2}\tau\zeta.

This means,

eHB(𝑵Δ)e2(eHB|𝑵Δ|ep)2/pτ2/pζ2/p2.\sum_{e\in H\cup B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\geq\mathopen{}\mathclose{{}\left(\sum_{e\in H\cup B}|\boldsymbol{\mathit{N}}\Delta|_{e}^{p}}\right)^{2/p}\geq\frac{\tau^{2/p}\zeta^{2/p}}{2}.

Secondly we note that,

eB(𝑵Δ)e2β1eB𝒓e(𝑵Δ)e2β13p2Ψ(𝒓).\sum_{e\in B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\beta^{-1}\sum_{e\in B}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\beta^{-1}3^{p-2}\Psi(\boldsymbol{\mathit{r}}).

So then, using Assumption 1,

eH(𝑵Δ)e2=eHB(𝑵Δ)e2eB(𝑵Δ)e2τ2/pζ2/p2β13p2Ψ(𝒓)τ2/pζ2/p4.\displaystyle\sum_{e\in H}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}=\sum_{e\in H\cup B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}-\sum_{e\in B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\geq\frac{\tau^{2/p}\zeta^{2/p}}{2}-\beta^{-1}3^{p-2}\Psi(\boldsymbol{\mathit{r}})\geq\frac{\tau^{2/p}\zeta^{2/p}}{4}.

As 𝒓e1\boldsymbol{\mathit{r}}_{e}\geq 1, this implies eH𝒓e(𝑵Δ)e2τ2/pζ2/p4\sum_{e\in H}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\geq\frac{\tau^{2/p}\zeta^{2/p}}{4}. We note that in a width reduction step, the resistances change by a factor of 2. Thus, combining our last two observations, and applying Lemma C.1, we get

Ψ(𝒓(i,k+1))Ψ(𝒓(i,k))+14τ2/pζ2/p.{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)}+\frac{1}{4}\tau^{2/p}\zeta^{2/p}.

Finally, for the “primal step” case, we use the trivial bound from Lemma C.1, ignoring the second term,

Ψ(𝒓(i,k+1))Ψ(𝒓(i,k)).{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)}.

3.1.9 Proof of Lemma 3.5

See 3.5

Proof.

Lemma 3.6 implies that,

Ψ(𝒓(i,k))\displaystyle{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)} =ζ(p2)/pm1p2pΔ~𝑴𝑴Δ~+13p2e𝒓e(𝑵Δ~)e2\displaystyle=\zeta^{-(p-2)/p}m_{1}^{\frac{p-2}{p}}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}
ζ(p2)/pm1p2pΔ𝑴𝑴Δ+13p2e𝒓e(𝑵Δ)e2\displaystyle\leq\zeta^{-(p-2)/p}m_{1}^{\frac{p-2}{p}}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{{{\Delta^{\star}}}}+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2}
ζ2/pm1p2p+ζ2/p13p2𝒘pp2\displaystyle\leq\zeta^{2/p}m_{1}^{\frac{p-2}{p}}+\zeta^{2/p}\frac{1}{3^{p-2}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-2}
ζ2/pm1p2p+ζ2/p13p2Φ(i,k)p2p.\displaystyle\leq\zeta^{2/p}m_{1}^{\frac{p-2}{p}}+\zeta^{2/p}\frac{1}{3^{p-2}}\Phi(i,k)^{\frac{p-2}{p}}.

3.2 Complete Algorithm for p\ell_{p}-Regression

Recall our problem, (1),

min𝑨𝒙=𝒃𝒇(𝒙)=𝒅𝒙+𝑴𝒙22+𝑵𝒙pp.\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

We will now use all the tools and algorithms described so far to give a complete algorithm for the above problem. We will assume we have a starting solution 𝒙(0)\boldsymbol{\mathit{x}}^{(0)} satisfying 𝑨𝒙(0)=𝒃\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}} and for purely p\ell_{p} objectives, we will use the homotopy analysis from Section 2.2.

Our overall algorithm reduces the problem to solving the residual problem (Definition 2.3) approximately. In Sections 3.1.1 and 3.1.2, we give an algorithm to solve the residual problem by first doing a binary search on the linear term and then applying a multiplicative weight update routine to minimize these problems. We have the following result which follows from Lemma 3.1 and Theorem 3.2.

Corollary 3.7.

Consider the residual problem at iteration tt of Algorithm 1. Algorithm 3 using Algorithm 5 as a subroutine finds a O(1)O(1)-approximate solution to the corresponding residual problem in O(pmp23p2logp)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p}\right) calls to a linear system solver.

Proof.

Let ν\nu be such that 𝒇(𝒙(t))𝒇(𝒙)(ν/2,ν]\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\in(\nu/2,\nu]. Refer to Lemma 2.7 to see that this is the case in which we use the solution of the residual problem. Now, from Lemma 2.6 we know that the optimum of the residual problem satisfies 𝒓𝒆𝒔p(Δ)(ν/32p,ν]\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\nu/32p,\nu]. Since we vary ζ\zeta to take all such values in the range (ν/16p,ν](\nu/16p,\nu] for one such ζ\zeta we must have 𝒓𝒆𝒔p(Δ)(ζ/2,ζ].\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\zeta/2,\zeta]. For such a ζ\zeta, consider problem (8). Using Algorithm 5 for this problem, from Theorem 3.2 we are guaranteed to find a solution Δ~{\widetilde{{\Delta}}} such that Δ~𝑹Δ~O(1)ζ{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}\leq O(1)\zeta and 𝑵Δ~ppO(3p)ζ\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq O(3^{p})\zeta. Now from Lemma 3.1, we note that Δ~{\widetilde{{\Delta}}} is an O(1)O(1)-approximate solution to the residual problem. Since Algorithm 5 requires O(pmp23p2)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}}\right) calls to a linear system solver, and Algorithm 3 calls this algorithm logp\log p times, we obtain the required runtime. ∎

We are now ready to prove our main result.

Theorem 3.8.

Let p2p\geq 2, and κ1\kappa\geq 1. Let the initial solution 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} satisfying 𝐀𝐱(0)=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}. Algorithm 1 using Algorithm 3 as a subroutine returns an ϵ\epsilon-approximate solution 𝐱\boldsymbol{\mathit{x}} to (1) in at most O(p2mp23p2logplog(𝐟(𝐱(0))𝐟(𝐱)ϵ))O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log p\log\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)}\right) calls to a linear system solver.

Proof.

Follows from Theorem 2.1 and Corollary 3.7. ∎

3.3 Complete Algorithm for Pure p\ell_{p} Objectives

Consider the special case when our problem is only the p\ell_{p}-norm, i.e., Problem (3),

min𝑨𝒙=𝒃𝑵𝒙pp.\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

In Section 2.2 we described how to find a good starting point for such problems. Combining this algorithm with our algorithm for solving the residual problem we can obtain a complete algorithm for finding a good starting point. Specifically, we prove the following result.

Corollary 3.9.

Algorithm 2 using Algorithm 3 returns 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} such that 𝐀𝐱(0)=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}} and 𝐍𝐱(0)ppO(m)𝐍𝐱pp\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{(0)}\|_{p}^{p}\leq O(m)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p} in O(p2mp23p2log2plogm)O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log^{2}p\log m}\right) calls to a linear system solver.

Proof.

From Lemma 2.9 Algorithm 2 finds such a solution in time O(plogm)k=2i,i=2i=logp1κkT(k,κk)O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k}), where κk\kappa_{k} and T(k,κk)T(k,\kappa_{k}) denote the approximation and time to solve a k\ell_{k} norm problem. Now consider Algorithm 3 with Algorithm 5 as a subroutine. From Corollary 3.7, we can solve any k\ell_{k}-norm residual problem to a O(1)O(1)-approximation in O(kmk23k2logk)O\mathopen{}\mathclose{{}\left(km^{\frac{k-2}{3k-2}}\log k}\right) calls to a linear system solver. We thus have κk=O(1)\kappa_{k}=O(1) for all kk and T(k,κk)=O(kmk23k2logk)O(pmp23p2logp)T(k,\kappa_{k})=O\mathopen{}\mathclose{{}\left(km^{\frac{k-2}{3k-2}}\log k}\right)\leq O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p}\right). Using these values, we obtain a runtime of,

O(plogm)k=2i,i=2i=logp1κkT(k,κk)O(plogm)logpO(pmp23p2logp)O(p2mp23p2log2plogm).O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k})\leq O\mathopen{}\mathclose{{}\left(p\log m}\right)\cdot\log p\cdot O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p}\right)\leq O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log^{2}p\log m}\right).

The following theorem gives a complete runtime for pure p\ell_{p} objectives.

Corollary 3.10.

Let p2p\geq 2, and κ1\kappa\geq 1. Let 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} be the solution returned by Algorithm 2. Algorithm 1 using Algorithm 3 as a subroutine returns 𝐱\boldsymbol{\mathit{x}} such that 𝐀𝐱=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}} and 𝐍𝐱pp(1+ϵ)𝐍𝐱pp\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p}, in at most O(p2mp23p2log2plog(mϵ))O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log^{2}p\log\mathopen{}\mathclose{{}\left(\frac{m}{\epsilon}}\right)}\right) calls to a linear system solver.

Proof.

Follows directly from Corollary 3.9 and Theorem 3.8. ∎

4 Solving pp-norm Problems using qq-norm Oracles

In this section, we propose a new technique that allows us to solve p\ell_{p}-norm residual problems by instead solving an q\ell_{q}-norm residual problem without adding much to the runtime. Such a technique is unknown for pure p\ell_{p} objectives without a large overhead in the runtime. As a consequence we also obtain an algorithm for p\ell_{p}-regression with a linear runtime dependence on pp instead of the p2p^{2} dependence in the algorithms from previous sections. The p2p^{2} dependence in algorithms had one pp factor resulting from solving the pp-norm residual problem. At a high level, we show that it is sufficient to solve a logm\log m-norm residual problem when pp is large, thus replacing a pp-factor with logm\log m. We prove the following results which are based on the proofs and results of [AS20].

Theorem 4.1.

Let ϵ>0\epsilon>0, 2ppoly(m)2\leq p\leq poly(m) and consider an instance of Problem (1),

min𝑨𝒙=𝒃𝒇(𝒙)=𝒅𝒙+𝑴𝒙22+𝑵𝒙pp.\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

Algorithm 6 finds an ϵ\epsilon-approximate solution to (1) in O(pmp23p2logplogmlog𝐟(𝐱(0))𝐟(𝐱)ϵ)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p\log m\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right) calls to a linear system solver.

Theorem 4.2.

Let ϵ>0\epsilon>0, 2ppoly(m)2\leq p\leq poly(m) and consider a pure p\ell_{p} instance,

min𝑨𝒙=𝒃𝑵𝒙pp.\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

Let 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} be the output of Algorithm 2. Algorithm 6 using 𝐱(0)\boldsymbol{\mathit{x}}^{(0)} as a starting solution finds 𝐱\boldsymbol{\mathit{x}} such that 𝐀𝐱=𝐛\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}} and 𝐍𝐱pp(1+ϵ)min𝐀𝐱=𝐛𝐍𝐱pp\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq(1+\epsilon)\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p} in O(pmp23p2log2plogmlogmϵ)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log^{2}p\log m\log\frac{m}{\epsilon}}\right) calls to a linear system solver.

4.1 Relation between Residual Problems for p\ell_{p} and q\ell_{q} Norms

In this section we prove how qq-norm residual problems can be used to solve pp-norm residual problems. This idea first appeared in the work of [AS20], where they also apply the results to the maximum flow problem. In this paper, we provide a much simpler proof for the main techncial content and unify the cases of p<qp<q and p>qp>q that were presented separately in previous works. We also unify the case of relating the decision versions of the residual problems (without the linear term) and the entire objective. The results for the maximum flow problem and p\ell_{p}-norm flow problem as described in the original paper still follow and we refer the reader to the original paper for these applications. The main result of the section is as follows.

Theorem 4.3.

Let p,q2p,q\geq 2 and ζ\zeta be such that 𝐫𝐞𝐬p(Δ)(ζ/2,ζ]\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\zeta/2,\zeta], where Δ{{{\Delta^{\star}}}} is the optimum of the p\ell_{p}-norm residual problem (Definition 2.3). The following q\ell_{q}-norm residual problem has optimum at least ζ4\frac{\zeta}{4},

max𝑨Δ=0𝒈ΔΔ𝑹Δ14ζ1qpmmin{qp1,0}𝑵Δqq.\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\Delta-\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}\Delta\|_{q}^{q}. (11)

Let β1\beta\geq 1 and Δ~{\widetilde{{\Delta}}} denote a feasible solution to the above q\ell_{q}-norm residual problem with objective value at least ζ16β\frac{\zeta}{16\beta}. For α=1256βmpp1|1p1q|\alpha=\frac{1}{256\beta}m^{-\frac{p}{p-1}\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|}, αΔ~\alpha{\widetilde{{\Delta}}} gives a O(β2)mpp1|1p1q|O(\beta^{2})m^{\frac{p}{p-1}\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|}-approximate solution to the p\ell_{p}-norm residual problem 𝐫𝐞𝐬p\boldsymbol{res}_{p}.

Proof.

Consider Δ{{{\Delta^{\star}}}}, the optimum of the p\ell_{p}-norm residual problem. Note that λΔ\lambda{{{\Delta^{\star}}}} is a feasible solution for all λ\lambda since 𝑨(λΔ)=0.\boldsymbol{\mathit{A}}(\lambda{{{\Delta^{\star}}}})=0. We know that the objective is optimum for λ=1\lambda=1. Thus,

[ddλ𝒓𝒆𝒔p(λΔ)]λ=1=0,\mathopen{}\mathclose{{}\left[\frac{d}{d\lambda}\boldsymbol{res}_{p}(\lambda{{{\Delta^{\star}}}})}\right]_{\lambda=1}=0,

which gives us,

𝒈Δ2Δ𝑹Δp𝑵Δpp=0.\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-2{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-p\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}=0.

Rearranging,

Δ𝑹Δ+(p1)𝑵Δpp=𝒈ΔΔ𝑹Δ𝑵Δppζ.{}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+(p-1)\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}=\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}\leq\zeta.

Since p2p\geq 2, 𝑵Δpζ1/p\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}\leq\zeta^{1/p} which implies

𝑵Δq{ζ1/pif, pqm1q1pζ1/potherwise.\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{q}\leq\begin{cases}\zeta^{1/p}&\text{if, $p\leq q$}\\ m^{\frac{1}{q}-\frac{1}{p}}\zeta^{1/p}&\text{otherwise.}\end{cases}

We also note that,

𝒈ΔΔ𝑹Δ>ζ2+𝑵Δpp>ζ2.\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}>\frac{\zeta}{2}+\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}>\frac{\zeta}{2}.

Combining these bounds, we obtain the optimum of (11) is at least,

𝒈ΔΔ𝑹Δ14ζ1qpmmin{qp1,0}𝑵Δqq>ζ214ζ1qpζq/p>ζ4.\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{q}^{q}>\frac{\zeta}{2}-\frac{1}{4}\zeta^{1-\frac{q}{p}}\zeta^{q/p}>\frac{\zeta}{4}.

Since the optimum of (11) is at least ζ/4\zeta/4, there exists a feasible Δ~{\widetilde{{\Delta}}} with objective value at least ζ/16β\zeta/16\beta. We now prove the second part, that a scaling of Δ~{\widetilde{{\Delta}}} gives a good approximation to the p\ell_{p}-norm residual problem. First, let us assume |𝒈Δ~|ζ|\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}|\leq\zeta. Since Δ~{\widetilde{{\Delta}}} has objective value at least ζ/16β\zeta/16\beta,

Δ~𝑹Δ~+14ζ1qpmmin{qp1,0}𝑵Δ~qq𝒈Δ~ζ16βζ.{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}+\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{q}^{q}\leq\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-\frac{\zeta}{16\beta}\leq\zeta.

Thus, mmin{1p1q,0}𝑵Δ~q41qζ1pm^{\min\mathopen{}\mathclose{{}\left\{\frac{1}{p}-\frac{1}{q},0}\right\}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{q}\leq 4^{\frac{1}{q}}\zeta^{\frac{1}{p}}, and 𝑵Δ~pp4pqζm|1pq|\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq 4^{\frac{p}{q}}\zeta m^{\mathopen{}\mathclose{{}\left|1-\frac{p}{q}}\right|}. Let Δ¯=αΔ~{\bar{{\Delta}}}=\alpha{\widetilde{{\Delta}}}, where α=1256βmpp1|1p1q|\alpha=\frac{1}{256\beta}m^{-\frac{p}{p-1}\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|}. We will show that αΔ¯\alpha{\bar{{\Delta}}} is a good solution to the p\ell_{p}-norm residual problem.

𝒓𝒆𝒔p(αΔ¯)\displaystyle\boldsymbol{res}_{p}(\alpha{\bar{{\Delta}}}) =α(𝒈Δ~αΔ~𝑹Δ~αp1𝑵Δ~pp)\displaystyle=\alpha\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-\alpha{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}-\alpha^{p-1}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}\right)
α(ζ16β1256βζαp14pqζm|1pq|)\displaystyle\geq\alpha\mathopen{}\mathclose{{}\left(\frac{\zeta}{16\beta}-\frac{1}{256\beta}\zeta-\alpha^{p-1}4^{\frac{p}{q}}\zeta m^{\mathopen{}\mathclose{{}\left|1-\frac{p}{q}}\right|}}\right)
α(ζ16βζ256βζ64β)\displaystyle\geq\alpha\mathopen{}\mathclose{{}\left(\frac{\zeta}{16\beta}-\frac{\zeta}{256\beta}-\frac{\zeta}{64\beta}}\right)
α64β𝒓𝒆𝒔p(Δ).\displaystyle\geq\frac{\alpha}{64\beta}\boldsymbol{res}_{p}({{{\Delta^{\star}}}}).

For the case |𝒈Δ~|ζ|\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}|\geq\zeta, consider the vector zΔ~z{\widetilde{{\Delta}}} where z=ζ2|𝒈Δ~|12z=\frac{\zeta}{2|\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}|}\leq\frac{1}{2}. This vector is still feasible for Problem (11) and 𝒈zΔ~=ζ2\boldsymbol{\mathit{g}}^{\top}z{\widetilde{{\Delta}}}=\frac{\zeta}{2} and,

z𝒈Δ~z2Δ~𝑹Δ~zq14ζ1qpmmin{qp1,0}𝑵Δ~qqζ2z2ζζ4.z\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-z^{2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}-z^{q}\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{q}^{q}\geq\frac{\zeta}{2}-z^{2}\zeta\geq\frac{\zeta}{4}.

We can now repeat the same argument as above. ∎

4.2 Faster Algorithm for p\ell_{p}-Regression

In this section, we will combine the tools developed in previous chapters and combine it with Section 4.1 to obtain an algorithm for Problem 1 that requires O(pmp23p2logplogmlog𝒇(𝒙(0))𝒇(𝒙)ϵ)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p\log m\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right) calls to a linear systems solver. For pure p\ell_{p} objectives we can combine our algorithm with the algorithm in Section 2.2 to obtain a convergence rate of O(pmp23p2log2plogmlogmϵ)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log^{2}p\log m\log\frac{m}{\epsilon}}\right) linear systems solves.

Algorithm 6 Complete Algorithm with Linear pp-dependence
1:procedure p\ell_{p}-Solver(𝑨,𝑴,𝑵,𝒅,𝒃,p,ϵ\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},p,\epsilon)
2:     𝒙𝒙(0)\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}^{(0)}
3:     ν\nu\leftarrow Upper bound on 𝒇(𝒙(0))𝒇(𝒙)\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\triangleright If 𝒇(𝒙)0\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\geq 0, then ν𝒇(𝒙(0))\nu\leftarrow\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})
4:     while ν>ϵ\nu>\epsilon do
5:         if plogmp\geq\log m then
6:              Δ~{\widetilde{{\Delta}}}\leftarrow logm\log m-ResidualSolver(𝒙,𝑴,𝑵,𝑨,𝒅,𝒃,ν,p\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p)
7:         else
8:              Δ~{\widetilde{{\Delta}}}\leftarrow ResidualSolver(𝒙,𝑴,𝑵,𝑨,𝒅,𝒃,ν,p\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p)          
9:         if 𝒓𝒆𝒔p(Δ~)ν32pκ\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa} then
10:              𝒙𝒙Δ~p\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}-\frac{{\widetilde{{\Delta}}}}{p}
11:         else
12:              νν2\nu\leftarrow\frac{\nu}{2}               
13:     return 𝒙\boldsymbol{\mathit{x}}
Algorithm 7 Residual Solver using logm\log m-norm
1:procedure logm\log m-ResidualSolver(𝒙,𝑴,𝑵,𝑨,𝒅,𝒃,ν,p\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p)
2:     ζν\zeta\leftarrow\nu
3:     αm1p1\alpha\leftarrow m^{-\frac{1}{p-1}}
4:     (𝒈,𝑹,𝑵)𝒓𝒆𝒔p(\boldsymbol{\mathit{g}},\boldsymbol{\mathit{R}},\boldsymbol{\mathit{N}})\leftarrow\boldsymbol{res}_{p}\triangleright Create residual problem at 𝒙\boldsymbol{\mathit{x}}
5:     while ζ>ν32p\zeta>\frac{\nu}{32p} do
6:         𝑵~141/logmζ1logm1pmmin{1p1logm,0}𝑵\widetilde{\boldsymbol{\mathit{N}}}\leftarrow\frac{1}{4^{1/\log m}}\zeta^{\frac{1}{\log m}-\frac{1}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{1}{p}-\frac{1}{\log m},0}\right\}}\boldsymbol{\mathit{N}}
7:         Δ~ζ{\widetilde{{\Delta}}}_{\zeta}\leftarrow MWU-Solver([𝑨,𝒈],𝑹1/2,𝑵~,[0,ζ2],ζ,logm)\mathopen{}\mathclose{{}\left([\boldsymbol{\mathit{A}},\boldsymbol{\mathit{g}}^{\top}],\boldsymbol{\mathit{R}}^{1/2},\widetilde{\boldsymbol{\mathit{N}}},[0,\frac{\zeta}{2}]^{\top},\zeta,\log m}\right)\triangleright Algorithm 5
8:         ζζ2\zeta\leftarrow\frac{\zeta}{2}      
9:     return αΔ~argminΔ~ζ𝒇(𝒙αΔ~ζp)\alpha{\widetilde{{\Delta}}}\leftarrow\arg\min_{{\widetilde{{\Delta}}}_{\zeta}}\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\alpha{\widetilde{{\Delta}}}_{\zeta}}{p}}\right)
Lemma 4.4.

Let poly(m)plogmpoly(m)\geq p\geq\log m. Algorithm 7 returns an O(m1p1)O(m^{\frac{1}{p-1}})-approximate solution to the p\ell_{p}-residual problem 𝐫𝐞𝐬p\boldsymbol{res}_{p} at 𝐱\boldsymbol{\mathit{x}} in at most O(mp23p2logmlogp)O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log m\log p}\right) calls to a linear system solver.

Proof.

Let ν\nu be such that 𝒇(𝒙(t))𝒇(𝒙)(ν/2,ν]\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\in(\nu/2,\nu]. Refer to Lemma 2.7 to see that this is the case in which we use the solution of the residual problem. Now, from Lemma 2.6 we know that the optimum of the residual problem satisfies 𝒓𝒆𝒔p(Δ)(ν/32p,ν]\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\nu/32p,\nu]. Since we vary ζ\zeta to take all such values in the range (ν/16p,ν](\nu/16p,\nu] for one such ζ\zeta we must have 𝒓𝒆𝒔p(Δ)(ζ/2,ζ].\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\zeta/2,\zeta]. For such a ζ\zeta, consider the logm\log m-norm residual problem (11). Using Algorithm 5 for this problem, from Theorem 3.2 we are guaranteed to find a solution Δ~{\widetilde{{\Delta}}} such that Δ~𝑹Δ~O(1)ζ{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}\leq O(1)\zeta and 𝑵~Δ~logmlogmO(3p)ζ\|\widetilde{\boldsymbol{\mathit{N}}}{\widetilde{{\Delta}}}\|_{\log m}^{\log m}\leq O(3^{p})\zeta. Now from Lemma 3.1, we note that Δ~{\widetilde{{\Delta}}} is an O(1)O(1)-approximate solution to the logm\log m-residual problem. We now use Theorem 4.3, which states that αΔ~\alpha{\widetilde{{\Delta}}} is a O(m1p1)O\mathopen{}\mathclose{{}\left(m^{\frac{1}{p-1}}}\right)-approximate solution to the required residual problem 𝒓𝒆𝒔p\boldsymbol{res}_{p}.

Since for plogmp\geq\log m, Algorithm 5 requires O(mlogm23logm2logm)O(mp23p2logm)O\mathopen{}\mathclose{{}\left(m^{\frac{\log m-2}{3\log m-2}}\log m}\right)\leq O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log m}\right) calls to a linear system solver, and Algorithm 3 calls this algorithm logp\log p times, we obtain the required runtime. ∎

See 4.1

Proof.

We note that Algorithm 6 is essentially Algorithm 1 which calls different residual solvers depending on the value of pp. If plogmp\leq\log m, from Theorem 3.8, we obtain the required solution in O(mp23p2logplog𝒇(𝒙(0))𝒇(𝒙)ϵ)O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log p\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right) calls to a linear system solver. If plogmp\geq\log m, from Lemma 4.4, we obtain an O(m1p1)O(m1logm)O(1)O(m^{\frac{1}{p-1}})\leq O(m^{\frac{1}{\log m}})\leq O(1) approximate solution to the residual problem at any iteration in O(mp23p2logmlogp)O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log m\log p}\right) calls to a linear system solver. Combining this with Theorem 2.1, we obtain our result. ∎

See 4.2

Proof.

From Lemma 2.9 we can find an O(m)O(m)-approximation to the above problem in time

O(plogm)k=2i,i=2i=logp1κkT(k,κk),O(p\log m)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k}),

where κ\kappa is the approximation to which we solve the residual problem for the kk-norm problem and T(k,κ)T(k,\kappa) is the time required to do so. If klogmk\geq\log m, we use Algorithm 7 to solve such residual problems. Thus κk=m1k1m1logmO(1)\kappa_{k}=m^{\frac{1}{k-1}}\leq m^{\frac{1}{\log m}}\leq O(1) and T(k,κk)=O(mp23p2logp)T(k,\kappa_{k})=O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log p}\right). If klogmk\leq\log m, we can use Algorithm 3 and κk=O(1)\kappa_{k}=O(1), T(k,κk)=O(mp23p2logp)T(k,\kappa_{k})=O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log p}\right). Thus, the total runtime is O(pmp23p2logmlog2p)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log m\log^{2}p}\right). We now combine this with Theorem 4.1 to obtain the required rates of convergence. ∎

5 Speedups for General Matrices via Inverse Maintenance

Inverse maintenance was first introduced by Vaidya in 1990 [Vai90] for speeding up algorithms for minimum cost and multicommodity flow problems. The key idea is to reuse the inverse of matrices, which is possible due to the controllable rates at which variables are updated in some algorithms. In the work by [Adi+19], the authors design a new inverse maintenance algorithm for p\ell_{p}-regression that can solve p\ell_{p}-regression for any p>2p>2 almost as fast as linear regression. This section is based on Section 6 of [Adi+19] and we give a more fine grained and simplified analysis of the original result. In particular, we simplify the proofs and give the result with explicit dependencies on both matrix dimensions as opposed to just the larger dimension.

Our inverse maintenance procedure is based on the same high-level ideas of combining low-rank updates and matrix multiplication as in [Vai90] and [LS15]. However, recall that the rate of convergence of our algorithm is controlled by two potentials which change at different rates based on the two different kind of weight update steps in our algorithm. In order to handle these updates, our inverse maintenance algorithm uses a new fine-grained bucketing scheme, inspired by lazy updates in data structures and is different from previous works on inverse maintenance which usually update weights based on fixed thresholds. Our scheme is also simpler than those used in [Vai90, LS15]. We now present our algorithm in detail.

Consider the weighted linear system being solved at each iteration of Algorithm 5. Each weighted linear system is of the form,

min𝑨𝒙=𝒄𝒙(𝑴𝑴+𝑵𝑹𝑵)𝒙\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{c}}}\boldsymbol{\mathit{x}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)\boldsymbol{\mathit{x}}

where 𝑨d×n,𝑵m1×n,𝑴m2×n\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}. From Equation (15) in Section 3, the solution of the above linear system is given by,

𝒙=(𝑴𝑴+𝑵𝑹𝑵)1𝑨(𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨)1𝒄.\boldsymbol{\mathit{x}}^{\star}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\boldsymbol{\mathit{c}}.

In order to compute the above expression, we require the following products in order. The runtimes are considering the fact ω2\omega\geq 2.

  • 𝑴𝑴\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}} and 𝑵𝑹𝑵\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}: require time m2nω1m_{2}n^{\omega-1} and m1nω1m_{1}n^{\omega-1} respectively

  • (𝑴𝑴+𝑵𝑹𝑵)1\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}: requires time nωn^{\omega}

  • (𝑴𝑴+𝑵𝑹𝑵)1𝑨\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top} and 𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}: require time n2dω2n^{2}d^{\omega-2}

  • (𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨)1\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}: requires time dωd^{\omega}

  • (𝑴𝑴+𝑵𝑹𝑵)1𝑨(𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨)1\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}: requires time ndω1nd^{\omega-1}

The cost of solving the above problem is dominated by the first step, and we thus require time O(mnω1)O(mn^{\omega-1}), where m=max{m1,m2}m=\max\{m_{1},m_{2}\}. This directly gives the runtime of Algorithm 5 to be O(pmp2(3p2)mnω1)O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{(3p-2)}}mn^{\omega-1}}\right). In this section, we show that we can implement Algorithm 5 in time similar to solving a system of linear equations for all p2p\geq 2. In particular, we prove the following result.

Theorem 5.1.

If 𝐀,𝐌,𝐍\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}} are explicitly given, matrices with polynomially bounded condition numbers, and p2p\geq 2, Algorithm 5 as given in Section 3.1.2 can be implemented to run in total time

O(mnω1+p3ωn2mω2+p3ωn2mp(104ω)3p2).O\mathopen{}\mathclose{{}\left(mn^{\omega-1}+p^{3-\omega}n^{2}m^{\omega-2}+p^{3-\omega}n^{2}m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}}\right).

5.1 Inverse Maintenance Algorithm

We first note that the weights 𝒘e(i)\boldsymbol{\mathit{w}}_{e}^{(i)}’s, and thus 𝒓e(i)\boldsymbol{\mathit{r}}_{e}^{(i)}’s are monotonically increasing. Our algorithm in Section 3.1.2 updates both in every iteration. Here, we will instead update these gradually when there is a significant increase in the values. We thus give a lazy update scheme. The update can be done via the following consequence of the Woodbury matrix formula. The main idea is that we initially explicitly compute the inverse of the required matrix, and then when we update the coordinates that have significant increases, but are still within a good factor approximation of the original values, and directly use the current matrix inverse as a preconditioner and solve linear systems faster.

5.1.1 Low Rank Update

The following lemma is the same as Lemma 6.2 of [Adi+19].

Lemma 5.2.

Given matrices 𝐍m1×n,𝐌m2×n\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}, and vectors 𝐫\boldsymbol{\mathit{r}} and 𝐫~\tilde{\boldsymbol{\mathit{r}}} that differ in kk entries, as well as the matrix 𝐙^=(𝐌𝐌+𝐍Diag(𝐫)𝐍)1\widehat{\boldsymbol{\mathit{Z}}}=(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}})^{-1}, we can construct (𝐌𝐌+𝐍Diag(𝐫~)𝐍)1(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}})^{-1} in O(kω2n2)O(k^{\omega-2}n^{2}) time.

Proof.

Let SS denote the entries that differ in 𝒓\boldsymbol{\mathit{r}} and 𝒓~\tilde{\boldsymbol{\mathit{r}}}. Then we have

𝑴𝑴+𝑵Diag(𝒓~)𝑵=𝑴𝑴+𝑵Diag(𝒓)𝑵+𝑵:,S(Diag(𝒓~S)Diag(𝒓S))𝑵S,:.\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}}=\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}}+\boldsymbol{\mathit{N}}_{:,S}^{\top}\mathopen{}\mathclose{{}\left(Diag(\tilde{\boldsymbol{\mathit{r}}}_{S})-Diag(\boldsymbol{\mathit{r}}_{S})}\right)\boldsymbol{\mathit{N}}_{S,:}.

This is a low rank perturbation, so by Woodbury matrix identity we get:

(𝑴𝑴+𝑵Diag(𝒓~)𝑵)1=𝒁^𝒁^𝑵:,S((Diag(𝒓~S)Diag(𝒓S))1+𝑵S,:𝒁^𝑵:,S)1𝑵S,:𝒁^,\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}}}\right)^{-1}=\widehat{\boldsymbol{\mathit{Z}}}-\widehat{\boldsymbol{\mathit{Z}}}\boldsymbol{\mathit{N}}_{:,S}^{\top}\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(Diag({\tilde{\boldsymbol{\mathit{r}}}_{S}})-Diag(\boldsymbol{\mathit{r}}_{S})}\right)^{-1}+\boldsymbol{\mathit{N}}_{S,:}\widehat{\boldsymbol{\mathit{Z}}}\boldsymbol{\mathit{N}}_{:,S}^{\top}}\right)^{-1}\boldsymbol{\mathit{N}}_{S,:}\widehat{\boldsymbol{\mathit{Z}}},

where we use 𝒁^=𝒁^\widehat{\boldsymbol{\mathit{Z}}}^{\top}=\widehat{\boldsymbol{\mathit{Z}}} because 𝑴𝑴+𝑵Diag(𝒓)𝑵\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}} is a symmetric matrix. To explicitly compute this matrix, we need to:

  1. 1.

    compute the matrix 𝑵S,:𝒁^\boldsymbol{\mathit{N}}_{S,:}\widehat{\boldsymbol{\mathit{Z}}},

  2. 2.

    compute 𝑵:,S𝒁^𝑵:,S\boldsymbol{\mathit{N}}_{:,S}\widehat{\boldsymbol{\mathit{Z}}}\boldsymbol{\mathit{N}}_{:,S}^{\top}

  3. 3.

    invert the middle term.

This cost is dominated by the first term, which can be viewed as multiplying n/k\lceil n/k\rceil pairs of k×nk\times n and n×kn\times k matrices. Each such multiplication takes time kω1nk^{\omega-1}n, for a total cost of O(kω2n2)O(k^{\omega-2}n^{2}). The other terms all involve matrices with dimension at most k×nk\times n, and are thus lower order terms. ∎

5.1.2 Approximation and Fast Linear Systems Solver

We now define the notion of approximation we use and how to solve linear systems fast given a good preconditioner.

Definition 5.3.

We use acba\approx_{c}b for positive numbers aa and bb iff c1abcbc^{-1}a\leq b\leq c\cdot b, and for vectors and for vectors 𝐚\boldsymbol{\mathit{a}} and 𝐛\boldsymbol{\mathit{b}} we use 𝐚c𝐛\boldsymbol{\mathit{a}}\approx_{c}\boldsymbol{\mathit{b}} to denote 𝐚ic𝐛i\boldsymbol{\mathit{a}}_{i}\approx_{c}\boldsymbol{\mathit{b}}_{i} entry-wise.

In our algorithm, we only update kk resistances that have increased by a constant factor. We can therefore use a constant factor preconditioner to solve the new linear system. We will use the following result on solving preconditioned systems of linear equations.

Lemma 5.4.

If 𝐫\boldsymbol{\mathit{r}} and 𝐫~\tilde{\boldsymbol{\mathit{r}}} are vectors such that 𝐫O~(1)𝐫~\boldsymbol{\mathit{r}}\approx_{\widetilde{O}(1)}\tilde{\boldsymbol{\mathit{r}}}, and we’re given the matrix 𝐙^1=(𝐌𝐌+𝐍Diag(𝐫)𝐍)1\widehat{\boldsymbol{\mathit{Z}}}^{-1}=(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}})^{-1} explicitly, then we can solve a system of linear equations involving 𝐙=𝐌𝐌+𝐍Diag(𝐫~)𝐍\boldsymbol{\mathit{Z}}=\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}} to 1/poly(n)1/poly(n) accuracy in O~(n2)\widetilde{O}(n^{2}) time.

Proof.

Suppose we want to solve the system,

𝒁𝒙=𝒃.\boldsymbol{\mathit{Z}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}.

We know 𝒁^1\widehat{\boldsymbol{\mathit{Z}}}^{-1} and that for some constant cc, 1c𝑰𝒁^1/2𝒁𝒁^1/2c𝑰\frac{1}{c}\boldsymbol{\mathit{I}}\preceq\widehat{\boldsymbol{\mathit{Z}}}^{-1/2}\boldsymbol{\mathit{Z}}\widehat{\boldsymbol{\mathit{Z}}}^{-1/2}\preceq c\boldsymbol{\mathit{I}}. The following iterative method (which is essentially gradient descent),

𝒙(k+1)𝒙(k)𝒁^1(𝒁𝒙𝒃)\boldsymbol{\mathit{x}}^{(k+1)}\rightarrow\boldsymbol{\mathit{x}}^{(k)}-\hat{\boldsymbol{\mathit{Z}}}^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{Z}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right)

converges to an ϵ\epsilon-approximate solution in O(clog1ϵ)O\mathopen{}\mathclose{{}\left(c\log\frac{1}{\epsilon}}\right) iterations. Each iteration can be computed via matrix-vector products. Since matrix vector products for n×nn\times n matrices require at most O(n2)O(n^{2}) we get the above lemma for ϵ=1/poly(n)\epsilon=1/poly(n). ∎

5.1.3 Algorithm

The algorithm is the same as that in Section 6 of [Adi+19]. The algorithm has two parts, an initialization routine InverseInit which is called only at the first iteration, and the inverse maintenance procedure, UpdateInverse which is called from Algorithm 4, Oracle. Algorithm Oracle is called every time the resistances are updated in Algorithm 5. For this section, we will assume access to all variables from these routines, and maintain the following global variables:

  1. 1.

    𝒓^\boldsymbol{\widehat{\mathit{r}}}: resistances from the last time we updated each entry.

  2. 2.

    counter(η)ecounter(\eta)_{e}: for each entry, track the number of times that it changed (relative to 𝒓^\boldsymbol{\widehat{\mathit{r}}}) by a factor of about 2η2^{-\eta} since the previous update.

  3. 3.

    𝒁^\widehat{\boldsymbol{\mathit{Z}}}, an inverse of the matrix given by 𝑴𝑴+𝑵Diag(𝒓^)𝑵\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\widehat{\mathit{r}}})\boldsymbol{\mathit{N}}.

Algorithm 8 Inverse Maintenance Initialization
1:procedure InverseInit(𝑴,𝑵,𝒓(0)\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{r}}^{(0)})
2:     Set 𝒓^𝒓(0)\boldsymbol{\widehat{\mathit{r}}}\leftarrow\boldsymbol{\mathit{r}}^{(0)}.
3:     Set counter(η)e0counter(\eta)_{e}\leftarrow 0 for all 0ηlog(m)0\leq\eta\leq\log(m) and ee.
4:     Set 𝒁^(𝑴𝑴+𝑵Diag(𝒓)𝑵)1\widehat{\boldsymbol{\mathit{Z}}}\leftarrow(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}})^{-1} by explicitly inverting the matrix.
Algorithm 9 Inverse Maintenance Procedure
1:procedure UpdateInverse
2:     for all entries ee do
3:         Find the least non-negative integer η\eta such that
12η𝒓e(i)𝒓e(i1)𝒓^e.\frac{1}{2^{\eta}}\leq\frac{\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i}\right)}_{e}-\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i-1}\right)}_{e}}{\boldsymbol{\widehat{\mathit{r}}}_{e}}.
4:         Increment counter(η)ecounter(\eta)_{e}.      
5:     Echangedη:i(mod2η)0{e:counter(η)e2η}E_{changed}\leftarrow\cup_{\eta:i\pmod{2^{\eta}}\equiv 0}\{e:counter(\eta)_{e}\geq 2^{\eta}\}
6:     𝒓~𝒓^\tilde{\boldsymbol{\mathit{r}}}\leftarrow\boldsymbol{\widehat{\mathit{r}}}
7:     for all eEchangede\in E_{changed} do
8:         𝒓~e𝒓e(i)\tilde{\boldsymbol{\mathit{r}}}_{e}\leftarrow\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}.
9:         Set counter(η)e0counter(\eta)_{e}\leftarrow 0 for all η\eta.      
10:     𝒁^LowRankUpdate(𝒁^,𝒓^,𝒓~)\widehat{\boldsymbol{\mathit{Z}}}\leftarrow\textsc{LowRankUpdate}(\widehat{\boldsymbol{\mathit{Z}}},\boldsymbol{\widehat{\mathit{r}}},\tilde{\boldsymbol{\mathit{r}}}).
11:     𝒓^𝒓~\boldsymbol{\widehat{\mathit{r}}}\leftarrow\tilde{\boldsymbol{\mathit{r}}}.

5.1.4 Analysis

We first verify that the maintained inverse is always a good preconditioner to the actual matrix, 𝑴𝑴+𝑵Diag(𝒓(i))𝑵\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}}^{(i)})\boldsymbol{\mathit{N}}.

Lemma 5.5 (Lemma 6.5, [Adi+19]).

After each call to UpdateInverse, the vector 𝐫^\boldsymbol{\widehat{\mathit{r}}} satisfies

𝒓^O~(1)𝒓(i).\boldsymbol{\widehat{\mathit{r}}}\approx_{\widetilde{O}\mathopen{}\mathclose{{}\left(1}\right)}\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i}\right)}.
Proof.

First, observe that any change in resistance exceeding 11 is reflected immediately. Otherwise, every time we update counter(j)ecounter(j)_{e}, 𝒓e\boldsymbol{\mathit{r}}_{e} can only increase additively by at most

2j+1𝒓^e.2^{-j+1}\boldsymbol{\widehat{\mathit{r}}}_{e}.

Once counter(j)ecounter(j)_{e} exceeds 2j2^{j}, ee will be added to EchangedE_{changed} after at most 2j2^{j} steps. So when we start from 𝒓^e\boldsymbol{\widehat{\mathit{r}}}_{e}, ee is added to EchangedE_{changed} after counter(j)e2j+2j=2j+1counter(j)_{e}\leq 2^{j}+2^{j}=2^{j+1} iterations. The maximum possible increase in resistance due to the bucket jj is,

2j+1𝒓^e2j+1=4𝒓^e.2^{-j+1}\boldsymbol{\widehat{\mathit{r}}}_{e}\cdot 2^{j+1}=4\boldsymbol{\widehat{\mathit{r}}}_{e}.

Since there are only at most m1/3m^{1/3} iterations, the contributions of buckets with j>logmj>\log{m} are negligible. Now the change in resistance is influenced by all buckets jj, each contributing at most 4𝒓^e4\boldsymbol{\widehat{\mathit{r}}}_{e} increase. The total change is at most 4𝒓^elogm4\boldsymbol{\widehat{\mathit{r}}}_{e}\log m since there are at most logm\log m buckets. We therefore have

𝒓^e𝒓e(i)5𝒓^elogm.\boldsymbol{\widehat{\mathit{r}}}_{e}\leq\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i}\right)}_{e}\leq 5\boldsymbol{\widehat{\mathit{r}}}_{e}\log m.

for every ii. ∎

It remains to bound the number and sizes of calls made to Lemma 5.2. For this we define variables k(η)(i)k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)} to denote the number of edges added to EchangedE_{changed} at iteration ii due to the value of counter(η)ecounter(\eta)_{e}. Note that k(η)(i)k(\eta)^{(i)} is non-zero only if i0(mod2η)i\equiv 0\pmod{2^{\eta}}, and

|Echanged(i)|ηk(η)(i).\mathopen{}\mathclose{{}\left|E_{changed}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right|\leq\sum_{\eta}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}.

We divide our analysis into 2 cases, when the relative change in resistance is at least 11 and when the relative change in resistance is at most 11. To begin with, let us first look at the following lemma that relates the change in weights to the relative change in resistance.

Lemma 5.6.

Consider a primal step from Algorithm 5. We have

𝒓e(i+1)𝒓e(i)𝒓e(i)(1+α|𝑵Δ|eζ1/p)p21\frac{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i+1}\right)}-\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\leq\mathopen{}\mathclose{{}\left(1+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}-1

where Δ\Delta is the solution produced by the oracle Algorithm 4.

Proof.

Recall from Algorithm 4 that

𝒓e(i)=(𝒘e(i))p2.\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{p-2}.

For a primal step of Algorithm 5, we have

𝒘e(i+1)𝒘e(i)=αζ1/p|𝑵Δ|e.\boldsymbol{\mathit{w}}^{\mathopen{}\mathclose{{}\left(i+1}\right)}_{e}-\boldsymbol{\mathit{w}}^{\mathopen{}\mathclose{{}\left(i}\right)}_{e}=\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}.

Substituting this in gives

𝒓e(i+1)𝒓e(i)𝒓e(i)=(𝒘e(i)+αζ1/p|𝑵Δ|e)p2(𝒘e(i))p2(𝒘e(i))p2(1+αζ1/p|𝑵Δ|e𝒘e(i))p21(1+αζ1/p|𝑵Δ|e)p21,\frac{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i+1}\right)}-\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}=\frac{\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}+\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}\right)^{p-2}-\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{p-2}}{\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{p-2}}\leq\mathopen{}\mathclose{{}\left(1+\frac{\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}}\right)^{p-2}-1\leq\mathopen{}\mathclose{{}\left(1+\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}\right)^{p-2}-1,

where the last inequality utilizes 𝒘e(i)1\boldsymbol{\mathit{w}}_{e}^{(i)}\geq 1. ∎

We now consider the case when the relative change in resistance is at least 11.

Lemma 5.7.

Throughout the course of a run of Algorithm 5, the number of edges added to EchangedE_{changed} due to relative resistance increase of at least 11,

1iTk(0)(i)O(mp+23p2).\sum_{1\leq i\leq T}k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O\mathopen{}\mathclose{{}\left(m^{\frac{p+2}{3p-2}}}\right).
Proof.

From Lemma C.1, we know that the change in energy over one iteration is at least,

e(𝑵Δ)e2(1𝒓e(i)𝒓e(i+1)).\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right).

Over all iterations, the change in energy is at least,

ie(𝑵Δ)e2(1𝒓e(i)𝒓e(i+1))\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right)

which is upper bounded by O(mp2p)ζ2/pO(m^{\frac{p-2}{p}})\zeta^{2/p}. When iteration ii is a width reduction step, the relative resistance change is always at least 11. In this case |𝑵Δ|ρζ1/p\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|\geq\rho\zeta^{1/p}. When we have a primal step, Lemma 5.6 implies that when the relative change in resistance is at least 11 then,

|𝑵Δ|eΩ(1)α1ζ1/p.\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\Omega(1)\alpha^{-1}\zeta^{1/p}.

Using the bound |𝑵Δ|eΩ(p1)α1ζ1/p\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\Omega(p^{-1})\alpha^{-1}\zeta^{1/p} is sufficient since ρ>Ω(p1α1)\rho>\Omega(p^{-1}\alpha^{-1}) and both kinds of iterations are accounted for. The total change in energy can now be bounded.

p2α2ζ2/pie𝟙[𝒓e(i+1)𝒓e(i)𝒓e(i)1]O(mp2p)ζ2/p\displaystyle p^{-2}\alpha^{-2}\zeta^{2/p}\sum_{i}\sum_{e}\mathbb{1}_{\mathopen{}\mathclose{{}\left[\frac{\boldsymbol{\mathit{r}}^{(i+1)}_{e}-\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i)}_{e}}\geq 1}\right]}\leq O(m^{\frac{p-2}{p}})\zeta^{2/p}
p2α2ik(0)(i)O(mp2p)\displaystyle\Leftrightarrow p^{-2}\alpha^{-2}\sum_{i}k(0)^{(i)}\leq O(m^{\frac{p-2}{p}})
ik(0)(i)O(p2m(p2)/pα2).\displaystyle\Leftrightarrow\sum_{i}k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O(p^{2}m^{(p-2)/p}\alpha^{2}).

The Lemma follows by substituting α=Θ(p1mp25p+2p(3p2))\alpha=\Theta\mathopen{}\mathclose{{}\left(p^{-1}m^{-\frac{p^{2}-5p+2}{p(3p-2)}}}\right) in the above equation. ∎

Lemma 5.8.

Throughout the course of a run of Algorithm 5, the number of edges added to EchangedE_{changed} due to relative resistance increase between 2η2^{-\eta} and 2η+12^{-\eta+1},

1iTk(η)(i){0if 2ηT,O(mp+23p222η)otherwise.\sum_{1\leq i\leq T}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq\begin{cases}0&\text{if $2^{\eta}\geq T$},\\ O\mathopen{}\mathclose{{}\left(m^{\frac{p+2}{3p-2}}2^{2\eta}}\right)&\text{otherwise}.\end{cases}
Proof.

From Lemma C.1, the total change in energy is at least,

ie(𝑵Δ)e2(1𝒓e(i)𝒓e(i+1)).\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right).

We know that 𝒓e(i+1)𝒓e(i)𝒓e(i)2η\frac{\boldsymbol{\mathit{r}}^{(i+1)}_{e}-\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i)}_{e}}\geq 2^{-\eta}. Using Lemma 5.6, we have,

(1+α|𝑵Δ|eζ1/p)p212η.\mathopen{}\mathclose{{}\left(1+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}-1\geq 2^{-\eta}.

We thus obtain,

(1+α|𝑵Δ|eζ1/p)p21{α|𝑵Δ|eζ1/p when α|Δe|ζ1/p or p21(2α|𝑵Δ|eζ1/p)p2 otherwise. \mathopen{}\mathclose{{}\left(1+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}-1\leq\begin{cases}\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}&\text{ when $\alpha\mathopen{}\mathclose{{}\left|\Delta_{e}}\right|\leq\zeta^{1/p}$ or $p-2\leq 1$}\\ \mathopen{}\mathclose{{}\left(2\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}&\text{ otherwise. }\end{cases}

Now, in the second case, when α|𝑵Δe|ζ1/p\alpha\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|\geq\zeta^{1/p} and p2>1p-2>1,

(2α|𝑵Δ|eζ1/p)p22ηα|𝑵Δ|e(12η)1/(p2)+1ζ1/p2η1ζ1/p\mathopen{}\mathclose{{}\left(2\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}\geq 2^{-\eta}\Rightarrow\alpha\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\mathopen{}\mathclose{{}\left(\frac{1}{2^{\eta}}}\right)^{1/(p-2)+1}\zeta^{1/p}\geq 2^{-\eta-1}\zeta^{1/p}

Therefore, for both cases we have,

α|𝑵Δ|e(2η1)ζ1/p.\alpha\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\mathopen{}\mathclose{{}\left(2^{-\eta-1}}\right)\zeta^{1/p}.

Using the above bound and the fact that the total change in energy is at most O(mp2p)ζ2/pO(m^{\frac{p-2}{p}})\zeta^{2/p}, gives,

ie(𝑵Δ)e2(1𝒓e(i)𝒓e(i+1))O(mp2p)ζ2/p\displaystyle\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right)\leq O(m^{\frac{p-2}{p}})\zeta^{2/p}
\displaystyle\Rightarrow 14ie(α12ηζ1/p)2(2η𝟙2η+1𝒓e(i+1)𝒓e(i)𝒓e(i)2η)O(mp2p)ζ2/p\displaystyle\frac{1}{4}\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\alpha^{-1}2^{-\eta}\zeta^{1/p}}\right)^{2}\cdot\mathopen{}\mathclose{{}\left(2^{-\eta}\mathbb{1}_{2^{-\eta+1}\geq\frac{\boldsymbol{\mathit{r}}^{(i+1)}_{e}-\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i)}_{e}}\geq 2^{-\eta}}}\right)\leq O(m^{\frac{p-2}{p}})\zeta^{2/p}
\displaystyle\Rightarrow α223ηi2ηk(η)(i)O(mp2p)\displaystyle\alpha^{-2}2^{-3\eta}\sum_{i}2^{\eta}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O(m^{\frac{p-2}{p}})
\displaystyle\Rightarrow ik(η)(i)O(m(p2)/pα222η)\displaystyle\sum_{i}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O\mathopen{}\mathclose{{}\left(m^{(p-2)/p}\alpha^{2}2^{2\eta}}\right)

The Lemma follows substituting α=Θ(p1mp25p+2p(3p2))\alpha=\Theta\mathopen{}\mathclose{{}\left(p^{-1}m^{-\frac{p^{2}-5p+2}{p(3p-2)}}}\right) in the above equation. ∎

We can now use the concavity of f(z)=zω2f(z)=z^{\omega-2} to upper bound the contribution of these terms.

Corollary 5.9.

Let k(η)(i)k(\eta)^{(i)} be as defined. Over all iterations we have,

i(k(0)(i))ω2O(p3ωmp(104ω)3p2)\sum_{i}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}\leq O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}}\right)

and for every η\eta,

iT(k(η)(i))ω2{0if 2ηT,O(p3ωmp2+4(ω2)3p22η(3ω7))otherwise.\sum_{i}^{T}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}\leq\begin{cases}0&\text{if $2^{\eta}\geq T$},\\ O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}\cdot 2^{\eta\mathopen{}\mathclose{{}\left(3\omega-7}\right)}}\right)&\text{otherwise}.\end{cases}
Proof.

Due to the concavity of the ω20.3727<1\omega-2\approx 0.3727<1 power, this total is maximized when it’s equally distributed over all iterations. In the first sum, the number of terms is equal to the number of iterations, i.e., O(pmp23p2)O(pm^{\frac{p-2}{3p-2}}). In the second sum the number of terms is O(pmp23p2)2ηO(pm^{\frac{p-2}{3p-2}})2^{-\eta}. Distributing the sum equally over the above numbers give,

iT(k(0)(i))ω2(O(p1mp+23p2p23p2))ω2O(pmp23p2)=O(p3ωmp2+4(ω2)3p2)=O(p3ωmp(104ω)3p2)\sum_{i}^{T}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}\leq\mathopen{}\mathclose{{}\left(O\mathopen{}\mathclose{{}\left(p^{-1}m^{\frac{p+2}{3p-2}-\frac{p-2}{3p-2}}}\right)}\right)^{\omega-2}\cdot O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}}\right)=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}}\right)=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}}\right)

and

iT(k(η)(i))ω2\displaystyle\sum_{i}^{T}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2} O(pmp23p22η)(p1mp+23p222ηmp23p22η)ω2\displaystyle\leq O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}2^{-\eta}}\right)\cdot\mathopen{}\mathclose{{}\left(p^{-1}\frac{m^{\frac{p+2}{3p-2}}2^{2\eta}}{m^{\frac{p-2}{3p-2}}2^{-\eta}}}\right)^{\omega-2}
=O(p3ωmp2+4(ω2)3p22η23η(ω2))\displaystyle=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}2^{-\eta}\cdot 2^{3\eta(\omega-2)}}\right)
=O(p3ωmp2+4(ω2)3p22η(3ω7)).\displaystyle=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}2^{\eta(3\omega-7)}}\right).

5.2 Proof of Theorem 5.1

See 5.1

Proof.

By Lemma 5.5, the 𝒓^\boldsymbol{\widehat{\mathit{r}}} that the inverse being maintained corresponds to always satisfy 𝒓^O~(1)𝒓(i)\boldsymbol{\widehat{\mathit{r}}}\approx_{\widetilde{O}(1)}\boldsymbol{\mathit{r}}^{(i)}. So by the iterative linear systems solver method outlined in Lemma 5.4, we can implement each call to Oracle (Algorithm 4)in time O(n2)O(n^{2}) in addition to the cost of performing inverse maintenance. This leads to a total cost of

O~(pn2mp23p2).\widetilde{O}\mathopen{}\mathclose{{}\left(pn^{2}m^{\frac{p-2}{3p-2}}}\right).

across the T=Θ(pmp23p2)T=\Theta(pm^{\frac{p-2}{3p-2}}) iterations.

The costs of inverse maintenance is dominated by the calls to the low-rank update procedure outlined in Lemma 5.2. Its total cost is bounded by

O(i|Echanged(i)|ω2n2)=O(n2i(ηk(η)(i))ω2).O\mathopen{}\mathclose{{}\left(\sum_{i}\mathopen{}\mathclose{{}\left|E_{changed}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right|^{\omega-2}n^{2}}\right)=O\mathopen{}\mathclose{{}\left(n^{2}\sum_{i}\mathopen{}\mathclose{{}\left(\sum_{\eta}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}}\right).

Because there are only O(logm)O(\log{m}) values of η\eta, and each k(η)(i)k(\eta)^{(i)} is non-negative, we can bound the total cost by:

O~(n2iη(k(η)(i))ω2)O~(p3ωn2η:2ηTmp2+4(ω2)3p22η(3ω7)),\widetilde{O}\mathopen{}\mathclose{{}\left(n^{2}\sum_{i}\sum_{\eta}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}}\right)\leq\widetilde{O}\mathopen{}\mathclose{{}\left(p^{3-\omega}n^{2}\sum_{\eta:2^{\eta}\leq T}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}\cdot 2^{\eta\mathopen{}\mathclose{{}\left(3\omega-7}\right)}}\right),

where the inequality follows from substituting in the result of Lemma 5.9. Depending on the sign of 3ω73\omega-7, this sum is dominated either at η=0\eta=0 or η=logT\eta=\log{T}. Including both terms then gives

O~(p3ωn2(mp2+4(ω2)3p2+mp2+4(ω2)+(p2)(3ω7)3p2)),\widetilde{O}\mathopen{}\mathclose{{}\left(p^{3-\omega}n^{2}\mathopen{}\mathclose{{}\left(m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}+m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)+\mathopen{}\mathclose{{}\left(p-2}\right)\mathopen{}\mathclose{{}\left(3\omega-7}\right)}{3p-2}}}\right)}\right),

with the exponent on the trailing term simplifying to ω2\omega-2 to give,

O~(p3ωn2(mp(104ω)3p2+mω2)).\widetilde{O}\mathopen{}\mathclose{{}\left(p^{3-\omega}n^{2}\mathopen{}\mathclose{{}\left(m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}+m^{\omega-2}}\right)}\right).

6 Iteratively Reweighted Least Squares Algorithm

Iteratively Reweighted Least Squares (IRLS) Algorithms are a family of algorithms for solving p\ell_{p}-regression. These algorithms have been studied extensively for about 60 years [Law61, Ric64, GR97] and the classical form solves the following version of p\ell_{p}-regression,

min𝒙𝑨𝒙𝒃p,\min_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}\|_{p}, (12)

where 𝑨\boldsymbol{\mathit{A}} is a tall thin matrix and 𝒃\boldsymbol{\mathit{b}} is a vector. The main idea in IRLS algorithms is to solve a weighted least squares problem in every iteration to obtain the next iterate,

𝒙(t+1)=argmin𝒙(𝑨𝒙(t)𝒃)𝑹(𝑨𝒙(t)𝒃)\boldsymbol{\mathit{x}}^{(t+1)}=\arg\min_{\boldsymbol{\mathit{x}}}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(t)}-\boldsymbol{\mathit{b}})^{\top}\boldsymbol{\mathit{R}}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(t)}-\boldsymbol{\mathit{b}}) (13)

starting from 𝒙(0)\boldsymbol{\mathit{x}}^{(0)} which is usually argmin𝒙𝑨𝒙𝒃22\arg\min_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}\|_{2}^{2}. Here 𝑹\boldsymbol{\mathit{R}} is picked to be Diag(|𝑨𝒙(t)𝒃|p2)Diag(|\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(t)}-\boldsymbol{\mathit{b}}|^{p-2}) and note that the above equation now becomes a fixed point iterate for the p\ell_{p}-regression problem. It is known that the fixed point is unique for p(1,)p\in(1,\infty).

The basic version of the above IRLS algorithm is guaranteed to converge for p(1.5,3)p\in(1.5,3), however, even for small p3.5p\approx 3.5, the algorithm diverges [RCL19]. Over the years there have been several studies on IRLS algorithms and attempts to show convergence [Kar70, Osb85], but none of them show quantitative bounds or require starting solutions close enough to the optimum. Refer to [Bur12] for a complete survey on these methods.

In this section we propose an IRLS algorithm and prove that our algorithm is guaranteed to converge geometrically to the optimum. Our algorithm is based on the algorithm of [APS19] and present some experimental results from experiments performed in the paper that demonstrate our algorithm works very well in practice. We provide a much simpler analysis and integrate the analysis with the framework we have built so far.

We will focus on the following pure p\ell_{p} setting for better readability,

min𝑨𝒙=𝒃𝑵𝒙p.\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}.

We note that our algorithm also works for the setting described in Equation (12). We will first describe our algorithm in the next section, and then present some experimental results from experiments that were performed in [APS19].

6.1 IRLS Algorithm

Our IRLS algorithm is based on our overall iterative refinement framework (Algorithm 1) where we will directly use a weighted least squares problem to solve the residual problem. Consider Algorithm 10 and compare it with Algorithm 1. We note that it is same overall, except now we have an extra step LineSearch and we update the solution (Line 7) at every iteration. These steps do not affect the overall convergence guarantees of the iterative refinement framework in Algorithm 1, since these are only ensuring that given a solution from ResidualSolver-IRLS, we are taking a step that reduces the objective value the most as opposed the fixed update defined in Algorithm 1. In other words, we are reducing the objective value in each iteration at least as much as in Algorithm 1. We thus require to prove the guarantees of ResidualSolver-IRLS (Algorithm 11) and combine it with Theorem 2.1 to obtain our final convergence guarantees. We will prove the following result on our IRLS algorithm (Algorithm 10).

See 1.3 The key connection with IRLS algorithms is that we are able to show that it is sufficient to solve a weighted least squares problem to solve the residual problem. The two main differences are, in every iteration we add a small systematic padding to 𝑹\boldsymbol{\mathit{R}} and, we perform a line search. These tricks are common empirical modifications used to avoid ill conditioning of matrices and for a faster convergence [Kar70, VB99].

Algorithm 10 Iteratively Reweighted Least Squares
1:procedure IRLS(𝑨,𝑵,𝒃,p,ϵ\boldsymbol{\mathit{A}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{b}},p,\epsilon)
2:     𝒙argmin𝑨𝒙=𝒃𝑵𝒙22\boldsymbol{\mathit{x}}\leftarrow{\arg\min}_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{2}^{2}
3:     ν𝑵𝒙pp\nu\leftarrow\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}
4:     while ν>ϵ2𝑵𝒙pp\nu>\frac{\epsilon}{2}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p} do
5:         Δ~,κ{\widetilde{{\Delta}}},\kappa\leftarrow ResidualSolver-IRLS(𝒙,𝑵,𝑨,𝒃,ν,p\boldsymbol{\mathit{x}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{b}},\nu,p)
6:         α\alpha\leftarrow LineSearch(𝑵,𝒙,Δ~\boldsymbol{\mathit{N}},\boldsymbol{\mathit{x}},{\widetilde{{\Delta}}}) \triangleright α=argminβ𝑵(𝒙βΔ~)pp\alpha=\arg\min_{\beta}\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\beta{\widetilde{{\Delta}}})\|_{p}^{p}
7:         𝒙𝒙αΔ~p\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}-\alpha\frac{{\widetilde{{\Delta}}}}{p}
8:         if 𝒓𝒆𝒔p(αΔ~)<ν32pκ\boldsymbol{res}_{p}(\alpha{\widetilde{{\Delta}}})<\frac{\nu}{32p\kappa} then
9:              νν2\nu\leftarrow\frac{\nu}{2}               
10:     return 𝒙\boldsymbol{\mathit{x}}
Algorithm 11 Residual Solver for IRLS
1:procedure ResidualSolver-IRLS(𝒙,𝑵,𝑨,𝒃,ν,p\boldsymbol{\mathit{x}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{b}},\nu,p)
2:     𝒈Diag(|𝑵𝒙|p2)𝑵𝒙\boldsymbol{\mathit{g}}\leftarrow Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}
3:     𝑹2Diag(|𝑵𝒙|p2)\boldsymbol{\mathit{R}}\leftarrow 2Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})
4:     𝒔νp2pmp2p\boldsymbol{\mathit{s}}\leftarrow\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}
5:     Δ~argmax𝑨Δ=0𝒈𝑵ΔΔ𝑵(R+𝒔𝑰)𝑵Δ{\widetilde{{\Delta}}}\leftarrow{\arg\max}_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}\Delta-\Delta^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\boldsymbol{\mathit{s}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}\Delta\triangleright Problem (14)
6:     k𝑵Δ~ppΔ~𝑵(𝑹+s𝑰)𝑵Δ~k\leftarrow\frac{\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}{{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}
7:     α0min{12,12k1/(p1)}\alpha_{0}\leftarrow\min\mathopen{}\mathclose{{}\left\{\frac{1}{2},\frac{1}{2k^{1/(p-1)}}}\right\}
8:     return Δ~,213p2α01{\widetilde{{\Delta}}},2^{13}p^{2}\alpha_{0}^{-1}

We will prove the following result about solving the residual problem.

Lemma 6.1.

Let 𝐱\boldsymbol{\mathit{x}} be the current iterate and ν\nu be such that 𝐍𝐱ppOPT(ν/2,ν]\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-OPT\in(\nu/2,\nu]. Let Δ~{\widetilde{{\Delta}}} be the solution of (14). Then for α0\alpha_{0} and α\alpha as defined in Algorithm 11 and Algorithm 10 respectively, αΔ~\alpha{\widetilde{{\Delta}}} is a O(p2α01)=O(p2mp22(p1))O\mathopen{}\mathclose{{}\left(p^{2}\alpha_{0}^{-1}}\right)=O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{2(p-1)}}}\right)-approximate solution to the residual problem.

We note that Theorem 1.3 directly follows from Lemma 6.1, Lemma 2.7 and Theorem 2.1. Therefore, in the next section, we will prove Lemma 6.1.

6.1.1 Solving the Residual Problem

Recall the residual problem (Definition 2.3),

max𝑨Δ=0𝒈𝑵ΔΔ𝑵𝑹𝑵Δ𝑵Δpp,\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}\Delta-\Delta^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\Delta-\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p},

with 𝒈=Diag(|𝑵𝒙|p2)𝑵𝒙\boldsymbol{\mathit{g}}=Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}} and 𝑹=2Diag(|𝑵𝒙|p2)\boldsymbol{\mathit{R}}=2Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2}). Let ν\nu be as in Algorithm 1, then we will show that the solution of the following weighted least squares problem is a good approximation to the residual problem,

max𝑨Δ=0𝒈𝑵ΔΔ𝑵(𝑹+νp2pmp2p𝑰)𝑵Δ.\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}\Delta-\Delta^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}\Delta. (14)

6.1.2 Proof of Lemma 6.1

Proof.

Since 𝑵𝒙ppOPT(ν/2,ν]\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-OPT\in(\nu/2,\nu], from Lemma 2.6, we have the optimum of the residual problem satisfies, resp(Δ)(ν/32p,ν]res_{p}({{{\Delta^{\star}}}})\in(\nu/32p,\nu]. We will next prove, that the objective of (14) at the optimum is at most ν\nu and at least ν213p2\frac{\nu}{2^{13}p^{2}}. Before proving the above bound, we will prove how αΔ~\alpha{\widetilde{{\Delta}}} gives the required approximation to the residual problem. We have α0=min{12,1(2k)1/p1}\alpha_{0}=\min\mathopen{}\mathclose{{}\left\{\frac{1}{2},\frac{1}{(2k)^{1/p-1}}}\right\}.

resp(αΔ~)\displaystyle res_{p}(\alpha{\widetilde{{\Delta}}}) 16presp(αΔ~/16p)\displaystyle\geq 16p\cdot res_{p}(\alpha{\widetilde{{\Delta}}}/16p)
𝑵𝒙pp𝑵(𝒙αΔ~)pp\displaystyle\geq\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\alpha{\widetilde{{\Delta}}})\|_{p}^{p}
𝑵𝒙pp𝑵(𝒙α0Δ~)pp\displaystyle\geq\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\alpha_{0}{\widetilde{{\Delta}}})\|_{p}^{p}
resp(α0Δ~)\displaystyle\geq res_{p}(\alpha_{0}{\widetilde{{\Delta}}})
=α0(𝒈𝑵Δ~α0Δ~𝑵𝑹𝑵Δ~α0p1𝑵Δ~pp)\displaystyle=\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}^{p-1}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}\right)
α0(𝒈𝑵Δ~α0Δ~𝑵(𝑹+s𝑰)𝑵Δ~α0p1kΔ~𝑵(𝑹+s𝑰)𝑵Δ~)\displaystyle\geq\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}^{p-1}k{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right)
α0(𝒈𝑵Δ~12Δ~𝑵(𝑹+s𝑰)𝑵Δ~12Δ~𝑵(𝑹+s𝑰)𝑵Δ~)\displaystyle\geq\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\frac{1}{2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\frac{1}{2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right)
=α0(𝒈𝑵Δ~Δ~𝑵(𝑹+s𝑰)𝑵Δ~)\displaystyle=\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right)
α0ν213p2α0213p2OPT.\displaystyle\geq\frac{\alpha_{0}\nu}{2^{13}p^{2}}\geq\frac{\alpha_{0}}{2^{13}p^{2}}OPT.

It remains to prove the bound on the optimal objective of (14) and bound α0\alpha_{0} for which it is sufficient to find an upper bound on kk,

k=𝑵Δ~ppΔ~𝑵(𝑹+s𝑰)𝑵Δ~.k=\frac{\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}{{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}.

We will first bound kk. Since, s𝑰𝑹+s𝑰s\boldsymbol{\mathit{I}}\preceq\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}},

𝑵Δ~221sΔ~𝑵(𝑹+s𝑰)𝑵Δ~,\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{2}\leq\frac{1}{s}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}},

and

𝑵Δ~pp𝑵Δ~2p1s𝑵Δ~2p2Δ~𝑵(𝑹+s𝑰)𝑵Δ~.\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{p}\leq\frac{1}{s}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{p-2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}.

Therefore it is sufficient to bound 𝑵Δ~2\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}, as

k=𝑵Δ~ppΔ~𝑵(𝑹+s𝑰)𝑵Δ~1s𝑵Δ~2p2.k=\frac{\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}{{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\leq\frac{1}{s}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{p-2}.

To bound 𝑵Δ~2\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}, we start by assuming |𝒈𝑵Δ~|ν|\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}|\leq\nu. Now, since optimal objective of (14) is lower bounded by ν213p2\frac{\nu}{2^{13}p^{2}},

Δ~𝑵(R+νp2pmp2p𝑰)𝑵Δ~𝒈𝑵Δ~ν213p2ν.{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\leq\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\frac{\nu}{2^{13}p^{2}}\leq\nu.

We thus have,

νp2pmp2p𝑵Δ~22ν.\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{2}\leq\nu.

Using this we get,

k1νp2pmp2pνp22ν(p2)22pm(p2)22p=mp22.k\leq\frac{1}{\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}}\frac{\nu^{\frac{p-2}{2}}}{\nu^{\frac{(p-2)^{2}}{2p}}m^{-\frac{(p-2)^{2}}{2p}}}=m^{\frac{p-2}{2}}.

We thus have α0\alpha_{0} lower bounded by mp22(p1)m^{-\frac{p-2}{2(p-1)}}, which gives us our result. It remains to give a lower bound to the optimal objective of (14).

Let Δ{{{\Delta^{\star}}}} denote the optimum of the residual problem. We know that 𝑵Δppν,Δ𝑵𝑹𝑵Δν\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}\leq\nu,{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\leq\nu and 𝒈𝑵Δ>ν/32p\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}>\nu/32p. Since 𝑵Δppν\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}\leq\nu we have 𝑵Δ22m(p2)/pν2/p\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{2}^{2}\leq m^{(p-2)/p}\nu^{2/p}. For a=1/27p,a=1/2^{7}p, aΔa{{{\Delta^{\star}}}} is a feasible solution for (14).

𝒈𝑵Δ~Δ~𝑵(R+νp2pmp2p𝑰)𝑵Δ~\displaystyle\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}} a𝒈𝑵Δa2Δ𝑵(R+νp2pmp2p𝑰)𝑵Δ\displaystyle\geq a\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}-a^{2}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}
a(ν32paΔ𝑵𝑹𝑵Δaνp2pmp2p𝑵Δ22)\displaystyle\geq a\mathopen{}\mathclose{{}\left(\frac{\nu}{32p}-a{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}-a\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{2}^{2}}\right)
a(ν32paνaνp2pmp2pm(p2)/pν2/p)\displaystyle\geq a\mathopen{}\mathclose{{}\left(\frac{\nu}{32p}-a\nu-a\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}m^{(p-2)/p}\nu^{2/p}}\right)
=a(ν32paνaν)\displaystyle=a\mathopen{}\mathclose{{}\left(\frac{\nu}{32p}-a\nu-a\nu}\right)
=aν26p=ν213p2\displaystyle=a\frac{\nu}{2^{6}p}=\frac{\nu}{2^{13}p^{2}}

Thus, the optimal objective of (14) is lower bounded by ν213p2\frac{\nu}{2^{13}p^{2}}. ∎

6.2 Experiments

In this section, we include the experimental results from [APS19] which are based on Algorithm pp-IRLS described in the paper. We would like to mention that pp-IRLS is similar in spirit to Algorithm 10 and thus we expect a similar performance by an implementation of Algorithm 10. Algorithm pp-IRLS is described for setting (12) and is available at https://github.com/fast-algos/pIRLS [APS19a]. We now give a brief summary of the experiments.

6.2.1 Experiments on p-IRLS

Refer to caption
(a) Size of 𝑨\boldsymbol{\mathit{A}} fixed to 1000×8501000\times 850.
Refer to caption
(b) Sizes of 𝑨\boldsymbol{\mathit{A}}: (50+100(k1))×100k(50+100(k-1))\times 100k. Error ϵ=108.\epsilon=10^{-8}.
Refer to caption
(c) Size of 𝑨\boldsymbol{\mathit{A}} is fixed to 1000×8501000\times 850. Error ϵ=108.\epsilon=10^{-8}.
Figure 1: Random Matrix instances. Comparing the number of iterations and time taken by our algorithm with the parameters. Averaged over 100 random samples for 𝑨\boldsymbol{\mathit{A}} and 𝒃\boldsymbol{\mathit{b}}. Linear solver used : backslash.
Refer to caption
Figure 2: Averaged over 100 random samples. Graph: 10001000 nodes (50005000-60006000 edges). Solver: PCG with Cholesky preconditioner.
Refer to caption
(a) Size of graph fixed to 10001000 nodes (around 50005000-60006000 edges).
Refer to caption
(b) Number of nodes: 100k100k. Error ϵ=108.\epsilon=10^{-8}.
Refer to caption
(c) Size of graph fixed to 10001000 nodes (around 50005000-60006000 edges). Error ϵ=108.\epsilon=10^{-8}.
Figure 3: Graph Instances. Comparing the number of iterations and time taken by our algorithm with the parameters. Averaged over 100 graph samples. Linear solver used : backslash.
Refer to caption
(a) Fixed p=8p=8. Size of matrices: 100k×(50+100(k1))100k\times(50+100(k-1)).
Refer to caption
(b) Size of matrices fixed to 500×450500\times 450.
Refer to caption
(c) Fixed p=8p=8. The number of nodes : 50k,k=1,2,,1050k,k=1,2,...,10.
Refer to caption
(d) Size of graphs fixed to 400400 nodes ( around 20002000 edges).
Figure 4: Averaged over 100 samples. Precision set to ϵ=108\epsilon=10^{-8}.CVX solver used : SDPT3 for Matrices and Sedumi for Graphs.

All implementations were done on on MATLAB 2018b on a Desktop ubuntu machine with an Intel Core i5i5-45704570 CPU @ 3.20GHz×43.20GHz\times 4 processor and 4GB RAM. The two kinds of instances considered are Random Matrices and Graph instances for the problem minx𝑨𝒙𝒃p\min_{x}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right\rVert_{p}.

  1. 1.

    Random Matrices: Matrices 𝑨\boldsymbol{\mathit{A}} and 𝒃\boldsymbol{\mathit{b}} are generated randomly i.e., every entry of the matrix is chosen uniformly at random between 0 and 11.

  2. 2.

    Graphs: Instances are generated as in [RCL19]. Vertices are uniform random vectors in [0,1]10[0,1]^{10} and edges are created by connecting the 1010 nearest neighbors. The weight of every edge is determined by a Gaussian function (Eq 3.1,[RCL19]). Around 10 vertices have labels chosen uniformly at random between 0 and 11. The problem is to minimize the p\ell_{p} laplacian. Appendix B contains details on how to formulate this problem into our standard form. These instances were generated using the code by [Rio19].

The performance of pp-IRLS is compared against Matlab/CVX solver [GB14, GB08] and the IRLS/homotopy based implementation from [RCL19]. More details on the experiments are in [APS19] and the plots and specific details of the implementation are included in Figures 1,2,3 and, 4.

References

  • [ABS21] Deeksha Adil, Brian Bullins and Sushant Sachdeva “Unifying Width-Reduced Methods for Quasi-Self-Concordant Optimization” In arXiv preprint arXiv:2107.02432, 2021
  • [Adi+19] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Iterative Refinement for p\ell_{p}-norm Regression” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 1405–1424 SIAM
  • [Adi+21] Deeksha Adil, Brian Bullins, Rasmus Kyng and Sushant Sachdeva “Almost-Linear-Time Weighted p\ell_{p}-Norm Solvers in Slightly Dense Graphs via Sparsification” In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), 2021 Schloss Dagstuhl-Leibniz-Zentrum für Informatik
  • [AL11] Morteza Alamgir and Ulrike Luxburg “Phase transition in the family of p-resistances” In Advances in neural information processing systems 24, 2011, pp. 379–387
  • [All+17] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira and Avi Wigderson “Much faster algorithms for matrix scaling” In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 890–901 IEEE
  • [APS19] Deeksha Adil, Richard Peng and Sushant Sachdeva “Fast, Provably Convergent IRLS Algorithm for p-norm Linear Regression” In Advances in Neural Information Processing Systems, 2019, pp. 14189–14200
  • [APS19a] Deeksha Adil, Richard Peng and Sushant Sachdeva “pIRLS” In https://github.com/fast-algos/pIRLS GitHub, https://github.com/fast-algos/pIRLS, 2019
  • [AS20] Deeksha Adil and Sushant Sachdeva “Faster p-norm Minimizing Flows, via Smoothed q-norm Problems” In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2020, pp. 892–910 SIAM
  • [BB94] Jose Antonio Barreto and C Sidney Burrus “L/sub p/-complex approximation using iterative reweighted least squares for FIR digital filters” In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing 3, 1994, pp. III–545 IEEE
  • [Bub+18] Sébastien Bubeck, Michael B Cohen, Yin Tat Lee and Yuanzhi Li “An homotopy method for p\ell_{p} regression provably beyond self-concordance and in input-sparsity time” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018, pp. 1130–1137
  • [Bul18] Brian Bullins “Fast minimization of structured convex quartics” In arXiv preprint arXiv:1812.10349, 2018
  • [Bur12] C Burrus “Iterative re-weighted least-squares OpenStax-CNX”, 2012
  • [Cal19] Jeff Calder “Consistency of Lipschitz learning with infinite unlabeled data and finite labeled data” In SIAM Journal on Mathematics of Data Science 1.4 SIAM, 2019, pp. 780–812
  • [Car+20] Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford and Kevin Tian “Acceleration with a Ball Optimization Oracle” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 19052–19063 URL: https://proceedings.neurips.cc/paper/2020/file/dba4c1a117472f6aca95211285d0587e-Paper.pdf
  • [Che+22] Li Chen, Rasmus Kyng, Yang P Liu, Richard Peng, Maximilian Probst Gutenberg and Sushant Sachdeva “Maximum flow and minimum-cost flow in almost-linear time” In arXiv preprint arXiv:2203.00671, 2022
  • [Chi+13] Hui Han Chin, Aleksander Madry, Gary L Miller and Richard Peng “Runtime guarantees for regression problems” In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, 2013, pp. 269–282
  • [Chi+17] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy and David P Woodruff “Algorithms for p\ell_{p} low-rank approximation” In International Conference on Machine Learning, 2017, pp. 806–814 PMLR
  • [Chr+11] Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman and Shang-Hua Teng “Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs” In Proceedings of the forty-third annual ACM symposium on Theory of computing, 2011, pp. 273–282
  • [CT05] Emmanuel J Candes and Terence Tao “Decoding by linear programming” In IEEE transactions on information theory 51.12 IEEE, 2005, pp. 4203–4215
  • [CY08] Rick Chartrand and Wotao Yin “Iteratively reweighted algorithms for compressive sensing” In 2008 IEEE international conference on acoustics, speech and signal processing, 2008, pp. 3869–3872 IEEE
  • [EDT17] Abderrahim Elmoataz, X Desquesnes and M Toutain “On the game p-Laplacian on weighted graphs with applications in image processing and data clustering” In European Journal of Applied Mathematics 28.6 Cambridge University Press, 2017, pp. 922–948
  • [ETT15] Abderrahim Elmoataz, Matthieu Toutain and Daniel Tenbrinck “On the p-Laplacian and \infty-Laplacian on graphs with applications in image and data processing” In SIAM Journal on Imaging Sciences 8.4 SIAM, 2015, pp. 2412–2451
  • [EV19] Alina Ene and Adrian Vladu “Improved Convergence for 1\ell_{1} and \ell_{\infty} Regression via Iteratively Reweighted Least Squares” In International Conference on Machine Learning, 2019, pp. 1794–1801 PMLR
  • [GB08] M. Grant and S. Boyd “Graph implementations for nonsmooth convex programs” http://stanford.edu/~boyd/graph_dcp.html In Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences Springer-Verlag Limited, 2008, pp. 95–110
  • [GB14] M. Grant and S. Boyd “CVX: Matlab Software for Disciplined Convex Programming, version 2.1”, http://cvxr.com/cvx, 2014
  • [GR97] Irina F Gorodnitsky and Bhaskar D Rao “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm” In IEEE Transactions on signal processing 45.3 IEEE, 1997, pp. 600–616
  • [HFE18] Yosra Hafiene, Jalal Fadili and Abderrahim Elmoataz “Nonlocal pp-Laplacian Variational problems on graphs” In arXiv preprint arXiv:1810.12817, 2018
  • [JLS22] Arun Jambulapati, Yang P Liu and Aaron Sidford “Improved iteration complexities for overconstrained p-norm regression” In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, 2022, pp. 529–542
  • [Kar70] LA Karlovitz “Construction of nearest points in the LpL_{p}, pp even, and LL_{\infty} norms. I” In Journal of Approximation Theory 3.2 Academic Press, 1970, pp. 123–127
  • [KLS20] Tarun Kathuria, Yang P Liu and Aaron Sidford “Unit Capacity Maxflow in Almost O(m4/3)O(m^{4/3}) Time” In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020, pp. 119–130 IEEE
  • [Kyn+15] R. Kyng, A.. Rao, S. Sachdeva and D. Spielman “Algorithms for Lipschitz learning on graphs” In COLT, 2015
  • [Kyn+19] Rasmus Kyng, Richard Peng, Sushant Sachdeva and Di Wang “Flows in almost linear time via adaptive preconditioning” In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019, pp. 902–913
  • [Law61] Charles Lawrence Lawson “Contribution to the theory of linear least maximum approximation” In Ph. D. dissertation, Univ. Calif., 1961
  • [LS14] Yin Tat Lee and Aaron Sidford “Path Finding Methods for Linear Programming: Solving Linear Programs in O~(rank)\tilde{O}(\sqrt{rank}) Iterations and Faster Algorithms for Maximum Flow” Available at http://arxiv.org/abs/1312.6677 and http://arxiv.org/abs/1312.6713 In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, 2014, pp. 424–433 IEEE
  • [LS15] Yin Tat Lee and Aaron Sidford “Efficient Inverse Maintenance and Faster Algorithms for Linear Programming” Available at: https://arxiv.org/abs/1503.01752 In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, 2015, pp. 230–249
  • [LSW15] Yin Tat Lee, Aaron Sidford and Sam Chiu-wai Wong “A faster cutting plane method and its implications for combinatorial and convex optimization” In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, 2015, pp. 1049–1065 IEEE
  • [NN94] Yurii Nesterov and Arkadii Nemirovskii “Interior-point polynomial algorithms in convex programming” SIAM, 1994
  • [Osb85] Michael Robert Osborne “Finite algorithms in optimization and data analysis” John Wiley & Sons, Inc., 1985
  • [RCL19] Mauricio Flores Rios, Jeff Calder and Gilad Lerman “Algorithms for p\ell_{p}-based semi-supervised learning on graphs” In arXiv preprint arXiv:1901.05031, 2019
  • [Ric64] John Rischard Rice “The approximation of functions” Addison-Wesley Reading, Mass., 1964
  • [Rio19] M.. Rios “Laplacian_\_Lp_\_Graph_\_SSL” In GitHub repository GitHub, https://github.com/mauriciofloresML/Laplacian_Lp_Graph_SSL, 2019
  • [SV16] Damian Straszak and Nisheeth K Vishnoi “IRLS and slime mold: Equivalence and convergence” In arXiv preprint arXiv:1601.02712, 2016
  • [SV16a] Damian Straszak and Nisheeth K Vishnoi “Natural algorithms for flow problems” In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, 2016, pp. 1868–1883 SIAM
  • [SV16b] Damian Straszak and Nisheeth K. Vishnoi “On a Natural Dynamics for Linear Programming” In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016
  • [Vai90] P Vaidya “Solving linear equations with diagonally dominant matrices by constructing good preconditioners”, 1990
  • [VB99] Ricardo A Vargas and Charles S Burrus “Adaptive iterative reweighted least squares design of L/sub p/FIR filters” In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258) 3, 1999, pp. 1129–1132 IEEE

Appendix A Solving 2\ell_{2} Problems under Subspace Constraints

We will show how to solve general problems of the following form using a linear system solver.

min𝒙\displaystyle\min_{\boldsymbol{\mathit{x}}} 𝑨𝒙𝒃22\displaystyle\quad\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right\rVert_{2}^{2}
𝑪𝒙=𝒅.\displaystyle\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{d}}.

We first write the Lagrangian of the problem,

L(𝒙,𝒗)=min𝒙max𝒗(𝑨𝒙𝒃)(𝑨𝒙𝒃)+𝒗(𝒅𝑪𝒙)L(\boldsymbol{\mathit{x}},\boldsymbol{\mathit{v}})=\min_{\boldsymbol{\mathit{x}}}\max_{\boldsymbol{\mathit{v}}}\quad(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})^{\top}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})+\boldsymbol{\mathit{v}}^{\top}(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}})

Using Lagrangian duality and noting that strong duality holds, we can write the above as,

L(𝒙,𝒗)=\displaystyle L(\boldsymbol{\mathit{x}},\boldsymbol{\mathit{v}})= min𝒙max𝒗(𝑨𝒙𝒃)(𝑨𝒙𝒃)+𝒗(𝒅𝑪𝒙)\displaystyle\min_{\boldsymbol{\mathit{x}}}\max_{\boldsymbol{\mathit{v}}}\quad(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})^{\top}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})+\boldsymbol{\mathit{v}}^{\top}(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}})
=\displaystyle= max𝒗min𝒙(𝑨𝒙𝒃)(𝑨𝒙𝒃)+𝒗(𝒅𝑪𝒙).\displaystyle\max_{\boldsymbol{\mathit{v}}}\min_{\boldsymbol{\mathit{x}}}\quad(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})^{\top}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})+\boldsymbol{\mathit{v}}^{\top}(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}}).

We first find 𝒙\boldsymbol{\mathit{x}}^{\star} that minimizes the above objective by setting the gradient with respect to 𝒙\boldsymbol{\mathit{x}} to 0. We thus have,

𝒙=(𝑨𝑨)1(2𝑨𝒃+𝑪𝒗2).\boldsymbol{\mathit{x}}^{\star}=(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\mathopen{}\mathclose{{}\left(\frac{2\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{C}}^{\top}\boldsymbol{\mathit{v}}}{2}}\right).

Using this value of 𝒙\boldsymbol{\mathit{x}} we arrive at the following dual program.

L(𝒗)=max𝒗14𝒗𝑪(𝑨𝑨)1𝑪𝒗𝒃𝑨(𝑨𝑨)1𝑨𝒃𝒗𝑪(𝑨𝑨)1𝑨𝒃+𝒃𝒃+𝒗𝒅,L(\boldsymbol{\mathit{v}})=\max_{\boldsymbol{\mathit{v}}}\quad-\frac{1}{4}\boldsymbol{\mathit{v}}^{\top}\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}\boldsymbol{\mathit{v}}-\boldsymbol{\mathit{b}}^{\top}\boldsymbol{\mathit{A}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}-\boldsymbol{\mathit{v}}^{\top}\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{b}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{v}}^{\top}\boldsymbol{\mathit{d}},

which is optimized at,

𝒗=2(𝑪(𝑨𝑨)1𝑪)1(𝒅𝑪(𝑨𝑨)1𝑨𝒃).\boldsymbol{\mathit{v}}^{\star}=2\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}}\right)^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}}\right).

Strong duality also implies that L(𝒙,𝒗)L(\boldsymbol{\mathit{x}},\boldsymbol{\mathit{v}}^{\star}) is optimized at 𝒙\boldsymbol{\mathit{x}}^{\star}, which gives us,

𝒙=(𝑨𝑨)1(𝑨𝒃+𝑪(𝑪(𝑨𝑨)1𝑪)1(𝒅𝑪(𝑨𝑨)1𝑨𝒃)).\boldsymbol{\mathit{x}}^{\star}=(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{C}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}}\right)^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}}\right)}\right).

We now note that we can compute 𝒙\boldsymbol{\mathit{x}}^{\star} by solving the following linear systems in order:

  1. 1.

    Find inverse of 𝑨𝑨\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}}

  2. 2.

    (𝑪(𝑨𝑨)1𝑪)x=(𝒅𝑪(𝑨𝑨)1𝑨𝒃)\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}}\right)x=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}}\right)

Appendix B Converting p\ell_{p}-Laplacian Minimization to Regression Form

Define the following terms:

  • nn denote the number of vertices.

  • ll denote the number of labels.

  • 𝑩\boldsymbol{\mathit{B}} denote the edge-vertex adjacency matrix.

  • 𝒈\boldsymbol{\mathit{g}} denote the vector of labels for the ll labelled vertices.

  • 𝑾\boldsymbol{\mathit{W}} denote the diagonal matrix with weights of the edges.

Set 𝑨=𝑾1/p𝑩\boldsymbol{\mathit{A}}=\boldsymbol{\mathit{W}}^{1/p}\boldsymbol{\mathit{B}} and 𝒃=𝑩[:,n:n+l]𝒈\boldsymbol{\mathit{b}}=-\boldsymbol{\mathit{B}}[:,n:n+l]\boldsymbol{\mathit{g}}. Now 𝑨𝒙𝒃pp\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right\rVert_{p}^{p} is equal to the p\ell_{p} laplacian and we can use our IRLS algorithm from Chapter 7 to find the 𝒙\boldsymbol{\mathit{x}} that minimizes this.

Appendix C Increasing Resistances

We first prove the following lemma that shows how much Ψ\Psi changes with a change in resistance.

Lemma C.1.

Let Δ~=argmin𝐀Δ=cΔ𝐌𝐌Δ+e𝐫e(𝐍Δ)e2{\widetilde{{\Delta}}}=\arg\min_{\boldsymbol{\mathit{A}}\Delta=c}\Delta^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}. Then one has for any 𝐫\boldsymbol{\mathit{r}}^{\prime} and 𝐫\boldsymbol{\mathit{r}} such that 𝐫𝐫\boldsymbol{\mathit{r}}^{\prime}\geq\boldsymbol{\mathit{r}},

Ψ(𝒓)Ψ(𝒓)+e(1𝒓e𝒓e)𝒓e(𝑵Δ~)e2.{\Psi({\boldsymbol{\mathit{r}}^{\prime}})}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}}}\right)}+\sum_{e}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}_{e}}{\boldsymbol{\mathit{r}}^{\prime}_{e}}}\right)\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}.
Proof.

For this proof, we use 𝑹=Diag(𝒓)\boldsymbol{\mathit{R}}=Diag(\boldsymbol{\mathit{r}}).

Ψ(𝒓)=min𝑨𝒙=𝒄𝒙𝑴𝑴𝒙+𝒙𝑵𝑹𝑵𝒙.\Psi(\boldsymbol{\mathit{r}})=\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{c}}}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}.

Constructing the Lagrangian and noting that strong duality holds,

Ψ(𝒓)\displaystyle\Psi(\boldsymbol{\mathit{r}}) =min𝒙max𝒚𝒙𝑴𝑴𝒙+𝒙𝑵𝑹𝑵𝒙+2𝒚(𝒄𝑨𝒙)\displaystyle=\min_{\boldsymbol{\mathit{x}}}\max_{\boldsymbol{\mathit{y}}}\quad\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}+2\boldsymbol{\mathit{y}}^{\top}(\boldsymbol{\mathit{c}}-\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}})
=max𝒚min𝒙𝒙𝑴𝑴𝒙+𝒙𝑵𝑹𝑵𝒙+2𝒚(𝒄𝑨𝒙).\displaystyle=\max_{\boldsymbol{\mathit{y}}}\min_{\boldsymbol{\mathit{x}}}\quad\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}+2\boldsymbol{\mathit{y}}^{\top}(\boldsymbol{\mathit{c}}-\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}).

Optimality conditions with respect to 𝒙\boldsymbol{\mathit{x}} give us,

2𝑴𝑴𝒙+2𝑵𝑹𝑵𝒙=2𝑨𝒚.2\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}^{\star}+2\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}=2\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{y}}.

Substituting this in Ψ\Psi gives us,

Ψ(𝒓)=max𝒚2𝒚𝒄𝒚𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨𝒚.\Psi(\boldsymbol{\mathit{r}})=\max_{\boldsymbol{\mathit{y}}}\quad 2\boldsymbol{\mathit{y}}^{\top}\boldsymbol{\mathit{c}}-\boldsymbol{\mathit{y}}^{\top}\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{y}}.

Optimality conditions with respect to 𝒚\boldsymbol{\mathit{y}} now give us,

2𝒄=2𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨𝒚,2\boldsymbol{\mathit{c}}=2\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{y}}^{\star},

which upon re-substitution gives,

Ψ(𝒓)=𝒄(𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨)1𝒄.\Psi(\boldsymbol{\mathit{r}})=\boldsymbol{\mathit{c}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\boldsymbol{\mathit{c}}.

We also note that

𝒙=(𝑴𝑴+𝑵𝑹𝑵)1𝑨(𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨)1𝒄.\boldsymbol{\mathit{x}}^{\star}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\boldsymbol{\mathit{c}}. (15)

We now want to see what happens when we change 𝒓\boldsymbol{\mathit{r}}. Let 𝑹\boldsymbol{\mathit{R}} denote the diagonal matrix with entries 𝒓\boldsymbol{\mathit{r}} and let 𝑹=𝑹+SS\boldsymbol{\mathit{R}}^{\prime}=\boldsymbol{\mathit{R}}+\SS, where SS\SS is the diagonal matrix with the changes in the resistances. We will use the following version of the Sherman-Morrison-Woodbury formula multiple times,

(𝑿+𝑼𝑪𝑽)1=𝑿1𝑿1𝑼(𝑪1+𝑽𝑿1𝑼)1𝑽𝑿1.(\boldsymbol{\mathit{X}}+\boldsymbol{\mathit{U}}\boldsymbol{\mathit{C}}\boldsymbol{\mathit{V}})^{-1}=\boldsymbol{\mathit{X}}^{-1}-\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}(\boldsymbol{\mathit{C}}^{-1}+\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}})^{-1}\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}.

We begin by applying the above formula for 𝑿=𝑴𝑴+𝑵𝑹𝑵\boldsymbol{\mathit{X}}=\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}, 𝑪=𝑰\boldsymbol{\mathit{C}}=\boldsymbol{\mathit{I}}, 𝑼=𝑵SS1/2\boldsymbol{\mathit{U}}=\boldsymbol{\mathit{N}}^{\top}\SS^{1/2} and 𝑽=SS1/2𝑵\boldsymbol{\mathit{V}}=\SS^{1/2}\boldsymbol{\mathit{N}}. We thus get,

(𝑴𝑴+𝑵𝑹𝑵)1=(𝑴𝑴+𝑵𝑹𝑵)1(𝑴𝑴+𝑵𝑹𝑵)1𝑵SS1/2(𝑰+SS1/2𝑵(𝑴𝑴+𝑵𝑹𝑵)1𝑵SS1/2)1SS1/2𝑵(𝑴𝑴+𝑵𝑹𝑵)1.\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}-\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}\\ \mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}}\right)^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}. (16)

We next claim that,

𝑰+SS1/2𝑵(𝑴𝑴+𝑵𝑹𝑵)1𝑵SS1/2𝑰+SS1/2𝑹1SS1/2,\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}\preceq\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2},

which gives us,

(𝑴𝑴+𝑵𝑹𝑵)1(𝑴𝑴+𝑵𝑹𝑵)1(𝑴𝑴+𝑵𝑹𝑵)1𝑵SS1/2(𝑰+SS1/2𝑹1SS1/2)1SS1/2𝑵(𝑴𝑴+𝑵𝑹𝑵)1.\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}\preceq\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}-\\ \mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}. (17)

This further implies,

𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑵SS1/2(𝑰+SS1/2𝑹1SS1/2)1SS1/2𝑵(𝑴𝑴+𝑵𝑹𝑵)1𝑨.\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\preceq\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}-\\ \boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}. (18)

We apply the Sherman-Morrison formula again for, 𝑿=𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨\boldsymbol{\mathit{X}}=\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}, 𝑪=(𝑰+SS1/2𝑹1SS1/2)1\boldsymbol{\mathit{C}}=-(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}, 𝑼=𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑵SS1/2\boldsymbol{\mathit{U}}=\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2} and 𝑽=SS1/2𝑵(𝑴𝑴+𝑵𝑹𝑵)1𝑨\boldsymbol{\mathit{V}}=\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}. Let us look at the term 𝑪1+𝑽𝑿1𝑼\boldsymbol{\mathit{C}}^{-1}+\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}.

(𝑪1+𝑽𝑿1𝑼)1=(𝑰+SS1/2𝑹1SS1/2𝑽𝑿1𝑼)1(𝑰+SS1/2𝑹1SS1/2)1.-\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}^{-1}+\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}}\right)^{-1}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2}-\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}}\right)^{-1}\succeq(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}.

Using this, we get,

(𝑨(𝑴𝑴+𝑵𝑹𝑵)1𝑨)1𝑿1+𝑿1𝑼(𝑰+SS1/2𝑹1SS1/2)1𝑽𝑿1,\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\succeq\boldsymbol{\mathit{X}}^{-1}+\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1},

which on multiplying by 𝒄\boldsymbol{\mathit{c}}^{\top} and 𝒄\boldsymbol{\mathit{c}} gives,

Ψ(𝒓)Ψ(𝒓)+𝒄𝑿1𝑼(𝑰+SS1/2𝑹1SS1/2)1𝑽𝑿1𝒄.\Psi(\boldsymbol{\mathit{r}}^{\prime})\geq\Psi(\boldsymbol{\mathit{r}})+\boldsymbol{\mathit{c}}^{\top}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{c}}.

We note from Equation (15) that 𝒙=(𝑴𝑴+𝑵𝑹𝑵)1𝑨𝑿1𝒄\boldsymbol{\mathit{x}}^{\star}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{c}}. We thus have,

Ψ(𝒓)\displaystyle\Psi(\boldsymbol{\mathit{r}}^{\prime}) Ψ(𝒓)+(𝒙)𝑵SS1/2(𝑰+SS1/2𝑹1SS1/2)1SS1/2𝑵𝒙\displaystyle\geq\Psi(\boldsymbol{\mathit{r}})+\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{\star}}\right)^{\top}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}
=Ψ(𝒓)+e(𝒓e𝒓e𝒓e)𝒓e(𝑵𝒙)e2.\displaystyle=\Psi(\boldsymbol{\mathit{r}})+\sum_{e}\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{r}}^{\prime}_{e}-\boldsymbol{\mathit{r}}_{e}}{\boldsymbol{\mathit{r}}^{\prime}_{e}}}\right)\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star})^{2}_{e}.