Fast Algorithms for $\ell_{p}$ -Regression

Deeksha Adil¹¹1Supported by a Post Graduate Schlarship (PGSD) by NSERC (Natural Sciences and Engineering Research Council of Canada) and Sushant Sachdeva’s NSERC Discovery Grant.
University of Toronto
deeksha@cs.toronto.edu Rasmus Kyng²²2The research leading to these results has received funding from the grant “Algorithms and complexity for high-accuracy flows and convex optimization” (no. 200021 204787) of the Swiss National Science Foundation.
ETH Zurich
kyng@inf.ethz.ch
Richard Peng³³3This material is based upon work supported by the National Science Foundation under Grant No. 1846218, and by an Natural Sciences and Engineering Research Council of Canada Discovery Grant. Part this work was done while the author was at the Georgia Institute of Technology.
University of Waterloo
y5peng@uwaterloo.ca Sushant Sachdeva⁴⁴4Sushant Sachdeva’s research is supported by an NSERC (Natural Sciences and Engineering Research Council of Canada) Discovery Grant.
University of Toronto
sachdeva@cs.toronto.edu

Abstract

The $\ell_{p}$ -norm regression problem is a classic problem in optimization with wide ranging applications in machine learning and theoretical computer science. The goal is to compute $\boldsymbol{\mathit{x}}^{\star}=\arg\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{x}}\|_{p}^{p}$ , where $\boldsymbol{\mathit{x}}^{\star}\in\mathbb{R}^{n},\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{b}}\in\mathbb{R}^{d}$ and $d\leq n$ . Efficient high-accuracy algorithms for the problem have been challenging both in theory and practice and the state of the art algorithms require $poly(p)\cdot n^{\frac{1}{2}-\frac{1}{p}}$ linear system solves for $p\geq 2$ . In this paper, we provide new algorithms for $\ell_{p}$ -regression (and a more general formulation of the problem) that obtain a high-accuracy solution in $O(pn^{\nicefrac{{(p-2)}}{{(3p-2)}}})$ linear system solves. We further propose a new inverse maintenance procedure that speeds-up our algorithm to $\widetilde{O}(n^{\omega})$ total runtime, where $O(n^{\omega})$ denotes the running time for multiplying $n\times n$ matrices. Additionally, we give the first Iteratively Reweighted Least Squares (IRLS) algorithm that is guaranteed to converge to an optimum in a few iterations. Our IRLS algorithm has shown exceptional practical performance, beating the currently available implementations in MATLAB/CVX by 10-50x.

1 Introduction

^†^†Preliminary versions of the results in this paper have appeared as conference publications [Adi+19, APS19, AS20, Adi+21]. This paper unifies and simplifies results from the preliminary versions.

Linear regression in $\ell_{p}$ -norm seeks to compute a vector $\boldsymbol{\mathit{x}}^{\star}\in\mathbb{R}^{n}$ such that,

\boldsymbol{\mathit{x}}^{\star}=\arg\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{x}}\|^{p}_{p},

where $\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{b}}\in\mathbb{R}^{d}$ , $d\leq n$ . This is a classic convex optimization problem that captures several well-studied questions including least squares regression ( $p=2$ ) which is equivalent to solving a system of linear equations, and linear programming ( $p=\infty$ ). The $\ell_{p}$ -norm regression problem for $p>1$ has found use across a wide range of applications in machine learning and theoretical computer science including low rank matrix approximation [Chi+17], sparse recovery [CT05], graph based semi-supervised learning [AL11, Cal19, RCL19, Kyn+15], data clustering and learning problems [ETT15, EDT17, HFE18]. In this paper, we focus on solving the $\ell_{p}$ -norm regression problem for $p\geq 2.$ The exact solution to the $\ell_{p}$ -norm regression problem for $p\neq 1,2,\infty,$ may not even be expressible using rationals. Thus, the goal is often relaxed to finding an $\varepsilon$ -approximate solution to the problem, i.e., find $\hat{\boldsymbol{\mathit{x}}}$ such that $\boldsymbol{\mathit{A}}\hat{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{b}}$ and,

\|\hat{\boldsymbol{\mathit{x}}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p},

for some small $\varepsilon>0.$ Furthermore, several applications such as graph based semi-supervised learning require that $\hat{\boldsymbol{\mathit{x}}}$ is close to $\boldsymbol{\mathit{x}}^{\star}$ coordinate-wise and not just in objective value – necessitating a high-accuracy solution with $\epsilon\approx\frac{1}{\text{poly}(n)}$ . In order to find such high-accuracy solutions efficiently, we require an algorithm with runtime dependence on $\epsilon$ being $\text{poly}\mathopen{}\mathclose{{}\left(\log\frac{1}{\epsilon}}\right)$ rather than $\text{poly}\mathopen{}\mathclose{{}\left(\frac{1}{\epsilon}}\right).$

Fast, high-accuracy algorithms for $\ell_{p}$ -regression are challenging both in theory and practice, due to the lack of smoothness and strong convexity of the objective. The Interior Point Method framework by [NN94] can be used to compute a high-accuracy solution for all $p\in[1,\infty]$ in $\widetilde{O}(\sqrt{n})$ ⁵⁵5 $\widetilde{O}$ hides constants, $p$ dependencies, $\log\frac{1}{\epsilon},$ and $\log n$ factors unless explicitly mentioned iterations, with each iteration requiring solving an $n\times n$ system of linear equations. This was the most efficient algorithm for $\ell_{p}$ -regression until 2018. In 2018, [Bub+18] showed that $\Omega(\sqrt{n})$ iterations are necessary for the interior point framework and proposed a new homotopy-based approach that could compute a high-accuracy solution in $\widetilde{O}(n^{\mathopen{}\mathclose{{}\left|\frac{1}{2}-\frac{1}{p}}\right|})$ linear system solves for all $p\in(1,\infty)$ . Their algorithms improve over the interior point method by $n^{\Omega(1)}$ factors for values of $p$ bounded away from $1$ and $\infty.$ However, for $p$ approaching $1$ or $\infty$ , the number of linear system solves required by their algorithm approaches $\sqrt{n},$ the same as required by interior point methods. Finding an algorithm for $\ell_{p}$ -regression requiring $o(n^{1/2})$ linear system solves has been a long standing open problem.

Among practical implementations for the $\ell_{p}$ -norm regression problem, the Iteratively Reweighted Least Squares (IRLS) methods stand out due to their simplicity, and have been studied since 1961 [Law61]. For some range of values for $p,$ IRLS converges rapidly. However, the method is guaranteed to converge only for $p\in(1.5,3)$ and diverges even for small values of $p,$ e.g. $p=3.5$ [RCL19]. Over the years, several empirical modifications of the algorithm have been used for various applications in practice (refer to [Bur12] for a full survey). However, an IRLS algorithm that is guaranteed to converge to the optimum in a few iterations for all values of $p,$ has again been a long standing challenge.

1.1 Our Contributions

In this paper, we present the first algorithm for the $\ell_{p}$ -regression problem that finds a high-accuracy solution in at most $O(pn^{1/3})=o(n^{1/2})$ linear system solves, which has been a long sought-after goal in optimization. Our algorithm builds on a new iterative refinement framework for $\ell_{p}$ -norm objectives that allows us to find a high-accuracy solution using low-accuracy solutions to a subproblem. The iterative refinement framework allows for the subproblems to be solved to an $n^{o(1)}$ -approximation and this has been useful in several follow up works on graph optimization (see Section 1.3). We further propose a new inverse maintenance framework and show how to speed up our algorithm to solve the $\ell_{p}$ -norm problem to a high-accuracy in total time $\widetilde{O}(n^{\omega})$ . Finally, we give the first IRLS algorithm that provably converges to a high-accuracy solution in a few iterations.

Preliminary versions of the results presented in this paper have appeared in previous conference publications by [Adi+19, APS19, AS20, Adi+21]. In this paper, we present our results for a more general formulation of the $\ell_{p}$ -regression problem,

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}

(1)

for matrices $\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{2}\times n},$ $m_{1},m_{2}\geq n,d\leq n$ . Let $m=\max\{m_{1},m_{2}\}$ and, $\boldsymbol{\mathit{d}}\perp\{ker(\boldsymbol{\mathit{M}})\cap ker(\boldsymbol{\mathit{N}})\cap ker(\boldsymbol{\mathit{A}})\}$ , $\boldsymbol{\mathit{b}}\in im(\boldsymbol{\mathit{A}})$ so that the above problem has a bounded solution. Our first result is a fast, high-accuracy algorithm for Problem (1).

Theorem 1.1.

Let $\epsilon>0$ and $p\geq 2$ . There is an algorithm that starting from $\boldsymbol{\mathit{x}}^{(0)}$ satisfying $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}$ , finds an $\epsilon$ -approximate solution to Problem (1) in $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)$ calls to a linear system solver.

As a corollary, for the $\ell_{p}$ -norm regression problem, i.e., $\boldsymbol{\mathit{d}}=\boldsymbol{\mathit{M}}=0$ and $\boldsymbol{\mathit{N}}=\boldsymbol{\mathit{I}}$ , our algorithm converges in $O(pn^{\frac{p-2}{3p-2}}\log\frac{n}{\epsilon})$ calls to a linear system solver. This is the first algorithm that converges to a high accuracy solution at an asymptotic rate of convergence $\widetilde{O}(n^{1/3})=o(n^{1/2})$ for all $p\in[2,\infty)$ , and thus faster than all previously known algorithms by at least a factor of $n^{\Omega(1)}$ . As a result, we answer the long standing problem in optimization of whether such a rate of convergence could be achieved.

Our next result shows how to speed up our algorithms and solve Problem (1) in time $\widetilde{O}(m^{\omega})$ (or $\widetilde{O}(n^{\omega})$ for $\ell_{p}$ -regression), where $\omega\approx 2.37$ and $O(n^{\omega})$ is the current time required for multiplying two $n\times n$ matrices. This is almost as fast as solving a system of linear equations. We achieve this guarantee via a new inverse maintenance procedure for $\ell_{p}$ -regression and prove the following result.

Theorem 1.2.

If $\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}}$ are explicitly given, matrices with polynomially bounded condition numbers, and $p\geq 2$ , there is an algorithm for Problem (1) that can be implemented to run in total time $\widetilde{O}(m^{\omega})$ .

Our inverse maintenance algorithm is presented in Section 5, where we also give a more fine grained dependence on the parameters $m_{1},m_{2},n$ and $p$ in the rate of convergence (Theorem 5.1). Our algorithms and techniques for $\ell_{p}$ -regression have motivated a line of work in graph optimization and the study of accelerated width reduced methods which we describe in detail in Section 1.3.

Our next contribution is towards the IRLS approach. For the $\ell_{p}$ -regression problem i.e. $\boldsymbol{\mathit{d}}=\boldsymbol{\mathit{M}}=0$ in (1), we give an IRLS algorithm that globally converges to the optimum in at most $O\mathopen{}\mathclose{{}\left(p^{3}m^{\frac{p-2}{2(p-1)}}\log\frac{m}{\epsilon}}\right)$ linear system solves for all $p\geq 2$ (Section 6). This is the first IRLS algorithm that is guaranteed to converge to the optimum for all values of $p\geq 2$ , with a quantitative bound on the runtime. Our IRLS algorithm has proven to be very fast and robust in practice and is faster than existing implementations in MATLAB/CVX by 10-50x. These speed-ups are demonstrated in experiments performed in [APS19] and we present these results along with our algorithm in Section 6.

Theorem 1.3.

Let $p\geq 2$ . Algorithm 10 returns $\boldsymbol{\mathit{x}}$ such that $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}$ and $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p}$ , in at most $O\mathopen{}\mathclose{{}\left(p^{3}m^{\frac{(p-2)}{2(p-1)}}\log\mathopen{}\mathclose{{}\left(\frac{m}{\epsilon}}\right)}\right)$ calls to a linear system solver.

The analysis of our IRLS algorithm fits into the overall framework of this paper. Such an algorithm first appeared in the conference paper by [APS19], where they also ran some experiments to demonstrate the performance of their IRLS algorithm in practice. We include some of their experimental results to show that the rate of convergence in practice is even better than the theoretical bounds.

1.2 Technical Overview

Overall $\log\frac{1}{\epsilon}$ Convergence

Our algorithm follows an overall iterative refinement approach for $p\geq 2$ , which implies $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}+\delta)-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})$ can be upper bounded by the function $\boldsymbol{res}_{p}=\boldsymbol{\mathit{g}}^{\top}\delta+\|\boldsymbol{\mathit{R}}\delta\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\delta\|_{p}^{p}$ , and lower bounded by a similar function. Here, the vector $\boldsymbol{\mathit{g}}$ and matrix $\boldsymbol{\mathit{R}}$ depend on $\boldsymbol{\mathit{x}},$ and the matrix $\boldsymbol{\mathit{N}}$ is as defined in Problem (1). We prove that if we can solve $\min_{\boldsymbol{\mathit{A}}\delta=0}\boldsymbol{res}_{p}(\delta)$ to a $\kappa$ -approximation, $O(p\kappa\log\nicefrac{{(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star}))}}{{\varepsilon}})$ such solves (iterations) suffice to obtain an $\varepsilon$ -approximate solution to Problem (1) (Theorem 2.1). We call this problem the Residual Problem and this process Iterative Refinement for $\ell_{p}$ -norms.

Solving the Residual Problem

We next perform a binary search on the linear term of the residual problem and reduce it to solving $O(\log p)$ problems of the form, $\min_{\boldsymbol{\mathit{A}}\delta=\boldsymbol{\mathit{c}}}\|\boldsymbol{\mathit{R}}\delta\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\delta\|_{p}^{p}$ (Lemma 3.1). In order to solve these new problems, we use a multiplicative weight update routine that returns a constant approximate solution in $O(pm^{\nicefrac{{(p-2)}}{{(3p-2)}}})$ calls to a linear system solver (Theorem 3.2). We can thus find a constant approximate solution to the residual problem in $O(pm^{\nicefrac{{(p-2)}}{{(3p-2)}}}\log p)$ calls to a linear system solver (Corollary 3.7). Combined with iterative refinement, we obtain an algorithm that converges in $O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log p\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)\leq\widetilde{O}\mathopen{}\mathclose{{}\left(p^{2}m^{1/3}\log\frac{1}{\epsilon}}\right)$ linear system solves.

Improving $p$ Dependence

Furthermore, we prove that for any $q\neq p$ , given a $p$ -norm residual problem, we can construct a corresponding $q$ -norm residual problem such that $\beta$ -approximate solution to the $q$ -norm residual problem roughly gives a $O(\beta^{2})m^{\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|}$ approximate solution to the $p$ -norm residual problem (Theorem 4.3). As a consequence, if $p$ is large, i.e. $p\geq\log m$ , a constant approximate solution to the corresponding $\log m$ -norm residual problem will give an $O(m^{\frac{1}{\log m}})\leq O(1)$ -approximate solution to the $p$ -norm residual problem in at most $O(\log m\cdot m^{\frac{\log m-2}{3\log m-2}})\leq\widetilde{O}(m^{\frac{p-2}{3p-2}})$ calls to a linear system solver. Combining this with the algorithm described in the previous paragraph, we obtain our final guarantees as described in Theorem 1.1.

$\ell_{p}$ -Regression in Matrix Multiplication Time

We next describe how to obtain the guarantees of Theorem 1.2. While solving the residual problem, the algorithm solves a system of linear equations at every iteration. The key observation for obtaining improved running times is that the weights determining these linear systems change slowly. Thus, we can maintain a spectral approximation to the linear system via a sequence of lazy low-rank updates. The Sherman-Morrison-Woodbury formula then allows us to update the inverse quickly. We can use the spectral approximation as a preconditioner for solving the linear system quickly at each iteration. Thus, we obtain a speed-up since the linear systems do not need to be solved from scratch at each iteration, giving Theorem 1.2.

Good Starting Solution

For $\ell_{p}$ -norm objectives, i.e., $\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}$ , we further show how to find a starting solution $\boldsymbol{\mathit{x}}^{(0)}$ such that $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{(0)}\|_{p}^{p}\leq O(m)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p}$ . The key idea is that for any $k$ , a constant approximate solution to the $k$ -norm problem is an $O(m)$ -approximate solution to the $2k$ -norm problem (Lemma 2.9). This inspires a homotopy approach, where we first solve an $\ell_{2}$ norm problem followed by $\ell_{2^{2}},\ell_{2^{3}},\cdots,\ell_{2^{\mathopen{}\mathclose{{}\left\lceil\log p}\right\rceil}}$ -norm problems to constant approximations. We can thus obtain the required starting solution in at most $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log m\log^{2}p}\right)$ calls to a linear system solver.

IRLS Algorithm

For the IRLS algorithm, given the residual problem at an iteration, we show how to construct a weighted least squares problem, the solution of which is an $O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{2(p-1)}}}\right)$ -approximate solution to the residual problem (Lemma 6.1). This result along with the overall iterative refinement culminates in our IRLS algorithm where we directly solve these weighted least squares problems in every iteration.

1.3 Related Works

$\ell_{p}$ -Regression

Until 2018, the fastest high-accuracy algorithms for $\ell_{p}$ -regression, including the [NN94] Interior Point Method framework and [Bub+18] homotopy method, asymptotically required $\approx O(\sqrt{n})$ linear system solves. The first algorithm for $\ell_{p}$ -regression to beat the $\sqrt{n}$ iteration bound was the algorithm by [Adi+19], which was faster than all known algorithms and asymptotically required at most $\approx O(p^{O(p)}n^{1/3})$ iterations , for all $p>1$ . Concurrently [Bul18] used tools from convex optimization to give an algorithm for $p=4$ which matches the rates of [Adi+19] up to logarithmic factors. Subsequent works have improved the $p$ dependence [AS20, Adi+21] and proposed alternate methods for obtaining matching rates (upto logarithmic and $p$ factors) [Car+20]. A recent work by [JLS22] shows how to solve $\ell_{p}$ -regression in $\approx n+poly(p)\cdot d^{\frac{p-2}{3p-2}}$ iterations where $d$ is the smaller dimension of the constraint matrix $\boldsymbol{\mathit{A}}$ .

Width Reduced MWU Algorithms

Width reduction is a technique that has been used repeatedly in multiplicative weight update algorithms to speed up rates of convergence from $m^{1/2}$ to $m^{1/3}$ , where $m$ is the size of the input. This technique was first seen in the work of [Chr+11], in the context of the maximum flow problem where for a graph with $n$ vertices and $m$ edges to improve the iteration complexity from $\widetilde{O}(m^{1/2})$ to $\widetilde{O}(m^{1/3})$ . A similar improvement was further seen in algorithms for $\ell_{1},\ell_{\infty}$ -regression by [Chi+13, EV19], $\ell_{p}$ -regression ( $p\geq 2$ ) [Adi+19] and, algorithms for matrix scaling [All+17]. In a recent work [ABS21] extend this technique to improve iteration complexities for all quasi-self-concordant objectives which includes soft-max and logistic regression among others.

Inverse Maintenance

Inverse Maintenance is a technique used to speed up algorithms and was first introduced by [Vai90] in the context of minimum cost and multicommodity flows and has further been used for interior point methods [LS14], [LSW15]. In 2019, [Adi+19] developed a method for $\ell_{p}$ -regression that utilized the idea of reusing inverses due to controllable rates of change of underlying variables.

IRLS Algorithms

Iteratively Reweighted Least Squares Algorithms are simple to implement and have thus been used in a wide range of applications including sparse signal reconstruction [GR97], compressive sensing [CY08] and Chebyshev approximation in FIR filter design [BB94]. Refer to [Bur12] for a full survey. The works by [Osb85] and [Kar70] show convergence in the limit and with certain assumptions on the starting solution. For $\ell_{1}$ -regression, [SV16b, SV16a, SV16] show quantitative convergence bounds. In 2019, [APS19] give the first IRLS algorithm with quantitative bounds that is guaranteed to converge with no conditions on the starting point. Their algorithm also works well in practice as suggested by their experiments.

Follow-up Work in Graph Optimization

The $\ell_{p}$ -norm flow problem, which asks to minimize the $\ell_{p}$ -norm of a flow vector while satisfying certain demand constraints, is modeled via the $\ell_{p}$ -regression problem. The maximum flow problem is the special case of $p=\infty$ . For graphs with $n$ vertices and $m$ edges, the $\ell_{p}$ -norm regression algorithm of [Adi+19] when combined with fast laplacian solvers, directly gives an $\approx\widetilde{O}(p^{O(p)}m^{4/3})$ time algorithm for the $\ell_{p}$ -norm flow problem. Building on their work, specifically the iterative refinement framework, which allows to solve these problems to a high-accuracy while only requiring an $m^{o(1)}$ -aproximate solution to an $\ell_{p}$ -norm subproblem, [Kyn+19] give an algorithm for unweighted graphs that runs in time $\exp(p^{3/2})m^{1+\frac{7}{\sqrt{p-1}}+o(1)}$ . We note that their algorithm runs in time $m^{1+o(1)}$ for $p=\sqrt{\log m}$ . Further works including [Adi+21] also utilize the iterative refinement guarantees to give an algorithm with runtime $p(m^{1+o(1)}+n^{4/3+o(1)})$ for weighted $\ell_{p}$ -norm flow problems by designing new sparsification algorithms that preserve $\ell_{p}$ -norm objectives of the subproblem to an $m^{o(1)}$ -approximation. For the maximum flow problem, [AS20] give an $m^{1+o(1)}\epsilon^{-1}$ time algorithm for the approximate maximum flow problem on unweighted graphs. [KLS20] build on these works further and give an algorithm that computes maximum $s$ - $t$ flow problem where each edge has integer capacities at most $U$ , in time $m^{4/3+o(1)}U^{1/3}$ . In a recent breakthrough result by [Che+22], the authors give an algorithm for the maximum flow problem and the $\ell_{p}$ -norm flow problem that runs in almost linear time, $m^{1+o(1)}$ .

1.4 Organization of Paper

Section 2 describes the overall iterative refinement framework, first for $p\geq 2$ , and then for $p\in(1,2)$ . In the end, we show how to find good starting solutions for pure $\ell_{p}$ -norm objectives for ${}p\geq 2$ . Section 3 describes the width reduced multiplicative weight update routine used to solve the residual problem. In Section 4 we show how to solve $p$ -norm residual problems using $q$ -norm residual problems and give our overall algorithm (Algorithm 6). Section 5 contains our new inverse maintenance algorithm that allows us to solve $\ell_{p}$ -regression almost as fast as linear regression. Finally in Section 6 we give an IRLS algorithm and present some experimental results from [APS19].

2 Iterative Refinement for $\ell_{p}$ -norms

Recall that we would like to find a high-accuracy solution for the problem,

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}

A common approach in smooth, convex optimization is upper bounding the function using a first order Taylor expansion plus a quadratic function (smoothness), and minimizing this bound repeatedly to converge to the optimum. Additionally, when the function has a similar quadratic lower bound (strong convexity) it can be shown that minimizing this upper bound $O\mathopen{}\mathclose{{}\left(\log\frac{1}{\epsilon}}\right)$ ⁶⁶6hiding problem dependent parameters times is sufficient to converge to an $\epsilon$ -approximate solution. The $\ell_{p}$ -norm function satisfies no such quadratic upper bound since it has a very steep growth, or lower bound since it is too flat around $0$ . In this section we show that we can instead upper and lower bound the $\ell_{p}$ function for $p\geq 2$ by a second order Taylor expansion plus an $\ell_{p}^{p}$ term. We show that it is sufficient to minimize such a bound to a $\kappa$ -approximation $O\mathopen{}\mathclose{{}\left(p\kappa\log\frac{1}{\epsilon}}\right)$ times. Such an iterative refinement method was previously only known for $p=2$ , and we thus call this algorithm Iterative Refinement for $\ell_{p}$ -norms. In further sections, we show different ways to minimize this upper bound approximately to obtain fast algorithms.

For $p\in(1,2)$ , we use a smoothed function which is quadratic in a small range around $0$ and grows as $\ell_{p}^{p}$ otherwise. We use this function to give upper and lower bounds and a similar iterative refinement scheme.

We further show how to obtain a good starting solution for Problem (1) in the special case when the vector $\boldsymbol{\mathit{d}}$ and matrix $\boldsymbol{\mathit{M}}$ are zero, i.e., the objective function is only the $\ell_{p}$ -norm function.

These sections are based on the results and proofs from [Adi+19, APS19, AS20, Adi+21].

2.1 Iterative Refinement

We will prove that the following algorithm can be used to obtain a high-accuracy solution, i.e., $\log\frac{1}{\epsilon}$ rate of convergence for $\ell_{p}$ -regression.

Algorithm 1 Iterative Refinement

1:procedure Main-Solver(

\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},p,\epsilon

)

\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}^{(0)}

\nu\leftarrow

Bound on

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})

\triangleright

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\geq 0

, then

\nu\leftarrow\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})

4: while

\nu>\epsilon

{\widetilde{{\Delta}}}\leftarrow

ResidualSolver(

\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p

)

6: if

\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa}

then

\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}-\frac{{\widetilde{{\Delta}}}}{p}

8: else

\nu\leftarrow\frac{\nu}{2}

10: return

\boldsymbol{\mathit{x}}

Specifically, we will prove,

Theorem 2.1.

Let $p\geq 2$ , and $\kappa\geq 1$ . Let the initial solution $\boldsymbol{\mathit{x}}^{(0)}$ satisfy $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}$ . Algorithm 1 returns an $\epsilon$ -approximate solution $\boldsymbol{\mathit{x}}$ of Problem (1) in at most $O\mathopen{}\mathclose{{}\left(p\kappa\log\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)}\right)$ calls to a $\kappa$ -approximate solver for the residual problem (Definition 2.3).

Before we prove the above result, we will define some of the terms used in the above statement.

2.1.1 Preliminaries

Definition 2.2 ( $\epsilon$ -Approximate Solution).

Let $\boldsymbol{\mathit{x}}^{\star}$ denote the optimizer of Problem (1). We say $\widetilde{\boldsymbol{\mathit{x}}}$ is an $\epsilon$ -approximate solution to (1) if $\boldsymbol{\mathit{A}}\widetilde{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{b}}$ and

\boldsymbol{\mathit{f}}(\widetilde{\boldsymbol{\mathit{x}}})\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})+\epsilon.

Definition 2.3 (Residual Problem).

For any $p\geq 2$ , we define the residual problem $res_{p}(\Delta)$ , for (1) at a feasible $\boldsymbol{\mathit{x}}$ as,

\max_{\boldsymbol{\mathit{A}}\Delta=0}\quad\boldsymbol{res}_{p}(\Delta)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\boldsymbol{\mathit{g}}^{\top}\Delta-\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p},\text{ where,}

\boldsymbol{\mathit{g}}=\frac{1}{p}\boldsymbol{\mathit{d}}+\frac{2}{p}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{N}}^{\top}Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\quad\text{and}\quad\boldsymbol{\mathit{R}}=\frac{2}{p^{2}}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+2\boldsymbol{\mathit{N}}^{\top}Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}.

Definition 2.4 (Approximation to Residual Problem).

Let $p\geq 2$ and ${{{\Delta^{\star}}}}$ be the optimum of the residual problem. ${\widetilde{{\Delta}}}$ is a $\kappa$ -approximation to the residual problem if $\boldsymbol{\mathit{A}}{\widetilde{{\Delta}}}=0$ and,

res_{p}({\widetilde{{\Delta}}})\geq\frac{1}{\kappa}res_{p}({{{\Delta^{\star}}}}).

2.1.2 Bounding Change in Objective

In order to prove our result, we first show that we can upper and lower bound the change in our $\ell_{p}$ -objective by a linear term plus a quadratic term plus an $\ell_{p}$ -norm term.

Lemma 2.5.

For any $\boldsymbol{\mathit{x}},\Delta$ and $p\geq 2$ , we have for vectors $\boldsymbol{\mathit{r}},\boldsymbol{\mathit{g}}$ defined coordinate wise as $\boldsymbol{\mathit{r}}=|\boldsymbol{\mathit{x}}|^{p-2}$ and $\boldsymbol{\mathit{g}}=p|\boldsymbol{\mathit{x}}|^{p-2}\boldsymbol{\mathit{x}}$ ,

\frac{p}{8}\sum_{i}\boldsymbol{\mathit{r}}_{i}\Delta_{i}^{2}+\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left\lVert\Delta}\right\rVert_{p}^{p}\leq\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}+\Delta}\right\rVert^{p}_{p}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}}\right\rVert_{p}^{p}-\boldsymbol{\mathit{g}}^{\top}\Delta\leq 2p^{2}\sum_{i}\boldsymbol{\mathit{r}}_{i}\Delta_{i}^{2}+p^{p}\mathopen{}\mathclose{{}\left\lVert\Delta}\right\rVert_{p}^{p}.

Proof.

To show this, we show that the above holds for all coordinates. For a single coordinate, the above expression is equivalent to proving,

\frac{p}{8}|x|^{p-2}\Delta^{2}+\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left|\Delta}\right|^{p}\leq\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{x}}+\Delta}\right|^{p}-\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{x}}}\right|^{p}-p\mathopen{}\mathclose{{}\left|x}\right|^{p-1}sgn(x)\Delta\leq 2p^{2}|x|^{p-2}\Delta^{2}+p^{p}\mathopen{}\mathclose{{}\left|\Delta}\right|^{p}.

Let $\Delta=\alpha x$ . Since the above clearly holds for $x=0$ , it remains to show for all $\alpha$ ,

\frac{p}{8}\alpha^{2}+\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}\leq\mathopen{}\mathclose{{}\left|1+\alpha}\right|^{p}-1-p\alpha\leq 2p^{2}\alpha^{2}+p^{p}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}.

1.

$\alpha\geq 1$ :
In this case, $1+\alpha\leq 2\alpha\leq p\cdot\alpha$ . So, $\mathopen{}\mathclose{{}\left|1+\alpha}\right|^{p}\leq p^{p}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}$ and the right inequality directly holds. To show the other side, let

$h(\alpha)=(1+\alpha)^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{1}{2^{p+1}}{\alpha}^{p}.$

We have,

$h^{\prime}(\alpha)=p(1+\alpha)^{p-1}-p-\frac{p}{4}\alpha-\frac{p}{2^{p+1}}{\alpha}^{p-1}$

and

$h^{\prime\prime}(\alpha)=p(p-1)(1+\alpha)^{p-2}-\frac{p}{4}-\frac{p(p-1)}{2^{p+1}}{\alpha}^{p-2}\geq 0.$

Since $h^{\prime\prime}(\alpha)\geq 0$ , $h^{\prime}(\alpha)\geq h^{\prime}(1)\geq 0$ . So $h$ is an increasing function in $\alpha$ and $h(\alpha)\geq h(1)\geq 0$ .

$\alpha\leq-1$ :
Now, $\mathopen{}\mathclose{{}\left|1+\alpha}\right|\leq 1+\mathopen{}\mathclose{{}\left|\alpha}\right|\leq p\cdot\mathopen{}\mathclose{{}\left|\alpha}\right|$ , and $2\alpha^{2}p^{2}-\mathopen{}\mathclose{{}\left|\alpha}\right|p\geq 0$ . As a result,

\mathopen{}\mathclose{{}\left|1+\alpha}\right|^{p}\leq-\mathopen{}\mathclose{{}\left|\alpha}\right|p+2\alpha^{2}p^{2}+p^{p}\cdot\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}

which gives the right inequality. Consider,

h(\alpha)=|1+\alpha|^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{1}{2^{p+1}}|\alpha|^{p}.

h^{\prime}(\alpha)=-p|1+\alpha|^{p-1}-p-\frac{p}{4}\alpha+p\frac{1}{2^{p+1}}|\alpha|^{p-1}.

Let $\beta=-\alpha$ . The above expression now becomes,

-p(\beta-1)^{p-1}-p+\frac{p}{4}\beta+p\frac{1}{2^{p+1}}\beta^{p-1}.

We know that $\beta\geq 1$ . When $\beta\geq 2$ , $\frac{\beta}{2}\leq\beta-1$ and $\frac{\beta}{2}\leq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}$ . This gives us,

\frac{p}{4}\beta+p\frac{1}{2^{p+1}}\beta^{p-1}\leq\frac{p}{2}\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}+\frac{p}{2}\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}\leq p(\beta-1)^{p-1}

giving us $h^{\prime}(\alpha)\leq 0$ for $\alpha\leq-2$ . When $\beta\leq 2$ , $\frac{\beta}{2}\geq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}$ and $\frac{\beta}{2}\leq 1$ .

\frac{p}{4}\beta+p\frac{1}{2^{p+1}}\beta^{p-1}\leq\frac{p}{2}\cdot\frac{\beta}{2}+\frac{p}{2}\cdot\frac{\beta}{2}\leq p

giving us $h^{\prime}(\alpha)\leq 0$ for $-2\leq\alpha\leq-1$ . Therefore, $h^{\prime}(\alpha)\leq 0$ giving us, $h(\alpha)\geq h(-1)\geq 0$ , thus giving the left inequality.

$\mathopen{}\mathclose{{}\left|\alpha}\right|\leq 1$ :
Let $s(\alpha)=1+p\alpha+2p^{2}\alpha^{2}+p^{p}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}-(1+\alpha)^{p}.$ Now,

s^{\prime}(\alpha)=p+4p^{2}\alpha+p^{p+1}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-1}sgn(\alpha)-p(1+\alpha)^{p-1}.

When $\alpha\leq 0$ , we have,

s^{\prime}(\alpha)=p+4p^{2}\alpha-p^{p+1}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-1}-p(1+\alpha)^{p-1}.

and

s^{\prime\prime}(\alpha)=4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-2}-p(p-1)(1+\alpha)^{p-1}\geq 2p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left|\alpha}\right|^{p-2}-p(p-1)\geq 0.

So $s^{\prime}$ is an increasing function of $\alpha$ which gives us, $s^{\prime}(\alpha)\leq s^{\prime}(0)=0$ . Therefore $s$ is a decreasing function, and the minimum is at $0$ which is $0$ . This gives us our required inequality for $\alpha\leq 0$ . When $\alpha\geq\frac{1}{p-1}$ , $1+\alpha\leq p\cdot\alpha$ and $s^{\prime}(\alpha)\geq 0$ . We are left with the range $0\leq\alpha\leq\frac{1}{p-1}$ . Again, we have,

	$\displaystyle s^{\prime\prime}(\alpha)$	$\displaystyle=4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left\|\alpha}\right\|^{p-2}-p(p-1)(1+\alpha)^{p-1}$
		$\displaystyle\geq 4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left\|\alpha}\right\|^{p-2}-p(p-1)(1+\frac{1}{p-1})^{p-1}$
		$\displaystyle\geq 4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left\|\alpha}\right\|^{p-2}-p(p-1)e,\text{When $p$ gets large the last term approaches $e$}$
		$\displaystyle\geq 0.$

Therefore, $s^{\prime}$ is an increasing function, $s^{\prime}(\alpha)\geq s^{\prime}(0)=0$ . This implies $s$ is an increasing function, giving, $s(\alpha)\geq s(0)=0$ as required.

To show the other direction,

h(\alpha)=(1+\alpha)^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{1}{2^{p+1}}\mathopen{}\mathclose{{}\left|\alpha}\right|^{p}\geq(1+\alpha)^{p}-1-p\alpha-\frac{p}{8}\alpha^{2}-\frac{p}{8}{\alpha}^{2}=(1+\alpha)^{p}-1-p\alpha-\frac{p}{4}\alpha^{2}.

Now, since $p\geq 2$ ,

		$\displaystyle\mathopen{}\mathclose{{}\left((1+\alpha)^{p-2}-1}\right)sgn(\alpha)\geq 0$
	$\displaystyle\Rightarrow$	$\displaystyle\mathopen{}\mathclose{{}\left((1+\alpha)^{p-1}-1-\alpha}\right)sgn(\alpha)\geq 0$
	$\displaystyle\Rightarrow$	$\displaystyle\mathopen{}\mathclose{{}\left(p(1+\alpha)^{p-1}-p-\frac{p}{2}\alpha}\right)sgn(\alpha)\geq 0$

We thus have, $h^{\prime}(\alpha)\geq 0$ when $\alpha$ is positive and $h^{\prime}(\alpha)\leq 0$ when $\alpha$ is negative. The minimum of $h$ is at $0$ which is $0$ . This concludes the proof of this case.

∎

2.1.3 Proof of Iterative Refinement

In this section we will prove our main result. We start by proving the following lemma which relates the objective of the residual problem defined in the preliminaries to the change in objective value when $\boldsymbol{\mathit{x}}$ is updated by $\Delta/p$ .

Lemma 2.6.

For any $\boldsymbol{\mathit{x}},\Delta$ and $p\geq 2$ and $\lambda=16p$ ,

\boldsymbol{res}_{p}(\Delta)\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right),

and

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\lambda\frac{\Delta}{p}}\right)\leq\lambda\cdot\boldsymbol{res}_{p}(\Delta).

Proof.

We note,

	$\displaystyle\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)=$	$\displaystyle\boldsymbol{\mathit{d}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{M}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)}\right\rVert_{2}^{2}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)}\right\rVert_{p}^{p}$
	$\displaystyle=$	$\displaystyle\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\\|_{2}^{2}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\Delta}{p}}\right)}\right\rVert_{p}^{p}-\frac{1}{p}\boldsymbol{\mathit{d}}^{\top}\Delta-\frac{2}{p}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\frac{1}{p^{2}}\\|\boldsymbol{\mathit{M}}\Delta\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\\|_{2}^{2}+\\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\\|_{p}^{p}-p\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|^{p-2}(\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}})^{\top}\frac{\boldsymbol{\mathit{N}}\Delta}{p}+2p^{2}\frac{(\boldsymbol{\mathit{N}}\Delta)^{\top}}{p}(\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}})^{p-2}\frac{\boldsymbol{\mathit{N}}\Delta}{p}$
		$\displaystyle+p^{p}\mathopen{}\mathclose{{}\left\lVert\frac{\boldsymbol{\mathit{N}}\Delta}{p}}\right\rVert_{p}^{p}-\frac{1}{p}\boldsymbol{\mathit{d}}^{\top}\Delta-\frac{2}{p}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\frac{1}{p^{2}}\\|\boldsymbol{\mathit{M}}\Delta\\|_{2}^{2}$
		(From right inequality of Lemma 2.5)
		$\displaystyle=\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\mathopen{}\mathclose{{}\left(\frac{1}{p}\boldsymbol{\mathit{d}}+\frac{2}{p}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{N}}^{\top}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|^{p-2}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}}\right)^{\top}\Delta$
		$\displaystyle-\Delta^{\top}\mathopen{}\mathclose{{}\left(\frac{2}{p^{2}}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+2\boldsymbol{\mathit{N}}^{\top}Diag(\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|^{p-2})\boldsymbol{\mathit{N}}}\right)\Delta+\\|\boldsymbol{\mathit{N}}\Delta\\|_{p}^{p}$
		$\displaystyle=\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\boldsymbol{res}_{p}(\Delta),\text{ From Definition \ref{def:residual}.}$

Let $\boldsymbol{\mathit{g}}$ and $\boldsymbol{\mathit{R}}$ be as defined in Definition 2.3. We now use a similar calculation and the left inequality of Lemma 2.5 to get,

\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\lambda\frac{\Delta}{p}}\right)\geq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{\lambda^{2}}{16p}\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{\lambda^{p}}{p^{p}2^{p+1}}.

For $\lambda=16p$ ,

	$\displaystyle\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{\lambda^{2}}{16p}\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{\lambda^{p}}{p^{p}2^{p+1}}$	$\displaystyle\geq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{\lambda}{16p}\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{\lambda^{p-1}}{p^{p}2^{p+1}}}\right)$
		$\displaystyle\geq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})-\lambda\boldsymbol{res}_{p}(\Delta),$

thus concluding the proof of the lemma. ∎

We now track the value of $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})$ with a parameter $\nu$ . We will first show that, if we have a $\kappa$ approximate solver for the residual problem, we can either take a step to obtain $\boldsymbol{\mathit{x}}^{(t+1)}$ such that

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t+1)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\mathopen{}\mathclose{{}\left(1-\frac{1}{32p\kappa}}\right)\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right),

(2)

or we need to reduce the value of $\nu$ by a factor of $2$ since $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})$ is less than $\nu/2$ .

Lemma 2.7.

Consider an iterate $t$ . Let $\boldsymbol{res}_{p}$ denote the residual problem at $\boldsymbol{\mathit{x}}^{(t)}$ and $\nu$ be as defined in Algorithm 1. Let ${\widetilde{{\Delta}}}$ denote the solution returned by a $\kappa$ -approximate solver to the residual problem. Then,

1.

either $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu$ and, $\boldsymbol{\mathit{x}}^{(t+1)}=\boldsymbol{\mathit{x}}^{(t)}-\frac{{\widetilde{{\Delta}}}}{p}$ satisfies (2),
2.

or, $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\frac{\nu}{2}$ and Line 9 in the algorithm is executed.

Proof.

We will first prove that $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu$ by induction. For $t=0$ , $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu$ by definition. Now, let us assume this is true for iteration $t$ . Note that, if the algorithm updates $\boldsymbol{\mathit{x}}$ in line 7, since $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t+1)})\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})$ (solution of the residual problem is always non-negative), the relation holds for $t+1$ . Otherwise, the algorithm reduces $\nu$ to $\nu/2$ and $\boldsymbol{res}_{p}({\widetilde{{\Delta}}})<\frac{\nu}{32p\kappa}$ . For ${\bar{{\Delta}}}$ such that $\boldsymbol{\mathit{x}}^{\star}=\boldsymbol{\mathit{x}}^{(t)}-\lambda\frac{{\bar{{\Delta}}}}{p}$ , and from Lemma 2.6,

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})=\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{(t)}-\lambda\frac{{\bar{{\Delta}}}}{p}}\right)\leq\lambda\boldsymbol{res}_{p}({\bar{{\Delta}}})\leq\lambda\boldsymbol{res}_{p}({{{\Delta^{\star}}}}).

Since ${\widetilde{{\Delta}}}$ is a $\kappa$ -approximate solution to the residual problem,

\lambda\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\leq\lambda\kappa\boldsymbol{res}_{p}({\widetilde{{\Delta}}})<16p\kappa\frac{\nu}{32p\kappa}\leq\frac{\nu}{2}.

We have thus shown that $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu$ for all iterates $t$ and whenever Line 9 of the algorithm is executed, 2 from the lemma statement holds. It remains to prove that if $\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa}$ , then $\boldsymbol{\mathit{x}}^{(t+1)}=\boldsymbol{\mathit{x}}^{(t)}-\frac{{\widetilde{{\Delta}}}}{p}$ satisfies (2). Since, $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu$ ,

\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa}\geq\frac{1}{32p\kappa}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right).

Now, from Lemma 2.6,

	$\displaystyle\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{(t+1)}}\right)-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})$	$\displaystyle\leq\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{res}_{p}({\widetilde{{\Delta}}})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})$
		$\displaystyle\leq\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right)-\frac{1}{32p\kappa}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right)$
		$\displaystyle=\mathopen{}\mathclose{{}\left(1-\frac{1}{32p\kappa}}\right)\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right).$

∎

Corollary 2.8.

The vector $\boldsymbol{\mathit{x}}$ returned by Algorithm 1 is an $\epsilon$ -approximate solution to Problem (1).

Proof.

Our starting solution $\boldsymbol{\mathit{x}}^{(0)}$ satisfies $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}$ and the solutions ${\widetilde{{\Delta}}}$ of the residual problem added in each iteration satisfy $\boldsymbol{\mathit{A}}{\widetilde{{\Delta}}}=0$ . Therefore, $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}$ . For the second part, note that we always have $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\nu$ . When we stop, $\nu\leq\epsilon$ . Thus,

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\epsilon.

∎

We are now ready to prove our main result. See 2.1

Proof.

From Corollary 2.8, the solution returned by the algorithm is as required. We next need to bound the runtime. From Lemma 2.7, the algorithm, either reduces $\nu$ or Equation (2) holds. The number of times we can reduce $\nu$ is bounded by $\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}$ . The number of times Equation (2) holds can be bounded as follows,

\frac{\epsilon}{2}\leq\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{(t+1)}}\right)-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\leq\mathopen{}\mathclose{{}\left(1-\frac{1}{32p\kappa}}\right)^{t}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}\right).

Therefore, the total number of iterations $T$ is bounded as $T\leq 32p\kappa\log\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)$ . ∎

2.2 Starting Solution and Homotopy for pure $\ell_{p}$ Objectives

In this section, we consider the case where $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}$ , i.e., $\boldsymbol{\mathit{d}}=0$ and $\boldsymbol{\mathit{M}}=0$ .

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}

(3)

For such cases, we show how to find a good starting solution. We note that we can solve the following problem since it is equivalent to solving a system of linear equations,

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{2}^{2}.

Refer to Appendix A for details on how the above is equivalent to solving a system of linear equations.

We next consider a homotopy on $p$ . Specifically, we want to find a starting solution for the $\ell_{p}$ -norm problem by first solving an $\ell_{2}$ -norm problem, followed by $\ell_{2^{2}},\ell_{2^{3}},...,\ell_{2^{\lfloor\log p-1\rfloor}}$ -norm problems to a constant approximation. The following lemma relates these solutions.

Lemma 2.9.

Let $\boldsymbol{\mathit{x}}_{k}^{\star}$ denote the optimum of the $k$ -norm and $\boldsymbol{\mathit{x}}_{2k}^{\star}$ the optimum of the $2k$ -norm problem (3). Let $\widetilde{\boldsymbol{\mathit{x}}}$ be an $O(1)$ -approximate solution to the $k$ -norm problem. The following relation holds,

\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}\leq\mathopen{}\mathclose{{}\left\lVert\widetilde{\boldsymbol{\mathit{x}}}}\right\rVert_{2k}^{2k}\leq O(m)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}.

In other words, $\widetilde{\boldsymbol{\mathit{x}}}$ is a $O(m)$ -approximate solution to the $2k$ -norm problem.

Proof.

The left side follows from optimality of $\boldsymbol{\mathit{x}}^{\star}_{2k}$ . For the other side, we have the following relation,

\mathopen{}\mathclose{{}\left\lVert\widetilde{\boldsymbol{\mathit{x}}}}\right\rVert_{2k}^{2k}\leq\mathopen{}\mathclose{{}\left\lVert\widetilde{\boldsymbol{\mathit{x}}}}\right\rVert_{k}^{2k}\leq O(1)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{k}}\right\rVert_{k}^{2k}\leq O(1)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{k}^{2k}\leq O(1)m^{2k\mathopen{}\mathclose{{}\left(\frac{1}{k}-\frac{1}{2k}}\right)}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}=O(m)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{x}}^{\star}_{2k}}\right\rVert_{2k}^{2k}.

∎

Consider the following procedure to obtain a starting point $\boldsymbol{\mathit{x}}^{(0)}$ for the $\ell_{p}$ -norm problem.

Algorithm 2 Homotopy on

p

for Starting Solution

1:procedure StartSolution(

\boldsymbol{\mathit{A}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{b}},p

)

\boldsymbol{\mathit{x}}^{(0)}\leftarrow 0,k\leftarrow 2

3: while

k\leq 2^{\lfloor\log p-1\rfloor}

\boldsymbol{\mathit{x}}^{(0)}\leftarrow

Main-Solver

(\boldsymbol{\mathit{A}},0,\boldsymbol{\mathit{N}},0,\boldsymbol{\mathit{b}},k,1)

\triangleright

2-approximate solution to the

k

-norm Problem

k\leftarrow 2k

6: return

\boldsymbol{\mathit{x}}^{(0)}

Lemma 2.10.

Let $\boldsymbol{\mathit{x}}^{(0)}$ be as returned by Algorithm 2. Suppose there exists an oracle that solves the residual problem for any norm $\ell_{k}$ , i.e., $\boldsymbol{res}_{k}$ to a $\kappa_{k}$ -approximation in time $T(k,\kappa_{k})$ . We can then compute $\boldsymbol{\mathit{x}}^{(0)}$ which is a $O(m)$ -approximation to the $\ell_{p}$ -norm problem, in time at most

O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k}).

Proof.

For any $k$ , we have an $O(1)$ -approximation solution to the $k/2$ -norm solution. From Lemma 2.9, this is a $O(m)$ -approximate solution to the $k$ -norm problem. We now have from Theorem 2.1, that we require $O\mathopen{}\mathclose{{}\left(k\kappa_{k}T(k,\kappa_{k})\log m}\right)$ time to solve the $k$ -norm problem to a constant approximation. Summing over all $k$ , we have total runtime,

T=\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}O(k\kappa_{k}T(k,\kappa_{k})\log m)\leq O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k}).

∎

In later sections, we will describe an oracle that will have $\kappa_{k}=O(1)$ for all values of $k$ and $T(k,\kappa_{k})$ depends on $k$ linearly.

2.3 Iterative Refinement for $p\in(1,2)$

We will consider the following pure $\ell_{p}$ problem here,

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p},

(4)

where $p\in(1,2)$ . In the previous sections we saw an iterative refinement framework that worked for $p\geq 2$ . In this section, we will show a similar iterative refinement for $p\in(1,2)$ . In particular, we will prove the following result from [Adi+19].

Theorem 2.11.

Let $p\in(1,2)$ , and $\kappa\geq 1$ . Given an initial solution $\boldsymbol{\mathit{x}}^{(0)}$ satisfying $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}$ , we can find $\widetilde{\boldsymbol{\mathit{x}}}$ such that $\boldsymbol{\mathit{A}}\widetilde{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{b}}$ and $\|\boldsymbol{\mathit{N}}\widetilde{\boldsymbol{\mathit{x}}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p}$ in $O\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(\frac{p}{p-1}}\right)^{\frac{1}{p-1}}\kappa\log\frac{m}{\epsilon}}\right)$ calls to a $\kappa$ -approximate solver to the residual problem (Definition 2.13).

The key idea in the algorithm for $p\geq 2$ was an upper and lower bound on the function that was an $\ell_{2}^{2}+\ell_{p}^{p}$ -norm term (Lemma 2.5). Such a bound does not hold when $p<2$ , however, we will show that a smoothed $\ell_{p}$ -norm function can be used for providing such bounds. Specifically, we use the following smoothed $\ell_{p}$ -norm function defined in [Bub+18].

Definition 2.12.

(Smoothed $\ell_{p}$ Function.) Let $p\in(1,2)$ , and $x\in\mathbb{R},t\geq 0$ . We define,

\gamma_{p}(t,x)=\begin{cases}\frac{p}{2}t^{p-2}x^{2}&\text{ if }|x|\leq t,\\ |x|^{p}-\mathopen{}\mathclose{{}\left(1-\frac{p}{2}}\right)t^{p}&\text{ otherwise.}\end{cases}

For any vector $\boldsymbol{\mathit{x}}$ and $\boldsymbol{\mathit{t}}\geq 0$ , we define $\gamma_{p}(\boldsymbol{\mathit{t}},\boldsymbol{\mathit{x}})=\sum_{i}\gamma_{p}(\boldsymbol{\mathit{t}}_{i},\boldsymbol{\mathit{x}}_{i})$ .

We define the following residual problem for this section.

Definition 2.13.

For $p\in(1,2)$ , we define the residual problem at any feasible $\boldsymbol{\mathit{x}}$ to be,

\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{res}_{p}(\Delta)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\boldsymbol{\mathit{g}}^{\top}\Delta-2^{p}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\boldsymbol{\mathit{N}}\Delta),

where $\boldsymbol{\mathit{g}}=p\boldsymbol{\mathit{N}}^{\top}|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}$ .

We will follow a similar structure as Section 2.1. We begin by proving analogues of Lemma 2.6 and Lemma 2.5.

Lemma 2.14.

Let $p\in(1,2)$ . For any $x$ and $\Delta$ ,

|x|^{p}+p|x|^{p-2}x\Delta+\frac{p-1}{p2^{p}}\gamma_{p}(|x|,\Delta)\leq|x+\Delta|^{p}\leq|x|^{p}+p|x|^{p-2}x\Delta+2^{p}\gamma_{p}(|x|,\Delta)

Proof.

We first show the following inequality holds for $|\alpha|\leq 1$

1+\alpha p+\frac{(p-1)}{4}\alpha^{2}\leq(1+\alpha)^{p}\leq 1+\alpha p+p2^{p-1}\alpha^{2}.

(5)

Let us first show the left inequality, i.e. $1+\alpha p+\frac{p-1}{4}\alpha^{2}\leq(1+\alpha)^{p}$ . Define the following function,

h(\alpha)=(1+\alpha)^{p}-1-\alpha p-\frac{p-1}{4}\alpha^{2}.

When $\alpha=1,-1$ , $h(\alpha)\geq 0$ . The derivative of $h$ with respect to $\alpha$ is, $h^{\prime}(\alpha)=p(1+\alpha)^{p-1}-p-\frac{(p-1)}{2}\alpha$ . Next let us see what happens when $\mathopen{}\mathclose{{}\left|\alpha}\right|<1$ .

h^{\prime\prime}(\alpha)=p(p-1)(1+\alpha)^{p-2}-\frac{p-1}{2}=(p-1)\mathopen{}\mathclose{{}\left(\frac{p}{(1+\alpha)^{2-p}}-\frac{1}{2}}\right)\geq 0

This implies that $h^{\prime}(\alpha)$ is an increasing function of $\alpha$ and $\alpha_{0}$ for which $h^{\prime}(\alpha_{0})=0$ is where $h$ attains its minimum value. The only point where $h^{\prime}$ is 0 is $\alpha_{0}=0$ . This implies $h(\alpha)\geq h(0)=0$ . This concludes the proof of the left inequality. For the right inequality, define:

s(\alpha)=1+\alpha p+p2^{p-1}\alpha^{2}-(1+\alpha)^{p}.

Note that $s(0)=0$ and $s(1),s(-1)\geq 0$ . We have,

s^{\prime}(\alpha)=p+p2^{p}\alpha-p(1+\alpha)^{p-1},

and

(1+\alpha)^{p-1}sign(\alpha)\leq(1+\alpha)sign(\alpha).

Using this, we get, $s^{\prime}(\alpha)sign(\alpha)\geq p|\alpha|(2^{p}-1)\geq 0$ which says $s^{\prime}(\alpha)$ is positive for $\alpha$ positive and negative for $\alpha$ negative. Thus the minima of $s$ is at 0 which is $0$ . So $s(\alpha)\geq 0$ .

Before we prove the lemma, we will prove the following inequality for $\beta\geq 1$ ,

(\beta-1)^{p-1}+1\geq\frac{1}{2^{p}}\beta^{p-1}.

(6)

$(\beta-1)\geq\frac{\beta}{2}$ for $\beta\geq 2$ . So the claim clearly holds for $\beta\geq 2$ since $(\beta-1)^{p-1}\geq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}$ . When $1\leq\beta\leq 2$ , $1\geq\frac{\beta}{2}$ , so the claim holds since, $1\geq\mathopen{}\mathclose{{}\left(\frac{\beta}{2}}\right)^{p-1}$

We now prove the lemma.

Let $\Delta=\alpha x$ . The term $p|x|^{p-1}sign(x)\cdot\alpha x=\alpha p|x|^{p-1}|x|=\alpha p|x|^{p}$ . Let us first look at the case when $|\alpha|\leq 1$ . We want to show,

	$\displaystyle\|x\|^{p}+\alpha p\|x\|^{p}+c\frac{p}{2}\|x\|^{p-2}\|\alpha x\|^{2}\leq\|x+\alpha x\|^{p}\leq\|x\|^{p}+\alpha p\|x\|^{p}+C\frac{p}{2}\|x\|^{p-2}\|\alpha x\|^{2}$
	$\displaystyle\Leftrightarrow(1+\alpha p)+c\frac{p}{2}\alpha^{2}\leq(1+\alpha)^{p}\leq(1+\alpha p)+C\frac{p}{2}\alpha^{2}.$

This follows from Equation (5) and the facts $\frac{cp}{2}\leq\frac{p-1}{4}$ and $\frac{Cp}{2}\geq p2^{p-1}$ . We next look at the case when $|\alpha|\geq 1$ . Now, $\gamma_{p}(|f|,\Delta)=|\Delta|^{p}+(\frac{p}{2}-1)|f|^{p}$ . We need to show

|x|^{p}(1+\alpha p)+\frac{|x|^{p}(p-1)}{p2^{p}}(|\alpha|^{p}+\frac{p}{2}-1)\leq|x|^{p}|1+\alpha|^{p}\leq|x|^{p}(1+\alpha p)+2^{p}|x|^{p}(|\alpha|^{p}+\frac{p}{2}-1).

When $|x|=0$ it is trivially true. When $|x|\neq 0$ , let

h(\alpha)=|1+\alpha|^{p}-(1+\alpha p)-\frac{(p-1)}{p2^{p}}(|\alpha|^{p}+\frac{p}{2}-1).

Now, taking the derivative with respect to $\alpha$ we get,

h^{\prime}(\alpha)=p\mathopen{}\mathclose{{}\left(|1+\alpha|^{p-1}sign(\alpha)-1-\frac{(p-1)}{p2^{p}}|\alpha|^{p-1}sign(\alpha)}\right).

We use the mean value theorem to get for $|\alpha|\geq 1$ ,

	$\displaystyle(1+\alpha)^{p-1}-1$	$\displaystyle=(p-1)\alpha(1+z)^{p-2},z\in(0,\alpha)$
		$\displaystyle\geq(p-1)\alpha(2\alpha)^{p-2}$
		$\displaystyle\geq\frac{p-1}{2}\alpha^{p-1}$

which implies $h^{\prime}(\alpha)\geq 0$ in this range as well. When $\alpha\leq-1$ it follows from Equation (6) that $h^{\prime}(\alpha)\leq 0$ . So the function $h$ is increasing for $\alpha\geq 1$ and decreasing for $\alpha\leq-1$ . The minimum value of $h$ is $min\{h(1),h(-1)\}\geq 0$ . It follows that $h(\alpha)\geq 0$ which gives us the left inequality. The other side requires proving,

|1+\alpha|^{p}\leq 1+\alpha p+2^{p}(|\alpha|^{p}+\frac{p}{2}-1).

Define:

s(\alpha)=1+\alpha p+2^{p}(|\alpha|^{p}+\frac{p}{2}-1)-|1+\alpha|^{p}.

The derivative $s^{\prime}(\alpha)=p+\mathopen{}\mathclose{{}\left(p2^{p}|\alpha|^{p-1}-p|1+\alpha|^{p-1}}\right)sign(\alpha)$ is non negative for $\alpha\geq 1$ and non positive for $\alpha\leq-1$ . The minimum value taken by $s$ is $\min\{s(1),s(-1)\}$ which is non negative. This gives us the right inequality.

∎

Lemma 2.15.

Let $p\in(1,2)$ and $\lambda^{p-1}=\frac{p4^{p}}{p-1}$ . Then for any $\Delta$ ,

\boldsymbol{res}_{p}(\Delta)\leq\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\Delta)\|_{p}^{p},

and

\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\lambda\Delta)\|_{p}^{p}\leq\lambda\boldsymbol{res}_{p}(\Delta).

Proof.

Applying Lemma 2.14 to all coordinates,

-\boldsymbol{\mathit{g}}^{\top}\Delta+\frac{p-1}{p2^{p}}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\boldsymbol{\mathit{N}}\Delta)\leq\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\Delta)\|_{p}^{p}-\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq-\boldsymbol{\mathit{g}}^{\top}\Delta+2^{p}\gamma_{p}(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|,\boldsymbol{\mathit{N}}\Delta).

From the definition of the residual problem and the above equation, the first inequality of our lemma directly follows. To see the other inequality, from the above equation,

	$\displaystyle\\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\\|_{p}^{p}-\\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\lambda\Delta)\\|_{p}^{p}$	$\displaystyle\leq\lambda\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{p-1}{p2^{p}}\gamma_{p}(\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|,\lambda\boldsymbol{\mathit{N}}\Delta)$
		$\displaystyle\leq\lambda\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\Delta-\lambda^{p-1}\frac{p-1}{p2^{p}}\gamma_{p}(\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|,\boldsymbol{\mathit{N}}\Delta)}\right)$
		$\displaystyle=\lambda\cdot\boldsymbol{res}_{p}(\Delta).$

Here, we are using the following property of $\gamma_{p}$ ,

\gamma_{p}(t,\lambda\Delta)\geq\min\{\lambda^{2},\lambda^{p}\}\gamma_{p}(t,\Delta).

∎

Lemma 2.15 is similar to Lemma 2.6, and we can follow the proof of Theorem 2.1 to obtain Theorem 2.11.

3 Fast Multiplicative Weight Update Algorithm for $\ell_{p}$ -norms

In this section, we will show how to solve the residual problem for $p\geq 2$ as defined in the previous section (Definition 2.3), to a constant approximation. The core of our approach is a multiplicative weight update routine with width reduction that is used to speed up the algorithm. For problem instances of size $m$ , this routine returns a constant approximate solution in at most $O(m^{1/3})$ calls to a linear system solver. Such a width reduced multiplicative weight update algorithm was first seen in the context of the maximum flow problem and $\ell_{\infty}$ -regression in works by [Chr+11, Chi+13]

The first instance of such a width reduced multiplicative weight update algorithm for $\ell_{p}$ -regression appeared in the work of [Adi+19]. In a further work, the authors improved the dependence on $p$ in the runtime [Adi+21]. The following sections are based on the improved algorithm from [Adi+21].

3.1 Algorithm for $\ell_{p}$ -norm Regression

Recall that our residual problem for $p\geq 2$ is defined as:

\max_{\boldsymbol{\mathit{A}}\Delta=0}\quad\boldsymbol{res}_{p}(\Delta)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\boldsymbol{\mathit{g}}^{\top}\Delta-\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p},

for some vector $\boldsymbol{\mathit{g}}$ and matrices $\boldsymbol{\mathit{R}}$ and $\boldsymbol{\mathit{N}}$ . Also recall that in Algorithm 1, we used a parameter $\nu$ , which was used to track the value of $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})$ at any iteration $t$ . We will now use this parameter $\nu$ to do a binary search on the linear term in $\boldsymbol{res}_{p}$ and reduce the residual problem to,

\displaystyle\begin{aligned} \min_{\Delta}&\quad\Delta^{\top}\boldsymbol{\mathit{R}}\Delta+\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p}\\ s.t.&\quad\boldsymbol{\mathit{A}}\Delta=0\\ &\quad\boldsymbol{\mathit{g}}^{\top}\Delta=c,\end{aligned}

(7)

for some constant $c$ . Further, we will use our multiplicative weight update solver to solve problems of this kind to a constant approximation. We start by proving the binary search results.

3.1.1 Binary Search

We first note that, if $\nu$ at iteration $t$ is such that $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\in(\nu/2,\nu]$ , then from Lemma 2.6, the residual at $\boldsymbol{\mathit{x}}^{(t)}$ has optimum value, $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\frac{\nu}{32p},\nu]$ . We now consider a parameter $\zeta$ that has value between $\frac{\nu}{16p}$ and $\nu$ such that $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\frac{\zeta}{2},\zeta]$ . We have the following lemma that relates the optimum of problem of the type (7) with $\zeta$ .

Lemma 3.1.

Let $\zeta$ be such that the residual problem satisfies $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\frac{\zeta}{2},\zeta]$ . The following problem has optimum at most $2\zeta$ .

\displaystyle\begin{aligned} \min_{\Delta}&\quad\Delta^{\top}\boldsymbol{\mathit{R}}\Delta+\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p}\\ s.t.&\quad\boldsymbol{\mathit{A}}\Delta=0\\ &\quad\boldsymbol{\mathit{g}}^{\top}\Delta=\frac{\zeta}{2}.\end{aligned}

(8)

Further, let ${\widetilde{{\Delta}}}$ be a solution to the above problem such that ${\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}\leq a^{2}\zeta$ and $\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq a^{p}\zeta$ for some $a>1$ . Then $\frac{{\widetilde{{\Delta}}}}{5a^{2}}$ is a $100a^{2}$ -approximation to the residual problem.

Proof.

We have assumed that,

\boldsymbol{res}({{{\Delta^{\star}}}})=\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\in\mathopen{}\mathclose{{}\left(\frac{\zeta}{2},\zeta}\right].

Since the last $2$ terms are strictly non-positive, we must have, $\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}\geq\frac{\zeta}{2}.$ Since ${{{\Delta^{\star}}}}$ is the optimum and satisfies $\boldsymbol{\mathit{A}}{{{\Delta^{\star}}}}=0$ ,

\frac{d}{d\lambda}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\lambda{{{\Delta^{\star}}}}-\lambda^{2}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\lambda^{p}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}}\right)_{\lambda=1}=0.

Thus,

\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}={{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+(p-1)\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}.

Since $p\geq 2,$ we get the following

{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\leq\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\leq\zeta.

Now, we know that, $\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}\geq\frac{\zeta}{2}$ and $\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}\leq\zeta$ . This gives,

\frac{\zeta}{2}\leq\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}\leq{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}}\right\rVert_{p}^{p}+\zeta\leq 2\zeta.

Now, let ${\widetilde{{\Delta}}}$ be as described in the lemma. We have,

	$\displaystyle\boldsymbol{res}_{p}\mathopen{}\mathclose{{}\left(\frac{{\widetilde{{\Delta}}}}{5a^{2}}}\right)$	$\displaystyle=\frac{1}{5a^{2}}\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-\frac{\zeta}{25a^{2}}-\frac{\zeta}{5^{p}a^{p}}$
		$\displaystyle\geq\frac{\zeta}{10a^{2}}-\frac{2\zeta}{25a^{2}}$
		$\displaystyle\geq\frac{\zeta}{50a^{2}}\geq\frac{1}{100a^{2}}\boldsymbol{res}_{p}({{{\Delta^{\star}}}})$

∎

Algorithm 3 Algorithm for Solving the Residual Problem

1:procedure ResidualSolver(

\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p

)

\zeta\leftarrow\nu

(\boldsymbol{\mathit{g}},\boldsymbol{\mathit{R}},\boldsymbol{\mathit{N}})\leftarrow\boldsymbol{res}_{p}

\triangleright

Create residual problem at

\boldsymbol{\mathit{x}}

4: while

\zeta>\frac{\nu}{32p}

{\widetilde{{\Delta}}}_{\zeta}\leftarrow

MWU-Solver

\mathopen{}\mathclose{{}\left([\boldsymbol{\mathit{A}},\boldsymbol{\mathit{g}}^{\top}],\boldsymbol{\mathit{R}}^{1/2},\boldsymbol{\mathit{N}},[0,\frac{\zeta}{2}]^{\top},\zeta,p}\right)

\triangleright

Algorithm 5

\zeta\leftarrow\frac{\zeta}{2}

7: return

\arg\min_{{\widetilde{{\Delta}}}_{\zeta}}\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{{\widetilde{{\Delta}}}_{\zeta}}{p}}\right)

3.1.2 Width-Reduced Approximate Solver

We are now finally ready to solve problems of the type (7). In this section, we will give an algorithm to solve the following problem,

	$\displaystyle\min_{\Delta}$	$\displaystyle\quad\Delta^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\\|\boldsymbol{\mathit{N}}\Delta\\|_{p}^{p}$		(9)
		$\displaystyle\text{s.t.}\quad\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}.$

Here $\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}$ , and vector $\boldsymbol{\mathit{c}}\in\mathbb{R}^{d}$ . Our approach involves a multiplicative weight update method with a width reduction step which allows us to solve these problems faster.

3.1.3 Slow Multiplicative Weight Update Solver

We first give an informal analysis of the multiplicative weight update method without width reduction. We will show that this method converges in $\approx m_{1}^{\frac{p-2}{2(p-1)}}\leq m_{1}^{1/2}$ iterations. For simplicity, we let $\boldsymbol{\mathit{M}}=0$ in Problem (9) and assume without loss of generality that the optimum ${{{\Delta^{\star}}}}$ satisfies, $\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}\leq 1$ . Consider the following MWU algorithm for parameter $\alpha$ that we will set later:

1.

$\boldsymbol{\mathit{w}}^{(0)}=1,\boldsymbol{\mathit{x}}^{(0)}=0,T=\alpha^{-1}m^{1/p}$
2.

for $t=1,\cdots,T$ :
$\Delta^{(t)}=\arg\min_{\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}}\sum_{i}(\boldsymbol{\mathit{w}}^{(t-1)}_{i})^{p-2}(\boldsymbol{\mathit{N}}\Delta)_{i}^{2},\quad\boldsymbol{\mathit{w}}^{(t)}=\boldsymbol{\mathit{w}}^{(t-1)}+\alpha|\boldsymbol{\mathit{N}}\Delta^{(t)}|,\quad\boldsymbol{\mathit{x}}^{(t)}=\boldsymbol{\mathit{x}}^{(t-1)}+\Delta^{(t)}$
3.

Return $\widetilde{\boldsymbol{\mathit{x}}}=\boldsymbol{\mathit{x}}/T$

We claim that the above algorithm returns $\widetilde{\boldsymbol{\mathit{x}}}$ such that $\|\boldsymbol{\mathit{N}}\widetilde{\boldsymbol{\mathit{x}}}\|_{p}^{p}\leq O_{p}(1)$ , i.e., a constant approximate solution to the residual problem, in $\approx m_{1}^{1/2}$ iterations. We will bound the value of the returned solution, $\|\boldsymbol{\mathit{N}}\widetilde{\boldsymbol{\mathit{x}}}\|_{p}^{p}$ by looking at how $\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p}$ grows with $t$ . From Lemma 2.5,

\|\boldsymbol{\mathit{w}}^{(t-1)}+\alpha\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}\leq\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p}+\alpha p\sum_{i}(\boldsymbol{\mathit{w}}^{(t-1)}_{i})^{p-1}(\boldsymbol{\mathit{N}}\Delta^{(t)})_{i}\\ +2p^{2}\alpha^{2}\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}\Delta^{(t)})^{2}_{i}+\alpha^{p}p^{p}\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}.

Observe that the third term on the right hand side is exactly the objective of the quadratic problem minimized to obtain $\Delta^{(t)}$ . Using that $\Delta^{(t)}$ must achieve a lower objective than ${{{\Delta^{\star}}}}$ , i.e., $\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}\Delta^{(t)})^{2}_{i}\leq\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})^{2}_{i}$ along with Hölder’s inequality and $\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}\leq 1$ , we can bound this term by $\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-2}$ . We can further bound the second term in right hand side of the above inequality by the third term using Hölder’s inequality (refer to Proof of Lemma 3.3 for details). These bounds give,

\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p}\leq\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p}+\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|^{p-1}_{p}+2\alpha^{2}p^{2}\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-2}+\alpha^{p}p^{p}\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}.

Observe that the growth of $\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p}$ is controlled by $\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|^{p}_{p}$ . We next see how large this quantity can be. Assume that, $\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}\leq 3m_{1}^{1/p}$ for all $t$ (one may verify in the end that this holds for all $t\leq T$ ). Since $(\boldsymbol{\mathit{w}}^{(t-1)}_{i})^{p-2}\geq(\boldsymbol{\mathit{w}}^{(0)}_{i})^{p-2}=1$ ,

\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{2}^{2}\leq\sum_{i}(\boldsymbol{\mathit{w}}_{i}^{(t-1)})^{p-2}(\boldsymbol{\mathit{N}}\Delta^{(t)})^{2}_{i}\begin{subarray}{c}(a)\\ \leq\end{subarray}\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-2}\leq 3^{p-2}m_{1}^{(p-2)/p},

where we used Hölder’s inequality in $(a)$ . This implies, $\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}\leq 3^{(p-2)p/2}m_{1}^{(p-2)/2}$ . Now, for $\alpha\approx m_{1}^{-\frac{p^{2}-4p+2}{2p(p-1)}}$ , $\alpha^{p}p^{p}\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}\leq\alpha pm_{1}^{\frac{p-1}{p}}\leq\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-1}$ and,

\|\boldsymbol{\mathit{w}}^{(t)}\|_{p}^{p}\leq\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p}+\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|^{p-1}_{p}+2\alpha^{2}p^{2}+\alpha p\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}^{p-1}\leq\mathopen{}\mathclose{{}\left(\|\boldsymbol{\mathit{w}}^{(t-1)}\|_{p}+2\alpha}\right)^{p}.

We can thus prove that,

\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{(T)}\|_{p}^{p}\leq\frac{1}{m_{1}}\|\boldsymbol{\mathit{w}}^{(T)}\|_{p}^{p}\leq\mathopen{}\mathclose{{}\left(\|\boldsymbol{\mathit{w}}^{(0)})\|_{p}+2\alpha T}\right)^{p}=\frac{1}{m_{1}}\mathopen{}\mathclose{{}\left(m_{1}^{1/p}+2m_{1}^{1/p}}\right)^{p}=3^{p},

as required. The total number of iterations is $T=\alpha^{-1}m_{1}^{1/p}\approx m_{1}^{\frac{p-2}{2(p-1)}}$ .

To obtain the improved rates of convergence via width reduction, our algorithm uses a hard threshold on $\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}$ and performs a width reduction step whenever $\|\boldsymbol{\mathit{N}}\Delta^{(t)}\|_{p}^{p}$ is larger than the threshold. The analysis now requires to additionally track how $\|\boldsymbol{\mathit{w}}\|_{p}$ changes with a width reduction step. Our analysis also tracks the value of an additional potential $\Psi=\min_{\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}}\sum_{i}\boldsymbol{\mathit{w}}_{i}\Delta_{i}^{2}$ . The interplay of these two potentials and balancing out their changes with respect to primal updates and width reduction steps give the improved rates of convergence.

3.1.4 Fast, Width-Reduced MWU Solver

In the previous section, we showed that a multiplicative weight update algorithm without width-reduction obtains a rate of convergence $\approx m_{1}^{1/2}$ . In this section we will show how width-reduction allows for a faster $\approx m_{1}^{1/3}$ rate of convergence. We now present the faster width-reduced algorithm. We will prove the following result.

Theorem 3.2.

Let $p\geq 2$ . Consider an instance of Problem (9) described by matrices $\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}$ , and vector $\boldsymbol{\mathit{c}}\in\mathbb{R}^{d}$ . If the optimum of this problem is at most $\zeta$ , Procedure Residual-Solver (Algorithm 5) returns an $\boldsymbol{\mathit{x}}$ such that $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{c}},$ and $\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\leq O(1)\zeta$ and $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq O(3^{p})\zeta$ . The algorithm makes ${O}\mathopen{}\mathclose{{}\left(pm_{1}^{\frac{p-2}{(3p-2)}}}\right)$ calls to a linear system solver.

The algorithm and analyses of this chapter are based on [Adi+19] and [Adi+21].

In every iteration of the algorithm, we solve a weighted linear system. The solution returned is used to update the current iterate if it has a small $\ell_{p}$ norm. Otherwise, we do not update the solution, but update the weights corresponding to the coordinates with large value by a constant factor. This step is refered to as the “width reduction step”. The analysis is based on a potential function argument for specially defined potentials.

The following is the oracle used in the algorithm, i.e., the linear system we need to solve. We show in the Appendix A how to implement the oracle using a linear system solver.

Algorithm 4 Oracle

1:procedure Oracle(

\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{c}},\boldsymbol{\mathit{w}},\zeta

)

\boldsymbol{\mathit{r}}_{e}\leftarrow\boldsymbol{\mathit{w}}_{e}^{p-2}

{\widetilde{\boldsymbol{\mathit{M}}}}\leftarrow\zeta^{-\frac{p-2}{2p}}\boldsymbol{\mathit{M}}

4: Compute,

\Delta=\arg\min_{\boldsymbol{\mathit{A}}\Delta^{\prime}=\boldsymbol{\mathit{c}}}\quad m_{1}^{\frac{p-2}{p}}{\Delta^{\prime}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}\Delta^{\prime}+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta^{{}^{\prime}}}\right)^{2}_{e}

5: return

\Delta

We now have the following multiplicative weight update algorithm given in Algorithm 5.

Algorithm 5 Width Reduced MWU Algorithm

1:procedure MWU-Solver(

\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{c}},\zeta,p

)

\boldsymbol{\mathit{w}}^{(0,0)}_{e}\leftarrow 1

\boldsymbol{\mathit{x}}\leftarrow 0

\rho\leftarrow m_{1}^{\frac{(p^{2}-4p+2)}{p(3p-2)}}

\triangleright

width parameter

\beta\leftarrow 3^{p-1}\cdot m_{1}^{\frac{p-2}{3p-2}}

\triangleright

resistance threshold

\alpha\leftarrow 3^{-\frac{p-1}{p}}\cdot p^{-1}m_{1}^{-\frac{p^{2}-5p+2}{p(3p-2)}}

\triangleright

step size

\tau\leftarrow 3^{p}\cdot m_{1}^{\frac{(p-1)(p-2)}{(3p-2)}}

\triangleright

\ell_{p}

threshold

T\leftarrow\alpha^{-1}m_{1}^{1/p}=3^{\frac{p-1}{p}}\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}}\right)

i\leftarrow 0,k\leftarrow 0

10: while

i<T

11:

\Delta\leftarrow\textsc{Oracle}(\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{c}},\boldsymbol{\mathit{w}}^{(i,k)},\zeta)

12:

\boldsymbol{\mathit{r}}\leftarrow\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)^{p-2}

13: if

\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\Delta}\right\rVert_{p}^{p}\leq\tau\zeta

then

\triangleright

primal step

14:

\boldsymbol{\mathit{w}}^{(i+1,k)}\leftarrow\boldsymbol{\mathit{w}}^{(i,k)}+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|}{\zeta^{1/p}}

15:

\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}+\Delta

16:

i\leftarrow i+1

17: else

18: For all coordinates

e

with

|\boldsymbol{\mathit{N}}\Delta|_{e}\geq\rho\zeta^{\frac{1}{p}}

and

\boldsymbol{\mathit{r}}_{e}\leq\beta

\triangleright

width reduction step

19:

\quad\quad\quad\boldsymbol{\mathit{w}}_{e}^{(i,k+1)}\leftarrow 2^{\frac{1}{p-2}}\boldsymbol{\mathit{w}}_{e}

20:

\quad\quad\quad k\leftarrow k+1

21: return

\frac{\boldsymbol{\mathit{x}}}{T}

Notation

We will use ${{{\Delta^{\star}}}}$ to denote the optimum of (9). Since we assume that the optimum value of (9) is at most $\zeta$ ,

{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{{{\Delta^{\star}}}}\leq\zeta\quad\text{ and }\quad\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\Delta^{*}}\right\rVert_{p}^{p}\leq\zeta

(10)

3.1.5 Analysis of Algorithm 5

Our analysis is based on tracking the following two potential functions. We will show how these potentials change with a primal step (Line 13) and a width reduction step (18) in the algorithm. The proofs of these lemmas appear later in the section.

\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}

\Psi(\boldsymbol{\mathit{r}})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\min_{\Delta:\boldsymbol{\mathit{A}}\Delta=\boldsymbol{\mathit{c}}}m_{1}^{\frac{p-2}{p}}{\Delta}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}\Delta+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)^{2}_{e}.

Finally, to prove our runtime bound, we will first show that if the total number of width reduction steps $K$ is not too large, then $\Phi$ is bounded. We then prove that the number of width reduction steps cannot be too large by using the relation between $\Phi$ and $\Psi$ and their respective changes throughout the algorithm.

We now begin our analysis. The next two lemmas show how our potentials change with every iteration of the algorithm.

Lemma 3.3.

After $i$ primal steps, and $k$ width-reduction steps, provided $p^{p}\alpha^{p}\tau\leq p\alpha m_{1}^{\frac{p-1}{p}}$ , the potential $\Phi$ is bounded as follows:

\displaystyle\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)\leq\mathopen{}\mathclose{{}\left(2\alpha i+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}.

Lemma 3.4.

After $i$ primal steps and $k$ width reduction steps, if,

1.

$\tau^{2/p}\zeta^{2/p}\geq 4\cdot 3^{p-2}\frac{\Psi(\boldsymbol{\mathit{r}})}{\beta}$ , and
2.

$\tau\zeta^{2/p}\geq 2\cdot 3^{p-2}\Psi(\boldsymbol{\mathit{r}})\rho^{p-2}$ ,

then,

{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)}+\frac{k}{4}\cdot\tau^{2/p}\zeta^{2/p}.

The next lemma gives a lower bound on the energy in the beginning and an upper bound on the energy at each step.

Lemma 3.5.

Let $i$ denote the number of primal steps and $k$ the number of width reduction steps. For any $i,k\geq 0$ , we have,

\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)\leq\zeta^{2/p}\mathopen{}\mathclose{{}\left(m_{1}^{\frac{p-2}{p}}+\frac{1}{3^{p-2}}\Phi(i,k)^{\frac{p-2}{p}}}\right).

3.1.6 Proof of Theorem 3.2

Proof.

Let $\frac{\boldsymbol{\mathit{x}}}{T}$ be the solution returned by Algorithm 5. We first note that this satisfies the linear constraint required. We next bound the objective value at $\frac{\boldsymbol{\mathit{x}}}{T}$ , i.e., $\frac{1}{T^{2}}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}$ and $\frac{1}{T^{p}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}$ .

Suppose the algorithm terminates in $T=\alpha^{-1}m_{1}^{1/p}$ primal steps and $K\leq 2^{-\frac{p}{p-2}}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}$ width reduction steps. We next note that our parameter values $\alpha$ and $\tau$ are such that $p^{p}\alpha^{p}\tau\leq p\alpha m_{1}^{\frac{p-1}{p}}$ . We can now apply Lemma 3.3 to get,

\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(T,K)}}\right)\leq 3^{p}m_{1}e^{1}=e\cdot 3^{p}m_{1}

We next observe from the weight and $\boldsymbol{\mathit{x}}$ update steps in our algorithm that, $\zeta^{1/p}m_{1}^{-1/p}\boldsymbol{\mathit{w}}^{(T,K)}\geq|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|$ . Thus,

\displaystyle\frac{1}{T}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq\frac{\zeta}{m_{1}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}^{(T,K)}}\right\rVert_{p}^{p}=\frac{\zeta}{m_{1}}\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(T,K)}}\right)\leq e\cdot 3^{p}\zeta.

We next bound the quadratic term. Let ${\widetilde{{\Delta}}}^{(t)}$ denote the solution returned by the oracle in iteration $t$ . Since $\Phi\leq e\cdot 3^{p}m_{1}$ for all iterations, we always have from Lemma 3.5 that, $\Psi(\boldsymbol{\mathit{r}})\leq 4m_{1}^{\frac{p-2}{p}}\zeta^{2/p}$ . We will first bound $\mathopen{}\mathclose{{}\left({\widetilde{{\Delta}}}^{(t)}}\right)^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}^{(t)}$ for every $t$ .

\mathopen{}\mathclose{{}\left({\widetilde{{\Delta}}}^{(t)}}\right)^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}^{(t)}=\zeta^{\frac{p-2}{p}}\mathopen{}\mathclose{{}\left({\widetilde{{\Delta}}}^{(t)}}\right)^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}^{\top}{\widetilde{\boldsymbol{\mathit{M}}}}{\widetilde{{\Delta}}}^{(t)}\leq\zeta^{\frac{p-2}{p}}m_{1}^{-\frac{p-2}{p}}\Psi(\boldsymbol{\mathit{r}})\leq 4\zeta.

Now from convexity of $\|\boldsymbol{\mathit{x}}\|_{2}^{2}$ , we get

\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{M}}\frac{\boldsymbol{\mathit{x}}}{T}}\right\rVert_{2}^{2}\leq\frac{1}{T^{2}}\cdot T\sum_{t}\|\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}^{(t)}\|_{2}^{2}\leq 4\zeta.

We have shown that if the number of width reduction steps is bounded by $K$ then our algorithm returns the required solution. We will next prove that we cannot have more than $K$ width reduction steps.

Suppose to the contrary, the algorithm takes a width reduction step starting from step $(i,k)$ where $i<T$ and $k=2^{-\frac{p}{p-2}}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}$ . Since the conditions for Lemma 3.3 hold for all preceding steps, we must have $\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)\leq e\cdot 3^{p}m_{1}$ which combined with Lemma 3.5 implies $\Psi\leq 4m_{1}^{\frac{p-2}{p}}\zeta^{2/p}$ . Using this bound on $\Psi$ , we note that our parameter values satisfy the conditions of Lemma 3.4. From lemma 3.4,

{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)}+\frac{1}{4}\tau^{2/p}\zeta^{2/p}k.

Since our parameter choices ensure $\tau^{2/p}k>\frac{1}{4}m_{1}$ ,

{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}-{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)}>\frac{m_{1}}{16}\zeta^{2/p}.

Since $\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)\leq O(3^{p})m_{1}$ and $\Psi\geq 0$ , from Lemma 3.5,

\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)-\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(0,0)}}}\right)\leq 4m_{1}^{\frac{p-2}{p}}\zeta^{2/p},

which is a contradiction. We can thus conclude that we can never have more than $K=2^{\frac{-p}{p-2}}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}$ width reduction steps, thus concluding the correctness of the returned solution. We next bound the number of oracle calls required. The total number of iterations is at most,

T+K\leq\alpha^{-1}m_{1}^{1/p}+2^{-p/(p-2)}\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}\leq O\mathopen{}\mathclose{{}\left(pm_{1}^{\frac{p-2}{3p-2}}}\right).

∎

3.1.7 Proof of Lemma 3.3

We first prove a simple lemma about the solution ${\widetilde{{\Delta}}}$ returned by the oracle, that we will use in our proof.

Lemma 3.6.

Let $p\geq 2$ . For any $\boldsymbol{\mathit{w}}$ , let ${\widetilde{{\Delta}}}$ be the solution returned by Algorithm 4. Then,

\sum_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}\leq\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}\leq\zeta^{\frac{2}{p}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert^{p-2}

Proof.

Since ${\widetilde{{\Delta}}}$ is the solution returned by Algorithm 4, and ${{{\Delta^{\star}}}}$ satisfies the constraints of the oracle, we have,

\displaystyle\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}\leq\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta^{*})_{e}^{2}

\displaystyle=\sum_{e}\boldsymbol{\mathit{w}}_{e}^{p-2}(\boldsymbol{\mathit{N}}\Delta^{*})_{e}^{2}\leq\zeta^{2/p}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert^{p-2}_{p}.

In the last inequality we use,

	$\displaystyle\sum_{e}\boldsymbol{\mathit{w}}_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2}$	$\displaystyle\leq\mathopen{}\mathclose{{}\left(\sum_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2\cdot\frac{p}{2}}}\right)^{2/p}\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{w}}_{e}}\right\|^{(p-2)\cdot\frac{p}{p-2}}}\right)^{(p-2)/p}$
		$\displaystyle=\\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\\|_{p}^{2}\\|\boldsymbol{\mathit{w}}\\|_{p}^{(p-2)/p}$
		$\displaystyle\leq\zeta^{2/p}\\|\boldsymbol{\mathit{w}}\\|_{p}^{(p-2)/p},\text{since $\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\Delta^{*}}\right\rVert^{p}_{p}\leq\zeta$ }.$

Finally, using $\boldsymbol{\mathit{r}}_{e}\geq 1,$ we have $\sum_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2},$ concluding the proof. ∎

See 3.3

Proof.

We prove this claim by induction. Initially, $i=k=0,$ and $\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(0,0)}}\right)=m_{1},$ and thus, the claim holds trivially. Assume that the claim holds for some $i,k\geq 0.$ We will use $\Phi$ as an abbreviated notation for $\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i,k)}}\right)$ below.

Primal Step.

For brevity, we use $\boldsymbol{\mathit{w}}$ to denote $\boldsymbol{\mathit{w}}^{(i,k)}$ . If the next step is a primal step,

	$\displaystyle\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i+1,k)}}\right)=$	$\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}^{(i,k)}+\alpha\frac{\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right\|}{\zeta^{1/p}}}\right\rVert_{p}^{p}$
	$\displaystyle\leq$	$\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}+\zeta^{-1/p}\alpha\mathopen{}\mathclose{{}\left\|(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})}\right\|^{\top}\mathopen{}\mathclose{{}\left\|\nabla\\|\boldsymbol{\mathit{w}}\\|_{p}^{p}}\right\|+2p^{2}\alpha^{2}\zeta^{-2/p}\sum_{e}\|\boldsymbol{\mathit{w}}_{e}\|^{p-2}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right\|_{e}^{2}+\alpha^{p}p^{p}\zeta^{-1}\\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\\|_{p}^{p}$
		by Lemma 2.5

We next bound $\mathopen{}\mathclose{{}\left|(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})}\right|^{\top}\mathopen{}\mathclose{{}\left|\nabla\|\boldsymbol{\mathit{w}}\|_{p}^{p}}\right|$ by $\zeta^{1/p}p\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-1}.$ Using Cauchy Schwarz’s inequality,

	$\displaystyle\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right\|_{e}\mathopen{}\mathclose{{}\left\|\nabla_{e}\\|\boldsymbol{\mathit{w}}\\|_{p}^{p}}\right\|}\right)^{2}=$	$\displaystyle p^{2}\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right\|_{e}\|\boldsymbol{\mathit{w}}_{e}\|^{p-2}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{w}}_{e}}\right\|}\right)^{2}$
	$\displaystyle\leq$	$\displaystyle p^{2}\mathopen{}\mathclose{{}\left(\sum_{e}\|\boldsymbol{\mathit{w}}_{e}\|^{p-2}\boldsymbol{\mathit{w}}_{e}^{2}}\right)\mathopen{}\mathclose{{}\left(\sum_{e}\|\boldsymbol{\mathit{w}}_{e}\|^{p-2}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}}\right)$
	$\displaystyle=$	$\displaystyle p^{2}\\|\boldsymbol{\mathit{w}}\\|_{p}^{p}\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}$
	$\displaystyle\leq$	$\displaystyle p^{2}\\|\boldsymbol{\mathit{w}}\\|_{p}^{2p-2}\zeta^{2/p},\text{ From Lemma \ref{lem:Oracle}.}$

We thus have,

\displaystyle\sum_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right|_{e}\mathopen{}\mathclose{{}\left|\nabla_{e}\|\boldsymbol{\mathit{w}}\|_{p}^{p}}\right|

\displaystyle\leq p\|\boldsymbol{\mathit{w}}\|_{p}^{p-1}\zeta^{1/p}.

Using the above bound, we now have,

	$\displaystyle\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i+1,k)}}\right)\leq$	$\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}+p\alpha\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-1}+2p^{2}\alpha^{2}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-2}+p^{p}\alpha^{p}\\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\\|_{p}^{p}$
	$\displaystyle\leq$	$\displaystyle\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p}+p\alpha\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-1}+2p^{2}\alpha^{2}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-2}+p\alpha m_{1}^{\frac{p-1}{p}},$
		(since $p^{p}\alpha^{p}\tau\leq p\alpha m_{1}^{\frac{p-1}{p}}$ )

Recall $\|\boldsymbol{\mathit{w}}\|_{p}^{p}=\Phi(\boldsymbol{\mathit{w}}).$ Since $\Phi\geq m_{1}$ , we have,

\Phi\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}^{(i+1,k)}}\right)\leq\Phi(\boldsymbol{\mathit{w}})+p\alpha\Phi(\boldsymbol{\mathit{w}})^{\frac{p-1}{p}}+2p^{2}\alpha^{2}\Phi(\boldsymbol{\mathit{w}})^{\frac{p-2}{p}}+p\alpha\Phi(\boldsymbol{\mathit{w}})^{\frac{p-1}{p}}\leq(\Phi(\boldsymbol{\mathit{w}})^{1/p}+2\alpha)^{p}.

From the inductive assumption, we have

\displaystyle\Phi(\boldsymbol{\mathit{w}})

\displaystyle\leq\mathopen{}\mathclose{{}\left({2\alpha i}+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}.

Thus,

\Phi(i+1,k)\leq(\Phi(\boldsymbol{\mathit{w}})^{1/p}+2\alpha)^{p}\leq\mathopen{}\mathclose{{}\left({2\alpha(i+1)}+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}

proving the inductive claim.

Width Reduction Step.

Let ${\widetilde{{\Delta}}}$ be the solution returned by the oracle and $H$ denote the set of indices $j$ such that $|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}|_{j}\geq\rho\zeta^{1/p}$ and $\boldsymbol{\mathit{r}}_{j}\leq\beta$ , i.e., the set of indices on which the algorithm performs width reduction. We have the following:

\sum_{j\in H}\boldsymbol{\mathit{r}}_{j}\leq\rho^{-2}\zeta^{-2/p}\sum_{j\in H}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{j}^{2}\leq\rho^{-2}\zeta^{-2/p}\sum_{j}\boldsymbol{\mathit{r}}_{j}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\rho^{-2}\|\boldsymbol{\mathit{w}}\|_{p}^{p-2}\leq\rho^{-2}\Phi^{\frac{p-2}{p}},

where we use Lemma 3.6 for the second last inequality. Also,

	$\displaystyle\Phi(\boldsymbol{\mathit{w}}^{(i,k+1)})$	$\displaystyle\leq\Phi+\sum_{j\in H}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{w}}_{j}^{k+1}}\right\|^{p}\leq\Phi+2^{\frac{p}{p-2}}\sum_{j\in H}\|\boldsymbol{\mathit{w}}_{j}\|^{p}\leq\Phi+2^{\frac{p}{p-2}}\sum_{j}\boldsymbol{\mathit{r}}_{j}^{\frac{p}{p-2}}$
		$\displaystyle\leq\Phi+2^{\frac{p}{p-2}}\mathopen{}\mathclose{{}\left(\sum_{j\in H}\boldsymbol{\mathit{r}}_{j}}\right)\mathopen{}\mathclose{{}\left(\max_{j\in H}\boldsymbol{\mathit{r}}_{j}}\right)^{\frac{p}{p-2}-1}\leq\Phi+2^{\frac{p}{p-2}}\rho^{-2}\Phi^{\frac{p-2}{p}}\beta^{\frac{2}{p-2}}.$

Again, since $\Phi(\boldsymbol{\mathit{w}})\geq m_{1}$ ,

\Phi(\boldsymbol{\mathit{w}}^{(i,k+1)})\leq\Phi\mathopen{}\mathclose{{}\left(1+2^{\frac{p}{p-2}}\rho^{-2}m_{1}^{-\frac{2}{p}}\beta^{\frac{2}{p-2}}}\right)\leq\mathopen{}\mathclose{{}\left(2\alpha i+m_{1}^{\nicefrac{{1}}{{p}}}}\right)^{p}\mathopen{}\mathclose{{}\left(1+\frac{2^{\frac{p}{p-2}}}{\rho^{2}m_{1}^{2/p}\beta^{-\frac{2}{p-2}}}}\right)^{k}

proving the inductive claim. ∎

3.1.8 Proof of Lemma 3.4

See 3.4

Proof.

It will be helpful for our analysis to split the index set into three disjoint parts:

•

$S=\mathopen{}\mathclose{{}\left\{e:\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|\leq\rho\zeta^{1/p}}\right\}$
•

$H=\mathopen{}\mathclose{{}\left\{e:\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|>\rho\zeta^{1/p}\text{ and }\boldsymbol{\mathit{r}}_{e}\leq\beta}\right\}$
•

$B=\mathopen{}\mathclose{{}\left\{e:\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|>\rho\zeta^{1/p}\text{ and }\boldsymbol{\mathit{r}}_{e}>\beta}\right\}$ .

Firstly, we note

\displaystyle\sum_{e\in S}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}\leq\rho^{p-2}\zeta^{\frac{p-2}{p}}\sum_{e\in S}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{2}\leq\rho^{p-2}\zeta^{\frac{p-2}{p}}\sum_{e\in S}\boldsymbol{\mathit{r}}_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{2}\leq\rho^{p-2}\zeta^{\frac{p-2}{p}}3^{p-2}\Psi(\boldsymbol{\mathit{r}}).

hence, using Assumption 2

\displaystyle\sum_{e\in H\cup B}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}\geq\sum_{e}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}-\sum_{e\in S}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}^{p}\geq\tau\zeta-\rho^{p-2}\zeta^{\frac{p-2}{p}}3^{p-2}\Psi(\boldsymbol{\mathit{r}})\geq\frac{1}{2}\tau\zeta.

This means,

\sum_{e\in H\cup B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\geq\mathopen{}\mathclose{{}\left(\sum_{e\in H\cup B}|\boldsymbol{\mathit{N}}\Delta|_{e}^{p}}\right)^{2/p}\geq\frac{\tau^{2/p}\zeta^{2/p}}{2}.

Secondly we note that,

\sum_{e\in B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\beta^{-1}\sum_{e\in B}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\leq\beta^{-1}3^{p-2}\Psi(\boldsymbol{\mathit{r}}).

So then, using Assumption 1,

\displaystyle\sum_{e\in H}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}=\sum_{e\in H\cup B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}-\sum_{e\in B}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\geq\frac{\tau^{2/p}\zeta^{2/p}}{2}-\beta^{-1}3^{p-2}\Psi(\boldsymbol{\mathit{r}})\geq\frac{\tau^{2/p}\zeta^{2/p}}{4}.

As $\boldsymbol{\mathit{r}}_{e}\geq 1$ , this implies $\sum_{e\in H}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}\geq\frac{\tau^{2/p}\zeta^{2/p}}{4}$ . We note that in a width reduction step, the resistances change by a factor of 2. Thus, combining our last two observations, and applying Lemma C.1, we get

{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)}+\frac{1}{4}\tau^{2/p}\zeta^{2/p}.

Finally, for the “primal step” case, we use the trivial bound from Lemma C.1, ignoring the second term,

{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k+1)}}}\right)}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)}.

∎

3.1.9 Proof of Lemma 3.5

See 3.5

Proof.

Lemma 3.6 implies that,

	$\displaystyle{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}^{(i,k)}}}\right)}$	$\displaystyle=\zeta^{-(p-2)/p}m_{1}^{\frac{p-2}{p}}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{\widetilde{{\Delta}}}+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}$
		$\displaystyle\leq\zeta^{-(p-2)/p}m_{1}^{\frac{p-2}{p}}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}{{{\Delta^{\star}}}}+\frac{1}{3^{p-2}}\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2}$
		$\displaystyle\leq\zeta^{2/p}m_{1}^{\frac{p-2}{p}}+\zeta^{2/p}\frac{1}{3^{p-2}}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{w}}}\right\rVert_{p}^{p-2}$
		$\displaystyle\leq\zeta^{2/p}m_{1}^{\frac{p-2}{p}}+\zeta^{2/p}\frac{1}{3^{p-2}}\Phi(i,k)^{\frac{p-2}{p}}.$

∎

3.2 Complete Algorithm for $\ell_{p}$ -Regression

Recall our problem, (1),

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

We will now use all the tools and algorithms described so far to give a complete algorithm for the above problem. We will assume we have a starting solution $\boldsymbol{\mathit{x}}^{(0)}$ satisfying $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}$ and for purely $\ell_{p}$ objectives, we will use the homotopy analysis from Section 2.2.

Our overall algorithm reduces the problem to solving the residual problem (Definition 2.3) approximately. In Sections 3.1.1 and 3.1.2, we give an algorithm to solve the residual problem by first doing a binary search on the linear term and then applying a multiplicative weight update routine to minimize these problems. We have the following result which follows from Lemma 3.1 and Theorem 3.2.

Corollary 3.7.

Consider the residual problem at iteration $t$ of Algorithm 1. Algorithm 3 using Algorithm 5 as a subroutine finds a $O(1)$ -approximate solution to the corresponding residual problem in $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p}\right)$ calls to a linear system solver.

Proof.

Let $\nu$ be such that $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\in(\nu/2,\nu]$ . Refer to Lemma 2.7 to see that this is the case in which we use the solution of the residual problem. Now, from Lemma 2.6 we know that the optimum of the residual problem satisfies $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\nu/32p,\nu]$ . Since we vary $\zeta$ to take all such values in the range $(\nu/16p,\nu]$ for one such $\zeta$ we must have $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\zeta/2,\zeta].$ For such a $\zeta$ , consider problem (8). Using Algorithm 5 for this problem, from Theorem 3.2 we are guaranteed to find a solution ${\widetilde{{\Delta}}}$ such that ${\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}\leq O(1)\zeta$ and $\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq O(3^{p})\zeta$ . Now from Lemma 3.1, we note that ${\widetilde{{\Delta}}}$ is an $O(1)$ -approximate solution to the residual problem. Since Algorithm 5 requires $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}}\right)$ calls to a linear system solver, and Algorithm 3 calls this algorithm $\log p$ times, we obtain the required runtime. ∎

We are now ready to prove our main result.

Theorem 3.8.

Let $p\geq 2$ , and $\kappa\geq 1$ . Let the initial solution $\boldsymbol{\mathit{x}}^{(0)}$ satisfying $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}$ . Algorithm 1 using Algorithm 3 as a subroutine returns an $\epsilon$ -approximate solution $\boldsymbol{\mathit{x}}$ to (1) in at most $O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log p\log\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)}\right)$ calls to a linear system solver.

Proof.

Follows from Theorem 2.1 and Corollary 3.7. ∎

3.3 Complete Algorithm for Pure $\ell_{p}$ Objectives

Consider the special case when our problem is only the $\ell_{p}$ -norm, i.e., Problem (3),

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

In Section 2.2 we described how to find a good starting point for such problems. Combining this algorithm with our algorithm for solving the residual problem we can obtain a complete algorithm for finding a good starting point. Specifically, we prove the following result.

Corollary 3.9.

Algorithm 2 using Algorithm 3 returns $\boldsymbol{\mathit{x}}^{(0)}$ such that $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(0)}=\boldsymbol{\mathit{b}}$ and $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{(0)}\|_{p}^{p}\leq O(m)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p}$ in $O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log^{2}p\log m}\right)$ calls to a linear system solver.

Proof.

From Lemma 2.9 Algorithm 2 finds such a solution in time $O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k})$ , where $\kappa_{k}$ and $T(k,\kappa_{k})$ denote the approximation and time to solve a $\ell_{k}$ norm problem. Now consider Algorithm 3 with Algorithm 5 as a subroutine. From Corollary 3.7, we can solve any $\ell_{k}$ -norm residual problem to a $O(1)$ -approximation in $O\mathopen{}\mathclose{{}\left(km^{\frac{k-2}{3k-2}}\log k}\right)$ calls to a linear system solver. We thus have $\kappa_{k}=O(1)$ for all $k$ and $T(k,\kappa_{k})=O\mathopen{}\mathclose{{}\left(km^{\frac{k-2}{3k-2}}\log k}\right)\leq O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p}\right)$ . Using these values, we obtain a runtime of,

O\mathopen{}\mathclose{{}\left(p\log m}\right)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k})\leq O\mathopen{}\mathclose{{}\left(p\log m}\right)\cdot\log p\cdot O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p}\right)\leq O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log^{2}p\log m}\right).

∎

The following theorem gives a complete runtime for pure $\ell_{p}$ objectives.

Corollary 3.10.

Let $p\geq 2$ , and $\kappa\geq 1$ . Let $\boldsymbol{\mathit{x}}^{(0)}$ be the solution returned by Algorithm 2. Algorithm 1 using Algorithm 3 as a subroutine returns $\boldsymbol{\mathit{x}}$ such that $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}$ and $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq(1+\epsilon)\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}\|_{p}^{p}$ , in at most $O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{3p-2}}\log^{2}p\log\mathopen{}\mathclose{{}\left(\frac{m}{\epsilon}}\right)}\right)$ calls to a linear system solver.

Proof.

Follows directly from Corollary 3.9 and Theorem 3.8. ∎

4 Solving $p$ -norm Problems using $q$ -norm Oracles

In this section, we propose a new technique that allows us to solve $\ell_{p}$ -norm residual problems by instead solving an $\ell_{q}$ -norm residual problem without adding much to the runtime. Such a technique is unknown for pure $\ell_{p}$ objectives without a large overhead in the runtime. As a consequence we also obtain an algorithm for $\ell_{p}$ -regression with a linear runtime dependence on $p$ instead of the $p^{2}$ dependence in the algorithms from previous sections. The $p^{2}$ dependence in algorithms had one $p$ factor resulting from solving the $p$ -norm residual problem. At a high level, we show that it is sufficient to solve a $\log m$ -norm residual problem when $p$ is large, thus replacing a $p$ -factor with $\log m$ . We prove the following results which are based on the proofs and results of [AS20].

Theorem 4.1.

Let $\epsilon>0$ , $2\leq p\leq poly(m)$ and consider an instance of Problem (1),

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}})=\boldsymbol{\mathit{d}}^{\top}\boldsymbol{\mathit{x}}+\|\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}\|_{2}^{2}+\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

Algorithm 6 finds an $\epsilon$ -approximate solution to (1) in $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p\log m\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)$ calls to a linear system solver.

Theorem 4.2.

Let $\epsilon>0$ , $2\leq p\leq poly(m)$ and consider a pure $\ell_{p}$ instance,

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\quad\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}.

Let $\boldsymbol{\mathit{x}}^{(0)}$ be the output of Algorithm 2. Algorithm 6 using $\boldsymbol{\mathit{x}}^{(0)}$ as a starting solution finds $\boldsymbol{\mathit{x}}$ such that $\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}$ and $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}\leq(1+\epsilon)\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}$ in $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log^{2}p\log m\log\frac{m}{\epsilon}}\right)$ calls to a linear system solver.

4.1 Relation between Residual Problems for $\ell_{p}$ and $\ell_{q}$ Norms

In this section we prove how $q$ -norm residual problems can be used to solve $p$ -norm residual problems. This idea first appeared in the work of [AS20], where they also apply the results to the maximum flow problem. In this paper, we provide a much simpler proof for the main techncial content and unify the cases of $p<q$ and $p>q$ that were presented separately in previous works. We also unify the case of relating the decision versions of the residual problems (without the linear term) and the entire objective. The results for the maximum flow problem and $\ell_{p}$ -norm flow problem as described in the original paper still follow and we refer the reader to the original paper for these applications. The main result of the section is as follows.

Theorem 4.3.

Let $p,q\geq 2$ and $\zeta$ be such that $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\zeta/2,\zeta]$ , where ${{{\Delta^{\star}}}}$ is the optimum of the $\ell_{p}$ -norm residual problem (Definition 2.3). The following $\ell_{q}$ -norm residual problem has optimum at least $\frac{\zeta}{4}$ ,

\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\Delta-\Delta^{\top}\boldsymbol{\mathit{R}}\Delta-\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}\Delta\|_{q}^{q}.

(11)

Let $\beta\geq 1$ and ${\widetilde{{\Delta}}}$ denote a feasible solution to the above $\ell_{q}$ -norm residual problem with objective value at least $\frac{\zeta}{16\beta}$ . For $\alpha=\frac{1}{256\beta}m^{-\frac{p}{p-1}\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|}$ , $\alpha{\widetilde{{\Delta}}}$ gives a $O(\beta^{2})m^{\frac{p}{p-1}\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|}$ -approximate solution to the $\ell_{p}$ -norm residual problem $\boldsymbol{res}_{p}$ .

Proof.

Consider ${{{\Delta^{\star}}}}$ , the optimum of the $\ell_{p}$ -norm residual problem. Note that $\lambda{{{\Delta^{\star}}}}$ is a feasible solution for all $\lambda$ since $\boldsymbol{\mathit{A}}(\lambda{{{\Delta^{\star}}}})=0.$ We know that the objective is optimum for $\lambda=1$ . Thus,

\mathopen{}\mathclose{{}\left[\frac{d}{d\lambda}\boldsymbol{res}_{p}(\lambda{{{\Delta^{\star}}}})}\right]_{\lambda=1}=0,

which gives us,

\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-2{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-p\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}=0.

Rearranging,

{}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}+(p-1)\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}=\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}\leq\zeta.

Since $p\geq 2$ , $\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}\leq\zeta^{1/p}$ which implies

\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{q}\leq\begin{cases}\zeta^{1/p}&\text{if, $p\leq q$}\\ m^{\frac{1}{q}-\frac{1}{p}}\zeta^{1/p}&\text{otherwise.}\end{cases}

We also note that,

\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}>\frac{\zeta}{2}+\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}>\frac{\zeta}{2}.

Combining these bounds, we obtain the optimum of (11) is at least,

\boldsymbol{\mathit{g}}^{\top}{{{\Delta^{\star}}}}-{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{R}}{{{\Delta^{\star}}}}-\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{q}^{q}>\frac{\zeta}{2}-\frac{1}{4}\zeta^{1-\frac{q}{p}}\zeta^{q/p}>\frac{\zeta}{4}.

Since the optimum of (11) is at least $\zeta/4$ , there exists a feasible ${\widetilde{{\Delta}}}$ with objective value at least $\zeta/16\beta$ . We now prove the second part, that a scaling of ${\widetilde{{\Delta}}}$ gives a good approximation to the $\ell_{p}$ -norm residual problem. First, let us assume $|\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}|\leq\zeta$ . Since ${\widetilde{{\Delta}}}$ has objective value at least $\zeta/16\beta$ ,

{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}+\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{q}^{q}\leq\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-\frac{\zeta}{16\beta}\leq\zeta.

Thus, $m^{\min\mathopen{}\mathclose{{}\left\{\frac{1}{p}-\frac{1}{q},0}\right\}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{q}\leq 4^{\frac{1}{q}}\zeta^{\frac{1}{p}}$ , and $\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq 4^{\frac{p}{q}}\zeta m^{\mathopen{}\mathclose{{}\left|1-\frac{p}{q}}\right|}$ . Let ${\bar{{\Delta}}}=\alpha{\widetilde{{\Delta}}}$ , where $\alpha=\frac{1}{256\beta}m^{-\frac{p}{p-1}\mathopen{}\mathclose{{}\left|\frac{1}{p}-\frac{1}{q}}\right|}$ . We will show that $\alpha{\bar{{\Delta}}}$ is a good solution to the $\ell_{p}$ -norm residual problem.

	$\displaystyle\boldsymbol{res}_{p}(\alpha{\bar{{\Delta}}})$	$\displaystyle=\alpha\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-\alpha{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}-\alpha^{p-1}\\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\\|_{p}^{p}}\right)$
		$\displaystyle\geq\alpha\mathopen{}\mathclose{{}\left(\frac{\zeta}{16\beta}-\frac{1}{256\beta}\zeta-\alpha^{p-1}4^{\frac{p}{q}}\zeta m^{\mathopen{}\mathclose{{}\left\|1-\frac{p}{q}}\right\|}}\right)$
		$\displaystyle\geq\alpha\mathopen{}\mathclose{{}\left(\frac{\zeta}{16\beta}-\frac{\zeta}{256\beta}-\frac{\zeta}{64\beta}}\right)$
		$\displaystyle\geq\frac{\alpha}{64\beta}\boldsymbol{res}_{p}({{{\Delta^{\star}}}}).$

For the case $|\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}|\geq\zeta$ , consider the vector $z{\widetilde{{\Delta}}}$ where $z=\frac{\zeta}{2|\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}|}\leq\frac{1}{2}$ . This vector is still feasible for Problem (11) and $\boldsymbol{\mathit{g}}^{\top}z{\widetilde{{\Delta}}}=\frac{\zeta}{2}$ and,

z\boldsymbol{\mathit{g}}^{\top}{\widetilde{{\Delta}}}-z^{2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}-z^{q}\frac{1}{4}\zeta^{1-\frac{q}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{q}{p}-1,0}\right\}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{q}^{q}\geq\frac{\zeta}{2}-z^{2}\zeta\geq\frac{\zeta}{4}.

We can now repeat the same argument as above. ∎

4.2 Faster Algorithm for $\ell_{p}$ -Regression

In this section, we will combine the tools developed in previous chapters and combine it with Section 4.1 to obtain an algorithm for Problem 1 that requires $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log p\log m\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)$ calls to a linear systems solver. For pure $\ell_{p}$ objectives we can combine our algorithm with the algorithm in Section 2.2 to obtain a convergence rate of $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log^{2}p\log m\log\frac{m}{\epsilon}}\right)$ linear systems solves.

Algorithm 6 Complete Algorithm with Linear

p

-dependence

1:procedure

\ell_{p}

-Solver(

\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},p,\epsilon

)

\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}^{(0)}

\nu\leftarrow

Upper bound on

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})

\triangleright

\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\geq 0

, then

\nu\leftarrow\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})

4: while

\nu>\epsilon

5: if

p\geq\log m

then

{\widetilde{{\Delta}}}\leftarrow

\log m

-ResidualSolver(

\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p

)

7: else

{\widetilde{{\Delta}}}\leftarrow

ResidualSolver(

\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p

)

9: if

\boldsymbol{res}_{p}({\widetilde{{\Delta}}})\geq\frac{\nu}{32p\kappa}

then

10:

\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}-\frac{{\widetilde{{\Delta}}}}{p}

11: else

12:

\nu\leftarrow\frac{\nu}{2}

13: return

\boldsymbol{\mathit{x}}

Algorithm 7 Residual Solver using

\log m

-norm

1:procedure

\log m

-ResidualSolver(

\boldsymbol{\mathit{x}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{d}},\boldsymbol{\mathit{b}},\nu,p

)

\zeta\leftarrow\nu

\alpha\leftarrow m^{-\frac{1}{p-1}}

(\boldsymbol{\mathit{g}},\boldsymbol{\mathit{R}},\boldsymbol{\mathit{N}})\leftarrow\boldsymbol{res}_{p}

\triangleright

Create residual problem at

\boldsymbol{\mathit{x}}

5: while

\zeta>\frac{\nu}{32p}

\widetilde{\boldsymbol{\mathit{N}}}\leftarrow\frac{1}{4^{1/\log m}}\zeta^{\frac{1}{\log m}-\frac{1}{p}}m^{\min\mathopen{}\mathclose{{}\left\{\frac{1}{p}-\frac{1}{\log m},0}\right\}}\boldsymbol{\mathit{N}}

{\widetilde{{\Delta}}}_{\zeta}\leftarrow

MWU-Solver

\mathopen{}\mathclose{{}\left([\boldsymbol{\mathit{A}},\boldsymbol{\mathit{g}}^{\top}],\boldsymbol{\mathit{R}}^{1/2},\widetilde{\boldsymbol{\mathit{N}}},[0,\frac{\zeta}{2}]^{\top},\zeta,\log m}\right)

\triangleright

Algorithm 5

\zeta\leftarrow\frac{\zeta}{2}

9: return

\alpha{\widetilde{{\Delta}}}\leftarrow\arg\min_{{\widetilde{{\Delta}}}_{\zeta}}\boldsymbol{\mathit{f}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}-\frac{\alpha{\widetilde{{\Delta}}}_{\zeta}}{p}}\right)

Lemma 4.4.

Let $poly(m)\geq p\geq\log m$ . Algorithm 7 returns an $O(m^{\frac{1}{p-1}})$ -approximate solution to the $\ell_{p}$ -residual problem $\boldsymbol{res}_{p}$ at $\boldsymbol{\mathit{x}}$ in at most $O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log m\log p}\right)$ calls to a linear system solver.

Proof.

Let $\nu$ be such that $\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(t)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})\in(\nu/2,\nu]$ . Refer to Lemma 2.7 to see that this is the case in which we use the solution of the residual problem. Now, from Lemma 2.6 we know that the optimum of the residual problem satisfies $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\nu/32p,\nu]$ . Since we vary $\zeta$ to take all such values in the range $(\nu/16p,\nu]$ for one such $\zeta$ we must have $\boldsymbol{res}_{p}({{{\Delta^{\star}}}})\in(\zeta/2,\zeta].$ For such a $\zeta$ , consider the $\log m$ -norm residual problem (11). Using Algorithm 5 for this problem, from Theorem 3.2 we are guaranteed to find a solution ${\widetilde{{\Delta}}}$ such that ${\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{R}}{\widetilde{{\Delta}}}\leq O(1)\zeta$ and $\|\widetilde{\boldsymbol{\mathit{N}}}{\widetilde{{\Delta}}}\|_{\log m}^{\log m}\leq O(3^{p})\zeta$ . Now from Lemma 3.1, we note that ${\widetilde{{\Delta}}}$ is an $O(1)$ -approximate solution to the $\log m$ -residual problem. We now use Theorem 4.3, which states that $\alpha{\widetilde{{\Delta}}}$ is a $O\mathopen{}\mathclose{{}\left(m^{\frac{1}{p-1}}}\right)$ -approximate solution to the required residual problem $\boldsymbol{res}_{p}$ .

Since for $p\geq\log m$ , Algorithm 5 requires $O\mathopen{}\mathclose{{}\left(m^{\frac{\log m-2}{3\log m-2}}\log m}\right)\leq O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log m}\right)$ calls to a linear system solver, and Algorithm 3 calls this algorithm $\log p$ times, we obtain the required runtime. ∎

See 4.1

Proof.

We note that Algorithm 6 is essentially Algorithm 1 which calls different residual solvers depending on the value of $p$ . If $p\leq\log m$ , from Theorem 3.8, we obtain the required solution in $O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log p\log\frac{\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{(0)})-\boldsymbol{\mathit{f}}(\boldsymbol{\mathit{x}}^{\star})}{\epsilon}}\right)$ calls to a linear system solver. If $p\geq\log m$ , from Lemma 4.4, we obtain an $O(m^{\frac{1}{p-1}})\leq O(m^{\frac{1}{\log m}})\leq O(1)$ approximate solution to the residual problem at any iteration in $O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log m\log p}\right)$ calls to a linear system solver. Combining this with Theorem 2.1, we obtain our result. ∎

See 4.2

Proof.

From Lemma 2.9 we can find an $O(m)$ -approximation to the above problem in time

O(p\log m)\sum_{k=2^{i},i=2}^{i=\lfloor\log p-1\rfloor}\kappa_{k}T(k,\kappa_{k}),

where $\kappa$ is the approximation to which we solve the residual problem for the $k$ -norm problem and $T(k,\kappa)$ is the time required to do so. If $k\geq\log m$ , we use Algorithm 7 to solve such residual problems. Thus $\kappa_{k}=m^{\frac{1}{k-1}}\leq m^{\frac{1}{\log m}}\leq O(1)$ and $T(k,\kappa_{k})=O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log p}\right)$ . If $k\leq\log m$ , we can use Algorithm 3 and $\kappa_{k}=O(1)$ , $T(k,\kappa_{k})=O\mathopen{}\mathclose{{}\left(m^{\frac{p-2}{3p-2}}\log p}\right)$ . Thus, the total runtime is $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}\log m\log^{2}p}\right)$ . We now combine this with Theorem 4.1 to obtain the required rates of convergence. ∎

5 Speedups for General Matrices via Inverse Maintenance

Inverse maintenance was first introduced by Vaidya in 1990 [Vai90] for speeding up algorithms for minimum cost and multicommodity flow problems. The key idea is to reuse the inverse of matrices, which is possible due to the controllable rates at which variables are updated in some algorithms. In the work by [Adi+19], the authors design a new inverse maintenance algorithm for $\ell_{p}$ -regression that can solve $\ell_{p}$ -regression for any $p>2$ almost as fast as linear regression. This section is based on Section 6 of [Adi+19] and we give a more fine grained and simplified analysis of the original result. In particular, we simplify the proofs and give the result with explicit dependencies on both matrix dimensions as opposed to just the larger dimension.

Our inverse maintenance procedure is based on the same high-level ideas of combining low-rank updates and matrix multiplication as in [Vai90] and [LS15]. However, recall that the rate of convergence of our algorithm is controlled by two potentials which change at different rates based on the two different kind of weight update steps in our algorithm. In order to handle these updates, our inverse maintenance algorithm uses a new fine-grained bucketing scheme, inspired by lazy updates in data structures and is different from previous works on inverse maintenance which usually update weights based on fixed thresholds. Our scheme is also simpler than those used in [Vai90, LS15]. We now present our algorithm in detail.

Consider the weighted linear system being solved at each iteration of Algorithm 5. Each weighted linear system is of the form,

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{c}}}\boldsymbol{\mathit{x}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)\boldsymbol{\mathit{x}}

where $\boldsymbol{\mathit{A}}\in\mathbb{R}^{d\times n},\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}$ . From Equation (15) in Section 3, the solution of the above linear system is given by,

\boldsymbol{\mathit{x}}^{\star}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\boldsymbol{\mathit{c}}.

In order to compute the above expression, we require the following products in order. The runtimes are considering the fact $\omega\geq 2$ .

•

$\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}$ and $\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}$ : require time $m_{2}n^{\omega-1}$ and $m_{1}n^{\omega-1}$ respectively
•

$\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}$ : requires time $n^{\omega}$
•

$\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}$ and $\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}$ : require time $n^{2}d^{\omega-2}$
•

$\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}$ : requires time $d^{\omega}$
•

$\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}$ : requires time $nd^{\omega-1}$

The cost of solving the above problem is dominated by the first step, and we thus require time $O(mn^{\omega-1})$ , where $m=\max\{m_{1},m_{2}\}$ . This directly gives the runtime of Algorithm 5 to be $O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{(3p-2)}}mn^{\omega-1}}\right)$ . In this section, we show that we can implement Algorithm 5 in time similar to solving a system of linear equations for all $p\geq 2$ . In particular, we prove the following result.

Theorem 5.1.

If $\boldsymbol{\mathit{A}},\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}}$ are explicitly given, matrices with polynomially bounded condition numbers, and $p\geq 2$ , Algorithm 5 as given in Section 3.1.2 can be implemented to run in total time

O\mathopen{}\mathclose{{}\left(mn^{\omega-1}+p^{3-\omega}n^{2}m^{\omega-2}+p^{3-\omega}n^{2}m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}}\right).

5.1 Inverse Maintenance Algorithm

We first note that the weights $\boldsymbol{\mathit{w}}_{e}^{(i)}$ ’s, and thus $\boldsymbol{\mathit{r}}_{e}^{(i)}$ ’s are monotonically increasing. Our algorithm in Section 3.1.2 updates both in every iteration. Here, we will instead update these gradually when there is a significant increase in the values. We thus give a lazy update scheme. The update can be done via the following consequence of the Woodbury matrix formula. The main idea is that we initially explicitly compute the inverse of the required matrix, and then when we update the coordinates that have significant increases, but are still within a good factor approximation of the original values, and directly use the current matrix inverse as a preconditioner and solve linear systems faster.

5.1.1 Low Rank Update

The following lemma is the same as Lemma 6.2 of [Adi+19].

Lemma 5.2.

Given matrices $\boldsymbol{\mathit{N}}\in\mathbb{R}^{m_{1}\times n},\boldsymbol{\mathit{M}}\in\mathbb{R}^{m_{2}\times n}$ , and vectors $\boldsymbol{\mathit{r}}$ and $\tilde{\boldsymbol{\mathit{r}}}$ that differ in $k$ entries, as well as the matrix $\widehat{\boldsymbol{\mathit{Z}}}=(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}})^{-1}$ , we can construct $(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}})^{-1}$ in $O(k^{\omega-2}n^{2})$ time.

Proof.

Let $S$ denote the entries that differ in $\boldsymbol{\mathit{r}}$ and $\tilde{\boldsymbol{\mathit{r}}}$ . Then we have

\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}}=\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}}+\boldsymbol{\mathit{N}}_{:,S}^{\top}\mathopen{}\mathclose{{}\left(Diag(\tilde{\boldsymbol{\mathit{r}}}_{S})-Diag(\boldsymbol{\mathit{r}}_{S})}\right)\boldsymbol{\mathit{N}}_{S,:}.

This is a low rank perturbation, so by Woodbury matrix identity we get:

\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}}}\right)^{-1}=\widehat{\boldsymbol{\mathit{Z}}}-\widehat{\boldsymbol{\mathit{Z}}}\boldsymbol{\mathit{N}}_{:,S}^{\top}\mathopen{}\mathclose{{}\left(\mathopen{}\mathclose{{}\left(Diag({\tilde{\boldsymbol{\mathit{r}}}_{S}})-Diag(\boldsymbol{\mathit{r}}_{S})}\right)^{-1}+\boldsymbol{\mathit{N}}_{S,:}\widehat{\boldsymbol{\mathit{Z}}}\boldsymbol{\mathit{N}}_{:,S}^{\top}}\right)^{-1}\boldsymbol{\mathit{N}}_{S,:}\widehat{\boldsymbol{\mathit{Z}}},

where we use $\widehat{\boldsymbol{\mathit{Z}}}^{\top}=\widehat{\boldsymbol{\mathit{Z}}}$ because $\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}}$ is a symmetric matrix. To explicitly compute this matrix, we need to:

1.

compute the matrix $\boldsymbol{\mathit{N}}_{S,:}\widehat{\boldsymbol{\mathit{Z}}}$ ,
2.

compute $\boldsymbol{\mathit{N}}_{:,S}\widehat{\boldsymbol{\mathit{Z}}}\boldsymbol{\mathit{N}}_{:,S}^{\top}$
3.

invert the middle term.

This cost is dominated by the first term, which can be viewed as multiplying $\lceil n/k\rceil$ pairs of $k\times n$ and $n\times k$ matrices. Each such multiplication takes time $k^{\omega-1}n$ , for a total cost of $O(k^{\omega-2}n^{2})$ . The other terms all involve matrices with dimension at most $k\times n$ , and are thus lower order terms. ∎

5.1.2 Approximation and Fast Linear Systems Solver

We now define the notion of approximation we use and how to solve linear systems fast given a good preconditioner.

Definition 5.3.

We use $a\approx_{c}b$ for positive numbers $a$ and $b$ iff $c^{-1}a\leq b\leq c\cdot b$ , and for vectors and for vectors $\boldsymbol{\mathit{a}}$ and $\boldsymbol{\mathit{b}}$ we use $\boldsymbol{\mathit{a}}\approx_{c}\boldsymbol{\mathit{b}}$ to denote $\boldsymbol{\mathit{a}}_{i}\approx_{c}\boldsymbol{\mathit{b}}_{i}$ entry-wise.

In our algorithm, we only update $k$ resistances that have increased by a constant factor. We can therefore use a constant factor preconditioner to solve the new linear system. We will use the following result on solving preconditioned systems of linear equations.

Lemma 5.4.

If $\boldsymbol{\mathit{r}}$ and $\tilde{\boldsymbol{\mathit{r}}}$ are vectors such that $\boldsymbol{\mathit{r}}\approx_{\widetilde{O}(1)}\tilde{\boldsymbol{\mathit{r}}}$ , and we’re given the matrix $\widehat{\boldsymbol{\mathit{Z}}}^{-1}=(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}})^{-1}$ explicitly, then we can solve a system of linear equations involving $\boldsymbol{\mathit{Z}}=\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\tilde{\boldsymbol{\mathit{r}}})\boldsymbol{\mathit{N}}$ to $1/poly(n)$ accuracy in $\widetilde{O}(n^{2})$ time.

Proof.

Suppose we want to solve the system,

\boldsymbol{\mathit{Z}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}.

We know $\widehat{\boldsymbol{\mathit{Z}}}^{-1}$ and that for some constant $c$ , $\frac{1}{c}\boldsymbol{\mathit{I}}\preceq\widehat{\boldsymbol{\mathit{Z}}}^{-1/2}\boldsymbol{\mathit{Z}}\widehat{\boldsymbol{\mathit{Z}}}^{-1/2}\preceq c\boldsymbol{\mathit{I}}$ . The following iterative method (which is essentially gradient descent),

\boldsymbol{\mathit{x}}^{(k+1)}\rightarrow\boldsymbol{\mathit{x}}^{(k)}-\hat{\boldsymbol{\mathit{Z}}}^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{Z}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right)

converges to an $\epsilon$ -approximate solution in $O\mathopen{}\mathclose{{}\left(c\log\frac{1}{\epsilon}}\right)$ iterations. Each iteration can be computed via matrix-vector products. Since matrix vector products for $n\times n$ matrices require at most $O(n^{2})$ we get the above lemma for $\epsilon=1/poly(n)$ . ∎

5.1.3 Algorithm

The algorithm is the same as that in Section 6 of [Adi+19]. The algorithm has two parts, an initialization routine InverseInit which is called only at the first iteration, and the inverse maintenance procedure, UpdateInverse which is called from Algorithm 4, Oracle. Algorithm Oracle is called every time the resistances are updated in Algorithm 5. For this section, we will assume access to all variables from these routines, and maintain the following global variables:

1.

$\boldsymbol{\widehat{\mathit{r}}}$ : resistances from the last time we updated each entry.
2.

$counter(\eta)_{e}$ : for each entry, track the number of times that it changed (relative to $\boldsymbol{\widehat{\mathit{r}}}$ ) by a factor of about $2^{-\eta}$ since the previous update.
3.

$\widehat{\boldsymbol{\mathit{Z}}}$ , an inverse of the matrix given by $\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\widehat{\mathit{r}}})\boldsymbol{\mathit{N}}$ .

Algorithm 8 Inverse Maintenance Initialization

1:procedure InverseInit(

\boldsymbol{\mathit{M}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{r}}^{(0)}

)

2: Set

\boldsymbol{\widehat{\mathit{r}}}\leftarrow\boldsymbol{\mathit{r}}^{(0)}

3: Set

counter(\eta)_{e}\leftarrow 0

for all

0\leq\eta\leq\log(m)

and

e

4: Set

\widehat{\boldsymbol{\mathit{Z}}}\leftarrow(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}})\boldsymbol{\mathit{N}})^{-1}

by explicitly inverting the matrix.

Algorithm 9 Inverse Maintenance Procedure

1:procedure UpdateInverse

2: for all entries

e

3: Find the least non-negative integer

\eta

such that

\frac{1}{2^{\eta}}\leq\frac{\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i}\right)}_{e}-\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i-1}\right)}_{e}}{\boldsymbol{\widehat{\mathit{r}}}_{e}}.

4: Increment

counter(\eta)_{e}

E_{changed}\leftarrow\cup_{\eta:i\pmod{2^{\eta}}\equiv 0}\{e:counter(\eta)_{e}\geq 2^{\eta}\}

\tilde{\boldsymbol{\mathit{r}}}\leftarrow\boldsymbol{\widehat{\mathit{r}}}

7: for all

e\in E_{changed}

\tilde{\boldsymbol{\mathit{r}}}_{e}\leftarrow\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}

9: Set

counter(\eta)_{e}\leftarrow 0

for all

\eta

10:

\widehat{\boldsymbol{\mathit{Z}}}\leftarrow\textsc{LowRankUpdate}(\widehat{\boldsymbol{\mathit{Z}}},\boldsymbol{\widehat{\mathit{r}}},\tilde{\boldsymbol{\mathit{r}}})

11:

\boldsymbol{\widehat{\mathit{r}}}\leftarrow\tilde{\boldsymbol{\mathit{r}}}

5.1.4 Analysis

We first verify that the maintained inverse is always a good preconditioner to the actual matrix, $\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}Diag(\boldsymbol{\mathit{r}}^{(i)})\boldsymbol{\mathit{N}}$ .

Lemma 5.5 (Lemma 6.5, [Adi+19]).

After each call to UpdateInverse, the vector $\boldsymbol{\widehat{\mathit{r}}}$ satisfies

\boldsymbol{\widehat{\mathit{r}}}\approx_{\widetilde{O}\mathopen{}\mathclose{{}\left(1}\right)}\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i}\right)}.

Proof.

First, observe that any change in resistance exceeding $1$ is reflected immediately. Otherwise, every time we update $counter(j)_{e}$ , $\boldsymbol{\mathit{r}}_{e}$ can only increase additively by at most

2^{-j+1}\boldsymbol{\widehat{\mathit{r}}}_{e}.

Once $counter(j)_{e}$ exceeds $2^{j}$ , $e$ will be added to $E_{changed}$ after at most $2^{j}$ steps. So when we start from $\boldsymbol{\widehat{\mathit{r}}}_{e}$ , $e$ is added to $E_{changed}$ after $counter(j)_{e}\leq 2^{j}+2^{j}=2^{j+1}$ iterations. The maximum possible increase in resistance due to the bucket $j$ is,

2^{-j+1}\boldsymbol{\widehat{\mathit{r}}}_{e}\cdot 2^{j+1}=4\boldsymbol{\widehat{\mathit{r}}}_{e}.

Since there are only at most $m^{1/3}$ iterations, the contributions of buckets with $j>\log{m}$ are negligible. Now the change in resistance is influenced by all buckets $j$ , each contributing at most $4\boldsymbol{\widehat{\mathit{r}}}_{e}$ increase. The total change is at most $4\boldsymbol{\widehat{\mathit{r}}}_{e}\log m$ since there are at most $\log m$ buckets. We therefore have

\boldsymbol{\widehat{\mathit{r}}}_{e}\leq\boldsymbol{\mathit{r}}^{\mathopen{}\mathclose{{}\left(i}\right)}_{e}\leq 5\boldsymbol{\widehat{\mathit{r}}}_{e}\log m.

for every $i$ . ∎

It remains to bound the number and sizes of calls made to Lemma 5.2. For this we define variables $k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}$ to denote the number of edges added to $E_{changed}$ at iteration $i$ due to the value of $counter(\eta)_{e}$ . Note that $k(\eta)^{(i)}$ is non-zero only if $i\equiv 0\pmod{2^{\eta}}$ , and

\mathopen{}\mathclose{{}\left|E_{changed}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right|\leq\sum_{\eta}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}.

We divide our analysis into 2 cases, when the relative change in resistance is at least $1$ and when the relative change in resistance is at most $1$ . To begin with, let us first look at the following lemma that relates the change in weights to the relative change in resistance.

Lemma 5.6.

Consider a primal step from Algorithm 5. We have

\frac{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i+1}\right)}-\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\leq\mathopen{}\mathclose{{}\left(1+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}-1

where $\Delta$ is the solution produced by the oracle Algorithm 4.

Proof.

Recall from Algorithm 4 that

\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{p-2}.

For a primal step of Algorithm 5, we have

\boldsymbol{\mathit{w}}^{\mathopen{}\mathclose{{}\left(i+1}\right)}_{e}-\boldsymbol{\mathit{w}}^{\mathopen{}\mathclose{{}\left(i}\right)}_{e}=\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}.

Substituting this in gives

\frac{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i+1}\right)}-\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}{\boldsymbol{\mathit{r}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}=\frac{\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}+\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}\right)^{p-2}-\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{p-2}}{\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{p-2}}\leq\mathopen{}\mathclose{{}\left(1+\frac{\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\boldsymbol{\mathit{w}}_{e}^{\mathopen{}\mathclose{{}\left(i}\right)}}}\right)^{p-2}-1\leq\mathopen{}\mathclose{{}\left(1+\frac{\alpha}{\zeta^{1/p}}\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}\right)^{p-2}-1,

where the last inequality utilizes $\boldsymbol{\mathit{w}}_{e}^{(i)}\geq 1$ . ∎

We now consider the case when the relative change in resistance is at least $1$ .

Lemma 5.7.

Throughout the course of a run of Algorithm 5, the number of edges added to $E_{changed}$ due to relative resistance increase of at least $1$ ,

\sum_{1\leq i\leq T}k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O\mathopen{}\mathclose{{}\left(m^{\frac{p+2}{3p-2}}}\right).

Proof.

From Lemma C.1, we know that the change in energy over one iteration is at least,

\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right).

Over all iterations, the change in energy is at least,

\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right)

which is upper bounded by $O(m^{\frac{p-2}{p}})\zeta^{2/p}$ . When iteration $i$ is a width reduction step, the relative resistance change is always at least $1$ . In this case $\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|\geq\rho\zeta^{1/p}$ . When we have a primal step, Lemma 5.6 implies that when the relative change in resistance is at least $1$ then,

\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\Omega(1)\alpha^{-1}\zeta^{1/p}.

Using the bound $\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\Omega(p^{-1})\alpha^{-1}\zeta^{1/p}$ is sufficient since $\rho>\Omega(p^{-1}\alpha^{-1})$ and both kinds of iterations are accounted for. The total change in energy can now be bounded.

	$\displaystyle p^{-2}\alpha^{-2}\zeta^{2/p}\sum_{i}\sum_{e}\mathbb{1}_{\mathopen{}\mathclose{{}\left[\frac{\boldsymbol{\mathit{r}}^{(i+1)}_{e}-\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i)}_{e}}\geq 1}\right]}\leq O(m^{\frac{p-2}{p}})\zeta^{2/p}$
	$\displaystyle\Leftrightarrow p^{-2}\alpha^{-2}\sum_{i}k(0)^{(i)}\leq O(m^{\frac{p-2}{p}})$
	$\displaystyle\Leftrightarrow\sum_{i}k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O(p^{2}m^{(p-2)/p}\alpha^{2}).$

The Lemma follows by substituting $\alpha=\Theta\mathopen{}\mathclose{{}\left(p^{-1}m^{-\frac{p^{2}-5p+2}{p(3p-2)}}}\right)$ in the above equation. ∎

Lemma 5.8.

Throughout the course of a run of Algorithm 5, the number of edges added to $E_{changed}$ due to relative resistance increase between $2^{-\eta}$ and $2^{-\eta+1}$ ,

\sum_{1\leq i\leq T}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq\begin{cases}0&\text{if $2^{\eta}\geq T$},\\ O\mathopen{}\mathclose{{}\left(m^{\frac{p+2}{3p-2}}2^{2\eta}}\right)&\text{otherwise}.\end{cases}

Proof.

From Lemma C.1, the total change in energy is at least,

\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right).

We know that $\frac{\boldsymbol{\mathit{r}}^{(i+1)}_{e}-\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i)}_{e}}\geq 2^{-\eta}$ . Using Lemma 5.6, we have,

\mathopen{}\mathclose{{}\left(1+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}-1\geq 2^{-\eta}.

We thus obtain,

\mathopen{}\mathclose{{}\left(1+\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}-1\leq\begin{cases}\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}&\text{ when $\alpha\mathopen{}\mathclose{{}\left|\Delta_{e}}\right|\leq\zeta^{1/p}$ or $p-2\leq 1$}\\ \mathopen{}\mathclose{{}\left(2\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}&\text{ otherwise. }\end{cases}

Now, in the second case, when $\alpha\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta_{e}}\right|\geq\zeta^{1/p}$ and $p-2>1$ ,

\mathopen{}\mathclose{{}\left(2\alpha\frac{\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}}{\zeta^{1/p}}}\right)^{p-2}\geq 2^{-\eta}\Rightarrow\alpha\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\mathopen{}\mathclose{{}\left(\frac{1}{2^{\eta}}}\right)^{1/(p-2)+1}\zeta^{1/p}\geq 2^{-\eta-1}\zeta^{1/p}

Therefore, for both cases we have,

\alpha\mathopen{}\mathclose{{}\left|\boldsymbol{\mathit{N}}\Delta}\right|_{e}\geq\mathopen{}\mathclose{{}\left(2^{-\eta-1}}\right)\zeta^{1/p}.

Using the above bound and the fact that the total change in energy is at most $O(m^{\frac{p-2}{p}})\zeta^{2/p}$ , gives,

		$\displaystyle\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{N}}\Delta}\right)_{e}^{2}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i+1)}_{e}}}\right)\leq O(m^{\frac{p-2}{p}})\zeta^{2/p}$
	$\displaystyle\Rightarrow$	$\displaystyle\frac{1}{4}\sum_{i}\sum_{e}\mathopen{}\mathclose{{}\left(\alpha^{-1}2^{-\eta}\zeta^{1/p}}\right)^{2}\cdot\mathopen{}\mathclose{{}\left(2^{-\eta}\mathbb{1}_{2^{-\eta+1}\geq\frac{\boldsymbol{\mathit{r}}^{(i+1)}_{e}-\boldsymbol{\mathit{r}}^{(i)}_{e}}{\boldsymbol{\mathit{r}}^{(i)}_{e}}\geq 2^{-\eta}}}\right)\leq O(m^{\frac{p-2}{p}})\zeta^{2/p}$
	$\displaystyle\Rightarrow$	$\displaystyle\alpha^{-2}2^{-3\eta}\sum_{i}2^{\eta}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O(m^{\frac{p-2}{p}})$
	$\displaystyle\Rightarrow$	$\displaystyle\sum_{i}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}\leq O\mathopen{}\mathclose{{}\left(m^{(p-2)/p}\alpha^{2}2^{2\eta}}\right)$

The Lemma follows substituting $\alpha=\Theta\mathopen{}\mathclose{{}\left(p^{-1}m^{-\frac{p^{2}-5p+2}{p(3p-2)}}}\right)$ in the above equation. ∎

We can now use the concavity of $f(z)=z^{\omega-2}$ to upper bound the contribution of these terms.

Corollary 5.9.

Let $k(\eta)^{(i)}$ be as defined. Over all iterations we have,

\sum_{i}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}\leq O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}}\right)

and for every $\eta$ ,

\sum_{i}^{T}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}\leq\begin{cases}0&\text{if $2^{\eta}\geq T$},\\ O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}\cdot 2^{\eta\mathopen{}\mathclose{{}\left(3\omega-7}\right)}}\right)&\text{otherwise}.\end{cases}

Proof.

Due to the concavity of the $\omega-2\approx 0.3727<1$ power, this total is maximized when it’s equally distributed over all iterations. In the first sum, the number of terms is equal to the number of iterations, i.e., $O(pm^{\frac{p-2}{3p-2}})$ . In the second sum the number of terms is $O(pm^{\frac{p-2}{3p-2}})2^{-\eta}$ . Distributing the sum equally over the above numbers give,

\sum_{i}^{T}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(0}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}\leq\mathopen{}\mathclose{{}\left(O\mathopen{}\mathclose{{}\left(p^{-1}m^{\frac{p+2}{3p-2}-\frac{p-2}{3p-2}}}\right)}\right)^{\omega-2}\cdot O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}}\right)=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}}\right)=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}}\right)

and

	$\displaystyle\sum_{i}^{T}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}$	$\displaystyle\leq O\mathopen{}\mathclose{{}\left(pm^{\frac{p-2}{3p-2}}2^{-\eta}}\right)\cdot\mathopen{}\mathclose{{}\left(p^{-1}\frac{m^{\frac{p+2}{3p-2}}2^{2\eta}}{m^{\frac{p-2}{3p-2}}2^{-\eta}}}\right)^{\omega-2}$
		$\displaystyle=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}2^{-\eta}\cdot 2^{3\eta(\omega-2)}}\right)$
		$\displaystyle=O\mathopen{}\mathclose{{}\left(p^{3-\omega}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}2^{\eta(3\omega-7)}}\right).$

∎

5.2 Proof of Theorem 5.1

See 5.1

Proof.

By Lemma 5.5, the $\boldsymbol{\widehat{\mathit{r}}}$ that the inverse being maintained corresponds to always satisfy $\boldsymbol{\widehat{\mathit{r}}}\approx_{\widetilde{O}(1)}\boldsymbol{\mathit{r}}^{(i)}$ . So by the iterative linear systems solver method outlined in Lemma 5.4, we can implement each call to Oracle (Algorithm 4)in time $O(n^{2})$ in addition to the cost of performing inverse maintenance. This leads to a total cost of

\widetilde{O}\mathopen{}\mathclose{{}\left(pn^{2}m^{\frac{p-2}{3p-2}}}\right).

across the $T=\Theta(pm^{\frac{p-2}{3p-2}})$ iterations.

The costs of inverse maintenance is dominated by the calls to the low-rank update procedure outlined in Lemma 5.2. Its total cost is bounded by

O\mathopen{}\mathclose{{}\left(\sum_{i}\mathopen{}\mathclose{{}\left|E_{changed}^{\mathopen{}\mathclose{{}\left(i}\right)}}\right|^{\omega-2}n^{2}}\right)=O\mathopen{}\mathclose{{}\left(n^{2}\sum_{i}\mathopen{}\mathclose{{}\left(\sum_{\eta}k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}}\right).

Because there are only $O(\log{m})$ values of $\eta$ , and each $k(\eta)^{(i)}$ is non-negative, we can bound the total cost by:

\widetilde{O}\mathopen{}\mathclose{{}\left(n^{2}\sum_{i}\sum_{\eta}\mathopen{}\mathclose{{}\left(k\mathopen{}\mathclose{{}\left(\eta}\right)^{\mathopen{}\mathclose{{}\left(i}\right)}}\right)^{\omega-2}}\right)\leq\widetilde{O}\mathopen{}\mathclose{{}\left(p^{3-\omega}n^{2}\sum_{\eta:2^{\eta}\leq T}m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}\cdot 2^{\eta\mathopen{}\mathclose{{}\left(3\omega-7}\right)}}\right),

where the inequality follows from substituting in the result of Lemma 5.9. Depending on the sign of $3\omega-7$ , this sum is dominated either at $\eta=0$ or $\eta=\log{T}$ . Including both terms then gives

\widetilde{O}\mathopen{}\mathclose{{}\left(p^{3-\omega}n^{2}\mathopen{}\mathclose{{}\left(m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)}{3p-2}}+m^{\frac{p-2+4\mathopen{}\mathclose{{}\left(\omega-2}\right)+\mathopen{}\mathclose{{}\left(p-2}\right)\mathopen{}\mathclose{{}\left(3\omega-7}\right)}{3p-2}}}\right)}\right),

with the exponent on the trailing term simplifying to $\omega-2$ to give,

\widetilde{O}\mathopen{}\mathclose{{}\left(p^{3-\omega}n^{2}\mathopen{}\mathclose{{}\left(m^{\frac{p-\mathopen{}\mathclose{{}\left(10-4\omega}\right)}{3p-2}}+m^{\omega-2}}\right)}\right).

∎

6 Iteratively Reweighted Least Squares Algorithm

Iteratively Reweighted Least Squares (IRLS) Algorithms are a family of algorithms for solving $\ell_{p}$ -regression. These algorithms have been studied extensively for about 60 years [Law61, Ric64, GR97] and the classical form solves the following version of $\ell_{p}$ -regression,

\min_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}\|_{p},

(12)

where $\boldsymbol{\mathit{A}}$ is a tall thin matrix and $\boldsymbol{\mathit{b}}$ is a vector. The main idea in IRLS algorithms is to solve a weighted least squares problem in every iteration to obtain the next iterate,

\boldsymbol{\mathit{x}}^{(t+1)}=\arg\min_{\boldsymbol{\mathit{x}}}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(t)}-\boldsymbol{\mathit{b}})^{\top}\boldsymbol{\mathit{R}}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(t)}-\boldsymbol{\mathit{b}})

(13)

starting from $\boldsymbol{\mathit{x}}^{(0)}$ which is usually $\arg\min_{\boldsymbol{\mathit{x}}}\|\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}\|_{2}^{2}$ . Here $\boldsymbol{\mathit{R}}$ is picked to be $Diag(|\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}^{(t)}-\boldsymbol{\mathit{b}}|^{p-2})$ and note that the above equation now becomes a fixed point iterate for the $\ell_{p}$ -regression problem. It is known that the fixed point is unique for $p\in(1,\infty)$ .

The basic version of the above IRLS algorithm is guaranteed to converge for $p\in(1.5,3)$ , however, even for small $p\approx 3.5$ , the algorithm diverges [RCL19]. Over the years there have been several studies on IRLS algorithms and attempts to show convergence [Kar70, Osb85], but none of them show quantitative bounds or require starting solutions close enough to the optimum. Refer to [Bur12] for a complete survey on these methods.

In this section we propose an IRLS algorithm and prove that our algorithm is guaranteed to converge geometrically to the optimum. Our algorithm is based on the algorithm of [APS19] and present some experimental results from experiments performed in the paper that demonstrate our algorithm works very well in practice. We provide a much simpler analysis and integrate the analysis with the framework we have built so far.

We will focus on the following pure $\ell_{p}$ setting for better readability,

\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}.

We note that our algorithm also works for the setting described in Equation (12). We will first describe our algorithm in the next section, and then present some experimental results from experiments that were performed in [APS19].

6.1 IRLS Algorithm

Our IRLS algorithm is based on our overall iterative refinement framework (Algorithm 1) where we will directly use a weighted least squares problem to solve the residual problem. Consider Algorithm 10 and compare it with Algorithm 1. We note that it is same overall, except now we have an extra step LineSearch and we update the solution (Line 7) at every iteration. These steps do not affect the overall convergence guarantees of the iterative refinement framework in Algorithm 1, since these are only ensuring that given a solution from ResidualSolver-IRLS, we are taking a step that reduces the objective value the most as opposed the fixed update defined in Algorithm 1. In other words, we are reducing the objective value in each iteration at least as much as in Algorithm 1. We thus require to prove the guarantees of ResidualSolver-IRLS (Algorithm 11) and combine it with Theorem 2.1 to obtain our final convergence guarantees. We will prove the following result on our IRLS algorithm (Algorithm 10).

See 1.3 The key connection with IRLS algorithms is that we are able to show that it is sufficient to solve a weighted least squares problem to solve the residual problem. The two main differences are, in every iteration we add a small systematic padding to $\boldsymbol{\mathit{R}}$ and, we perform a line search. These tricks are common empirical modifications used to avoid ill conditioning of matrices and for a faster convergence [Kar70, VB99].

Algorithm 10 Iteratively Reweighted Least Squares

1:procedure IRLS(

\boldsymbol{\mathit{A}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{b}},p,\epsilon

)

\boldsymbol{\mathit{x}}\leftarrow{\arg\min}_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{b}}}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{2}^{2}

\nu\leftarrow\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}

4: while

\nu>\frac{\epsilon}{2}\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}

{\widetilde{{\Delta}}},\kappa\leftarrow

ResidualSolver-IRLS(

\boldsymbol{\mathit{x}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{b}},\nu,p

)

\alpha\leftarrow

LineSearch(

\boldsymbol{\mathit{N}},\boldsymbol{\mathit{x}},{\widetilde{{\Delta}}}

)

\triangleright

\alpha=\arg\min_{\beta}\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\beta{\widetilde{{\Delta}}})\|_{p}^{p}

\boldsymbol{\mathit{x}}\leftarrow\boldsymbol{\mathit{x}}-\alpha\frac{{\widetilde{{\Delta}}}}{p}

8: if

\boldsymbol{res}_{p}(\alpha{\widetilde{{\Delta}}})<\frac{\nu}{32p\kappa}

then

\nu\leftarrow\frac{\nu}{2}

10: return

\boldsymbol{\mathit{x}}

Algorithm 11 Residual Solver for IRLS

1:procedure ResidualSolver-IRLS(

\boldsymbol{\mathit{x}},\boldsymbol{\mathit{N}},\boldsymbol{\mathit{A}},\boldsymbol{\mathit{b}},\nu,p

)

\boldsymbol{\mathit{g}}\leftarrow Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}

\boldsymbol{\mathit{R}}\leftarrow 2Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})

\boldsymbol{\mathit{s}}\leftarrow\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}

{\widetilde{{\Delta}}}\leftarrow{\arg\max}_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}\Delta-\Delta^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\boldsymbol{\mathit{s}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}\Delta

\triangleright

Problem (14)

k\leftarrow\frac{\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}{{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}

\alpha_{0}\leftarrow\min\mathopen{}\mathclose{{}\left\{\frac{1}{2},\frac{1}{2k^{1/(p-1)}}}\right\}

8: return

{\widetilde{{\Delta}}},2^{13}p^{2}\alpha_{0}^{-1}

We will prove the following result about solving the residual problem.

Lemma 6.1.

Let $\boldsymbol{\mathit{x}}$ be the current iterate and $\nu$ be such that $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-OPT\in(\nu/2,\nu]$ . Let ${\widetilde{{\Delta}}}$ be the solution of (14). Then for $\alpha_{0}$ and $\alpha$ as defined in Algorithm 11 and Algorithm 10 respectively, $\alpha{\widetilde{{\Delta}}}$ is a $O\mathopen{}\mathclose{{}\left(p^{2}\alpha_{0}^{-1}}\right)=O\mathopen{}\mathclose{{}\left(p^{2}m^{\frac{p-2}{2(p-1)}}}\right)$ -approximate solution to the residual problem.

We note that Theorem 1.3 directly follows from Lemma 6.1, Lemma 2.7 and Theorem 2.1. Therefore, in the next section, we will prove Lemma 6.1.

6.1.1 Solving the Residual Problem

Recall the residual problem (Definition 2.3),

\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}\Delta-\Delta^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\Delta-\|\boldsymbol{\mathit{N}}\Delta\|_{p}^{p},

with $\boldsymbol{\mathit{g}}=Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}$ and $\boldsymbol{\mathit{R}}=2Diag(|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}|^{p-2})$ . Let $\nu$ be as in Algorithm 1, then we will show that the solution of the following weighted least squares problem is a good approximation to the residual problem,

\max_{\boldsymbol{\mathit{A}}\Delta=0}\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}\Delta-\Delta^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}\Delta.

(14)

6.1.2 Proof of Lemma 6.1

Proof.

Since $\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|_{p}^{p}-OPT\in(\nu/2,\nu]$ , from Lemma 2.6, we have the optimum of the residual problem satisfies, $res_{p}({{{\Delta^{\star}}}})\in(\nu/32p,\nu]$ . We will next prove, that the objective of (14) at the optimum is at most $\nu$ and at least $\frac{\nu}{2^{13}p^{2}}$ . Before proving the above bound, we will prove how $\alpha{\widetilde{{\Delta}}}$ gives the required approximation to the residual problem. We have $\alpha_{0}=\min\mathopen{}\mathclose{{}\left\{\frac{1}{2},\frac{1}{(2k)^{1/p-1}}}\right\}$ .

	$\displaystyle res_{p}(\alpha{\widetilde{{\Delta}}})$	$\displaystyle\geq 16p\cdot res_{p}(\alpha{\widetilde{{\Delta}}}/16p)$
		$\displaystyle\geq\\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\\|_{p}^{p}-\\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\alpha{\widetilde{{\Delta}}})\\|_{p}^{p}$
		$\displaystyle\geq\\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\\|_{p}^{p}-\\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\alpha_{0}{\widetilde{{\Delta}}})\\|_{p}^{p}$
		$\displaystyle\geq res_{p}(\alpha_{0}{\widetilde{{\Delta}}})$
		$\displaystyle=\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}^{p-1}\\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\\|_{p}^{p}}\right)$
		$\displaystyle\geq\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\alpha_{0}^{p-1}k{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right)$
		$\displaystyle\geq\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\frac{1}{2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\frac{1}{2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right)$
		$\displaystyle=\alpha_{0}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right)$
		$\displaystyle\geq\frac{\alpha_{0}\nu}{2^{13}p^{2}}\geq\frac{\alpha_{0}}{2^{13}p^{2}}OPT.$

It remains to prove the bound on the optimal objective of (14) and bound $\alpha_{0}$ for which it is sufficient to find an upper bound on $k$ ,

k=\frac{\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}{{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}.

We will first bound $k$ . Since, $s\boldsymbol{\mathit{I}}\preceq\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}}$ ,

\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{2}\leq\frac{1}{s}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}},

and

\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}\leq\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{p}\leq\frac{1}{s}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{p-2}{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}.

Therefore it is sufficient to bound $\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}$ , as

k=\frac{\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{p}^{p}}{{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(\boldsymbol{\mathit{R}}+s\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\leq\frac{1}{s}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{p-2}.

To bound $\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}$ , we start by assuming $|\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}|\leq\nu$ . Now, since optimal objective of (14) is lower bounded by $\frac{\nu}{2^{13}p^{2}}$ ,

{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\leq\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-\frac{\nu}{2^{13}p^{2}}\leq\nu.

We thus have,

\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}\|_{2}^{2}\leq\nu.

Using this we get,

k\leq\frac{1}{\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}}\frac{\nu^{\frac{p-2}{2}}}{\nu^{\frac{(p-2)^{2}}{2p}}m^{-\frac{(p-2)^{2}}{2p}}}=m^{\frac{p-2}{2}}.

We thus have $\alpha_{0}$ lower bounded by $m^{-\frac{p-2}{2(p-1)}}$ , which gives us our result. It remains to give a lower bound to the optimal objective of (14).

Let ${{{\Delta^{\star}}}}$ denote the optimum of the residual problem. We know that $\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}\leq\nu,{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\leq\nu$ and $\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}>\nu/32p$ . Since $\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{p}^{p}\leq\nu$ we have $\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\|_{2}^{2}\leq m^{(p-2)/p}\nu^{2/p}$ . For $a=1/2^{7}p,$ $a{{{\Delta^{\star}}}}$ is a feasible solution for (14).

	$\displaystyle\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}-{\widetilde{{\Delta}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}$	$\displaystyle\geq a\boldsymbol{\mathit{g}}^{\top}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}-a^{2}{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{N}}^{\top}(R+\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\boldsymbol{\mathit{I}})\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}$
		$\displaystyle\geq a\mathopen{}\mathclose{{}\left(\frac{\nu}{32p}-a{{{\Delta^{\star}}}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}-a\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}\\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\\|_{2}^{2}}\right)$
		$\displaystyle\geq a\mathopen{}\mathclose{{}\left(\frac{\nu}{32p}-a\nu-a\nu^{\frac{p-2}{p}}m^{-\frac{p-2}{p}}m^{(p-2)/p}\nu^{2/p}}\right)$
		$\displaystyle=a\mathopen{}\mathclose{{}\left(\frac{\nu}{32p}-a\nu-a\nu}\right)$
		$\displaystyle=a\frac{\nu}{2^{6}p}=\frac{\nu}{2^{13}p^{2}}$

Thus, the optimal objective of (14) is lower bounded by $\frac{\nu}{2^{13}p^{2}}$ . ∎

6.2 Experiments

In this section, we include the experimental results from [APS19] which are based on Algorithm $p$ -IRLS described in the paper. We would like to mention that $p$ -IRLS is similar in spirit to Algorithm 10 and thus we expect a similar performance by an implementation of Algorithm 10. Algorithm $p$ -IRLS is described for setting (12) and is available at https://github.com/fast-algos/pIRLS [APS19a]. We now give a brief summary of the experiments.

6.2.1 Experiments on p-IRLS

Refer to caption — (a) Size of $\boldsymbol{\mathit{A}}$ fixed to $1000\times 850$ .

All implementations were done on on MATLAB 2018b on a Desktop ubuntu machine with an Intel Core $i5$ - $4570$ CPU @ $3.20GHz\times 4$ processor and 4GB RAM. The two kinds of instances considered are Random Matrices and Graph instances for the problem $\min_{x}\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right\rVert_{p}$ .

1.

Random Matrices: Matrices $\boldsymbol{\mathit{A}}$ and $\boldsymbol{\mathit{b}}$ are generated randomly i.e., every entry of the matrix is chosen uniformly at random between $0$ and $1$ .
2.

Graphs: Instances are generated as in [RCL19]. Vertices are uniform random vectors in $[0,1]^{10}$ and edges are created by connecting the $10$ nearest neighbors. The weight of every edge is determined by a Gaussian function (Eq 3.1,[RCL19]). Around 10 vertices have labels chosen uniformly at random between $0$ and $1$ . The problem is to minimize the $\ell_{p}$ laplacian. Appendix B contains details on how to formulate this problem into our standard form. These instances were generated using the code by [Rio19].

The performance of $p$ -IRLS is compared against Matlab/CVX solver [GB14, GB08] and the IRLS/homotopy based implementation from [RCL19]. More details on the experiments are in [APS19] and the plots and specific details of the implementation are included in Figures 1,2,3 and, 4.

References

[ABS21] Deeksha Adil, Brian Bullins and Sushant Sachdeva “Unifying Width-Reduced Methods for Quasi-Self-Concordant Optimization” In arXiv preprint arXiv:2107.02432, 2021
[Adi+19] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Iterative Refinement for $\ell_{p}$ -norm Regression” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 1405–1424 SIAM
[Adi+21] Deeksha Adil, Brian Bullins, Rasmus Kyng and Sushant Sachdeva “Almost-Linear-Time Weighted $\ell_{p}$ -Norm Solvers in Slightly Dense Graphs via Sparsification” In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), 2021 Schloss Dagstuhl-Leibniz-Zentrum für Informatik
[AL11] Morteza Alamgir and Ulrike Luxburg “Phase transition in the family of p-resistances” In Advances in neural information processing systems 24, 2011, pp. 379–387
[All+17] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira and Avi Wigderson “Much faster algorithms for matrix scaling” In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 890–901 IEEE
[APS19] Deeksha Adil, Richard Peng and Sushant Sachdeva “Fast, Provably Convergent IRLS Algorithm for p-norm Linear Regression” In Advances in Neural Information Processing Systems, 2019, pp. 14189–14200
[APS19a] Deeksha Adil, Richard Peng and Sushant Sachdeva “pIRLS” In https://github.com/fast-algos/pIRLS GitHub, https://github.com/fast-algos/pIRLS, 2019
[AS20] Deeksha Adil and Sushant Sachdeva “Faster p-norm Minimizing Flows, via Smoothed q-norm Problems” In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2020, pp. 892–910 SIAM
[BB94] Jose Antonio Barreto and C Sidney Burrus “L/sub p/-complex approximation using iterative reweighted least squares for FIR digital filters” In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing 3, 1994, pp. III–545 IEEE
[Bub+18] Sébastien Bubeck, Michael B Cohen, Yin Tat Lee and Yuanzhi Li “An homotopy method for $\ell_{p}$ regression provably beyond self-concordance and in input-sparsity time” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018, pp. 1130–1137
[Bul18] Brian Bullins “Fast minimization of structured convex quartics” In arXiv preprint arXiv:1812.10349, 2018
[Bur12] C Burrus “Iterative re-weighted least-squares OpenStax-CNX”, 2012
[Cal19] Jeff Calder “Consistency of Lipschitz learning with infinite unlabeled data and finite labeled data” In SIAM Journal on Mathematics of Data Science 1.4 SIAM, 2019, pp. 780–812
[Car+20] Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford and Kevin Tian “Acceleration with a Ball Optimization Oracle” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 19052–19063 URL: https://proceedings.neurips.cc/paper/2020/file/dba4c1a117472f6aca95211285d0587e-Paper.pdf
[Che+22] Li Chen, Rasmus Kyng, Yang P Liu, Richard Peng, Maximilian Probst Gutenberg and Sushant Sachdeva “Maximum flow and minimum-cost flow in almost-linear time” In arXiv preprint arXiv:2203.00671, 2022
[Chi+13] Hui Han Chin, Aleksander Madry, Gary L Miller and Richard Peng “Runtime guarantees for regression problems” In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, 2013, pp. 269–282
[Chi+17] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy and David P Woodruff “Algorithms for $\ell_{p}$ low-rank approximation” In International Conference on Machine Learning, 2017, pp. 806–814 PMLR
[Chr+11] Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman and Shang-Hua Teng “Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs” In Proceedings of the forty-third annual ACM symposium on Theory of computing, 2011, pp. 273–282
[CT05] Emmanuel J Candes and Terence Tao “Decoding by linear programming” In IEEE transactions on information theory 51.12 IEEE, 2005, pp. 4203–4215
[CY08] Rick Chartrand and Wotao Yin “Iteratively reweighted algorithms for compressive sensing” In 2008 IEEE international conference on acoustics, speech and signal processing, 2008, pp. 3869–3872 IEEE
[EDT17] Abderrahim Elmoataz, X Desquesnes and M Toutain “On the game p-Laplacian on weighted graphs with applications in image processing and data clustering” In European Journal of Applied Mathematics 28.6 Cambridge University Press, 2017, pp. 922–948
[ETT15] Abderrahim Elmoataz, Matthieu Toutain and Daniel Tenbrinck “On the p-Laplacian and $\infty$ -Laplacian on graphs with applications in image and data processing” In SIAM Journal on Imaging Sciences 8.4 SIAM, 2015, pp. 2412–2451
[EV19] Alina Ene and Adrian Vladu “Improved Convergence for $\ell_{1}$ and $\ell_{\infty}$ Regression via Iteratively Reweighted Least Squares” In International Conference on Machine Learning, 2019, pp. 1794–1801 PMLR
[GB08] M. Grant and S. Boyd “Graph implementations for nonsmooth convex programs” http://stanford.edu/~boyd/graph_dcp.html In Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences Springer-Verlag Limited, 2008, pp. 95–110
[GB14] M. Grant and S. Boyd “CVX: Matlab Software for Disciplined Convex Programming, version 2.1”, http://cvxr.com/cvx, 2014
[GR97] Irina F Gorodnitsky and Bhaskar D Rao “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm” In IEEE Transactions on signal processing 45.3 IEEE, 1997, pp. 600–616
[HFE18] Yosra Hafiene, Jalal Fadili and Abderrahim Elmoataz “Nonlocal $p$ -Laplacian Variational problems on graphs” In arXiv preprint arXiv:1810.12817, 2018
[JLS22] Arun Jambulapati, Yang P Liu and Aaron Sidford “Improved iteration complexities for overconstrained p-norm regression” In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, 2022, pp. 529–542
[Kar70] LA Karlovitz “Construction of nearest points in the $L_{p}$ , $p$ even, and $L_{\infty}$ norms. I” In Journal of Approximation Theory 3.2 Academic Press, 1970, pp. 123–127
[KLS20] Tarun Kathuria, Yang P Liu and Aaron Sidford “Unit Capacity Maxflow in Almost $O(m^{4/3})$ Time” In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020, pp. 119–130 IEEE
[Kyn+15] R. Kyng, A.. Rao, S. Sachdeva and D. Spielman “Algorithms for Lipschitz learning on graphs” In COLT, 2015
[Kyn+19] Rasmus Kyng, Richard Peng, Sushant Sachdeva and Di Wang “Flows in almost linear time via adaptive preconditioning” In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019, pp. 902–913
[Law61] Charles Lawrence Lawson “Contribution to the theory of linear least maximum approximation” In Ph. D. dissertation, Univ. Calif., 1961
[LS14] Yin Tat Lee and Aaron Sidford “Path Finding Methods for Linear Programming: Solving Linear Programs in $\tilde{O}(\sqrt{rank})$ Iterations and Faster Algorithms for Maximum Flow” Available at http://arxiv.org/abs/1312.6677 and http://arxiv.org/abs/1312.6713 In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, 2014, pp. 424–433 IEEE
[LS15] Yin Tat Lee and Aaron Sidford “Efficient Inverse Maintenance and Faster Algorithms for Linear Programming” Available at: https://arxiv.org/abs/1503.01752 In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, 2015, pp. 230–249
[LSW15] Yin Tat Lee, Aaron Sidford and Sam Chiu-wai Wong “A faster cutting plane method and its implications for combinatorial and convex optimization” In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, 2015, pp. 1049–1065 IEEE
[NN94] Yurii Nesterov and Arkadii Nemirovskii “Interior-point polynomial algorithms in convex programming” SIAM, 1994
[Osb85] Michael Robert Osborne “Finite algorithms in optimization and data analysis” John Wiley & Sons, Inc., 1985
[RCL19] Mauricio Flores Rios, Jeff Calder and Gilad Lerman “Algorithms for $\ell_{p}$ -based semi-supervised learning on graphs” In arXiv preprint arXiv:1901.05031, 2019
[Ric64] John Rischard Rice “The approximation of functions” Addison-Wesley Reading, Mass., 1964
[Rio19] M.. Rios “Laplacian $\_$ Lp $\_$ Graph $\_$ SSL” In GitHub repository GitHub, https://github.com/mauriciofloresML/Laplacian_Lp_Graph_SSL, 2019
[SV16] Damian Straszak and Nisheeth K Vishnoi “IRLS and slime mold: Equivalence and convergence” In arXiv preprint arXiv:1601.02712, 2016
[SV16a] Damian Straszak and Nisheeth K Vishnoi “Natural algorithms for flow problems” In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, 2016, pp. 1868–1883 SIAM
[SV16b] Damian Straszak and Nisheeth K. Vishnoi “On a Natural Dynamics for Linear Programming” In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016
[Vai90] P Vaidya “Solving linear equations with diagonally dominant matrices by constructing good preconditioners”, 1990
[VB99] Ricardo A Vargas and Charles S Burrus “Adaptive iterative reweighted least squares design of L/sub p/FIR filters” In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258) 3, 1999, pp. 1129–1132 IEEE

Appendix A Solving $\ell_{2}$ Problems under Subspace Constraints

We will show how to solve general problems of the following form using a linear system solver.

	$\displaystyle\min_{\boldsymbol{\mathit{x}}}$	$\displaystyle\quad\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right\rVert_{2}^{2}$
		$\displaystyle\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{d}}.$

We first write the Lagrangian of the problem,

L(\boldsymbol{\mathit{x}},\boldsymbol{\mathit{v}})=\min_{\boldsymbol{\mathit{x}}}\max_{\boldsymbol{\mathit{v}}}\quad(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})^{\top}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})+\boldsymbol{\mathit{v}}^{\top}(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}})

Using Lagrangian duality and noting that strong duality holds, we can write the above as,

	$\displaystyle L(\boldsymbol{\mathit{x}},\boldsymbol{\mathit{v}})=$	$\displaystyle\min_{\boldsymbol{\mathit{x}}}\max_{\boldsymbol{\mathit{v}}}\quad(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})^{\top}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})+\boldsymbol{\mathit{v}}^{\top}(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}})$
	$\displaystyle=$	$\displaystyle\max_{\boldsymbol{\mathit{v}}}\min_{\boldsymbol{\mathit{x}}}\quad(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})^{\top}(\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}})+\boldsymbol{\mathit{v}}^{\top}(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}\boldsymbol{\mathit{x}}).$

We first find $\boldsymbol{\mathit{x}}^{\star}$ that minimizes the above objective by setting the gradient with respect to $\boldsymbol{\mathit{x}}$ to $0$ . We thus have,

\boldsymbol{\mathit{x}}^{\star}=(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\mathopen{}\mathclose{{}\left(\frac{2\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{C}}^{\top}\boldsymbol{\mathit{v}}}{2}}\right).

Using this value of $\boldsymbol{\mathit{x}}$ we arrive at the following dual program.

L(\boldsymbol{\mathit{v}})=\max_{\boldsymbol{\mathit{v}}}\quad-\frac{1}{4}\boldsymbol{\mathit{v}}^{\top}\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}\boldsymbol{\mathit{v}}-\boldsymbol{\mathit{b}}^{\top}\boldsymbol{\mathit{A}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}-\boldsymbol{\mathit{v}}^{\top}\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{b}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{v}}^{\top}\boldsymbol{\mathit{d}},

which is optimized at,

\boldsymbol{\mathit{v}}^{\star}=2\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}}\right)^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}}\right).

Strong duality also implies that $L(\boldsymbol{\mathit{x}},\boldsymbol{\mathit{v}}^{\star})$ is optimized at $\boldsymbol{\mathit{x}}^{\star}$ , which gives us,

\boldsymbol{\mathit{x}}^{\star}=(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}+\boldsymbol{\mathit{C}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}}\right)^{-1}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}}\right)}\right).

We now note that we can compute $\boldsymbol{\mathit{x}}^{\star}$ by solving the following linear systems in order:

1.

Find inverse of $\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}}$
2.

$\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{C}}^{\top}}\right)x=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{d}}-\boldsymbol{\mathit{C}}(\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{A}})^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{b}}}\right)$

Appendix B Converting $\ell_{p}$ -Laplacian Minimization to Regression Form

Define the following terms:

•

$n$ denote the number of vertices.
•

$l$ denote the number of labels.
•

$\boldsymbol{\mathit{B}}$ denote the edge-vertex adjacency matrix.
•

$\boldsymbol{\mathit{g}}$ denote the vector of labels for the $l$ labelled vertices.
•

$\boldsymbol{\mathit{W}}$ denote the diagonal matrix with weights of the edges.

Set $\boldsymbol{\mathit{A}}=\boldsymbol{\mathit{W}}^{1/p}\boldsymbol{\mathit{B}}$ and $\boldsymbol{\mathit{b}}=-\boldsymbol{\mathit{B}}[:,n:n+l]\boldsymbol{\mathit{g}}$ . Now $\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}-\boldsymbol{\mathit{b}}}\right\rVert_{p}^{p}$ is equal to the $\ell_{p}$ laplacian and we can use our IRLS algorithm from Chapter 7 to find the $\boldsymbol{\mathit{x}}$ that minimizes this.

Appendix C Increasing Resistances

We first prove the following lemma that shows how much $\Psi$ changes with a change in resistance.

Lemma C.1.

Let ${\widetilde{{\Delta}}}=\arg\min_{\boldsymbol{\mathit{A}}\Delta=c}\Delta^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\Delta+\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\Delta)_{e}^{2}$ . Then one has for any $\boldsymbol{\mathit{r}}^{\prime}$ and $\boldsymbol{\mathit{r}}$ such that $\boldsymbol{\mathit{r}}^{\prime}\geq\boldsymbol{\mathit{r}}$ ,

{\Psi({\boldsymbol{\mathit{r}}^{\prime}})}\geq{\Psi\mathopen{}\mathclose{{}\left({\boldsymbol{\mathit{r}}}}\right)}+\sum_{e}\mathopen{}\mathclose{{}\left(1-\frac{\boldsymbol{\mathit{r}}_{e}}{\boldsymbol{\mathit{r}}^{\prime}_{e}}}\right)\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}.

Proof.

For this proof, we use $\boldsymbol{\mathit{R}}=Diag(\boldsymbol{\mathit{r}})$ .

\Psi(\boldsymbol{\mathit{r}})=\min_{\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}=\boldsymbol{\mathit{c}}}\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}.

Constructing the Lagrangian and noting that strong duality holds,

	$\displaystyle\Psi(\boldsymbol{\mathit{r}})$	$\displaystyle=\min_{\boldsymbol{\mathit{x}}}\max_{\boldsymbol{\mathit{y}}}\quad\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}+2\boldsymbol{\mathit{y}}^{\top}(\boldsymbol{\mathit{c}}-\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}})$
		$\displaystyle=\max_{\boldsymbol{\mathit{y}}}\min_{\boldsymbol{\mathit{x}}}\quad\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}+\boldsymbol{\mathit{x}}^{\top}\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}+2\boldsymbol{\mathit{y}}^{\top}(\boldsymbol{\mathit{c}}-\boldsymbol{\mathit{A}}\boldsymbol{\mathit{x}}).$

Optimality conditions with respect to $\boldsymbol{\mathit{x}}$ give us,

2\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}\boldsymbol{\mathit{x}}^{\star}+2\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}=2\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{y}}.

Substituting this in $\Psi$ gives us,

\Psi(\boldsymbol{\mathit{r}})=\max_{\boldsymbol{\mathit{y}}}\quad 2\boldsymbol{\mathit{y}}^{\top}\boldsymbol{\mathit{c}}-\boldsymbol{\mathit{y}}^{\top}\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{y}}.

Optimality conditions with respect to $\boldsymbol{\mathit{y}}$ now give us,

2\boldsymbol{\mathit{c}}=2\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{y}}^{\star},

which upon re-substitution gives,

\Psi(\boldsymbol{\mathit{r}})=\boldsymbol{\mathit{c}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\boldsymbol{\mathit{c}}.

We also note that

\boldsymbol{\mathit{x}}^{\star}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\boldsymbol{\mathit{c}}.

(15)

We now want to see what happens when we change $\boldsymbol{\mathit{r}}$ . Let $\boldsymbol{\mathit{R}}$ denote the diagonal matrix with entries $\boldsymbol{\mathit{r}}$ and let $\boldsymbol{\mathit{R}}^{\prime}=\boldsymbol{\mathit{R}}+\SS$ , where $\SS$ is the diagonal matrix with the changes in the resistances. We will use the following version of the Sherman-Morrison-Woodbury formula multiple times,

(\boldsymbol{\mathit{X}}+\boldsymbol{\mathit{U}}\boldsymbol{\mathit{C}}\boldsymbol{\mathit{V}})^{-1}=\boldsymbol{\mathit{X}}^{-1}-\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}(\boldsymbol{\mathit{C}}^{-1}+\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}})^{-1}\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}.

We begin by applying the above formula for $\boldsymbol{\mathit{X}}=\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}$ , $\boldsymbol{\mathit{C}}=\boldsymbol{\mathit{I}}$ , $\boldsymbol{\mathit{U}}=\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}$ and $\boldsymbol{\mathit{V}}=\SS^{1/2}\boldsymbol{\mathit{N}}$ . We thus get,

\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}-\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}\\ \mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}}\right)^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}.

(16)

We next claim that,

\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}\preceq\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2},

which gives us,

\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}\preceq\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}-\\ \mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}.

(17)

This further implies,

\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\preceq\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}-\\ \boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}.

(18)

We apply the Sherman-Morrison formula again for, $\boldsymbol{\mathit{X}}=\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}$ , $\boldsymbol{\mathit{C}}=-(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}$ , $\boldsymbol{\mathit{U}}=\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}$ and $\boldsymbol{\mathit{V}}=\SS^{1/2}\boldsymbol{\mathit{N}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}$ . Let us look at the term $\boldsymbol{\mathit{C}}^{-1}+\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}$ .

-\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{C}}^{-1}+\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}}\right)^{-1}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2}-\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}}\right)^{-1}\succeq(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}.

Using this, we get,

\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{A}}\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}^{\prime}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}}\right)^{-1}\succeq\boldsymbol{\mathit{X}}^{-1}+\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1},

which on multiplying by $\boldsymbol{\mathit{c}}^{\top}$ and $\boldsymbol{\mathit{c}}$ gives,

\Psi(\boldsymbol{\mathit{r}}^{\prime})\geq\Psi(\boldsymbol{\mathit{r}})+\boldsymbol{\mathit{c}}^{\top}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{U}}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\boldsymbol{\mathit{V}}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{c}}.

We note from Equation (15) that $\boldsymbol{\mathit{x}}^{\star}=\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{M}}^{\top}\boldsymbol{\mathit{M}}+\boldsymbol{\mathit{N}}^{\top}\boldsymbol{\mathit{R}}\boldsymbol{\mathit{N}}}\right)^{-1}\boldsymbol{\mathit{A}}^{\top}\boldsymbol{\mathit{X}}^{-1}\boldsymbol{\mathit{c}}$ . We thus have,

	$\displaystyle\Psi(\boldsymbol{\mathit{r}}^{\prime})$	$\displaystyle\geq\Psi(\boldsymbol{\mathit{r}})+\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{x}}^{\star}}\right)^{\top}\boldsymbol{\mathit{N}}^{\top}\SS^{1/2}(\boldsymbol{\mathit{I}}+\SS^{1/2}\boldsymbol{\mathit{R}}^{-1}\SS^{1/2})^{-1}\SS^{1/2}\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star}$
		$\displaystyle=\Psi(\boldsymbol{\mathit{r}})+\sum_{e}\mathopen{}\mathclose{{}\left(\frac{\boldsymbol{\mathit{r}}^{\prime}_{e}-\boldsymbol{\mathit{r}}_{e}}{\boldsymbol{\mathit{r}}^{\prime}_{e}}}\right)\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}^{\star})^{2}_{e}.$

∎

	$\displaystyle s^{\prime\prime}(\alpha)$	$\displaystyle=4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left\|\alpha}\right\|^{p-2}-p(p-1)(1+\alpha)^{p-1}$
		$\displaystyle\geq 4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left\|\alpha}\right\|^{p-2}-p(p-1)(1+\frac{1}{p-1})^{p-1}$
		$\displaystyle\geq 4p^{2}+p^{p+1}(p-1)\mathopen{}\mathclose{{}\left\|\alpha}\right\|^{p-2}-p(p-1)e,\text{When $p$ gets large the last term approaches $e$}$
		$\displaystyle\geq 0.$

	$\displaystyle\\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\\|_{p}^{p}-\\|\boldsymbol{\mathit{N}}(\boldsymbol{\mathit{x}}-\lambda\Delta)\\|_{p}^{p}$	$\displaystyle\leq\lambda\boldsymbol{\mathit{g}}^{\top}\Delta-\frac{p-1}{p2^{p}}\gamma_{p}(\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|,\lambda\boldsymbol{\mathit{N}}\Delta)$
		$\displaystyle\leq\lambda\mathopen{}\mathclose{{}\left(\boldsymbol{\mathit{g}}^{\top}\Delta-\lambda^{p-1}\frac{p-1}{p2^{p}}\gamma_{p}(\|\boldsymbol{\mathit{N}}\boldsymbol{\mathit{x}}\|,\boldsymbol{\mathit{N}}\Delta)}\right)$
		$\displaystyle=\lambda\cdot\boldsymbol{res}_{p}(\Delta).$

	$\displaystyle\sum_{e}\boldsymbol{\mathit{w}}_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2}$	$\displaystyle\leq\mathopen{}\mathclose{{}\left(\sum_{e}(\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}})_{e}^{2\cdot\frac{p}{2}}}\right)^{2/p}\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{w}}_{e}}\right\|^{(p-2)\cdot\frac{p}{p-2}}}\right)^{(p-2)/p}$
		$\displaystyle=\\|\boldsymbol{\mathit{N}}{{{\Delta^{\star}}}}\\|_{p}^{2}\\|\boldsymbol{\mathit{w}}\\|_{p}^{(p-2)/p}$
		$\displaystyle\leq\zeta^{2/p}\\|\boldsymbol{\mathit{w}}\\|_{p}^{(p-2)/p},\text{since $\mathopen{}\mathclose{{}\left\lVert\boldsymbol{\mathit{N}}\Delta^{*}}\right\rVert^{p}_{p}\leq\zeta$ }.$

	$\displaystyle\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right\|_{e}\mathopen{}\mathclose{{}\left\|\nabla_{e}\\|\boldsymbol{\mathit{w}}\\|_{p}^{p}}\right\|}\right)^{2}=$	$\displaystyle p^{2}\mathopen{}\mathclose{{}\left(\sum_{e}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}}}\right\|_{e}\|\boldsymbol{\mathit{w}}_{e}\|^{p-2}\mathopen{}\mathclose{{}\left\|\boldsymbol{\mathit{w}}_{e}}\right\|}\right)^{2}$
	$\displaystyle\leq$	$\displaystyle p^{2}\mathopen{}\mathclose{{}\left(\sum_{e}\|\boldsymbol{\mathit{w}}_{e}\|^{p-2}\boldsymbol{\mathit{w}}_{e}^{2}}\right)\mathopen{}\mathclose{{}\left(\sum_{e}\|\boldsymbol{\mathit{w}}_{e}\|^{p-2}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}}\right)$
	$\displaystyle=$	$\displaystyle p^{2}\\|\boldsymbol{\mathit{w}}\\|_{p}^{p}\sum_{e}\boldsymbol{\mathit{r}}_{e}(\boldsymbol{\mathit{N}}{\widetilde{{\Delta}}})_{e}^{2}$
	$\displaystyle\leq$	$\displaystyle p^{2}\\|\boldsymbol{\mathit{w}}\\|_{p}^{2p-2}\zeta^{2/p},\text{ From Lemma \ref{lem:Oracle}.}$

Fast Algorithms for ℓp\ell_{p}-Regression

Abstract

1 Introduction

1.1 Our Contributions

Theorem 1.1.

Theorem 1.2.

Theorem 1.3.

1.2 Technical Overview

Overall log⁡1ϵ\log\frac{1}{\epsilon} Convergence

Solving the Residual Problem

Improving pp Dependence

ℓp\ell_{p}-Regression in Matrix Multiplication Time

Good Starting Solution

IRLS Algorithm

1.3 Related Works

ℓp\ell_{p}-Regression

Width Reduced MWU Algorithms

Inverse Maintenance

IRLS Algorithms

Follow-up Work in Graph Optimization

1.4 Organization of Paper

2 Iterative Refinement for ℓp\ell_{p}-norms

2.1 Iterative Refinement

Theorem 2.1.

2.1.1 Preliminaries

Definition 2.2 (ϵ\epsilon-Approximate Solution).

Definition 2.3 (Residual Problem).

Definition 2.4 (Approximation to Residual Problem).

2.1.2 Bounding Change in Objective

Lemma 2.5.

Proof.

2.1.3 Proof of Iterative Refinement

Lemma 2.6.

Proof.

Lemma 2.7.

Proof.

Corollary 2.8.

Proof.

Proof.

2.2 Starting Solution and Homotopy for pure ℓp\ell_{p} Objectives

Lemma 2.9.

Proof.

Lemma 2.10.

Proof.

2.3 Iterative Refinement for p∈(1,2)p\in(1,2)

Theorem 2.11.

Definition 2.12.

Definition 2.13.

Lemma 2.14.

Proof.

Lemma 2.15.

Proof.

3 Fast Multiplicative Weight Update Algorithm for ℓp\ell_{p}-norms

3.1 Algorithm for ℓp\ell_{p}-norm Regression

3.1.1 Binary Search

Lemma 3.1.

Proof.

3.1.2 Width-Reduced Approximate Solver

3.1.3 Slow Multiplicative Weight Update Solver

3.1.4 Fast, Width-Reduced MWU Solver

Theorem 3.2.

Notation

3.1.5 Analysis of Algorithm 5

Lemma 3.3.

Lemma 3.4.

Lemma 3.5.

3.1.6 Proof of Theorem 3.2

Proof.

3.1.7 Proof of Lemma 3.3

Lemma 3.6.

Proof.

Proof.

Primal Step.

Width Reduction Step.

3.1.8 Proof of Lemma 3.4

Proof.

3.1.9 Proof of Lemma 3.5

Proof.

3.2 Complete Algorithm for ℓp\ell_{p}-Regression

Corollary 3.7.

Fast Algorithms for $\ell_{p}$ -Regression

Overall $\log\frac{1}{\epsilon}$ Convergence

Improving $p$ Dependence

$\ell_{p}$ -Regression in Matrix Multiplication Time

$\ell_{p}$ -Regression

2 Iterative Refinement for $\ell_{p}$ -norms

Definition 2.2 ( $\epsilon$ -Approximate Solution).

2.2 Starting Solution and Homotopy for pure $\ell_{p}$ Objectives

2.3 Iterative Refinement for $p\in(1,2)$

3 Fast Multiplicative Weight Update Algorithm for $\ell_{p}$ -norms

3.1 Algorithm for $\ell_{p}$ -norm Regression

3.2 Complete Algorithm for $\ell_{p}$ -Regression

3.3 Complete Algorithm for Pure $\ell_{p}$ Objectives

4 Solving $p$ -norm Problems using $q$ -norm Oracles

4.1 Relation between Residual Problems for $\ell_{p}$ and $\ell_{q}$ Norms

4.2 Faster Algorithm for $\ell_{p}$ -Regression

Appendix A Solving $\ell_{2}$ Problems under Subspace Constraints

Appendix B Converting $\ell_{p}$ -Laplacian Minimization to Regression Form