This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of log(1/ϵ)\text{log}(1/\epsilon) for Gradient-Based Algorithms in Zero-Sum Games without the Condition Number:
A Smoothed Analysis

Ioannis Anagnostides Tuomas Sandholm Optimized Markets, Inc.
Abstract

Gradient-based algorithms have shown great promise in solving large (two-player) zero-sum games. However, their success has been mostly confined to the low-precision regime since the number of iterations grows polynomially in 1/ϵ1/\epsilon, where ϵ>0\epsilon>0 is the duality gap. While it has been well-documented that linear convergence—an iteration complexity scaling as log(1/ϵ)\textsf{log}(1/\epsilon)—can be attained even with gradient-based algorithms, that comes at the cost of introducing a dependency on certain condition number-like quantities which can be exponentially large in the description of the game.

To address this shortcoming, we examine the iteration complexity of several gradient-based algorithms in the celebrated framework of smoothed analysis, and we show that they have polynomial smoothed complexity, in that their number of iterations grows as a polynomial in the dimensions of the game, log(1/ϵ)\textsf{log}(1/\epsilon), and 1/σ1/\sigma, where σ\sigma measures the magnitude of the smoothing perturbation. Our result applies to optimistic gradient and extra-gradient descent/ascent, as well as a certain iterative variant of Nesterov’s smoothing technique. From a technical standpoint, the proof proceeds by characterizing and performing a smoothed analysis of a certain error bound, the key ingredient driving linear convergence in zero-sum games. En route, our characterization also makes a natural connection between the convergence rate of such algorithms and perturbation-stability properties of the equilibrium, which is of interest beyond the model of smoothed complexity.

1 Introduction

We consider the fundamental problem of computing an equilibrium strategy for a (two-player) zero-sum game

min𝒙Δnmax𝒚Δm𝒙,𝐀𝒚,\min_{\bm{x}\in\Delta^{n}}\max_{\bm{y}\in\Delta^{m}}\langle\bm{x},\mathbf{A}\bm{y}\rangle, (1)

where Δd+1{𝒙0d+1:𝒙𝟏d+1=1}\Delta^{d+1}\coloneqq\{\bm{x}\in{\mathbb{R}}_{\geq 0}^{d+1}:\bm{x}^{\top}\bm{1}_{d+1}=1\} represents the dd-dimensional probability simplex and 𝐀n×m\mathbf{A}\in{\mathbb{R}}^{n\times m} is the payoff matrix of the game. Tracing all the way back to Von Neumann’s celebrated minimax theorem (von Neumann, 1928), zero-sum games played a pivotal role in the early development of game theory (von Neumann and Morgenstern, 1947) and the crystallization of linear programming duality (Dantzig, 1951). Indeed, in light of the equivalence between zero-sum games and linear programming (Adler, 2013; von Stengel, 2023; Brooks and Reny, 2023), many central optimization problems can be cast as (1).

State of the art algorithms for solving zero-sum games can be coarsely classified based on the desired accuracy of a feasible solution (𝒙,𝒚)(\bm{x},\bm{y}), measured in terms of the duality gap

Φ(𝒙,𝒚)max𝒚Δm𝒙,𝐀𝒚min𝒙Δn𝒙,𝐀𝒚.\Phi(\bm{x},\bm{y})\coloneqq\max_{\bm{y}^{\prime}\in\Delta^{m}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-\min_{\bm{x}^{\prime}\in\Delta^{n}}\langle\bm{x}^{\prime},\mathbf{A}\bm{y}\rangle. (2)

In the so-called low-precision regime, where one is content with a crude solution (𝒙,𝒚)(\bm{x}^{\star},\bm{y}^{\star}) such that Φ(𝒙,𝒚)ϵ0\Phi(\bm{x}^{\star},\bm{y}^{\star})\eqqcolon\epsilon\gg 0, the best available algorithms typically revolve around the framework of regret minimization, both in practice (Farina et al., 2021; Brown and Sandholm, 2019; Zinkevich et al., 2007; Tang et al., 2023) and in theory (Carmon et al., 2020, 2019, 2024; Grigoriadis and Khachiyan, 1995; Clarkson et al., 2012; Alacaoglu and Malitsky, 2022)—in conjunction with other techniques to speed up the per-iteration complexity, such as variance reduction, data structure design, and matrix sparsification (Zhang and Sandholm, 2020; Farina and Sandholm, 2022). Such algorithms have been central to landmark results in practical computation of equilibrium strategies even in enormous games (Brown and Sandholm, 2018; Bowling et al., 2015; Moravčík et al., 2017; Perolat et al., 2022).

The high-precision regime, where ϵ1\poly(nm)\epsilon\ll\frac{1}{\poly(nm)}, has turned out to be more elusive, with current LP-based techniques struggling to scale favorably in large instances. This deficiency can be in part attributed to the relatively high per-iteration complexity of LP-based approaches, such as interior-point methods or the ellipsoid algorithm, as well as their intense memory requirements. A promising antidote is to instead rely on iterative gradient-based methods that have a minimal per-iteration cost. Indeed, in a line of work pioneered by Tseng (1995), it is by known well-documented that linear convergence—an iteration complexity scaling only as 𝗅𝗈𝗀(1/ϵ)\mathsf{log}(1/\epsilon)—can been achieved even with such methods (Tseng, 1995; Gilpin et al., 2012; Wei et al., 2021; Applegate et al., 2023; Fercoq, 2023). There is, however, a major caveat to those results: the number of iterations no longer grows polynomially with the dimensions of the game nn and mm, but instead depends on certain condition number-like quantities that could be exponentially large in the description of the problem; it is thus unclear how to interpret those results from a computational standpoint.

To address those shortcomings, in this paper we work in the celebrated framework of smoothed analysis pioneered by Spielman and Teng (2004). Namely, our goal is to characterize the iteration complexity of certain gradient-based algorithms in zero-sum games when the payoff matrix 𝐀\mathbf{A} is subjected to small but random perturbations, as formally introduced below.

Definition 1.1 (Zero-sum games under Gaussian perturbations).

Let 𝐀¯[1,1]n×m\bar{\mathbf{A}}\in[-1,1]^{n\times m}. We assume that the payoff matrix is given by 𝐀𝐀¯+𝐆\mathbf{A}\coloneqq\bar{\mathbf{A}}+\mathbf{G}, where each entry of 𝐆\mathbf{G} is an independent (univariate) Gaussian random variable with zero mean and variance σ21\sigma^{2}\leq 1.

Randomness here is only injected into the payoff matrix and not the set of constraints (that is, the probability simplex), which is the natural model; after applying the perturbation, the problem should still be a zero-sum game in the form of (1). Under this model, we investigate the convergence of the following gradient-based algorithms.111The vanilla gradient descent/ascent algorithm does not even converge (in a last-iterate sense) in zero-sum games (e.g.(Mertikopoulos et al., 2018)), which is why our analysis revolves around certain variants thereof. It is worth noting that regret minimization techniques provide guarantees concerning the average iterates, a distinction blurred in our introduction. (Their formal description is given later in Appendix B.)

  1. 1.

    optimistic gradient descent/ascent (OGDA) (Popov, 1980);

  2. 2.

    optimistic multiplicative weights update (OMWU) (Syrgkanis et al., 2015; Chiang et al., 2012; Rakhlin and Sridharan, 2013);

  3. 3.

    extra-gradient descent/ascent (EGDA) (Korpelevich, 1976); and

  4. 4.

    an iterative variant of Nesterov’s smoothing technique (IterSmooth) (Gilpin et al., 2012; Nesterov, 2005).

Smoothed complexity allows interpolating between worst-case analysis—when the variance of the noise σ2\sigma^{2} is negligible—and average-case analysis—when the noise dominates over the underlying input. An average-case analysis is often unreliable since—as Edelman (1993) convincingly argued—a fully random matrix does not necessarily capture typical instances encountered in practice. Spielman and Teng (2004) put forward the framework of smoothed analysis as an attempt to explain the performance of algorithms in realistic scenarios; to understand how brittle worst-case instances really are. They famously proved that the simplex algorithm, under a certain pivoting rule, enjoys polynomial smoothed complexity, meaning that its running time is bounded by some polynomial in the size of the input and 1/σ1/\sigma. Smoothed analysis is by now a well-accepted algorithmic framework with a tremendous impact in the analysis of algorithms. We also argue that it is particularly well-motivated from a game-theoretic perspective: there is often misspecification or noise when modeling a game, so smoothed analysis offers a compelling way of bypassing pathological instances that are perhaps artificial in the first place.

Nevertheless, we are not aware of any prior work operating in the smoothed complexity model per Definition 1.1 in the context of zero-sum games. To clarify this point, it is important to stress here that although zero-sum games can be immediately reduced to linear programs, that reduction is less clear in the smoothed complexity model. In particular, one set of constraints in the induced linear program takes the form 𝐀𝒚v𝟏n𝒃\mathbf{A}\bm{y}\leq v\bm{1}_{n}\eqqcolon\bm{b}, where 𝟏nn\bm{1}_{n}\in{\mathbb{R}}^{n} is the all-ones vector. According to the usual model of smoothed complexity in the context of linear programs, randomness has to be injected into both 𝐀\mathbf{A} and 𝒃\bm{b}, but that clearly disturbs the validity of the equivalence. More broadly, reductions in the smoothed complexity model are quite delicate (Bläser and Manthey, 2015); as a further example, even reductions involving solely linear transformations can break in the smoothed complexity model since independence—a crucial assumption in this framework—is not guaranteed to carry over. Relatedly, one interesting direction arising from the work of Spielman and Teng (2003) is to perform smoothed analysis in linear programs which are guaranteed to be feasible and bounded, no matter the perturbation; zero-sum games under Definition 1.1 constitute such a class. Besides the point above, different algorithms designed for the same problem can have entirely different properties, not least in terms of their smoothed complexity. The class of algorithms we consider in this paper is quite distinct from the ones shown to have polynomial smoothed complexity in the context of linear programs (described further in Appendix A). In many ways, gradient-based methods are simpler and more natural, which partly justifies their tremendous practical use. As a result, understanding their smoothed complexity is an important question.

1.1 Our results

Our main contribution is to show that, with the exception of OMWU, the other gradient-based algorithms mentioned above (Items 1, 3 and 4) have polynomial smoothed complexity with high probability—that is to say, with probability at least 11\poly(nm)1-\frac{1}{\poly(nm)}.

Theorem 1.2.

With high probability over the randomness of 𝐀n×m\mathbf{A}\in{\mathbb{R}}^{n\times m} (Definition 1.1), OGDA, EGDA and IterSmooth converge to an ϵ\epsilon-equilibrium after \poly(n,m,1/σ)𝗅𝗈𝗀(1/ϵ)\poly(n,m,1/\sigma)\cdot\mathsf{log}(1/\epsilon) iterations.

The main takeaway of this result is that, modulo pathological instances, certain gradient-based algorithms are reliable solvers in zero-sum games even in the high-precision regime. Similarly to earlier endeavors in the context of linear programs (Spielman and Teng, 2004; Blum and Dunagan, 2002), a dependency of \poly(1/σ)\poly(1/\sigma) (as in Theorem 1.2) is what we should expect; the one exception is the class of interior-point methods whose running time grows as 𝗅𝗈𝗀(1/σ)\mathsf{log}(1/\sigma), but those algorithms are (weakly) polynomial even in the worst case. We further remark that the polynomial dependency on nn and mm in Theorem 1.2 can almost certainly be improved, and we made no effort to optimize it.

Regarding OMWU, which is not covered by Theorem 1.2, we also obtain a significant improvement in the iteration complexity compared to the worst-case analysis of Wei et al. (2021), but our bound is still not polynomial. As we explain further in Section C.3, the main difficulty pertaining to OMWU is that the analysis of Wei et al. (2021) gives (at best) an exponential bound no matter the geometry of the problem. With that mind, our result is essentially the best one could hope for without refining the worst-case analysis of OMWU, which is not within our scope here. We anticipate that our characterization herein will prove useful in conjunction with future developments in the worst-case complexity of OMWU, as well as in the analysis of other iterative methods.

The error bound

The central ingredient that enables gradient-based algorithms to exhibit linear convergence is a certain error bound, given below as Definition 1.3. For compactness in our notation, we let 𝒳Δn{\mathcal{X}}\coloneqq\Delta^{n} and 𝒴Δm{\mathcal{Y}}\coloneqq\Delta^{m}. We then let 𝒛(𝒙,𝒚)\bm{z}\coloneqq(\bm{x},\bm{y}), 𝒵𝒳×𝒴{\mathcal{Z}}\coloneqq{\mathcal{X}}\times{\mathcal{Y}}, and 𝒵𝒳×𝒴\mathcal{Z}^{\star}\coloneqq\mathcal{X}^{\star}\times\mathcal{Y}^{\star}, where 𝒳\mathcal{X}^{\star} and 𝒴\mathcal{Y}^{\star} represent the (convex) set of equilibria for Player xx and Player yy, respectively.

Definition 1.3 (Error bound).

Let Φ(𝒛)\Phi(\bm{z}) denote the duality gap as introduced in (2). We say that the zero-sum game (1) satisfies an error bound with modulus κ>0\kappa\in{\mathbb{R}}_{>0} if

Φ(𝒛)κ𝒛Π𝒵(𝒛)𝒛𝒵.\Phi(\bm{z})\geq\kappa\|\bm{z}-\Pi_{\mathcal{Z}^{\star}}(\bm{z})\|\quad\forall\bm{z}\in{\mathcal{Z}}. (3)

Above, Π𝒵()\Pi_{\mathcal{Z}^{\star}}(\cdot) denotes the (Euclidean) projection operator; the set of games with a unique equilibrium has measure one, so we can safely replace Π𝒵(𝒛)\Pi_{\mathcal{Z}^{\star}}(\bm{z}) by the unique equilibrium 𝒛𝒵\bm{z}^{\star}\in\mathcal{Z}^{\star}. It has been known at least since the work of Tseng (1995) that affine variational inequalities indeed satisfy (3). Nevertheless, it should come to no surprise that, even in 3×33\times 3 games, κ\kappa can be arbitrarily small (Proposition 3.1), which in turn means that, linear convergence notwithstanding, the number of iterations prescribed by an analysis revolving around (3) can be arbitrarily large. In fact, with the exception of OMWU, which is to be discussed further below, Definition 1.3 suffices to establish linear convergence (essentially) based on existing results.222Definition 1.3 also readily establishes linear convergence for other compelling primal-dual algorithms, as shown recently by Applegate et al. (2023); in that paper, the error bound was referred to as “sharpness,” a terminology employed in other papers as well (e.g.(Zarifis et al., 2024)). Our main result pertaining to Definition 1.3 is that the modulus κ\kappa is likely to be polynomial in the smoothed complexity model:

Theorem 1.4.

With high probability over the randomness of 𝐀\mathbf{A} (Definition 1.1), the error bound per Definition 1.3 is satisfied for any sufficiently small κ\poly(σ,1/(nm))\kappa\geq\poly(\sigma,1/(nm)).

To establish this result, the first step is to lower bound κ\kappa in terms of certain natural geometric features of the problem (Theorem 3.6), which is discussed further in Section 3.1. Establishing Theorem 1.4 then reduces to analyzing each of those quantities under Definition 1.1. It turns out that bounding those quantities also suffices for characterizing OMWU, whose existing analysis due to Wei et al. (2021) involves some further ingredients besides the error bound of Definition 1.3.

Further implications

Our characterization of the error bound given in Theorem 3.6 has some further important implications. First, a well-known vexing issue regarding computing equilibria even in zero-sum games is that a solution with small duality gap can still be relatively far from the equilibrium in the geometric sense, a phenomenon further exacerbated in multi-player games (Etessami and Yannakakis, 2007). Therefore, results providing guarantees in terms of the duality gap are not particularly informative when it comes to computing strategies close to the equilibrium in a geometric sense. At the same time, there are ample reasons why the latter guarantee is more appealing (Etessami and Yannakakis, 2007). Theorem 1.4 implies that such concerns can be alleviated in the smoothed complexity model:

Corollary 1.5.

With high probability over the randomness of 𝐀\mathbf{A} (Definition 1.1), any point 𝐳𝒵\bm{z}\in{\mathcal{Z}} with Φ(𝐳)ϵ\Phi(\bm{z})\leq\epsilon satisfies 𝐳𝐳ϵ\poly(n,m,1/σ)\|\bm{z}-\bm{z}^{\star}\|\leq\epsilon\cdot\poly(n,m,1/\sigma).

Beyond smoothed analysis, Theorem 3.6 applies to any non-degenerate game (in the sense of Definition 3.2), and can be thereby used to parameterize the rate of convergence of gradient-based algorithms based on natural and interpretable game-theoretic quantities of the underlying game, which has eluded prior work. In particular, we make a natural connection between the complexity of gradient-based algorithms and perturbation stability properties of the equilibrium. In light of misspecifications which are often present in game-theoretic modeling, focusing on games with perturbation-stable equilibria is well-motivated and has already received ample of interest in prior work (Balcan and Braverman, 2017; Awasthi et al., 2010); more broadly, perturbation stability is a common assumption in the analysis of algorithms beyond the worst-case model (Makarychev and Makarychev, 2021). There are different natural ways of defining perturbation-stable games; here, we assume that any perturbation with magnitude below δ>0\delta>0, in that 𝐀𝐀2δ\|\mathbf{A}^{\prime}-\mathbf{A}\|_{2}\leq\delta, maintains the support of the equilibrium and the non-degeneracy of the game; we call such games δ\delta-support-stable (Definition 4.1). In this context, we show the following result.

Corollary 1.6.

For any δ\delta-support-stable zero-sum game, OGDA, EGDA and IterSmooth converge to an ϵ\epsilon-equilibrium after \poly(n,m,1/δ)𝗅𝗈𝗀(1/ϵ)\poly(n,m,1/\delta)\cdot\mathsf{log}(1/\epsilon) iterations.

That is, games in which δ\delta is not too close to 0 are more amenable to gradient-based algorithms, which is a quite natural connection. Corollary 1.6 is shown by relating each of the quantities involved in Theorem 3.6 to parameter δ\delta defined above.

2 Notation

Before we proceed with our technical content, we first take the opportunity to streamline our notation; further background on smoothed analysis and a description of the algorithms referred to earlier (Items 1, 2, 3 and 4) is given later in Appendix B, as it is not important for the purpose of the main body.

We use boldface letters, such as 𝒙,𝒚,𝒃,𝒄\bm{x},\bm{y},\bm{b},\bm{c}, to represent vectors in a Euclidean space. For a vector 𝒙n\bm{x}\in{\mathbb{R}}^{n}, we access its iith coordinate via a subscript, namely 𝒙i\bm{x}_{i}. Superscripts (together with parantheses) are typically reserved for the (discrete) time index. We denote by 𝒙\|\bm{x}\| the Euclidean norm, 𝒙i=1n𝒙i2\|\bm{x}\|\coloneqq\sqrt{\sum_{i=1}^{n}\bm{x}_{i}^{2}}, the \ell_{\infty} norm by 𝒙max1in|𝒙i|\|\bm{x}\|_{\infty}\coloneqq\max_{1\leq i\leq n}|\bm{x}_{i}|, and the 1\ell_{1} norm by 𝒙1i=1n|𝒙i|\|\bm{x}\|_{1}\coloneqq\sum_{i=1}^{n}|\bm{x}_{i}|. For 𝒙,𝒙n\bm{x},\bm{x}^{\prime}\in{\mathbb{R}}^{n}, we let 𝖽𝗂𝗌𝗍(𝒙,𝒙)𝒙𝒙\mathsf{dist}(\bm{x},\bm{x}^{\prime})\coloneqq\|\bm{x}-\bm{x}^{\prime}\|. 𝗌𝗉𝖺𝗇()\mathsf{span}(\cdot) represents the linear space spanned by a given set of vectors. For 𝒙n\bm{x}\in{\mathbb{R}}^{n} and a subset B[n]B\subseteq[n], we denote by 𝒙BB\bm{x}_{B}\in{\mathbb{R}}^{B} the subvector of 𝒙\bm{x} induced by BB. We let 𝟏nn\bm{1}_{n}\in{\mathbb{R}}^{n} be the all-ones vector of dimension nn; we will typically omit the subscript when it is clear from the context. For vectors 𝒙n\bm{x}\in{\mathbb{R}}^{n} and 𝒚m\bm{y}\in{\mathbb{R}}^{m}, we write (𝒙,𝒚)n+m(\bm{x},\bm{y})\in{\mathbb{R}}^{n+m} to denote their concatenation. Throughout this paper, we use 𝒙\bm{x} and 𝒚\bm{y} to denote the strategy of Player xx and Player yy, respectively.

To represent matrices, we use boldface capital letter, such as 𝐀,𝐐\mathbf{A},\mathbf{Q}. It will sometimes be convenient to use 𝐀nm\mathbf{A}^{\flat}\in{\mathbb{R}}^{nm} to represent a vectorization of 𝐀n×m\mathbf{A}\in{\mathbb{R}}^{n\times m}. We overload notation by letting 𝐀\|\mathbf{A}\| be the spectral norm of 𝐀\mathbf{A}. For a matrix 𝐀n×m\mathbf{A}\in{\mathbb{R}}^{n\times m} and subsets B[n],N[m]B\subseteq[n],N\subseteq[m], we denote by 𝐀B,NB×N\mathbf{A}_{B,N}\in{\mathbb{R}}^{B\times N} the submatrix of 𝐀\mathbf{A} induced by BB and NN. 𝐀i,:\mathbf{A}_{i,:} and 𝐀:,j\mathbf{A}_{:,j} represent the iith row and jjth column of 𝐀\mathbf{A}, respectively. The singular values of a matrix 𝐌d×d\mathbf{M}\in{\mathbb{R}}^{d\times d} are denoted by σ1(𝐌)σ2(𝐌)σd(𝐌)0\sigma_{1}(\mathbf{M})\geq\sigma_{2}(\mathbf{M})\geq\dots\geq\sigma_{d}(\mathbf{M})\geq 0 (not to be confused with our notation for the variance σ2\sigma^{2}). To be more explicit, we may also use σmax(𝐌)σ1(𝐌)\sigma_{\mathrm{max}}(\mathbf{M})\coloneqq\sigma_{1}(\mathbf{M}) and σmin(𝐌)σd(𝐌)\sigma_{\mathrm{min}}(\mathbf{M})\coloneqq\sigma_{d}(\mathbf{M}).

3 Smoothed analysis of the error bound

In this section, we perform a smoothed analysis of the error bound—as introduced earlier in Definition 1.3—in (two-player) zero-sum games. It is first instructive to point out why smoothed analysis is useful in the first place: the modulus κ\kappa can be arbitrarily close to 0 even when n=m=3n=m=3 (that is, 3×33\times 3 games); this is detrimental as the iteration complexity of algorithms such as OGDA grows as a polynomial in 1/κ1/\kappa.

Proposition 3.1.

There exists a 3×33\times 3 zero-sum game such that κ\kappa per Definition 1.3 is arbitrarily close to 0.

In proof, it is enough to consider the ill-conditioned diagonal matrix

𝐀=(γ0002γ0001),\mathbf{A}=\begin{pmatrix}\gamma&0&0\\ 0&2\gamma&0\\ 0&0&1\end{pmatrix}, (4)

where 0<γ10<\gamma\ll 1. The (unique) equilibrium of (4) reads 𝒙=𝒚=13+2γ(2,1,2γ)Δ3\bm{x}^{\star}=\bm{y}^{\star}=\frac{1}{3+2\gamma}(2,1,2\gamma)\in\Delta^{3}. Now, considering 𝒙=(1,0,0)\bm{x}=(1,0,0) and 𝒚=(0,0,1)\bm{y}=(0,0,1), for the duality gap we have Φ(𝒙,𝒚)=γ\Phi(\bm{x},\bm{y})=\gamma, while the distance of (𝒙,𝒚)(\bm{x},\bm{y}) from the optimal solution (𝒙,𝒚)(\bm{x}^{\star},\bm{y}^{\star}) is at least 33+2γ\frac{3}{3+2\gamma}. In turn, by Definition 1.3, this means that κ2γ\kappa\leq 2\gamma. So, Proposition 3.1 follows by taking γ0\gamma\to 0.333 If we want to specify the game with a (finite) number of LL bits, Proposition 3.1 tells us that the modulus κ\kappa can be exponentially small in LL.

Proposition 3.1 exposes one type of pathology that can decelerate gradient-based algorithms, which is evidently related to the poor spectral properties of the payoff matrix. This intuition is quite helpful when equilibria are fully supported—as is the case in (4)—but has to be significantly refined more broadly, as we formalize in the sequel.

To sidestep such pathological examples, we thus turn to the smoothed analysis framework of Definition 1.1.

3.1 Overview

The most natural approach to analyze the error bound in the smoothed complexity model is to rely on an existing (worst-case) analysis proving that a positive κ\kappa exists, and then attempt to refine that analysis. Yet, at least based on such prior results we are aware of, that turns out to be challenging. As an example, let us consider the recent analysis of Wei et al. (2021). As we explain in more detail in Section C.3, Wei et al. (2021) relate the modulus κ\kappa of the error bound to the (inverse of the) norm of a solution to a certain feasible linear program; the existence of a legitimate κ>0\kappa>0 then follows readily from feasibility. Now, this reduction seems quite promising: Renegar (1994) has shown that the norm of a solution to a linear program can be bounded in terms of its condition number—the distance to infeasibility in our case, and Dunagan et al. (2011) later proved that the condition number of linear programs is polynomial in the smoothed complexity model. Nevertheless, there are some difficulties in materializing that argument. First, the induced linear program involves terms depending on both the payoff matrix and the geometry of the constraints (the probability simplex in our case). Consequently, the analysis of Dunagan et al. (2011) does not carry over since randomness is only injected into the payoff matrix. The second and more important obstacle is that the induced linear program depends on the optimal solution, which in turn depends on the randomness of the payoff matrix; this significantly entangles the underlying distribution. As there are exponentially many possible configurations, we cannot afford to argue about each one separately and then apply the union bound. This difficulty is in fact known to be the crux in performing smoothed analysis (Spielman and Teng, 2004).444This is not a concern in the unconstrained setting, where 𝒳=n{\mathcal{X}}={\mathbb{R}}^{n} and 𝒴=m{\mathcal{Y}}={\mathbb{R}}^{m}, in which a polynomial smoothed complexity follows readily from existing results relating the convergence of OGDA or EGDA to the condition number of the payoff matrix 𝐀\mathbf{A} (e.g.(Mokhtari et al., 2020; Li et al., 2023; Azizian et al., 2020)), which in turn is well-known to be polynomial in the smoothed complexity model (Spielman and Teng, 2004).

To address those challenges, we provide a new characterization of the error bound in terms of some natural quantities of the underlying game (Theorem 3.6), which in some sense capture the difficulty of the problem. We are then able to use a technique due to Spielman and Teng (2004), exposed in Section 3.3, to bound the probability that each of the involved quantities is close to 0 (Propositions 3.10, 3.8 and 3.9), even though the underlying distribution is quite convoluted. The resulting analysis follows the one given by Spielman and Teng (2003) in the context of termination of linear programs, but still has to account for a number of structural differences.

In what follows, we structure our argument as follows. First, in Section 3.2, we relate the modulus κ\kappa to some natural quantities capturing key geometric features of the problem. Section 3.3 then proceed by analyzing those quantities in the smoothed analysis framework.

3.2 Characterization of the error bound

Our first goal is to characterize the error bound in terms of certain natural quantities, which will then enable us to provide polynomial error bounds in the smoothed complexity model. Our only assumption here is that the zero-sum game is non-degenerate, in the sense of Definition 3.2 below; this can always be met with the addition of an arbitrarily small amount of noise (Lemma C.1). As such, our characterization here has an interest beyond the smoothed analysis framework, casting the error bound in terms of more interpretable game-theoretic quantities; for example, a concrete implication is given in Section 4.

Let us denote by vv the value of game (1), that is,

v=min𝒙𝒳max𝒚𝒴𝒙,𝐀𝒚=max𝒚𝒴min𝒙𝒳𝒙,𝐀𝒚,v=\min_{\bm{x}\in{\mathcal{X}}}\max_{\bm{y}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}\rangle=\max_{\bm{y}\in{\mathcal{Y}}}\min_{\bm{x}\in{\mathcal{X}}}\langle\bm{x},\mathbf{A}\bm{y}\rangle,

which is a consequence of the minimax theorem (von Neumann, 1928). We are now ready to state the formal definition of a non-degenerate game.

Definition 3.2 (Non-degenerate game).

A zero-sum game described with a payoff matrix 𝐀\mathbf{A} and value vv is said to be non-degenerate if it admits a unique equilibrium (𝒙,𝒚)𝒵(\bm{x}^{\star},\bm{y}^{\star})\in{\mathcal{Z}}, and 𝒙\bm{x}^{\star} and 𝒚\bm{y}^{\star} make tight exactly nn of the inequalities {𝒙i0}i[n]{𝒙,𝐀:,jv}j[m]\{\bm{x}_{i}\geq 0\}_{i\in[n]}\cup\{\langle\bm{x},\mathbf{A}_{:,j}\rangle\leq v\}_{j\in[m]} and mm of the inequalities {𝒚j0}j[m]{𝒚,𝐀i,:v}i[n]\{\bm{y}_{j}\geq 0\}_{j\in[m]}\cup\{\langle\bm{y},\mathbf{A}_{i,:}\rangle\geq v\}_{i\in[n]}, respectively.

In the sequel, we will make constant use of the fact that the set of degenerate games has measure zero under the law induced by Definition 1.1 (Lemma C.1).

In this context, we let B(𝒙){i[n]:𝒙i>0}B(\bm{x}^{\star})\coloneqq\{i\in[n]:\bm{x}^{\star}_{i}>0\} denote the support of 𝒙\bm{x}^{\star} (corresponding to Player xx), and similarly N(𝒚){j[m]:𝒚j>0}N(\bm{y}^{\star})\coloneqq\{j\in[m]:\bm{y}^{\star}_{j}>0\} for the support of Player yy. The strict complementarity theorem (Ye, 2011) tells us that BB indexes exactly the set of tight inequalities {𝒚,𝐀i,:v}i[n]\{\langle\bm{y},\mathbf{A}_{i,:}\rangle\geq v\}_{i\in[n]}, and symmetrically, NN indexes exactly the set of tight inequalities {𝒙,𝐀:,jv}j[m]\{\langle\bm{x},\mathbf{A}_{:,j}\rangle\leq v\}_{j\in[m]}. In particular, this implies that |B|=|N||B|=|N| with probability 11. It will also be convenient to define \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B[n]B\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}\coloneqq[n]\setminus B and \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N[m]N\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}\coloneqq[m]\setminus N.

Now, at a high level, one can split solving a zero-sum game into two subproblems: i) identifying the support of the equilibrium, and ii) solving the induced linear system to specify the exact probabilities within the support. It will be helpful to have that viewpoint in mind in the upcoming analysis, and in particular in the proof of Theorem 3.6. Roughly speaking, thinking of κ\kappa as a measure of the problem’s difficulty, we will relate κ\kappa to i) the difficulty of identifying the support of the equilibrium, and ii) the difficulty of solving the induced linear system. To be clear, those two subproblems are only helpful for the purpose of the analysis, and they are certainty intertwined when using algorithms such as OGDA.

Staying on the latter task, we will make use of a certain transformation so as to eliminate one of the redundant variables. Namely, for any 𝒙^BΔ(B)\widehat{\bm{x}}_{B}\in\Delta(B) and 𝒚^NΔ(N)\widehat{\bm{y}}_{N}\in\Delta(N), let us select a fixed pair of coordinates (i,j)B×N(i,j)\in B\times N (for example, the ones with the smallest index). Using the fact that 𝒙^B,𝟏=1\langle\widehat{\bm{x}}_{B},\bm{1}\rangle=1 and 𝒚^N,𝟏=1\langle\widehat{\bm{y}}_{N},\bm{1}\rangle=1, we can eliminate 𝒙^i\widehat{\bm{x}}_{i} and 𝒚^j\widehat{\bm{y}}_{j} by writing

𝒙^B,𝐀B,N𝒚^N=𝒙~,𝐐𝒚~𝒙~,𝒄𝒚~,𝒃+d,\langle\widehat{\bm{x}}_{B},\mathbf{A}_{B,N}\widehat{\bm{y}}_{N}\rangle=\langle\widetilde{\bm{x}},\mathbf{Q}\widetilde{\bm{y}}\rangle-\langle\widetilde{\bm{x}},\bm{c}\rangle-\langle\widetilde{\bm{y}},\bm{b}\rangle+d, (5)

where 𝒙~0B~,𝒚~0N~\widetilde{\bm{x}}\in{\mathbb{R}}_{\geq 0}^{\widetilde{B}},\widetilde{\bm{y}}\in{\mathbb{R}}_{\geq 0}^{\widetilde{N}} (for B~B{i}\widetilde{B}\coloneqq B\setminus\{i\} and N~N{j}\widetilde{N}\coloneqq N\setminus\{j\}) coincide with 𝒙^B\widehat{\bm{x}}_{B} and 𝒚^N\widehat{\bm{y}}_{N} on all coordinates in B~\widetilde{B} and N~\widetilde{N}, respectively, and 𝐀B,N=𝐓(𝐐,𝒃,𝒄,d)\mathbf{A}^{\flat}_{B,N}=\mathbf{T}(\mathbf{Q}^{\flat},\bm{b},\bm{c},d) for a (non-singular) linear transformation 𝐓(BN)×(BN)\mathbf{T}\in{\mathbb{R}}^{(BN)\times(BN)}. (We spell out the exact definition of 𝐓\mathbf{T} later in Section C.1, as it is not important for our purposes here; it follows by simply writing 𝒙^i=1𝒙~,𝟏\widehat{\bm{x}}_{i}=1-\langle\widetilde{\bm{x}},\bm{1}\rangle and 𝒚^j=1𝒚~,𝟏\widehat{\bm{y}}_{j}=1-\langle\widetilde{\bm{y}},\bm{1}\rangle.) The point of transformation (5) is that, by eliminating one of the redundant variables, there is a convenient characterization of the equilibrium (C.3); namely, 𝐐𝒚=𝒄\mathbf{Q}\bm{y}^{\star}=\bm{c} and 𝐐𝒙=𝒃\mathbf{Q}^{\top}\bm{x}^{\star}=\bm{b}.

We are now ready to introduce the key quantities upon which our characterization relies on. It turns out that those are analogous to the ones considered by Spielman and Teng (2003) in the context of analyzing the termination of linear programs; this is not coincidental, as our analysis was especially targeted to do so.

Definition 3.3.

Let 𝐀\mathbf{A} be the payoff matrix of a non-degenerate game, (𝒙,𝒚)𝒵(\bm{x}^{\star},\bm{y}^{\star})\in{\mathcal{Z}} the unique equilibrium, and B[n],N[m]B\subseteq[n],N\subseteq[m] the support of 𝒙\bm{x}^{\star} and 𝒚\bm{y}^{\star} respectively. We introduce the following quantities.

  1. 1.

    αP(𝐀)miniB(𝒙i)\alpha_{P}(\mathbf{A})\coloneqq\min_{i\in B}(\bm{x}^{\star}_{i}) and αD(𝐀)minjN(𝒚j)\alpha_{D}(\mathbf{A})\coloneqq\min_{j\in N}(\bm{y}^{\star}_{j});

  2. 2.

    βP(𝐀)minj\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(v𝒙B,𝐀B,j)\beta_{P}(\mathbf{A})\coloneqq\min_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}(v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle) and βD(𝐀)mini\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B(𝐀i,N,𝒚Nv)\beta_{D}(\mathbf{A})\coloneqq\min_{i\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}}(\langle\mathbf{A}_{i,N},\bm{y}^{\star}_{N}\rangle-v); and

  3. 3.

    γP(𝐀)minjN~𝖽𝗂𝗌𝗍(𝐐:,j,𝗌𝗉𝖺𝗇(𝐐:,N~j))\gamma_{P}(\mathbf{A})\coloneqq\min_{j\in\widetilde{N}}\mathsf{dist}(\mathbf{Q}_{:,j},\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j})) and γD(𝐀)miniB~𝖽𝗂𝗌𝗍(𝐐i,:,𝗌𝗉𝖺𝗇(𝐐B~i,:))\gamma_{D}(\mathbf{A})\coloneqq\min_{i\in\widetilde{B}}\mathsf{dist}(\mathbf{Q}_{i,:},\mathsf{span}(\mathbf{Q}_{\widetilde{B}-i,:})), where we use the shorthand notation B~iB~{i}\widetilde{B}-i\coloneqq\widetilde{B}\setminus\{i\} (N~jN~{j}\widetilde{N}-j\coloneqq\widetilde{N}\setminus\{j\}), and 𝐐=𝐐(𝐀)\mathbf{Q}=\mathbf{Q}(\mathbf{A}) is defined in (5).

(Above, we adopt the convention that if a minimization problem is with respect to an empty set, the minimum is to be evaluated as 11.)

Item 3 above will enable us to control the norm of solutions to any linear system induced by 𝐐\mathbf{Q}, as we explain in the sequel. Our proof will actually rely on a slightly different matrix, which we call \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}; the lemma below relates the geometry of \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{} to 𝐐\mathbf{Q}, and reassures us that the condition number of \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{} cannot be far from that of 𝐐\mathbf{Q} so long as 1jN~𝒚jαD(𝐀)1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}\geq\alpha_{D}(\mathbf{A}) (by Item 1) is not too close to 0. (A symmetric statement holds when focusing on Player yy.)

Lemma 3.4.

Let 𝐜=𝐐𝐲~=jN~𝐲~j𝐐:,j\bm{c}=\mathbf{Q}\widetilde{\bm{y}}^{\star}=\sum_{j\in\widetilde{N}}\widetilde{\bm{y}}^{\star}_{j}\mathbf{Q}_{:,j}, and suppose that \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~×N~\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\in{\mathbb{R}}^{\widetilde{B}\times\widetilde{N}} is such that its jjth column is equal to 𝐐:,j𝐜\mathbf{Q}_{:,j}-\bm{c}. Then,

minjN~𝖽𝗂𝗌𝗍(𝐐:,j,𝗌𝗉𝖺𝗇(𝐐:,N~j))(1+|N~|1jN~𝒚j)minjN~𝖽𝗂𝗌𝗍(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,:,j𝗌𝗉𝖺𝗇(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111):,N~j).\min_{j\in\widetilde{N}}\mathsf{dist}(\mathbf{Q}_{:,j},\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j}))\leq\left(1+\frac{|\widetilde{N}|}{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}}\right)\min_{j\in\widetilde{N}}\mathsf{dist}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j},\mathsf{span}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,\widetilde{N}-j})).

Next, we recall a fairly standard bound relating the magnitude of a solution to a linear system 𝒙~=𝐌𝒑\widetilde{\bm{x}}=\mathbf{M}\bm{p} with the smallest singular value of a full-rank matrix 𝐌\mathbf{M}.

Lemma 3.5.

Let 𝐌d×d\mathbf{M}\in{\mathbb{R}}^{d\times d} be a full-rank matrix. For any 𝐱~d\widetilde{\bm{x}}\in{\mathbb{R}}^{d} there is 𝐩d\bm{p}\in{\mathbb{R}}^{d} with 𝐩1σmin(𝐌)𝐱~\|\bm{p}\|\leq\frac{1}{\sigma_{\mathrm{min}}(\mathbf{M})}\|\widetilde{\bm{x}}\| such that

𝒙~=𝐌𝒑=j=1d𝒑j𝐌:,j.\widetilde{\bm{x}}=\mathbf{M}\bm{p}=\sum_{j=1}^{d}\bm{p}_{j}\mathbf{M}_{:,j}.

Moreover, to connect Lemma 3.5 with γP(𝐀)\gamma_{P}(\mathbf{A}), we observe that the smallest singular value can also be lower bounded in terms of the smallest distance of a column from the linear space spanned by the rest of the columns—which now matches the expression of Item 3 we saw earlier. In particular, we will make use of the so-called negative second moment identity (Tao et al., 2010) (Proposition C.4), which implies that

σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)1jN~𝖽𝗂𝗌𝗍2(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,:,j𝗌𝗉𝖺𝗇(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111):,N~j)1|N~|minjN~𝖽𝗂𝗌𝗍(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,:,j𝗌𝗉𝖺𝗇(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111):,N~j).\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})\geq\sqrt{\frac{1}{\sum_{j\in\widetilde{N}}\mathsf{dist}^{-2}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j},\mathsf{span}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,\widetilde{N}-j}))}}\geq\frac{1}{\sqrt{|\widetilde{N}|}}\min_{j\in\widetilde{N}}\mathsf{dist}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j},\mathsf{span}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,\widetilde{N}-j})). (6)

Proposition C.4 also implies that γD(𝐀)1|B~|γP(𝐀)\gamma_{D}(\mathbf{A})\geq\frac{1}{\sqrt{|\widetilde{B}|}}\gamma_{P}(\mathbf{A}), and so it will suffice to lower bound γP(𝐀)\gamma_{P}(\mathbf{A}) in the sequel. We are now ready to proceed with the main result of this subsection. Below, we use the notation “\gtrsim” to suppress lower-order terms and absolute constants.

Theorem 3.6.

Let 𝐀\mathbf{A} be a non-degenerate payoff matrix, and suppose that (αP(𝐀),αD(𝐀))(\alpha_{P}(\mathbf{A}),\alpha_{D}(\mathbf{A})), (βP(𝐀),βD(𝐀))(\beta_{P}(\mathbf{A}),\beta_{D}(\mathbf{A})) and (γP(𝐀),γD(𝐀))(\gamma_{P}(\mathbf{A}),\gamma_{D}(\mathbf{A})) are as in Definition 3.3. Then, the error bound (Definition 1.3) is satisfied for any sufficiently small modulus

κ1𝐀1min(n,m)3min{(αD(𝐀))2βD(𝐀)γP(𝐀),(αP(𝐀))2βP(𝐀)γD(𝐀)}.\kappa\gtrsim\frac{1}{\|\mathbf{A}^{\flat}\|_{\infty}}\frac{1}{\min(n,m)^{3}}\min\left\{(\alpha_{D}(\mathbf{A}))^{2}\beta_{D}(\mathbf{A})\gamma_{P}(\mathbf{A}),(\alpha_{P}(\mathbf{A}))^{2}\beta_{P}(\mathbf{A})\gamma_{D}(\mathbf{A})\right\}.

It is enough to explain how to lower bound κ>0\kappa>0 such that max𝒚𝒴𝒙,𝐀𝒚vκ𝒙Π𝒳(𝒙)=κ𝒙𝒙\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v\geq\kappa\|\bm{x}-\Pi_{\mathcal{X}^{\star}}(\bm{x})\|=\kappa\|\bm{x}-\bm{x}^{\star}\| for any 𝒙𝒳\bm{x}\in{\mathcal{X}}. In a nutshell, our argument is divided based on the magnitude λ𝒙B\lambda\coloneqq\|\bm{x}_{B}\|, which can be thought of as a measure of closeness from the support of the equilibrium. When λ1\lambda\ll 1, which means that 𝒙\bm{x} is still far from the support of the equilibrium, max𝒚𝒴𝒙,𝐀𝒚v\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v is governed by βD(𝐀)\beta_{D}(\mathbf{A}). In the contrary case, our basic strategy revolves around showing that the error bound can be treated as in the unconstrained case, which would then relate the modulus κ\kappa to the smallest singular value of the underlying matrix (essentially by Lemma 3.5)—and subsequently to γP(𝐀)\gamma_{P}(\mathbf{A}) due to (6). Indeed, this turns out to be possible by working with matrix \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}, as defined earlier in Lemma 3.4. We defer the precise argument to Section C.1.

3.3 Smoothed analysis

Having established Theorem 3.6, our next step is to show that each of the quantities introduced in Definition 3.3 is unlikely to be too close to 0 in the smoothed complexity model, which would then imply Theorem 1.4. The main difficulty lies in the fact that each configuration that may arise depends on the support of the equilibrium, which in turn depends on the underlying randomization of 𝐀\mathbf{A}, thereby significantly complicating the underlying distribution. Further, one cannot afford to argue about each configuration separately and then apply the union bound as there are too many possible configurations. To tackle this challenge, we follow the approach put forward by Spielman and Teng (2003).

In particular, given that all quantities of interest in Theorem 3.6 depend on the support of the equilibrium, it is natural to proceed by partitioning the probability space over all possible supports, and then bound the worst possible one—that is, the one maximizing the probability we want to minimize. In doing so, the challenge is that one has to condition on the equilibrium having a given support (formally justified by Proposition C.5). To argue about the induced probability density function upon such a conditioning, it is convenient to perform a change of variables from 𝐀\mathbf{A} to a new set of variables that now contains the equilibrium (𝒙,𝒚)(\bm{x}^{\star},\bm{y}^{\star}) (Lemma C.6). The basic idea here is that since the event we condition on concerns the equilibrium, it is helpful to have that equilibrium being part of our set of variables. The induced probability density function is now quite complicated, but can still be analyzed using the following lemma.

Lemma 3.7 (Spielman and Teng, 2003).

Let ρ\rho be the probability density function of a random variable XX. If there exist δ>0\delta>0 and c(0,1]c\in(0,1] such that

0ttδρ(t)ρ(t)c,0\leq t\leq t^{\prime}\leq\delta\implies\frac{\rho(t^{\prime})}{\rho(t)}\geq c, (7)

then

[XϵX0]ϵcδ.\operatorname{\mathbb{P}}[X\leq\epsilon\mid X\geq 0]\leq\frac{\epsilon}{c\delta}.

In words, random variables whose density is smooth—in the sense of (7)—are unlikely to be too close to 0. Gaussian random variables certainly have that property (Lemma C.8), but it is not confined to the Gaussian law; the analysis of Spielman and Teng (2003)—and subsequently our result—is not tailored to the Gaussian case.

We are now ready to state our main results in the smoothed complexity model; the proofs are deferred to Section C.2. We commence with βP(𝐀)\beta_{P}(\mathbf{A}), which is the easiest to analyze. In particular, the following result is a consequence of an anti-concentration bound with respect to a conditional Gaussian random variable (Lemma C.7).

Proposition 3.8.

Let βP(𝐀)\beta_{P}(\mathbf{A}) be defined as in Item 2. For any ϵ0\epsilon\geq 0,

𝐀[βP(𝐀)ϵ5𝐀]ϵemin(n,m)2σ2.\underset{\mathbf{A}}{\operatorname{\mathbb{P}}}\left[\beta_{P}(\mathbf{A})\leq\frac{\epsilon}{5\|\mathbf{A}^{\flat}\|_{\infty}}\right]\leq\epsilon\frac{e\min(n,m)^{2}}{\sigma^{2}}.

The analysis of γP(𝐀)\gamma_{P}(\mathbf{A}) is more challenging, and makes crucial use of Lemma 3.7. As we alluded to earlier, a key step is to change variables from 𝐀B,N\mathbf{A}_{B,N} to (𝐐,𝒃,𝒄,)(\mathbf{Q},\bm{b},\bm{c},\cdot)—in accordance with (5)—and then to (𝐐,𝒙,𝒚,)(\mathbf{Q},\bm{x}^{\star},\bm{y}^{\star},\cdot) based on 𝐐𝒚~=𝒄\mathbf{Q}\widetilde{\bm{y}}^{\star}=\bm{c}, 𝐐𝒙~=𝒃\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star}=\bm{b}. It is important to note that 𝐐\mathbf{Q} no longer contains independent random variables even though 𝐀B,N\mathbf{A}_{B,N} is (by Definition 1.1); this stems from the presence of a redundant variable in 𝒙B\bm{x}^{\star}_{B} (since 𝒙B,𝟏=1\langle\bm{x}^{\star}_{B},\bm{1}\rangle=1). Nevertheless, we can still overcome this issue using Lemma 3.7, leading to the following bound.

Proposition 3.9.

Let γP(𝐀)\gamma_{P}(\mathbf{A}) be defined as in Item 3. For any ϵ0\epsilon\geq 0,

𝐀[γP(𝐀)ϵ4maxjN~𝐐:,j+20𝐀+3]ϵ4emin(n,m)3σ2.\underset{\mathbf{A}}{\operatorname{\mathbb{P}}}\left[\gamma_{P}(\mathbf{A})\leq\frac{\epsilon}{4\max_{j\in\widetilde{N}}\|\mathbf{Q}_{:,j}\|+20\|\mathbf{A}^{\flat}\|_{\infty}+3}\right]\leq\epsilon\frac{4e\min(n,m)^{3}}{\sigma^{2}}.

Similar reasoning, albeit with some further complications, provides a bound for αP(𝐀)\alpha_{P}(\mathbf{A}), which is given below.

Proposition 3.10.

Let αP(𝐀)\alpha_{P}(\mathbf{A}) be defined as in Item 1. For any ϵ0\epsilon\geq 0,

𝐀[αP(𝐀)ϵ25(𝐀+1)2]ϵ8e2mnmin(n,m)σ2.\underset{\mathbf{A}}{\operatorname{\mathbb{P}}}\left[\alpha_{P}(\mathbf{A})\leq\frac{\epsilon}{25(\|\mathbf{A}^{\flat}\|_{\infty}+1)^{2}}\right]\leq\epsilon\frac{8e^{2}mn\min(n,m)}{\sigma^{2}}.

Armed with Propositions 3.10, 3.8 and 3.9 and Theorem 3.6, we can establish Theorem 1.2 by suitably leveraging existing results, as we formalize in Section C.3.

4 Parameterized results for perturbation-stable games

Another important implication of our characterization in Theorem 3.6 is that it enables connecting the convergence rate of gradient-based algorithms to natural and interpretable game-theoretic quantities. In particular, here we highlight a connection with perturbation-stable games, in the following formal sense.

Definition 4.1 (Perturbation-stable games).

Let 𝐀\mathbf{A} be the payoff matrix of a non-degenerate game. We say that the game is δ\delta-support-stable, with δ>0\delta>0, if for any 𝐀\mathbf{A}^{\prime} with 𝐀𝐀δ\|\mathbf{A}-\mathbf{A}^{\prime}\|\leq\delta it holds that 𝐀\mathbf{A}^{\prime} is a non-degenerate game whose equilibrium has the same support as 𝐀\mathbf{A}.

Perhaps the simplest example of a support-stable game with a favorable parameter δ>0\delta>0 arises when 𝐀\mathbf{A} is the 2×22\times 2 identity matrix. Indeed, as long as the perturbation parameter δ\delta remains below a certain absolute constant, the perturbed game still admits a unique full-support equilibrium. To see this, suppose for the sake of contradiction that the perturbed game has an equilibrium such that Player xx plays one of the two actions with probability 11. Player yy would then obtain a utility of at least 1O(δ)1-O(\delta). But the value of the original game was 1/21/2, which in turn implies that the value of the perturbed game is 1/2±Θ(δ)1/2\pm\Theta(\delta); for a sufficiently small δ\delta this leads to a contradiction. Similar reasoning applies with respect to Player yy. (The previous argument carries over more broadly to diagonally dominant 2×22\times 2 payoff matrices.)

As we have highlighted already, games with perturbation-stable equilibria—albeit under different notions of stability—have already received a lot of attention in the literature (Balcan and Braverman, 2017; Awasthi et al., 2010) (cf. Cohen (1986)), and are part of a broader trend in the analysis of algorithms beyond the worst case (for further background, we refer to the excellent book edited by Roughgarden (2021)). Our goal here is to make the following natural connection.

Theorem 4.2.

Any δ\delta-support-stable game (per Definition 4.1) satisfies the error bound for any sufficiently small modulus

κ\poly(1n,1m,δ).\kappa\geq\poly\left(\frac{1}{n},\frac{1}{m},\delta\right).

By virtue of our discussion in Section C.3, Theorem 4.2 immediately implies Corollary 1.6. Indeed, we observe that all parameters involved in Theorem 3.6 can be lower bounded in terms of the stability parameter of Definition 4.1, as we formalize in Section C.4.

5 Conclusions and future research

In conclusion, we performed the first smoothed analysis with respect to a number of well-studied gradient-based algorithms in zero-sum games. In particular, we showed that OGDA, EGDA and IterSmooth all enjoy polynomial smoothed complexity, meaning that their iteration complexity grows as a polynomial in the dimensions of the game, 1/σ1/\sigma, and 𝗅𝗈𝗀(1/ϵ)\mathsf{log}(1/\epsilon); for OMWU, our analysis reveals a significant improvement over the worst-case bound due to Wei et al. (2021), but it still remains superpolynomial. We also made a connection between the rate of convergence of the above algorithms and a natural perturbation-stability property of the equilibrium, which is interesting beyond the model of smoothed complexity.

A number of interesting avenues for future research remain open. First, is it the case that OMWU has polynomial smoothed complexity or is there an inherent separation with the other algorithms we studied? Answering this question in the positive would necessitate significantly improving the worst-case analysis of OMWU due to Wei et al. (2021) (cf. Cai et al. (2024) for a recent development concerning the last-iterate convergence of OMWU). Beyond OMWU, our results could also prove useful for establishing polynomial bounds for other natural dynamics in the smoothed analysis framework. Moreover, our characterization of the error bound in Theorem 3.6 assumes that the game is non-degenerate. This is an innocuous assumption in the smoothed complexity model, as it holds with probability 11, but nevertheless it would be interesting to generalize it to any game. Doing so could shed some light into whether Theorem 4.2 holds with respect to other, perhaps more natural notions of perturbation stability beyond Definition 4.1. It would also be interesting to investigate other models of smoothed complexity that account for dependencies between the entries of the payoff matrix (Bhaskara et al., 2024). Moreover, our focus has been on zero-sum games under simplex constraints, but we suspect that more general positive results should be attainable under polyhedral constraint sets; perhaps the most notable such candidate is the class of extensive-form games (Romanovskii, 1962; von Stengel, 1996). Even beyond (two-player) zero-sum games, Theorem 1.2 could apply to (multi-player) polymatrix zero-sum games (Cai et al., 2016). It is less clear whether the model of smoothed complexity can be informative when it comes to convergence to coarse correlated equilibria in multi-player games.

Acknowledgments

We are grateful to the anonymous reviewers at NeurIPS for their helpful feedback. The first author is indebted to Ioannis Panageas for many insightful discussions. This material is based on work supported by the Vannevar Bush Faculty Fellowship ONR N00014-23-1-2876, National Science Foundation grants RI-2312342 and RI-1901403, ARO award W911NF2210266, and NIH award A240108S001.

References

  • Abe et al. (2023) Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Kentaro Toyoshima, and Atsushi Iwasaki. Last-iterate convergence with full and noisy feedback in two-player zero-sum games. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023.
  • Adler (2013) Ilan Adler. The equivalence of linear programs and zero-sum games. Int. J. Game Theory, 42(1):165–177, 2013.
  • Alacaoglu and Malitsky (2022) Ahmet Alacaoglu and Yura Malitsky. Stochastic variance reduction for variational inequality methods. In Conference on Learning Theory (COLT), 2022.
  • Antonakopoulos et al. (2021) Kimon Antonakopoulos, Elena Veronica Belmega, and Panayotis Mertikopoulos. Adaptive extra-gradient methods for min-max optimization and games. In International Conference on Learning Representations (ICLR), 2021.
  • Applegate et al. (2023) David L. Applegate, Oliver Hinder, Haihao Lu, and Miles Lubin. Faster first-order primal-dual methods for linear programming using restarts and sharpness. Mathematical Programming, 201(1):133–184, 2023.
  • Awasthi et al. (2010) Pranjal Awasthi, Maria-Florina Balcan, Avrim Blum, Or Sheffet, and Santosh S. Vempala. On nash-equilibria of approximation-stable games. In International Symposium on Algorithmic Game Theory (SAGT), 2010.
  • Azizian et al. (2020) Waïss Azizian, Damien Scieur, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. Accelerating smooth games by manipulating spectral shapes. In International Conference on Artificial Intelligence and Statistics (2020), Proceedings of Machine Learning Research, 2020.
  • Balcan and Braverman (2017) Maria-Florina Balcan and Mark Braverman. Nash equilibria in perturbation-stable games. Theory Comput., 13(1):1–31, 2017.
  • Bhaskara et al. (2024) Aditya Bhaskara, Eric Evert, Vaidehi Srinivas, and Aravindan Vijayaraghavan. New tools for smoothed analysis: Least singular value bounds for random matrices with dependent entries. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2024.
  • Bläser and Manthey (2015) Markus Bläser and Bodo Manthey. Smoothed complexity theory. ACM Trans. Comput. Theory, 7(2):6:1–6:21, 2015.
  • Blum and Dunagan (2002) Avrim Blum and John Dunagan. Smoothed analysis of the perceptron algorithm for linear programming. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.
  • Boodaghians et al. (2020) Shant Boodaghians, Joshua Brakensiek, Samuel B. Hopkins, and Aviad Rubinstein. Smoothed complexity of 2-player nash equilibria. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), 2020.
  • Bowling et al. (2015) Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, 2015.
  • Brooks and Reny (2023) Benjamin Brooks and Philip J. Reny. A canonical game—75 years in the making—showing the equivalence of matrix games and linear programming. Economic Theory Bulletin, 2023.
  • Brown and Sandholm (2018) Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
  • Brown and Sandholm (2019) Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted regret minimization. In Conference on Artificial Intelligence (AAAI), 2019.
  • Buriol et al. (2011) Luciana S. Buriol, Marcus Ritt, Félix Carvalho Rodrigues, and Guido Schäfer. On the smoothed price of anarchy of the traffic assignment problem. In Workshop on Algorithmic Approaches for Transportation Modeling, Optimization, and Systems (ATMOS), 2011.
  • Cai et al. (2016) Yang Cai, Ozan Candogan, Constantinos Daskalakis, and Christos H. Papadimitriou. Zero-sum polymatrix games: A generalization of minmax. Mathematics of Operations Research, 41(2):648–655, 2016.
  • Cai et al. (2022) Yang Cai, Argyris Oikonomou, and Weiqiang Zheng. Finite-time last-iterate convergence for learning in multi-player games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • Cai et al. (2024) Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, and Weiqiang Zheng. Fast last-iterate convergence of learning in games requires forgetful algorithms. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.
  • Carmon et al. (2019) Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • Carmon et al. (2020) Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Coordinate methods for matrix games. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), 2020.
  • Carmon et al. (2024) Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. A whole new ball game: A primal accelerated method for matrix games and minimizing the maximum of smooth functions. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2024.
  • Chen et al. (2009) Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computing two-player Nash equilibria. Journal of the ACM, 2009.
  • Chen et al. (2020) Xi Chen, Chenghao Guo, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Mihalis Yannakakis, and Xinzhi Zhang. Smoothed complexity of local max-cut and binary max-csp. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2020.
  • Chen et al. (2024) Xi Chen, Chenghao Guo, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Mihalis Yannakakis. Smoothed complexity of SWAP in local graph partitioning. Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2024.
  • Chiang et al. (2012) Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory (COLT), 2012.
  • Christ and Yannakakis (2023) Miranda Christ and Mihalis Yannakakis. The smoothed complexity of policy iteration for markov decision processes. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2023.
  • Clarkson et al. (2012) Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for machine learning. Journal of the ACM, 59(5):23:1–23:49, 2012.
  • Cohen (1986) Joel E. Cohen. Perturbation theory of completely mixed matrix games. Linear Algebra and its Applications, 79:153–162, 1986.
  • Cohen et al. (2017) Johanne Cohen, Amélie Héliou, and Panayotis Mertikopoulos. Hedging under uncertainty: Regret minimization meets exponentially fast convergence. In International Symposium on Algorithmic Game Theory (SAGT), 2017.
  • Cohen et al. (2021) Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. Journal of the ACM, 68(1):3:1–3:39, 2021.
  • Cunha et al. (2022) Leonardo Cunha, Gauthier Gidel, Fabian Pedregosa, Damien Scieur, and Courtney Paquette. Only tails matter: Average-case universality and robustness in the convex regime. In International Conference on Machine Learning (ICML), 2022.
  • Dantzig (1951) George Dantzig. A proof of the equivalence of the programming problem and the game problem. In Tjalling Koopmans, editor, Activity Analysis of Production and Allocation, pages 330–335. John Wiley & Sons, 1951.
  • Daskalakis and Panageas (2019) Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In Innovations in Theoretical Computer Science Conference (ITCS), 2019.
  • Daskalakis et al. (2015) Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. Games and Economic Behavior, 92:327–348, 2015.
  • Daskalakis et al. (2024) Constantinos Daskalakis, Noah Golowich, Nika Haghtalab, and Abhishek Shetty. Smooth nash equilibria: Algorithms and complexity. In Innovations in Theoretical Computer (ITCS), 2024.
  • Dontchev and Rockafellar (2009) Asen L Dontchev and R Tyrrell Rockafellar. Implicit functions and solution mappings: A view from variational analysis, volume 616. Springer, 2009.
  • Dunagan et al. (2011) John Dunagan, Daniel A. Spielman, and Shang-Hua Teng. Smoothed analysis of condition numbers and complexity implications for linear programming. Mathematical Programming, 126(2):315–350, 2011.
  • Edelman (1993) Alan Edelman. Eigenvalue roulette and random test matrices. Linear Algebra for Large Scale and Real-Time Applications, pages 365–368, 1993.
  • Etessami and Yannakakis (2007) Kousha Etessami and Mihalis Yannakakis. On the complexity of Nash equilibria and other fixed points (extended abstract). In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), 2007.
  • Farina and Sandholm (2022) Gabriele Farina and Tuomas Sandholm. Fast payoff matrix sparsification techniques for structured extensive-form games. In Conference on Artificial Intelligence (AAAI), 2022.
  • Farina et al. (2021) Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Faster game solving via predictive blackwell approachability: Connecting regret matching and mirror descent. In Conference on Artificial Intelligence (AAAI), 2021.
  • Fercoq (2023) Olivier Fercoq. Quadratic error bound of the smoothed gap and the restarted averaged primal-dual hybrid gradient, 2023.
  • Gatti et al. (2013) Nicola Gatti, Marco Rocco, and Tuomas Sandholm. Strong Nash equilibrium is in smoothed P. In Conference on Artificial Intelligence (AAAI), 2013. Late-breaking paper track.
  • Giannakopoulos (2023) Yiannis Giannakopoulos. A smoothed FPTAS for equilibria in congestion games. CoRR, abs/2306.10600, 2023.
  • Giannakopoulos et al. (2022) Yiannis Giannakopoulos, Alexander Grosz, and Themistoklis Melissourgos. On the smoothed complexity of combinatorial local search. CoRR, abs/2211.07547, 2022.
  • Giannou et al. (2021) Angeliki Giannou, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Panayotis Mertikopoulos. On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • Gilpin et al. (2012) Andrew Gilpin, Javier Peña, and Tuomas Sandholm. First-order algorithm with 𝒪(ln(1/ϵ))\mathcal{O}(\mathrm{ln}(1/\epsilon)) convergence for ϵ\epsilon-equilibrium in two-person zero-sum games. Mathematical Programming, 133(1–2):279–298, 2012.
  • Golowich et al. (2020a) Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2020a.
  • Golowich et al. (2020b) Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, and Asuman E. Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. In Conference on Learning Theory (COLT), 2020b.
  • Gorbunov et al. (2022) Eduard Gorbunov, Adrien Taylor, and Gauthier Gidel. Last-iterate convergence of optimistic gradient method for monotone variational inequalities. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • Grigoriadis and Khachiyan (1995) Michael D. Grigoriadis and Leonid G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 18(2):53–58, 1995.
  • Haghtalab et al. (2022) Nika Haghtalab, Michael I. Jordan, and Eric Zhao. On-demand sampling: Learning optimally from multiple distributions. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • Haghtalab et al. (2023) Nika Haghtalab, Michael I. Jordan, and Eric Zhao. A unifying perspective on multi-calibration: Unleashing game dynamics for multi-objective learning. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.
  • Hsieh et al. (2019) Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the convergence of single-call stochastic extra-gradient methods. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • Huiberts et al. (2023) Sophie Huiberts, Yin Tat Lee, and Xinzhi Zhang. Upper and lower bounds on the smoothed complexity of the simplex method. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2023.
  • Korpelevich (1976) Galina M Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
  • Lee et al. (2021) Chung-Wei Lee, Christian Kroer, and Haipeng Luo. Last-iterate convergence in extensive-form games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • Li et al. (2023) Chris Junchi Li, Huizhuo Yuan, Gauthier Gidel, Quanquan Gu, and Michael I. Jordan. Nesterov meets optimism: Rate-optimal separable minimax optimization. In International Conference on Machine Learning (ICML), 2023.
  • Mahdavinia et al. (2022) Pouria Mahdavinia, Yuyang Deng, Haochuan Li, and Mehrdad Mahdavi. Tight analysis of extra-gradient and optimistic gradient methods for nonconvex minimax problems. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • Maiti et al. (2023) Arnab Maiti, Kevin G. Jamieson, and Lillian J. Ratliff. Instance-dependent sample complexity bounds for zero-sum matrix games. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023.
  • Makarychev and Makarychev (2021) Konstantin Makarychev and Yury Makarychev. Perturbation Resilience, page 95–119. Cambridge University Press, 2021.
  • Mertikopoulos et al. (2018) Panayotis Mertikopoulos, Christos H. Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018.
  • Mertikopoulos et al. (2019) Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In International Conference on Learning Representations (ICLR), 2019.
  • Mokhtari et al. (2020) Aryan Mokhtari, Asuman E. Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.
  • Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
  • Nesterov (2005) Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103, 2005.
  • Paquette et al. (2023) Courtney Paquette, Bart van Merriënboer, Elliot Paquette, and Fabian Pedregosa. Halting time is predictable for large models: A universality property and average-case analysis. Found. Comput. Math., 23(2):597–673, 2023.
  • Perolat et al. (2022) Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptiste Lespiau, Bilal Piot, Shayegan Omidshafiei, Edward Lockhart, Laurent Sifre, Nathalie Beauguerlange, Remi Munos, David Silver, Satinder Singh, Demis Hassabis, and Karl Tuyls. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
  • Popov (1980) L.D. Popov. A modification to the Arrow-Hurwicz method for search of saddle-points. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845–848, 1980.
  • Rakhlin and Sridharan (2013) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019, 2013.
  • Renegar (1995) James Renegar. Incorporating condition measures into the complexity theory of linear programming. SIAM Journal on Optimization, 5(3):506–524, 1995.
  • Renegar (1994) Jarnes Renegar. Some perturbation theory for linear programming. Mathematical Programming, 65:73–91, 1994.
  • Rockafellar (2015) Ralph Tyrell Rockafellar. Convex Analysis. Princeton university press, 2015.
  • Romanovskii (1962) I. Romanovskii. Reduction of a game with complete memory to a matrix game. Soviet Mathematics, 3, 1962.
  • Roughgarden (2021) Tim Roughgarden. Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, 2021.
  • Rubinstein (2016) Aviad Rubinstein. Settling the complexity of computing approximate two-player nash equilibria. In Irit Dinur, editor, Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), 2016.
  • Scieur and Pedregosa (2020) Damien Scieur and Fabian Pedregosa. Universal average-case optimality of polyak momentum. In International Conference on Machine Learning (ICML), 2020.
  • Song et al. (2023) Zhuoqing Song, Jason D. Lee, and Zhuoran Yang. Can we find nash equilibria at a linear rate in markov games? In International Conference on Learning Representations (ICLR), 2023.
  • Spielman and Teng (2003) Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of termination of linear programming algorithms. Math. Program., 97(1-2):375–404, 2003.
  • Spielman and Teng (2004) Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3):385–463, 2004.
  • Spielman and Teng (2009) Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Commun. ACM, 52(10):76–84, 2009.
  • Syrgkanis et al. (2015) Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, 2015.
  • Tang et al. (2023) Xiaohang Tang, Le Cong Dinh, Stephen Marcus McAleer, and Yaodong Yang. Regret-minimizing double oracle for extensive-form games. In International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, 2023.
  • Tao (2023) Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Society, 2023.
  • Tao et al. (2010) Terence Tao, Van Vu, and Manjunath Krishnapur. Random matrices: Universality of ESDs and the circular law. The Annals of Probability, 38(5):2023 – 2065, 2010.
  • Tseng (1995) Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60(1):237–252, 1995.
  • van Damme (1991) Eric van Damme. Stability and perfection of Nash equilibria, volume 339. Springer, 1991.
  • van den Brand et al. (2021) Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Minimum cost flows, mdps, and 1\ell_{1}-regression in nearly linear time for dense instances. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2021.
  • Vankov et al. (2023) Daniil Vankov, Angelia Nedić, and Lalitha Sankar. Last iterate convergence of popov method for non-monotone stochastic variational inequalities, 2023.
  • von Neumann (1928) John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295–320, 1928.
  • von Neumann and Morgenstern (1947) John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1947.
  • von Stengel (1996) Bernhard von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220–246, 1996.
  • von Stengel (2023) Bernhard von Stengel. Zero-sum games and linear programming duality. Mathematics of Operations Research, 2023.
  • Wei et al. (2021) Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations (ICLR), 2021.
  • Ye (2011) Yinyu Ye. Interior point algorithms: theory and analysis. John Wiley & Sons, 2011.
  • Zarifis et al. (2024) Nikos Zarifis, Puqian Wang, Ilias Diakonikolas, and Jelena Diakonikolas. Robustly learning single-index models via alignment sharpness. In International Conference on Machine Learning (ICML), 2024.
  • Zhang and Sandholm (2020) Brian Hu Zhang and Tuomas Sandholm. Sparsified linear programming for zero-sum equilibrium finding. In International Conference on Machine Learning (ICML), 2020.
  • Zinkevich et al. (2007) Martin Zinkevich, Michael Bowling, Michael Johanson, and Carmelo Piccione. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.

Appendix A Further related work

Besides the pioneering work of Spielman and Teng (2004), which revolved around the simplex algorithm, other prominent algorithms for solving linear programs have also been investigated through the lens of smoothed complexity. Blum and Dunagan (2002) showed that perceptron, a popular algorithm in machine learning, also enjoys a polynomial smoothed complexity (with high probability) for solving linear programming feasibility problems, which can also capture general linear programs via a binary search procedure. Further, Dunagan et al. (2011) performed a smoothed analysis of interior-point methods by relying on an earlier characterization due to Renegar (1995).

Beyond linear programming and (two-player) zero-sum games, there has been a considerable interest in understanding the smoothed complexity of Nash equilibria in general-sum games, but the outlook that has emerged from this endeavor is rather bleak (Chen et al., 2009; Boodaghians et al., 2020; Rubinstein, 2016). On a more positive note, Daskalakis et al. (2024) recently considered a more permissive solution concept they refer to as a smooth Nash equilibrium; the basic idea of their relaxation is that instead of considering best-response deviations, they restrict to deviations that do not assign too much probability mass on any pure strategy, as controlled by a certain parameter. For a certain regime of that parameter, they obtained positive results, bypassing the intractability of the usual Nash equilibrium. Considering smooth Nash equilibria could also be fruitful in the context of zero-sum games. In particular, we surmise that, if one is content with convergence to smooth Nash equilibria, the error bound could exhibit more favorable properties. Smoothed analysis has also been applied to more structured classes of games, such as congestion or potential games (Giannakopoulos, 2023; Giannakopoulos et al., 2022; Chen et al., 2020), as well as other important problems in game theory (Gatti et al., 2013; Buriol et al., 2011). Other notable developments in a broader context were covered in an older survey by Spielman and Teng (2009); for more recent developments, we point to, for example, Christ and Yannakakis (2023); Chen et al. (2024); Huiberts et al. (2023), and the many references therein.

Moreover, average-case analysis has also been a popular topic in the optimization literature (Cunha et al., 2022; Paquette et al., 2023; Scieur and Pedregosa, 2020), and so it is worth relating our results to that line of work. In particular, let us focus on the recent work of Cunha et al. (2022). First, that paper targets a certain class of convex quadratic problems, whereas we examine zero-sum games. They also operate under a different perturbation model, deriving a parametrization based on the concentration of the eigenvalues of a certain matrix. Further, without strong convexity, Cunha et al. (2022) establish a complexity scaling with \poly(1/ϵ)\poly(1/\epsilon), while here we target the 𝗅𝗈𝗀(1/ϵ)\mathsf{log}(1/\epsilon) regime. We finally remark that the techniques employed are also quite different. In particular, Cunha et al. (2022, Problem 2.1) posit that the optimal solution does not depend on the underlying randomization. In contrast, as we have already highlighted, the fact that the equilibrium is a function of the randomization constitutes the main technical crux in our setting. At the same time, Cunha et al. (2022) encountered several challenges not present in our setting, so overall those results are complementary.

Beyond smoothed complexity, understanding the last-iterate convergence of gradient-based methods such as OGDA and EGDA has received tremendous interest in the literature especially in recent years; e.g., (Golowich et al., 2020a; Cai et al., 2022; Gorbunov et al., 2022; Vankov et al., 2023; Golowich et al., 2020b; Mahdavinia et al., 2022; Antonakopoulos et al., 2021; Mertikopoulos et al., 2019; Abe et al., 2023). It is worth noting that linear convergence has also been documented for the more challenging class of extensive-form games (Lee et al., 2021), as well as Markov games (Song et al., 2023). Nevertheless, there are lower bounds precluding linear convergence beyond affine variational inequalities (Golowich et al., 2020a; Wei et al., 2021). We also refer to the works of Cohen et al. (2017) and Giannou et al. (2021) for further characterizations of the convergence rate of no-regret dynamics in multi-player games.

Contrary to the above line of work, which focuses on last-iterate convergence, the most common approach to solving zero-sum games revolves around regret minimization whereby optimality guarantees concern the average strategies. Learning in such settings has been a popular research topic as it captures many central and natural problems; two notable recent applications are learning from multiple distributions (Haghtalab et al., 2022) and multi-calibration (Haghtalab et al., 2023). Yet, there are at least three limitations of the no-regret framework worth highlighting here. The first one, which has been stressed extensively already, is that the number of iterations must grow at least as Ω(1/ϵ)\Omega(1/\epsilon) when one insists on taking (uniform) averages (Daskalakis et al., 2015). The second and more nuanced caveat is that the no-regret framework does not provide instance-based guarantees based on natural game-theoretic parameters of the problem (see, for example, the discussion of Maiti et al. (2023)). Building on earlier work (Wei et al., 2021; Tseng, 1995), some of our results here attempt to address this shortcoming by coming up with a more interpretable parameterization of the iteration complexity of algorithms such as OGDA. The final limitation is that, convergence to the set of equilibria notwithstanding, no-regret guarantees provide no information regarding properties of the equilibrium reached. Although not an issue in non-degenerate zero-sum games, equilibrium selection still remains a central problem. Earlier results (Wei et al., 2021; Tseng, 1995) provide an interesting characterization for the last iterate of OGDA and EGDA by showing that the limit point is the projection of the initial point to the set of equilibria.

Finally, it is worth pointing out the best available theoretical guarantees for solving zero-sum games. Assuming that each entry of 𝐀\mathbf{A} has absolute value bounded by 11, (1) can be solved in O~(max{n,m}ω)\tilde{O}(\max\{n,m\}^{\omega}) (Cohen et al., 2021) or O~(nm+min{n,m}5/2)\tilde{O}(nm+\min\{n,m\}^{5/2}) (van den Brand et al., 2021). Here, ω\omega is the exponent of matrix multiplication and O~\tilde{O} suppresses polylogarithmic factors in nn and mm. The complexity we obtain for algorithms such as OGDA is not competitive even though we work in the more benign smoothed complexity model; we reiterate that we did not attempt to optimize the polynomial factors in terms of nn and mm, and those can almost certainly be improved. On the other hand, there are two main aspects in which algorithms such as OGDA are more appealing in terms of their scalability: the per-iteration complexity and the memory requirements. An algorithm such as OGDA requires a single matrix-vector product in each iteration, which can be implemented in linear time for sparse matrices, and has a limited memory footprint. In contrast, implementing interior-point methods in large games can be prohibitive.

Appendix B Preliminaries

In this section, we introduce some further background on smoothed complexity and define the algorithms cited earlier (Items 1, 2, 3 and 4).

Further notation

For a random variable XX, we denote by 𝔼[X]{\mathbb{E}}[X] its expectation and by 𝕍[X]\mathbb{V}[X] its variance, under the assumption that both are finite. For a sequence of random variables X1,,XdX_{1},\dots,X_{d} and scalars α1,,αd\alpha_{1},\dots,\alpha_{d}\in{\mathbb{R}}, linearity of expectation yields that 𝔼[α1X1++αdXd]=α1𝔼[X1]++αd𝔼[Xd]{\mathbb{E}}[\alpha_{1}X_{1}+\dots+\alpha_{d}X_{d}]=\alpha_{1}{\mathbb{E}}[X_{1}]+\dots+\alpha_{d}{\mathbb{E}}[X_{d}]. Assuming independence, it also holds that 𝕍[α1X1++αdXd]=(α1)2𝕍[X1]++(αd)2𝕍[Xd]\mathbb{V}[\alpha_{1}X_{1}+\dots+\alpha_{d}X_{d}]=(\alpha_{1})^{2}\mathbb{V}[X_{1}]+\dots+(\alpha_{d})^{2}\mathbb{V}[X_{d}]. We will also use the fact that a linear combination of independent Gaussian random variables is also Gaussian. More broadly, linear combinations can be understood through a convolution in the space of probability density functions, which means that smoothness (in the sense of Lemma C.7) is preserved in a certain regime.

B.1 Smoothed complexity

To fully specify Definition 1.1, we first recall that a (univariate) Gaussian random variable with zero mean and variance σ2\sigma^{2} admits a probability density function of the form

μ:t1σ2πexp(t22σ2).\mu:t\mapsto\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{t^{2}}{2\sigma^{2}}\right).

The law of such a Gaussian random variable will be denoted by 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). In the original work of Spielman and Teng (2004), smoothed complexity was defined as the expected running time (or some other cost function) of some algorithm over the perturbed input. More precisely, let 𝒜{\mathcal{A}} be an algorithm whose inputs can be expressed as vectors in d{\mathbb{R}}^{d}, and let T𝒜()T_{{\mathcal{A}}}({\mathcal{I}}) be the running time of algorithm 𝒜{\mathcal{A}} on input d{\mathcal{I}}\in{\mathbb{R}}^{d}. Then, the smoothed complexity of 𝒜{\mathcal{A}} is

𝒞𝒜(d,σ)maxd𝔼𝒈𝒩(𝟎d,σ2𝐈d×d)[T𝒜(+𝒈)].\mathcal{C}_{{\mathcal{A}}}(d,\sigma)\coloneqq\max_{{\mathcal{I}}\in{\mathbb{R}}^{d}}{\mathbb{E}}_{\bm{g}\sim\mathcal{N}(\bm{0}_{d},\sigma^{2}\mathbf{I}_{d\times d})}[T_{{\mathcal{A}}}({\mathcal{I}}+\|{\mathcal{I}}\|\bm{g})].

As pointed out by Spielman and Teng (2003), one does not need to limit smoothed analysis to measure the expected running time, and high probability guarantees are also quite natural; see, for example, the smoothed analysis of the perceptron algorithm due to Blum and Dunagan (2002). Our main result also provides a guarantee with high probability; it is not clear whether the expected running time can also be bounded by \poly(n,m,1/σ)\poly(n,m,1/\sigma), which is left for future work.

B.2 Algorithms

Next, we specify the algorithms we consider in this work.

Optimistic gradient descent/ascent

Originally proposed by Popov (1980), optimistic gradient descent/ascent (OGDA)—and other variants thereof (Hsieh et al., 2019)—has been recently revived in the online learning literature commencing from the pioneering works of Rakhlin and Sridharan (2013) and Chiang et al. (2012). If we denote for compactness F(𝒛)(𝐀𝒚,𝐀𝒙)F(\bm{z})\coloneqq(\mathbf{A}\bm{y},-\mathbf{A}^{\top}\bm{x}), OGDA can be expressed as follows for t(={1,2,,})t\in{\mathbb{N}}(=\{1,2,\dots,\}).

𝒛(t)Π𝒵(𝒛^(t)ηF(𝒛(t1)),\displaystyle\bm{z}^{(t)}\coloneqq\Pi_{{\mathcal{Z}}}(\widehat{\bm{z}}^{(t)}-\eta F(\bm{z}^{(t-1)}),
𝒛^(t+1)Π𝒵(𝒛^(t)ηF(𝒛(t)).\displaystyle\widehat{\bm{z}}^{(t+1)}\coloneqq\Pi_{{\mathcal{Z}}}(\widehat{\bm{z}}^{(t)}-\eta F(\bm{z}^{(t)}).

Here, η>0\eta>0 is the learning rate; Π𝒵()\Pi_{{\mathcal{Z}}}(\cdot) denotes the (Euclidean) projection operator on set 𝒵𝒳×𝒴{\mathcal{Z}}\coloneqq{\mathcal{X}}\times{\mathcal{Y}}; and 𝒛(0)=𝒛^(1)𝒵\bm{z}^{(0)}=\widehat{\bm{z}}^{(1)}\in{\mathcal{Z}} is the initialization. That is, players simultaneously update their strategies through optimistic gradient steps. Given that 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are probability simplexes, each projection can be computed exactly in nearly linear time. The key reference point for OGDA in affine variational inequalities is the work of Wei et al. (2021) who established linear convergence using the notion of metric subregularity (Definition C.9), which is strongly related to Definition 1.3; we discuss their approach later in Section C.3.

Optimistic multiplicative weights update

Deriving from the same class of online learning algorithms as OGDA, optimistic multiplicative weights (OMWU) is the incarnation of optimistic mirror descent with an entropic regularizer, namely

𝒙(t)𝒙(t1)exp(2η𝐀𝒚(t1)+η𝐀𝒚(t2)),\displaystyle\bm{x}^{(t)}\propto\bm{x}^{(t-1)}\circ\exp\left(-2\eta\mathbf{A}\bm{y}^{(t-1)}+\eta\mathbf{A}\bm{y}^{(t-2)}\right),
𝒚(t)𝒚(t1)exp(2η𝐀𝒙(t1)η𝐀𝒙(t2))\displaystyle\bm{y}^{(t)}\propto\bm{y}^{(t-1)}\circ\exp\left(2\eta\mathbf{A}^{\top}\bm{x}^{(t-1)}-\eta\mathbf{A}^{\top}\bm{x}^{(t-2)}\right)

for tt\in{\mathbb{N}}.555OMWU is oftentimes expressed via the (optimistic) mirror descent viewpoint, but the form we provide here is easily seen to be equivalent. Above, \circ denotes the component-wise product; the exponential mapping exp()\exp(\cdot) is also to be applied component-wise; and 𝒛(1)𝒛(0)(1n𝟏n,1m𝟏m)\bm{z}^{(-1)}\coloneqq\bm{z}^{(0)}\coloneqq(\frac{1}{n}\bm{1}_{n},\frac{1}{m}\bm{1}_{m}). Daskalakis and Panageas (2019) first proved that OMWU exhibits asymptotic (last-iterate) convergence, and Wei et al. (2021) later established linear convergence.

Remark B.1.

It is important to note here that the exponential map of OMWU can produce iterates with an arbitrarily large number of bits. Nevertheless, it is not hard to show that the analysis of Wei et al. (2021) carries over when the iterates are truncated up to a certain length of the most significant bits, and so we will not dwell further on this issue here.

Extra-gradient descent/ascent

The extra-gradient method of Korpelevich (1976) is quite similar to OGDA, namely

𝒛^(t)Π𝒵(𝒛(t)ηF(𝒛(t)),\displaystyle\widehat{\bm{z}}^{(t)}\coloneqq\Pi_{{\mathcal{Z}}}(\bm{z}^{(t)}-\eta F(\bm{z}^{(t)}),
𝒛(t+1)Π𝒵(𝒛(t)ηF(𝒛^(t))\displaystyle\bm{z}^{(t+1)}\coloneqq\Pi_{{\mathcal{Z}}}(\bm{z}^{(t)}-\eta F(\widehat{\bm{z}}^{(t)})

for tt\in{\mathbb{N}}. Unlike OGDA, one caveat is that it requires two gradient evaluations per each iteration tt. EGDA is also less suited to use in an online environment: it requires more feedback than what is provided in the online learning setting, and in fact, even legitimate variants of EGDA can still incur substantial regret (Golowich et al., 2020a). Tseng (1995) first established that EGDA exhibits linear convergence for problems such as (1), discussed further in Section C.3.

Iterative smoothing

This algorithm is a refinement of Nesterov’s classical smoothing technique (Nesterov, 2005) due to Gilpin et al. (2012). Let us first recall the vanilla version of Nesterov, which we refer to as Smoothing(𝐀,𝒛(0),ϵ)\texttt{Smoothing}(\mathbf{A},\bm{z}^{(0)},\epsilon):

  1. 1.

    Initialize ηϵD𝒵\eta\coloneqq\frac{\epsilon}{D_{{\mathcal{Z}}}} and 𝒛^(0)𝒛(0)\widehat{\bm{z}}^{(0)}\coloneqq\bm{z}^{(0)}, where D𝒵D_{{\mathcal{Z}}} is the 2\ell_{2} diameter of 𝒵{\mathcal{Z}}.

  2. 2.

    For t=0,1,t=0,1,\dots

    1. (a)

      𝒖(t)22+t𝒛^(t)+tt+2𝒛(t)\bm{u}^{(t)}\coloneqq\frac{2}{2+t}\widehat{\bm{z}}^{(t)}+\frac{t}{t+2}\bm{z}^{(t)}.

    2. (b)
      𝒛(t+1)argmin𝒛𝒵{Fη(𝒖(t)),𝒛𝒖(t)+L22η𝒛𝒖(t)2},\bm{z}^{(t+1)}\coloneqq\arg\min_{\bm{z}\in{\mathcal{Z}}}\left\{\langle\nabla F_{\eta}(\bm{u}^{(t)}),\bm{z}-\bm{u}^{(t)}\rangle+\frac{L^{2}}{2\eta}\|\bm{z}-\bm{u}^{(t)}\|^{2}\right\},

      where Fη(𝒛)max𝒛^𝒵{F(𝒛),𝒛𝒛^η2𝒛𝒛^2}F_{\eta}(\bm{z})\coloneqq\max_{\widehat{\bm{z}}\in{\mathcal{Z}}}\{\langle F(\bm{z}),\bm{z}-\widehat{\bm{z}}\rangle-\frac{\eta}{2}\|\bm{z}-\widehat{\bm{z}}\|^{2}\} and LL is a suitable matrix norm.

    3. (c)

      If Φ(𝒛(t+1))<ϵ\Phi(\bm{z}^{(t+1)})<\epsilon, return.

    4. (d)
      𝒛^(t+1)argmin𝒛^𝒵{τ=0tτ+12Fη(𝒖(τ)),𝒛^𝒖(τ)+L22η𝒛^𝒛(0)2}.\widehat{\bm{z}}^{(t+1)}\coloneqq\arg\min_{\widehat{\bm{z}}\in{\mathcal{Z}}}\left\{\sum_{\tau=0}^{t}\frac{\tau+1}{2}\langle\nabla F_{\eta}(\bm{u}^{(\tau)}),\widehat{\bm{z}}-\bm{u}^{(\tau)}\rangle+\frac{L^{2}}{2\eta}\|\widehat{\bm{z}}-\bm{z}^{(0)}\|^{2}\right\}.

In this context, IterSmooth(𝐀,𝒛(0),ρ,ϵ)\texttt{IterSmooth}(\mathbf{A},\bm{z}^{(0)},\rho,\epsilon) is simple refinement of Smoothing, which nonetheless attains linear convergence (Gilpin et al., 2012).

  1. 1.

    Let ϵ(0)=F(𝒛(0))\epsilon^{(0)}=F(\bm{z}^{(0)}).

  2. 2.

    For t=0,1,t=0,1,\dots

    1. (a)

      ϵ(t+1)ϵ(t)ρ\epsilon^{(t+1)}\coloneqq\frac{\epsilon^{(t)}}{\rho}.

    2. (b)

      𝒛(t+1)Smoothing(𝐀,𝒛(t),ϵ(t+1))\bm{z}^{(t+1)}\coloneqq\texttt{Smoothing}(\mathbf{A},\bm{z}^{(t)},\epsilon^{(t+1)}).

    3. (c)

      If Φ(𝒛(t+1))<ϵ\Phi(\bm{z}^{(t+1)})<\epsilon, return.

Appendix C Omitted proofs

We dedicate this section to the proofs omitted earlier from the main body.

C.1 Proofs from Section 3.2

We first point out that degenerates games have measure zero (cf. Spielman and Teng (2003, Proposition 5.1)).

Lemma C.1.

For a Gaussian distributed payoff matrix 𝐀\mathbf{A} per Definition 1.1, the game is non-degenerate (Definition 3.2) with probability 11 (almost surely).

Indeed, the set of games with a non unique equilibrium has measure zero (van Damme, 1991, Theorem 3.5.1). Regarding the characterization in terms of the number of tight inequalities of the corresponding (primal and dual) linear programs, gathered in Definition 3.2, we note that if n+1n+1 of the inequalities were tight at 𝒙\bm{x}^{\star}, that would induce a feasible linear system of nn equalities (by eliminating vv) in n1n-1 variables (by eliminating one of the redundant variables); such degeneracies have measure zero, and there are only finitely many possible such degeneracies, leading to Lemma C.1. As a result, in the smoothed complexity model, we can safely assume that the game is non-degenerate.

Now, as we alluded to earlier, establishing Definition 1.3 reduces to showing that for any points 𝒙𝒳\bm{x}\in{\mathcal{X}} and 𝒚𝒴\bm{y}\in{\mathcal{Y}},

max𝒚𝒴𝒙,𝐀𝒚vκ𝒙Π𝒳(𝒙)=κ𝒙𝒙,\displaystyle\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v\geq\kappa\|\bm{x}-\Pi_{\mathcal{X}^{\star}}(\bm{x})\|=\kappa\|\bm{x}-\bm{x}^{\star}\|, (8)
vmin𝒙𝒳𝒙,𝐀𝒚κ𝒚Π𝒴(𝒚)=κ𝒚𝒚.\displaystyle v-\min_{\bm{x}^{\prime}\in{\mathcal{X}}}\langle\bm{x}^{\prime},\mathbf{A}\bm{y}\rangle\geq\kappa\|\bm{y}-\Pi_{\mathcal{Y}^{\star}}(\bm{y})\|=\kappa\|\bm{y}-\bm{y}^{\star}\|. (9)

(Definition 1.3 then indeed follows from the obvious fact 𝒙𝒙+𝒚𝒚𝒛𝒛\|\bm{x}-\bm{x}^{\star}\|+\|\bm{y}-\bm{y}^{\star}\|\geq\|\bm{z}-\bm{z}^{\star}\|.) Accordingly, our proof of Theorem 3.6 below will focus on lower bounding κ\kappa so that (8) holds, and (9) can then be treated similarly.

Before we proceed, let us make some observations regarding transformation (5) we saw earlier. First, one can understand the transformation 𝐀B,N=𝐓(𝐐,𝒃,𝒄,d)\mathbf{A}_{B,N}^{\flat}=\mathbf{T}(\mathbf{Q}^{\flat},\bm{b},\bm{c},d) through the equations

d=𝐀i,j;𝒃j=𝐀i,j+𝐀i,j;𝒄i=𝐀i,j+𝐀i,j; and 𝐐i,j=𝐀i,j𝐀i,j𝐀i,j+𝐀i,j\displaystyle d=\mathbf{A}_{i,j};\bm{b}_{j^{\prime}}=-\mathbf{A}_{i,j^{\prime}}+\mathbf{A}_{i,j};\bm{c}_{i^{\prime}}=-\mathbf{A}_{i^{\prime},j}+\mathbf{A}_{i,j};\text{ and }\mathbf{Q}_{i^{\prime},j^{\prime}}=\mathbf{A}_{i^{\prime},j^{\prime}}-\mathbf{A}_{i,j^{\prime}}-\mathbf{A}_{i^{\prime},j}+\mathbf{A}_{i,j} (10)

for all (i,j)B~×N~(i^{\prime},j^{\prime})\in\widetilde{B}\times\widetilde{N}. This can easily be derived from (5) by using the fact that 𝒙^B=(𝒙~,1𝟏𝒙~)\widehat{\bm{x}}_{B}=(\widetilde{\bm{x}},1-\bm{1}^{\top}\widetilde{\bm{x}}) and 𝒚^N=(𝒚~,1𝟏𝒚~)\widehat{\bm{y}}_{N}=(\widetilde{\bm{y}},1-\bm{1}^{\top}\widetilde{\bm{y}}). From (10), we see that there is a permutation of the rows of 𝐓\mathbf{T} that is upper triangular, with every entry being either 11 or 1-1. This implies that |det(𝐓)|=1|\det(\mathbf{T})|=1. With a slight abuse of notation, we will write 𝐓i,j\mathbf{T}_{i,j} (as opposed to 𝐓(i,j),:\mathbf{T}_{(i,j),:}) to access the (i,j)(i,j) row of 𝐓\mathbf{T}, so that 𝐀i,j=𝐓i,j,(𝐐,𝒃,𝒄,d)\mathbf{A}_{i,j}=\langle\mathbf{T}_{i,j},(\mathbf{Q}^{\flat},\bm{b},\bm{c},d)\rangle. From (10), we also see that 𝐓i,j\mathbf{T}_{i,j} contains at most 44 non-zero entries. In turn, this implies that 𝐓i,j2\|\mathbf{T}_{i,j}\|\leq 2 and 𝐓i,j14\|\mathbf{T}_{i,j}\|_{1}\leq 4. We gather the above observations in the claim below, which will be used in the sequel.

Claim C.2.

For the (linear) transformation 𝐓(BN)×(BN)\mathbf{T}\in{\mathbb{R}}^{(BN)\times(BN)} given in (10), it holds that |det(𝐓)|=1|\det(\mathbf{T})|=1. Further, 𝐓i,j2\|\mathbf{T}_{i,j}\|\leq 2 and 𝐓i,j14\|\mathbf{T}_{i,j}\|_{1}\leq 4 for all (i,j)B×N(i,j)\in B\times N.

The point of transformation (5) is that, as we claimed earlier, the spectral properties of matrix 𝐐\mathbf{Q} (as opposed to 𝐀B,N\mathbf{A}_{B,N}, which is a natural candidate) suffice to capture the difficulty of addressing the second subproblem identified in Section 3.2. In addition, there is a straightforward but convenient characterization of the equilibrium (𝒙,𝒚)(\bm{x}^{\star},\bm{y}^{\star}) in terms of the transformed game in (5), as stated below.

Claim C.3.

It holds that 𝐐𝐲~=𝐜\mathbf{Q}\widetilde{\bm{y}}^{\star}=\bm{c} and 𝐐𝐱~=𝐛\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star}=\bm{b}.

Proof.

It is clear that the vector 𝐐𝒚~𝒄\mathbf{Q}\widetilde{\bm{y}}^{\star}-\bm{c} must have the same value in every coordinate since 𝒙~\widetilde{\bm{x}}^{\star} is fully supported and a best response (by assumption). If that entry was positive, then 𝒙~\widetilde{\bm{x}}^{\star} would not be a best response since Player xx could profit from removing all the probability mass (which is possible since iB~𝒙i>0\sum_{i\in\widetilde{B}}\bm{x}^{\star}_{i}>0). If there was a negative entry, Player xx would profit from increasing its probability mass (which is possible since iB~𝒙i<1\sum_{i\in\widetilde{B}}\bm{x}^{\star}_{i}<1). Similar reasoning yields 𝐐𝒙~=𝒃\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star}=\bm{b}. ∎

Having made the above observations, we next establish some lemmas claimed earlier in Section 3.2 which will be used for the proof of Theorem 3.6. First, we give the proof of Lemma 3.4.

See 3.4

Proof.

Let N~jargminjN~𝖽𝗂𝗌𝗍(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,:,j𝗌𝗉𝖺𝗇(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111):,N~j)\widetilde{N}\ni j^{\prime}\in\arg\min_{j\in\widetilde{N}}\mathsf{dist}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j},\mathsf{span}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,\widetilde{N}-j})). By definition, there is 𝝆N~j\bm{\rho}\in{\mathbb{R}}^{\widetilde{N}-j^{\prime}} and 𝒓N~\bm{r}\in{\mathbb{R}}^{\widetilde{N}} with 𝒓=1\|\bm{r}\|=1 such that

\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111:,jjN~j𝒚~j𝐐:,j+(1𝒚j)𝐐:,j=jN~j𝝆j(𝐐:,j𝒄)+ϵ𝒓,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j}\coloneqq-\sum_{j\in\widetilde{N}-j^{\prime}}\widetilde{\bm{y}}^{\star}_{j}\mathbf{Q}_{:,j}+(1-\bm{y}^{\star}_{j^{\prime}})\mathbf{Q}_{:,j^{\prime}}=\sum_{j\in\widetilde{N}-j^{\prime}}\bm{\rho}_{j}(\mathbf{Q}_{:,j}-\bm{c})+\epsilon\bm{r},

where ϵminjN~𝖽𝗂𝗌𝗍(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,:,j𝗌𝗉𝖺𝗇(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111):,N~j)\epsilon\coloneqq\min_{j\in\widetilde{N}}\mathsf{dist}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j},\mathsf{span}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,\widetilde{N}-j})). Rearranging, we have

𝐐:,j(1𝒚j+𝒚jjN~j𝝆j)ϕj+jN~j𝐐:,j(𝒚j𝝆j+𝒚jj′′N~j𝝆j′′)ϕj=ϵ𝒓.\mathbf{Q}_{:,j^{\prime}}\overbrace{\left(1-\bm{y}^{\star}_{j^{\prime}}+\bm{y}^{\star}_{j^{\prime}}\sum_{j\in\widetilde{N}-j^{\prime}}\bm{\rho}_{j}\right)}^{\phi_{j^{\prime}}}+\sum_{j\in\widetilde{N}-j^{\prime}}\mathbf{Q}_{:,j}\overbrace{\left(-\bm{y}^{\star}_{j}-\bm{\rho}_{j}+\bm{y}^{\star}_{j}\sum_{j^{\prime\prime}\in\widetilde{N}-j^{\prime}}\bm{\rho}_{j^{\prime\prime}}\right)}^{\phi_{j}}=\epsilon\bm{r}. (11)

Now, let us suppose that all coefficients above are such that |ϕj|ϵ1jN~𝒚j1jN~𝒚j+|N~||\phi_{j}|\leq\epsilon^{\prime}\coloneqq\frac{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}}{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}+|\widetilde{N}|} for all jN~j\in\widetilde{N}. Then, jN~ϕj=±|N~|ϵ\sum_{j\in\widetilde{N}}\phi_{j}=\pm|\widetilde{N}|\epsilon^{\prime} since |jN~ϕj|jN~|ϕj|ϵ|N~||\sum_{j\in\widetilde{N}}\phi_{j}|\leq\sum_{j\in\widetilde{N}}|\phi_{j}|\leq\epsilon|\widetilde{N}|, where for convenience we used the notation jN~ϕj=±|N~|ϵ|N~|ϵjN~ϕj|N~|ϵ\sum_{j\in\widetilde{N}}\phi_{j}=\pm|\widetilde{N}|\epsilon^{\prime}\iff-|\widetilde{N}|\epsilon^{\prime}\leq\sum_{j\in\widetilde{N}}\phi_{j}\leq|\widetilde{N}|\epsilon^{\prime}. Thus, by definition of ϕj\phi_{j},

(1jN~𝒚j)(jN~j𝝆j)=(1jN~𝒚j)±ϵ|N~|.\left(1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}\right)\left(\sum_{j\in\widetilde{N}-j^{\prime}}\bm{\rho}_{j}\right)=\left(1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}\right)\pm\epsilon^{\prime}|\widetilde{N}|.

Since 0<1jN~𝒚j0<1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}, we have

(jN~j𝝆j)=1±ϵ|N~|1jN~𝒚j.\left(\sum_{j\in\widetilde{N}-j^{\prime}}\bm{\rho}_{j}\right)=1\pm\epsilon^{\prime}\frac{|\widetilde{N}|}{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}}.

Thus,

ϕj=1𝒚j+𝒚jjN~j𝝆j=1±ϵ|N~|1jN~𝒚j>ϵ\phi_{j^{\prime}}=1-\bm{y}^{\star}_{j^{\prime}}+\bm{y}^{\star}_{j^{\prime}}\sum_{j\in\widetilde{N}-j^{\prime}}\bm{\rho}_{j}=1\pm\epsilon^{\prime}\frac{|\widetilde{N}|}{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}}>\epsilon^{\prime}

since ϵ1jN~𝒚j1jN~𝒚j+|N~|\epsilon^{\prime}\leq\frac{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}}{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}+|\widetilde{N}|}. The last displayed inequality contradicts our earlier assumption that |ϕj|ϵ|\phi_{j^{\prime}}|\leq\epsilon^{\prime}. As a result, we conclude that at least one coefficient ϕj\phi_{j} has an absolute value at least ϵ\epsilon^{\prime}. Dividing (11) by that coefficient, we get

minjN~𝖽𝗂𝗌𝗍(𝐐:,j,𝗌𝗉𝖺𝗇(𝐐:,N~j))ϵϵ(1+|N~|1jN~𝒚j)minjN~𝖽𝗂𝗌𝗍(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,:,j𝗌𝗉𝖺𝗇(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111):,N~j).\min_{j\in\widetilde{N}}\mathsf{dist}(\mathbf{Q}_{:,j},\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j}))\leq\frac{\epsilon}{\epsilon^{\prime}}\leq\left(1+\frac{|\widetilde{N}|}{1-\sum_{j\in\widetilde{N}}\bm{y}^{\star}_{j}}\right)\min_{j\in\widetilde{N}}\mathsf{dist}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j},\mathsf{span}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,\widetilde{N}-j})).

This completes the proof. ∎

We continue with the proof of Lemma 3.5.

See 3.5

Proof.

Let 𝐌=𝐔𝚺𝐕\mathbf{M}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\top} be a singular value decomposition (SVD) of 𝐐\mathbf{Q}, where 𝐔\mathbf{U} and 𝐕\mathbf{V} are orthonormal. Then, given that 𝐐\mathbf{Q} is invertible (by assumption),

𝒑=𝐕𝚺1𝐔𝒙~,\bm{p}=\mathbf{V}\mathbf{\Sigma}^{-1}\mathbf{U}^{\top}\widetilde{\bm{x}},

where 𝚺1=𝖽𝗂𝖺𝗀(σmin1,,σmax1)\mathbf{\Sigma}^{-1}=\mathsf{diag}(\sigma_{\mathrm{min}}^{-1},\dots,\sigma_{\mathrm{max}}^{-1}). (Here, σmax\sigma_{\mathrm{max}} and σmin\sigma_{\mathrm{min}} are the maximum and minimum singular values of 𝐌\mathbf{M}, respectively.) Thus, 𝒑𝐕𝚺1𝐔𝒙~1σmin(𝐐)𝒙~\|\bm{p}\|\leq\|\mathbf{V}\|\|\mathbf{\Sigma}^{-1}\|\|\mathbf{U}^{\top}\|\|\widetilde{\bm{x}}\|\leq\frac{1}{\sigma_{\mathrm{min}}(\mathbf{Q})}\|\widetilde{\bm{x}}\|, where we used the fact that the spectral norm of any orthonormal matrix is 11 and the spectral norm of any diagonal matrix is its maximum entry in asbolute value. ∎

We next state the negative second moment identity that connects the smallest singular values in terms of a certain geometric property of the matrix (namely, Item 3) (see also (Tao, 2023) for further background).

Proposition C.4 (Negative second moment identity (Tao et al., 2010)).

Let 𝐌d×d\mathbf{M}\in{\mathbb{R}}^{d\times d} be an invertible matrix. Then,

r=1d1σr2(𝐌)=r=1d1𝖽𝗂𝗌𝗍2(𝐌r,:,Hr,:)=r=1d1𝖽𝗂𝗌𝗍2(𝐌:,r,H:,r),\sum_{r=1}^{d}\frac{1}{\sigma^{2}_{r}(\mathbf{M})}=\sum_{r=1}^{d}\frac{1}{\mathsf{dist}^{2}(\mathbf{M}_{r,:},H_{-r,:})}=\sum_{r=1}^{d}\frac{1}{\mathsf{dist}^{2}(\mathbf{M}_{:,r},H_{:,-r})}, (12)

where Hr,:𝗌𝗉𝖺𝗇(𝐌1,:,,𝐌r1,:,𝐌r+1,:,,𝐌d,:)H_{-r,:}\coloneqq\mathsf{span}(\mathbf{M}_{1,:},\dots,\mathbf{M}_{r-1,:},\mathbf{M}_{r+1,:},\dots,\mathbf{M}_{d,:}).

One can readily prove this identity by equivalently expressing the negative second moment tr((𝐌1)𝐌1)\operatorname{tr}((\mathbf{M}^{-1})^{\top}\mathbf{M}^{-1}) as either r=1dσr2(𝐌1)=r=1dσr2(𝐌)\sum_{r=1}^{d}\sigma^{2}_{r}(\mathbf{M}^{-1})=\sum_{r=1}^{d}\sigma^{-2}_{r}(\mathbf{M}) or r=1d𝐌:,r12\sum_{r=1}^{d}\|\mathbf{M}^{-1}_{:,r}\|^{2}, leading to the first identity in (12). The second one follows from the fact that the singular values of 𝐌\mathbf{M}^{\top} coincide with the singular values of 𝐌\mathbf{M}.

We are now ready to prove Theorem 3.6, restated below.

See 3.6

Proof.

We lower bound κ\kappa so that (8) holds; bound (9) will then be treated in a symmetric fashion, and Definition 1.3 will follow.

Let us fix any point 𝒙𝒳\bm{x}\in{\mathcal{X}}. We can write 𝒙\bm{x} as λ𝒙^B+(1λ)𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B\lambda\widehat{\bm{x}}_{B}+(1-\lambda)\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}} for some λ[0,1]\lambda\in[0,1] such that 𝒙^B𝒳\widehat{\bm{x}}_{B}\in{\mathcal{X}} and all coordinates of 𝒙^B\widehat{\bm{x}}_{B} in \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B} are zero, and 𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B𝒳\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}}\in{\mathcal{X}} and all coordinates of 𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}} in BB are zero. For notational convenience, we define

P(𝐀)12|N||B|σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)(1+1αD(𝐀))1.P(\mathbf{A})\coloneqq\frac{1}{2|N|\sqrt{|B|}}\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})\left(1+\frac{1}{\alpha_{D}(\mathbf{A})}\right)^{-1}. (13)

We consider the following two cases.

Case I:

λP(𝐀)𝒙^B𝒙B4(1λ)𝐀\lambda P(\mathbf{A})\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|\geq 4(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty}. If 𝒙^B=𝒙B\widehat{\bm{x}}_{B}=\bm{x}^{\star}_{B}, it follows that 𝒙=𝒙\bm{x}=\bm{x}^{\star} (since λ=1\lambda=1), and the conclusion trivially follows. We can thus assume that 𝒙^B𝒙B\widehat{\bm{x}}_{B}\neq\bm{x}^{\star}_{B}. In this case, it follows that B~\widetilde{B}\neq\emptyset, and we proceed as follows.

max𝒚𝒴𝒙,𝐀𝒚v\displaystyle\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v λmaxjN𝒙^B𝒙B,𝐀B,j+(1λ)(𝒙\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,jv)\displaystyle\geq\lambda\max_{j\in N}\langle\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle+(1-\lambda)\left(\langle\bm{x}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}},\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},j}\rangle-v\right) (14)
λmaxjN𝒙^B𝒙B,𝐀B,j2(1λ)𝐀,\displaystyle\geq\lambda\max_{j\in N}\langle\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle-2(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty}, (15)

where (14) follows from the definition 𝒙λ𝒙^B+(1λ)𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B\bm{x}\coloneqq\lambda\widehat{\bm{x}}_{B}+(1-\lambda)\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}} and the fact that v=𝒙B,𝐀B,jv=\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle for all jNj\in N; and (15) uses definition of 𝐀\|\mathbf{A}^{\flat}\|_{\infty} to lower bound the second term in (14). Continuing from (15), we can use the transformation defined in (5) to get

maxjN𝒙^B𝒙B,𝐀B,j=maxjN𝒙~𝒙~,𝐐:,j𝒄,\max_{j\in N}\langle\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle=\max_{j\in N}\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},\mathbf{Q}_{:,j}-\bm{c}\rangle, (16)

where, with an abuse of notation, the convention above is that 𝐐:,j=𝟎\mathbf{Q}_{:,j}=\mathbf{0} if jN~j\neq\widetilde{N}. For convenience, let us define χj𝒙~𝒙~,𝐐:,j𝒄\chi_{j}\coloneqq\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},\mathbf{Q}_{:,j}-\bm{c}\rangle for all jNj\in N. Our goal is to lower bound maxjNχj\max_{j\in N}\chi_{j}. To that end, we first observe that, by the fact that 𝐐𝒚~=𝒄\mathbf{Q}\widetilde{\bm{y}}^{\star}=\bm{c} (C.3),

0=𝒙~𝒙~,𝐐𝒚~𝒄\displaystyle 0=\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}-\bm{c}\rangle =jN~𝒚~j𝒙~𝒙~,𝐐:,j𝒙~𝒙~,𝒄\displaystyle=\sum_{j\in\widetilde{N}}\widetilde{\bm{y}}^{\star}_{j}\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},\mathbf{Q}_{:,j}\rangle-\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},\bm{c}\rangle
=jN~𝒚~j𝒙~𝒙~,𝐐:,j𝒄+(1jN~𝒚~j)𝒙~𝒙~,𝒄.\displaystyle=\sum_{j\in\widetilde{N}}\widetilde{\bm{y}}^{\star}_{j}\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},\mathbf{Q}_{:,j}-\bm{c}\rangle+\left(1-\sum_{j\in\widetilde{N}}\widetilde{\bm{y}}^{\star}_{j}\right)\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},-\bm{c}\rangle.

In other words,

jN𝒚jχj=0,\sum_{j\in N}\bm{y}^{\star}_{j}\chi_{j}=0,

which in turn implies that

jNmax(0,χj)jN𝒚jmax(0,χj)\displaystyle\sum_{j\in N}\max(0,\chi_{j})\geq\sum_{j\in N}\bm{y}^{\star}_{j}\max(0,\chi_{j}) =jN𝒚jmin(0,χj)\displaystyle=-\sum_{j\in N}\bm{y}^{\star}_{j}\min(0,\chi_{j})
αD(𝐀)jNmin(0,χj),\displaystyle\geq-\alpha_{D}(\mathbf{A})\sum_{j\in N}\min(0,\chi_{j}), (17)

where we made use of the obvious identity t=max(0,t)+min(0,t)t=\max(0,t)+\min(0,t) for all tt\in{\mathbb{R}}, as well as the definition of αD(𝐀)\alpha_{D}(\mathbf{A}) (Item 1). We let 𝒑N~\bm{p}\in{\mathbb{R}}^{\widetilde{N}} be the (unique) solution to the linear system

𝒙~𝒙~=\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111𝒑=jN~(𝐐:,j𝒄)𝒑j,\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}=\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\bm{p}=\sum_{j\in\widetilde{N}}(\mathbf{Q}_{:,j}-\bm{c})\bm{p}_{j},

and 𝒑j=0\bm{p}_{j}=0 for jNN~j\in N\setminus\widetilde{N}. By Lemma 3.5, we know that 𝒑(σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111))1𝒙~𝒙~\|\bm{p}\|\leq(\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}))^{-1}\|\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}\|. Then, we have

jNχj𝒑j=jN~χj𝒑j=𝒙~𝒙~,jN~(𝐐:,j𝒄)𝒑j=𝒙~𝒙~2.\sum_{j\in N}\chi_{j}\bm{p}_{j}=\sum_{j\in\widetilde{N}}\chi_{j}\bm{p}_{j}=\left\langle\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star},\sum_{j\in\widetilde{N}}(\mathbf{Q}_{:,j}-\bm{c})\bm{p}_{j}\right\rangle=\|\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}\|^{2}. (18)

Moreover,

jNχj𝒑j\displaystyle\sum_{j\in N}\chi_{j}\bm{p}_{j} =jN𝒑jmax(0,χj)+jN𝒑jmin(0,χj)\displaystyle=\sum_{j\in N}\bm{p}_{j}\max(0,\chi_{j})+\sum_{j\in N}\bm{p}_{j}\min(0,\chi_{j})
jNmax(0,𝒑j)max(0,χj)+jNmin(0,𝒑j)min(0,χj)\displaystyle\leq\sum_{j\in N}\max(0,\bm{p}_{j})\max(0,\chi_{j})+\sum_{j\in N}\min(0,\bm{p}_{j})\min(0,\chi_{j}) (19)
𝒑jNmax(0,χj)𝒑jNmin(0,χj)\displaystyle\leq\|\bm{p}\|_{\infty}\sum_{j\in N}\max(0,\chi_{j})-\|\bm{p}\|_{\infty}\sum_{j\in N}\min(0,\chi_{j}) (20)
𝒑(1+1αD(𝐀))jNmax(0,χj)\displaystyle\leq\|\bm{p}\|_{\infty}\left(1+\frac{1}{\alpha_{D}(\mathbf{A})}\right)\sum_{j\in N}\max(0,\chi_{j}) (21)
1σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)(1+1αD(𝐀))|N|maxjNχj𝒙~𝒙~,\displaystyle\leq\frac{1}{\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})}\left(1+\frac{1}{\alpha_{D}(\mathbf{A})}\right)|N|\max_{j\in N}\chi_{j}\|\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}\|, (22)

where (19) follows from the fact that 𝒑jmax(0,χj)max(0,𝒑j)max(0,χj)\bm{p}_{j}\max(0,\chi_{j})\leq\max(0,\bm{p}_{j})\max(0,\chi_{j}) (by nonnegativity of max(0,χj)\max(0,\chi_{j})) and 𝒑jmin(0,χj)min(0,𝒑j)min(0,χj)\bm{p}_{j}\min(0,\chi_{j})\leq\min(0,\bm{p}_{j})\min(0,\chi_{j}) (by nonpositivity of min(0,χj)\min(0,\chi_{j})); (20) uses that min(0,𝒑j)|𝒑j|𝒑\min(0,\bm{p}_{j})\geq-|\bm{p}_{j}|\geq-\|\bm{p}\|_{\infty}, which gives min(0,𝒑j)min(0,χj)𝒑min(0,χj)\min(0,\bm{p}_{j})\min(0,\chi_{j})\leq-\|\bm{p}\|_{\infty}\min(0,\chi_{j}); (21) follows from (17); and (22) uses the bound 𝒑2(σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111))1𝒙~𝒙~\|\bm{p}\|_{2}\leq(\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}))^{-1}\|\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}\| (Lemma 3.5). Combining (18) and (22),

maxjN𝒙^B𝒙B,𝐀B,j\displaystyle\max_{j\in N}\langle\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle 1|N|σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)(1+1αD(𝐀))1𝒙~𝒙~\displaystyle\geq\frac{1}{|N|}\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})\left(1+\frac{1}{\alpha_{D}(\mathbf{A})}\right)^{-1}\|\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}\| (23)
12|N||B|σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)(1+1αD(𝐀))1𝒙^B𝒙B,\displaystyle\geq\frac{1}{2|N|\sqrt{|B|}}\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})\left(1+\frac{1}{\alpha_{D}(\mathbf{A})}\right)^{-1}\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|, (24)

where (23) uses the definition of χj\chi_{j} and the assumption that 𝒙~𝒙~\widetilde{\bm{x}}\neq\widetilde{\bm{x}}^{\star} (equivalently, 𝒙B𝒙^B\bm{x}^{\star}_{B}\neq\widehat{\bm{x}}_{B}), and (24) follows from the bound

𝒙^B𝒙B𝒙^B𝒙B1iB~|𝒙i𝒙i|+|iB~(𝒙i𝒙i)|2𝒙~𝒙~12|B|𝒙~𝒙~.\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|\leq\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|_{1}\leq\sum_{i\in\widetilde{B}}\left|\bm{x}_{i}-\bm{x}^{\star}_{i}\right|+\left|\sum_{i\in\widetilde{B}}(\bm{x}_{i}-\bm{x}^{\star}_{i})\right|\leq 2\|\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}\|_{1}\leq 2\sqrt{|B|}\|\widetilde{\bm{x}}-\widetilde{\bm{x}}^{\star}\|.

Returning to (15), we have

max𝒚𝒴𝒙,𝐀𝒚v\displaystyle\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v λ12|N||B|σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)(1+1αD(𝐀))1𝒙^B𝒙B2(1λ)𝐀\displaystyle\geq\lambda\frac{1}{2|N|\sqrt{|B|}}\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})\left(1+\frac{1}{\alpha_{D}(\mathbf{A})}\right)^{-1}\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|-2(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty}
=λP(𝐀)𝒙^B𝒙B2(1λ)𝐀,\displaystyle=\lambda P(\mathbf{A})\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|-2(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty}, (25)

where the equality above follows from the definition of P(𝐀)P(\mathbf{A}) in (13). Next, we bound

𝒙𝒙2\displaystyle\|\bm{x}-\bm{x}^{\star}\|^{2} =λ𝒙^B𝒙B2+(1λ)2𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B2\displaystyle=\|\lambda\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|^{2}+(1-\lambda)^{2}\|\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}}\|^{2}
=λ(𝒙^B𝒙B)(1λ)𝒙B2+(1λ)2𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B2\displaystyle=\|\lambda(\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B})-(1-\lambda)\bm{x}^{\star}_{B}\|^{2}+(1-\lambda)^{2}\|\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}}\|^{2}
2λ2𝒙^B𝒙B2+2(1λ)2𝒙B2+(1λ)2𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B2\displaystyle\leq 2\lambda^{2}\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|^{2}+2(1-\lambda)^{2}\|\bm{x}^{\star}_{B}\|^{2}+(1-\lambda)^{2}\|\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}}\|^{2} (26)
2λ2𝒙^B𝒙B2+3(1λ)2,\displaystyle\leq 2\lambda^{2}\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|^{2}+3(1-\lambda)^{2}, (27)

where (26) uses triangle inequality with respect to \|\cdot\| along with the inequality (t1+t2)22t12+2t22(t_{1}+t_{2})^{2}\leq 2t_{1}^{2}+2t_{2}^{2}, and (27) uses that 𝒙B,𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B1\|\bm{x}^{\star}_{B}\|,\|\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}}\|\leq 1. Since we are assuming that λP(𝐀)𝒙^B𝒙B4(1λ)𝐀\lambda P(\mathbf{A})\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|\geq 4(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty}, (27) in turn implies that

𝒙𝒙2\displaystyle\|\bm{x}-\bm{x}^{\star}\|^{2} 2λ2𝒙^B𝒙B2+λ2(P(𝐀)𝐀)2𝒙^B𝒙B2\displaystyle\leq 2\lambda^{2}\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|^{2}+\lambda^{2}\left(\frac{P(\mathbf{A})}{\|\mathbf{A}^{\flat}\|_{\infty}}\right)^{2}\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|^{2}
=λ2(2+(P(𝐀)𝐀)2)𝒙^B𝒙B2.\displaystyle=\lambda^{2}\left(2+\left(\frac{P(\mathbf{A})}{\|\mathbf{A}^{\flat}\|_{\infty}}\right)^{2}\right)\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|^{2}. (28)

Combining (25) and (28) with the assumption that λP(𝐀)𝒙^B𝒙B4(1λ)𝐀\lambda P(\mathbf{A})\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|\geq 4(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty},

max𝒚𝒴𝒙,𝐀𝒚v\displaystyle\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v λ2P(𝐀)𝒙^B𝒙B\displaystyle\geq\frac{\lambda}{2}P(\mathbf{A})\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|
12P(𝐀)(2+(P(𝐀)𝐀)2)2𝒙𝒙κ(𝐀)𝒙𝒙.\displaystyle\geq\frac{1}{2}P(\mathbf{A})\left(2+\left(\frac{P(\mathbf{A})}{\|\mathbf{A}^{\flat}\|_{\infty}}\right)^{2}\right)^{-2}\|\bm{x}-\bm{x}^{\star}\|\geq\kappa(\mathbf{A})\|\bm{x}-\bm{x}^{\star}\|.

It is easy to see that P(𝐀)/𝐀P(\mathbf{A})/\|\mathbf{A}^{\flat}\|_{\infty} is upper bounded by an absolute constant, and so we have

max𝒚𝒴𝒙,𝐀𝒚vP(𝐀)𝒙𝒙\displaystyle\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v\gtrsim P(\mathbf{A})\|\bm{x}-\bm{x}^{\star}\| =12|N||B|σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)(1+1αD(𝐀))1𝒙𝒙\displaystyle=\frac{1}{2|N|\sqrt{|B|}}\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})\left(1+\frac{1}{\alpha_{D}(\mathbf{A})}\right)^{-1}\|\bm{x}-\bm{x}^{\star}\|
1|B|3(αD(𝐀))2γP(𝐀)𝒙𝒙.\displaystyle\gtrsim\frac{1}{|B|^{3}}(\alpha_{D}(\mathbf{A}))^{2}\gamma_{P}(\mathbf{A})\|\bm{x}-\bm{x}^{\star}\|.

Above, the last bound uses the fact that

σmin(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\displaystyle\sigma_{\mathrm{min}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}) 1|B~|minjN~𝖽𝗂𝗌𝗍(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,:,j𝗌𝗉𝖺𝗇(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111):,N~j)\displaystyle\geq\frac{1}{\sqrt{|\widetilde{B}|}}\min_{j\in\widetilde{N}}\mathsf{dist}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,j},\mathsf{span}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{:,\widetilde{N}-j}))
1|B~|3/2minjN~𝖽𝗂𝗌𝗍(𝐐:,j,𝗌𝗉𝖺𝗇(𝐐:,N~j))αD(𝐀)=1|B~|3/2γP(𝐀)αD(𝐀),\displaystyle\geq\frac{1}{|\widetilde{B}|^{3/2}}\min_{j\in\widetilde{N}}\mathsf{dist}(\mathbf{Q}_{:,j},\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j}))\alpha_{D}(\mathbf{A})=\frac{1}{|\widetilde{B}|^{3/2}}\gamma_{P}(\mathbf{A})\alpha_{D}(\mathbf{A}),

where the first inequality uses (6), while the second one is a consequence of Lemma 3.4.

Case II:

λP(𝐀)𝒙^B𝒙B<4(1λ)𝐀\lambda P(\mathbf{A})\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|<4(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty}. This case can only arise when \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}\neq\emptyset (for otherwise λ=1\lambda=1). Then, we bound

max𝒚𝒴𝒙,𝐀𝒚v\displaystyle\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v 𝒙,𝐀𝒚v\displaystyle\geq\langle\bm{x},\mathbf{A}\bm{y}^{\star}\rangle-v
λ(𝒙^B𝒙B,𝐀B,N𝒚N)+(1λ)(𝒙^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv)\displaystyle\geq\lambda(\langle\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B},\mathbf{A}_{B,N}\bm{y}^{\star}_{N}\rangle)+(1-\lambda)(\langle\widehat{\bm{x}}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}},\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}-v\rangle)
(1λ)βD(𝐀),\displaystyle\geq(1-\lambda)\beta_{D}(\mathbf{A}), (29)

by definition of βD(𝐀)\beta_{D}(\mathbf{A}) (Item 2) and the fact that 𝒙^B𝒙B,𝐀B,N𝒚N=v𝒙^B𝒙B,𝟏=0\langle\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B},\mathbf{A}_{B,N}\bm{y}^{\star}_{N}\rangle=v\langle\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B},\bm{1}\rangle=0. Moreover, by (27) together with the assumption that λP(𝐀)𝒙^B𝒙B<4(1λ)𝐀\lambda P(\mathbf{A})\|\widehat{\bm{x}}_{B}-\bm{x}^{\star}_{B}\|<4(1-\lambda)\|\mathbf{A}^{\flat}\|_{\infty},

𝒙𝒙232(𝐀P(𝐀))2(1λ)2+3(1λ)2=(32(𝐀P(𝐀))2+3)(1λ)2.\|\bm{x}-\bm{x}^{\star}\|^{2}\leq 32\left(\frac{\|\mathbf{A}^{\flat}\|_{\infty}}{P(\mathbf{A})}\right)^{2}(1-\lambda)^{2}+3(1-\lambda)^{2}=\left(32\left(\frac{\|\mathbf{A}^{\flat}\|_{\infty}}{P(\mathbf{A})}\right)^{2}+3\right)(1-\lambda)^{2}.

Combining with (29) yields

max𝒚𝒴𝒙,𝐀𝒚v\displaystyle\max_{\bm{y}^{\prime}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}^{\prime}\rangle-v (32(𝐀P(𝐀))2+3)2βD(𝐀)𝒙𝒙\displaystyle\geq\left(32\left(\frac{\|\mathbf{A}^{\flat}\|_{\infty}}{P(\mathbf{A})}\right)^{2}+3\right)^{-2}\beta_{D}(\mathbf{A})\|\bm{x}-\bm{x}^{\star}\|
P(𝐀)𝐀βD(𝐀)𝒙𝒙\displaystyle\gtrsim\frac{P(\mathbf{A})}{\|\mathbf{A}^{\flat}\|_{\infty}}\beta_{D}(\mathbf{A})\|\bm{x}-\bm{x}^{\star}\|
1𝐀1|B|3αD(𝐀)2βD(𝐀)γP(𝐀)𝒙𝒙.\displaystyle\gtrsim\frac{1}{\|\mathbf{A}^{\flat}\|_{\infty}}\frac{1}{|B|^{3}}\alpha_{D}(\mathbf{A})^{2}\beta_{D}(\mathbf{A})\gamma_{P}(\mathbf{A})\|\bm{x}-\bm{x}^{\star}\|.

C.2 Proofs from Section 3.3

We continue with the proofs from Section 3.3. As we have noted already, given that all quantities of interest in Definition 3.3 depend on the support of the equilibrium, it is natural to proceed by partitioning the probability space over all possible such configurations. To do so, we will use the following simple fact (Spielman and Teng, 2003, Proposition 8.1).

Proposition C.5 (Spielman and Teng, 2003).

Let XX and YY be random variables distributed according to an integrable density function. For any event (X,Y){\mathcal{E}}(X,Y),

X,Y[(X,Y)]maxyX,Y[(X,Y)Y=y]maxYX,Y[(X,Y)Y].\underset{X,Y}{\operatorname{\mathbb{P}}}[{\mathcal{E}}(X,Y)]\leq\max_{y}\underset{X,Y}{\operatorname{\mathbb{P}}}[{\mathcal{E}}(X,Y)\mid Y=y]\eqqcolon\max_{Y}\underset{X,Y}{\operatorname{\mathbb{P}}}[{\mathcal{E}}(X,Y)\mid Y].

In our application, we want to condition on the event that BB is the support of 𝒙\bm{x}^{\star} and NN is the support of 𝒚\bm{y}^{\star}. For convenience, we let 𝖳𝗒𝗉𝖾B,N(𝐀)\mathsf{Type}_{B,N}(\mathbf{A}) denote the indicator random variable representing whether BB and NN indeed index the positive coordinates of the equilibrium; that is, 𝖳𝗒𝗉𝖾B,N(𝐀)𝟙{B={i[n]:𝒙i(𝐀)>0}N={j[m]:𝒚j(𝐀)>0}}\mathsf{Type}_{B,N}(\mathbf{A})\coloneqq\mathbbm{1}\{B=\{i\in[n]:\bm{x}^{\star}_{i}(\mathbf{A})>0\}\land N=\{j\in[m]:\bm{y}^{\star}_{j}(\mathbf{A})>0\}\}. Unlike general linear programs, which can be infeasible or unbounded, the linear program induced by a zero-sum game is guaranteed to be primal and dual feasible, no matter the perturbation (under Definition 1.1). We will thus only have to condition on events in which BB and NN are both nonempty. To be able to control the probability density function upon conditioning on 𝖳𝗒𝗉𝖾B,N(𝐀)\mathsf{Type}_{B,N}(\mathbf{A}), it will be convenient to perform a certain change of variables, which is described next.

Change of variables

Let us denote by 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}} the entries of 𝐀\mathbf{A} excluding those in 𝐀B,N\mathbf{A}_{B,N}. We first perform a change of variables from 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐀B,N\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{A}_{B,N} to 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒄,𝒃,d\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{c},\bm{b},d, which uses the linear transformation 𝐓\mathbf{T} associated with (5). With this new set of variables at hand, we can conveniently express 𝐐𝒚~=𝒄\mathbf{Q}\widetilde{\bm{y}}^{\star}=\bm{c} and 𝐐𝒙~=𝒃\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star}=\bm{b} (C.3). Accordingly, we next perform a change of variables from 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒄,𝒃,d\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{c},\bm{b},d to 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙,𝒚,v\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{x}^{\star},\bm{y}^{\star},v. When performing those change of variables one has to account for the transformed probability density function. This can be understood as follows. The probability of an event (𝐀){\mathcal{E}}(\mathbf{A}) can be expressed as

𝐀(𝐀)μ𝐀(𝐀)𝑑𝐀.\int_{\mathbf{A}}{\mathcal{E}}(\mathbf{A})\mu_{\mathbf{A}}(\mathbf{A})d\mathbf{A}.

The integral above can be cast in terms of a new set of variables 𝐁\mathbf{B} by computing the corresponding Jacobian, assuming that it is non-singular. We will make use of this fact in the sequel. The following lemma gathers some of the above observations regarding the change of variables.

Lemma C.6 (Change of variables).

Let (𝐀){\mathcal{E}}(\mathbf{A}) be any event that depends on the randomness of 𝐀\mathbf{A}. Then,

𝐀[(𝐀)]\displaystyle\underset{\mathbf{A}}{\operatorname{\mathbb{P}}}[{\mathcal{E}}(\mathbf{A})] maxB,N𝐀[(𝐀)𝖳𝗒𝗉𝖾B,N(𝐀)]\displaystyle\leq\max_{B,N}\underset{\mathbf{A}}{\operatorname{\mathbb{P}}}[{\mathcal{E}}(\mathbf{A})\mid\mathsf{Type}_{B,N}(\mathbf{A})]
=maxB,N𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙,𝒚,v[(𝐀)𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv𝟏 and 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B𝒙Bv𝟏].\displaystyle=\max_{B,N}\underset{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{x}^{\star},\bm{y}^{\star},v}{\operatorname{\mathbb{P}}}[{\mathcal{E}}(\mathbf{A})\mid\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}\geq v\bm{1}\textrm{ and }\mathbf{A}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},B}\bm{x}^{\star}_{B}\leq v\bm{1}].

Indeed, the first inequality above is a consequence of Proposition C.5. The equality then follows from noting that, when

𝒄=𝐐𝒚~,𝒃=𝐐𝒙~,v=d𝒙~,𝐐𝒚~𝐀B,N𝒚=v𝟏,𝐀N,B𝒙=v𝟏,\bm{c}=\mathbf{Q}\widetilde{\bm{y}}^{\star},\bm{b}=\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star},v=d-\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle\iff\mathbf{A}_{B,N}\bm{y}^{\star}=v\bm{1},\mathbf{A}^{\top}_{N,B}\bm{x}^{\star}=v\bm{1},

the event 𝖳𝗒𝗉𝖾B,N(𝐀)\mathsf{Type}_{B,N}(\mathbf{A}) can be equivalently expressed as 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv𝟏\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}\geq v\bm{1} and 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B𝒙Bv𝟏\mathbf{A}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},B}\bm{x}^{\star}_{B}\leq v\bm{1}.

We first bound the probability that βP(𝐀)minj\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(v𝒙B,𝐀B,j)\beta_{P}(\mathbf{A})\coloneqq\min_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}(v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle) is close to 0; the proof for βD(𝐀)\beta_{D}(\mathbf{A}) is then symmetric. The key ingredient is the following anti-concentration lemma pertaining to a conditional Gaussian distribution (Spielman and Teng, 2003, Lemma 8.3).

Lemma C.7 (Spielman and Teng, 2003).

Let gg be a Gaussian random variable of variance σ2\sigma^{2} and mean of absolute value at most 11. For ϵ0\epsilon\geq 0, τ1\tau\geq 1 and tτt\leq\tau,

[gt+ϵgt]ϵτσ2eϵ(τ+3)σ2.\operatorname{\mathbb{P}}[g\leq t+\epsilon\mid g\geq t]\leq\frac{\epsilon\tau}{\sigma^{2}}e^{\frac{\epsilon(\tau+3)}{\sigma^{2}}}.

See 3.8

Proof.

By Lemma C.6, it suffices to bound

maxB,N𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙,𝒚,v[βP(𝐀)ϵ𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv𝟏 and 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B𝒙Bv𝟏].\max_{B,N}\underset{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{x}^{\star},\bm{y}^{\star},v}{\operatorname{\mathbb{P}}}[\beta_{P}(\mathbf{A})\leq\epsilon^{\prime}\mid\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}\geq v\bm{1}\textrm{ and }\mathbf{A}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},B}\bm{x}^{\star}_{B}\leq v\bm{1}].

By Proposition C.5, it suffices to prove that for all B,N,𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N,𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,𝐐,𝒙,𝒚,vB,N,\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N},\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}},\mathbf{Q},\bm{x}^{\star},\bm{y}^{\star},v satisfying 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv𝟏\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}\geq v\bm{1},

𝐀B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N[j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N:v𝒙B,𝐀B,j\displaystyle\underset{\mathbf{A}_{B,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}}{\operatorname{\mathbb{P}}}[\exists j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}:v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle ϵjN:v𝒙B,𝐀B,j0]\displaystyle\leq\epsilon^{\prime}\mid\forall j\in N:v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle\geq 0] (30)
j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝐀B,j[v𝒙B,𝐀B,jϵjN:v𝒙B,𝐀B,j0]\displaystyle\leq\sum_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\underset{\mathbf{A}_{B,j}}{\operatorname{\mathbb{P}}}[v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle\leq\epsilon^{\prime}\mid\forall j\in N:v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle\geq 0] (31)
j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝐀B,j[v𝒙B,𝐀B,jϵv𝒙B,𝐀B,j0]\displaystyle\leq\sum_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\underset{\mathbf{A}_{B,j}}{\operatorname{\mathbb{P}}}[v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle\leq\epsilon^{\prime}\mid v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle\geq 0] (32)
=j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝒈j[𝒈jϵv𝒈jv].\displaystyle=\sum_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\underset{\bm{g}_{j}}{\operatorname{\mathbb{P}}}[\bm{g}_{j}\leq\epsilon^{\prime}-v\mid\bm{g}_{j}\geq-v]. (33)

where in (30) the distribution of 𝐀B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N\mathbf{A}_{B,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}} after conditioning on 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N,𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N},\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}, 𝐐\mathbf{Q}, 𝒙\bm{x}^{\star}, 𝒚\bm{y}^{\star}, vv remains the same, which is a consequence of independence per Definition 1.1; (31) is an application of the union bound; (32) uses the fact that the events {v𝒙B,𝐀B,j0}jN\{v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle\geq 0\}_{j\in N} are pairwise independent; and (33) defines 𝒈j𝒙B,𝐀B,j\bm{g}_{j}\coloneqq-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle, which is a Gaussian random variable with expectation |𝔼[𝒈j]|maxiB|𝐀i,j||{\mathbb{E}}[\bm{g}_{j}]|\leq\max_{i\in B}|\mathbf{A}_{i,j}| and variance 𝕍[𝒈j]=iB(𝒙i)2𝕍[𝐀i,j]=σ2iB(𝒙i)2\mathbb{V}[\bm{g}_{j}]=\sum_{i\in B}(\bm{x}^{\star}_{i})^{2}\mathbb{V}[\mathbf{A}_{i,j}]=\sigma^{2}\sum_{i\in B}(\bm{x}^{\star}_{i})^{2} (by independence). In particular, by Cauchy-Schwarz, 𝕍[𝒈j]1|B|σ2\mathbb{V}[\bm{g}_{j}]\geq\frac{1}{|B|}\sigma^{2}. Further, by Lemma C.7 (for τ=max(1,|v|/|𝔼[𝒈j]|)\tau=\max(1,|v|/|{\mathbb{E}}[\bm{g}_{j}]|)), we have

𝒈j[𝒈jϵv𝒈jv]\displaystyle\underset{\bm{g}_{j}}{\operatorname{\mathbb{P}}}[\bm{g}_{j}\leq\epsilon^{\prime}-v\mid\bm{g}_{j}\geq-v] ϵmax(|v|,|𝔼[𝒈j]|)𝕍[𝒈j]eϵmax(4|𝔼[𝒈j]|,3|𝔼[𝒈j]|+|v|)𝕍[𝒈j]\displaystyle\leq\epsilon^{\prime}\frac{\max(|v|,|{\mathbb{E}}[\bm{g}_{j}]|)}{\mathbb{V}[\bm{g}_{j}]}e^{\epsilon^{\prime}\frac{\max(4|{\mathbb{E}}[\bm{g}_{j}]|,3|{\mathbb{E}}[\bm{g}_{j}]|+|v|)}{\mathbb{V}[\bm{g}_{j}]}}
ϵmin(n,m)max(|v|,|𝔼[𝒈j]|)σ2eϵmin(n,m)max(4|𝔼[𝒈j]|,3|𝔼[𝒈j]|+|v|)σ2\displaystyle\leq\epsilon^{\prime}\frac{\min(n,m)\max(|v|,|{\mathbb{E}}[\bm{g}_{j}]|)}{\sigma^{2}}e^{\epsilon^{\prime}\frac{\min(n,m)\max(4|{\mathbb{E}}[\bm{g}_{j}]|,3|{\mathbb{E}}[\bm{g}_{j}]|+|v|)}{\sigma^{2}}}

for any ϵ0\epsilon^{\prime}\geq 0 and j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111Nj\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}, where we note that we applied Lemma C.7 for 𝒈j/|𝔼[𝒈j]|\bm{g}_{j}/|{\mathbb{E}}[\bm{g}_{j}]| (since the absolute value of the mean has to be at most 11), which has variance 𝕍[𝒈j]/(𝔼[𝒈j])2\mathbb{V}[\bm{g}_{j}]/({\mathbb{E}}[\bm{g}_{j}])^{2}. So, setting ϵϵ(|v|+4|𝔼[𝒈j]|)\epsilon\coloneqq\epsilon^{\prime}(|v|+4|{\mathbb{E}}[\bm{g}_{j}]|),

𝒈j[𝒈jϵ|v|+4maxiB|𝐀i,j|v𝒈jv]\displaystyle\underset{\bm{g}_{j}}{\operatorname{\mathbb{P}}}\left[\bm{g}_{j}\leq\frac{\epsilon}{|v|+4\max_{i\in B}|\mathbf{A}_{i,j}|}-v\mid\bm{g}_{j}\geq-v\right] 𝒈j[𝒈jϵ|v|+4|𝔼[𝒈j]|v𝒈jv]\displaystyle\leq\underset{\bm{g}_{j}}{\operatorname{\mathbb{P}}}\left[\bm{g}_{j}\leq\frac{\epsilon}{|v|+4|{\mathbb{E}}[\bm{g}_{j}]|}-v\mid\bm{g}_{j}\geq-v\right]
ϵmin(n,m)σ2eϵmin(n,m)σ2.\displaystyle\leq\epsilon\frac{\min(n,m)}{\sigma^{2}}e^{\epsilon\frac{\min(n,m)}{\sigma^{2}}}. (34)

Now, when ϵmin(n,m)σ2>1\epsilon\frac{\min(n,m)}{\sigma^{2}}>1 the proposition is vacuously true, while in the contrary case the claim follows from (34) and (33). ∎

Next, we proceed with the bound on γP(𝐀)\gamma_{P}(\mathbf{A}). The key ingredient is the observation that a random variable with a slowly changing density function cannot be too concentrated on any any interval (Lemma 3.7 due to Spielman and Teng (2003, Lemma 8.2); we restate it below for convenience). Gaussian random variables have this property, as pointed out by Spielman and Teng (2003, Lemma 8.1).

Lemma C.8 (Spielman and Teng, 2003).

Let μ\mu be the probability density function of a Gaussian random variable in d{\mathbb{R}}^{d} of variance σ2\sigma^{2} centered at a point of norm at most 11. If 𝖽𝗂𝗌𝗍(𝐫,𝐫)ϵ1\mathsf{dist}(\bm{r},\bm{r}^{\prime})\leq\epsilon\leq 1, then

μ(𝒓)μ(𝒓)eϵ(𝒓+2)σ2.\frac{\mu(\bm{r}^{\prime})}{\mu(\bm{r})}\geq e^{-\frac{\epsilon(\|\bm{r}\|+2)}{\sigma^{2}}}.

See 3.7

See 3.9

Proof.

Let μ𝐀(𝐀)\mu_{\mathbf{A}}(\mathbf{A}) be the probability density function of 𝐀\mathbf{A}, which, by independence, can be expressed as i[n],j[m]μ𝐀i,j\prod_{i\in[n],j\in[m]}\mu_{\mathbf{A}_{i,j}}, where μ𝐀i,j\mu_{\mathbf{A}_{i,j}} is a Gaussian random variable. We first perform a change of variables from 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐀B,N\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{A}_{B,N} to 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒃,𝒄,d\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{b},\bm{c},d, in accordance with (5); this can be understood through the (non-singular; C.2) linear transformation 𝐀B,N=𝐓(𝐐,𝒃,𝒄,d)\mathbf{A}^{\flat}_{B,N}=\mathbf{T}(\mathbf{Q}^{\flat},\bm{b},\bm{c},d). To express the density in the new variables, we first note that the Jacobian of the change of variables is |det(𝐓)|=1|\det(\mathbf{T})|=1 (C.2), and so the density on 𝐐,𝒃,𝒄,d\mathbf{Q},\bm{b},\bm{c},d can be expressed as μ𝐀B,N(𝐓(𝐐,𝒃,𝒄,d))μ𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\bm{b},\bm{c},d))\mu_{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}}(\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}).

Next, we perform a change of variables from 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒃,𝒄,d\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{b},\bm{c},d to 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙~,𝒚~,v\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\widetilde{\bm{x}}^{\star},\widetilde{\bm{y}}^{\star},v according to the transformations 𝐐𝒚~=𝒄\mathbf{Q}\widetilde{\bm{y}}^{\star}=\bm{c}; 𝐐𝒙~=𝒃\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star}=\bm{b}; and v=d𝒙~,𝐐𝒚~v=d-\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle. It is easy to see that the Jacobian of the change of variables is

|det((𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒃,𝒄,d)(𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙~,𝒚~,v))|=|det((𝒃,𝒄,d)(𝒙~,𝒚~,v))|=det(𝐐)2.\left|\det\left(\frac{\partial(\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{b},\bm{c},d)}{\partial(\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\widetilde{\bm{x}}^{\star},\widetilde{\bm{y}}^{\star},v)}\right)\right|=\left|\det\left(\frac{\partial(\bm{b},\bm{c},d)}{\partial(\widetilde{\bm{x}}^{\star},\widetilde{\bm{y}}^{\star},v)}\right)\right|=\det(\mathbf{Q})^{2}.

So, the density on 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙~,𝒚~,v\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\widetilde{\bm{x}}^{\star},\widetilde{\bm{y}}^{\star},v reads

μ𝐀B,N(𝐓(𝐐,𝐐𝒙~,𝐐𝒚~,v+𝒙~,𝐐𝒚~))μ𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)det(𝐐)2.\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle))\mu_{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}}(\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}})\det(\mathbf{Q})^{2}.

By Lemma C.6, it suffices to upper bound

maxB,N𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙,𝒚,v[γP(𝐀)ϵ𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv𝟏 and 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B𝒙Bv𝟏].\max_{B,N}\underset{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{x}^{\star},\bm{y}^{\star},v}{\operatorname{\mathbb{P}}}[\gamma_{P}(\mathbf{A})\leq\epsilon\mid\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}\geq v\bm{1}\textrm{ and }\mathbf{A}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},B}\bm{x}^{\star}_{B}\leq v\bm{1}].

Further, by Proposition C.5, it is in turn enough to bound 𝐐[γP(𝐀)ϵ]{\operatorname{\mathbb{P}}}_{\mathbf{Q}}[\gamma_{P}(\mathbf{A})\leq\epsilon] for all BB, NN (for the non-trivial case where B~,N~\widetilde{B},\widetilde{N}\neq\emptyset), 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}, 𝒙~\widetilde{\bm{x}}^{\star}, 𝒚~\widetilde{\bm{y}}^{\star}, vv such that 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv𝟏\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}\geq v\bm{1} and 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B𝒙Bv𝟏\mathbf{A}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},B}\bm{x}^{\star}_{B}\leq v\bm{1}, where the induced distribution on 𝐐\mathbf{Q} is

μ𝐀B,N(𝐓(𝐐,𝐐𝒙~,𝐐𝒚~,v+𝒙~,𝐐𝒚~))det(𝐐)2.\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle))\det(\mathbf{Q})^{2}.

We will prove that for any jN~j\in\widetilde{N} and 𝐐:,N~j\mathbf{Q}_{:,\widetilde{N}-j},

𝐐:,j[𝖽𝗂𝗌𝗍(𝐐:,j,𝗌𝗉𝖺𝗇(𝐐:,N~j))ϵ4𝐐:,j+4|v|+4𝐐:,N~j+3]ϵ4emin(n,m)2σ2,\underset{\mathbf{Q}_{:,j}}{\operatorname{\mathbb{P}}}\left[\mathsf{dist}(\mathbf{Q}_{:,j},\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j}))\leq\frac{\epsilon}{4\|\mathbf{Q}_{:,j}\|+4|v|+4\|\mathbf{Q}^{\flat}_{:,\widetilde{N}-j}\|_{\infty}+3}\right]\leq\epsilon\frac{4e\min(n,m)^{2}}{\sigma^{2}}, (35)

and then apply a union bound over jN~j\in\widetilde{N}. Having fixed 𝐐:,N~j\mathbf{Q}_{:,\widetilde{N}-j}, we can express 𝐐:,j\mathbf{Q}_{:,j} as 𝒒+t𝒒\bm{q}^{\parallel}+t\bm{q}^{\perp}, where B~𝒒𝗌𝗉𝖺𝗇(𝐐:,N~j){\mathbb{R}}^{\widetilde{B}}\ni\bm{q}^{\parallel}\in\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j}) and B~𝒒{\mathbb{R}}^{\widetilde{B}}\ni\bm{q}^{\perp} is the unit vector orthogonal to 𝗌𝗉𝖺𝗇(𝐐:,N~j)\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j}). Then, |t|=𝖽𝗂𝗌𝗍(𝐐:,j,𝗌𝗉𝖺𝗇(𝐐:,N~j))|t|=\mathsf{dist}(\mathbf{Q}_{:,j},\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j})) and |det(𝐐)|=tC(𝐐:,N~j)|\det(\mathbf{Q})|=tC(\mathbf{Q}_{:,\widetilde{N}-j}), where C(𝐐:,N~j)C(\mathbf{Q}_{:,\widetilde{N}-j}) does not depend on 𝐐:,j\mathbf{Q}_{:,j} (this can be obtained by expressing the determinant using the formula for parallelepipeds). By symmetry, we can prove (35) by bounding the probability that tt is at most ϵ\epsilon given that tt is at least 0. We can thus focus on proving

max𝒒𝗌𝗉𝖺𝗇(𝐐:,N~j)𝑡[tϵt0]ϵ4emin(n,m)2(4𝒒+4|v|+4𝐐:,N~j+3)σ2,\max_{\bm{q}^{\parallel}\in\mathsf{span}(\mathbf{Q}_{:,\widetilde{N}-j})}\underset{t}{\operatorname{\mathbb{P}}}[t\leq\epsilon\mid t\geq 0]\leq\epsilon\frac{4e\min(n,m)^{2}(4\|\bm{q}^{\parallel}\|_{\infty}+4|v|+4\|\mathbf{Q}^{\flat}_{:,\widetilde{N}-j}\|_{\infty}+3)}{\sigma^{2}}, (36)

and then (35) follows from the fact that 𝐐:,j𝒒\|\mathbf{Q}_{:,j}\|\geq\|\bm{q}^{\parallel}\|. Now, the induced distribution on tt is proportional to

ρ(t)t2(i,j)B×Nμ𝐀i,j(𝐓i,j,𝒓i,j(t))\rho(t)\coloneqq t^{2}\prod_{(i,j)\in B\times N}\mu_{\mathbf{A}_{i,j}}\left(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle\right)

for 𝒓i,j(t)\bm{r}_{i,j}(t) defined as

(𝒒+t𝒒,𝐐:,N~j,𝐐N~j,:𝒙~,𝒙~,𝒒+t𝒒,𝐐:,N~j𝒚~N~j+𝒚~j(𝒒+t𝒒),\displaystyle(\bm{q}^{\parallel}+t\bm{q}^{\perp},\mathbf{Q}^{\flat}_{:,\widetilde{N}-j},\mathbf{Q}_{\widetilde{N}-j,:}^{\top}\widetilde{\bm{x}}^{\star},\langle\widetilde{\bm{x}}^{\star},\bm{q}^{\parallel}+t\bm{q}^{\perp}\rangle,\mathbf{Q}_{:,\widetilde{N}-j}\widetilde{\bm{y}}^{\star}_{\widetilde{N}-j}+\widetilde{\bm{y}}^{\star}_{j}(\bm{q}^{\parallel}+t\bm{q}^{\perp}),
v+𝒙~,𝐐:,N~j𝒚~N~j+𝒚~j𝒙~,𝒒+t𝒒).\displaystyle v+\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}_{:,\widetilde{N}-j}\widetilde{\bm{y}}^{\star}_{\widetilde{N}-j}\rangle+\widetilde{\bm{y}}^{\star}_{j}\langle\widetilde{\bm{x}}^{\star},\bm{q}^{\parallel}+t\bm{q}^{\perp}\rangle).

We now want to apply Lemma 3.7. To that end, we have

|𝐓i,j,𝒓i,j(t)𝒓i,j(t)|2\displaystyle|\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)-\bm{r}_{i,j}(t^{\prime})\rangle|^{2} 𝐓i,j2𝒓i,j(t)𝒓i,j(t)2\displaystyle\leq\|\mathbf{T}_{i,j}\|^{2}\|\bm{r}_{i,j}(t)-\bm{r}_{i,j}(t^{\prime})\|^{2}
4(tt)2(𝒒,𝒙~,𝒒,𝒚~j𝒒,𝒚~j𝒙~,𝒒)2\displaystyle\leq 4(t-t^{\prime})^{2}\|(\bm{q}^{\perp},\langle\widetilde{\bm{x}}^{\star},\bm{q}^{\perp}\rangle,\widetilde{\bm{y}}^{\star}_{j}\bm{q}^{\perp},\widetilde{\bm{y}}^{\star}_{j}\langle\widetilde{\bm{x}}^{\star},\bm{q}^{\perp}\rangle)\|^{2} (37)
16(tt)2,\displaystyle\leq 16(t-t^{\prime})^{2}, (38)

where (37) follows from the fact that 𝐓i,j22\|\mathbf{T}_{i,j}\|_{2}\leq 2 (C.2), and (38) follows from noting that 𝒒,𝒙~,𝒚~1\|\bm{q}^{\perp}\|,\|\widetilde{\bm{x}}^{\star}\|,\|\widetilde{\bm{y}}^{\star}\|\leq 1. Moreover, again by C.2,

|𝐓i,j,𝒓i,j(t)|𝐓i,j1𝒓i,j(t)4(𝒒+|v|+𝐐:,N~j+t).|\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle|\leq\|\mathbf{T}_{i,j}\|_{1}\|\bm{r}_{i,j}(t)\|_{\infty}\leq 4(\|\bm{q}^{\parallel}\|_{\infty}+|v|+\|\mathbf{Q}_{:,\widetilde{N}-j}^{\flat}\|_{\infty}+t).

Let 0ttδ140\leq t\leq t^{\prime}\leq\delta\leq\frac{1}{4} for δ=σ24|B||N|(4𝒒+4|v|+4𝐐:,N~j+3)\delta=\frac{\sigma^{2}}{4|B||N|(4\|\bm{q}^{\parallel}\|+4|v|+4\|\mathbf{Q}^{\flat}_{:,\widetilde{N}-j}\|_{\infty}+3)}. Lemma C.8 then implies that

μ𝐀i,j(𝐓i,j,𝒓i,j(t))μ𝐀i,j(𝐓i,j,𝒓i,j(t))e1|B||N|.\frac{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t^{\prime})\rangle)}{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle)}\geq e^{-\frac{1}{|B||N|}}.

Thus,

ρ(t)ρ(t)(tt)2(i,j)B×Nμ𝐀i,j(𝐓i,j,𝒓i,j(t))μ𝐀i,j(𝐓i,j,𝒓i,j(t))e1.\frac{\rho(t^{\prime})}{\rho(t)}\geq\left(\frac{t^{\prime}}{t}\right)^{2}\prod_{(i,j)\in B\times N}\frac{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t^{\prime})\rangle)}{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle)}\geq e^{-1}.

We conclude that (36) can be obtained from Lemma 3.7, and the theorem follows. ∎

Finally, we bound the probability that αP(𝐀)\alpha_{P}(\mathbf{A}) (Item 1) is close to 0; αD(𝐀)\alpha_{D}(\mathbf{A}) can be bounded in a similar fashion.

See 3.10

Proof.

By Lemma C.6, it suffices to bound

maxB,N𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,𝐐,𝒙,𝒚,v[αP(𝐀)ϵ𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚Nv𝟏 and 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B𝒙Bv𝟏],\max_{B,N}\underset{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}},\mathbf{Q},\bm{x}^{\star},\bm{y}^{\star},v}{\operatorname{\mathbb{P}}}[\alpha_{P}(\mathbf{A})\leq\epsilon\mid\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}_{N}\geq v\bm{1}\textrm{ and }\mathbf{A}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},B}\bm{x}^{\star}_{B}\leq v\bm{1}],

where we recall that the induced probability density function on 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}, 𝐐\mathbf{Q}, 𝒙\bm{x}^{\star}, 𝒚\bm{y}^{\star}, vv reads

μ𝐀B,N(𝐓(𝐐,𝐐𝒙~,𝐐𝒚~,v+𝒙~,𝐐𝒚~))μ𝐀B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(𝐀B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N)μ𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N(𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N)μ𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N)det(𝐐)2.\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle))\mu_{\mathbf{A}_{B,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}}(\mathbf{A}_{B,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}})\mu_{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}}(\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N})\mu_{\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}}(\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}})\det(\mathbf{Q})^{2}.

We consider the non-trivial case where B~,N~\widetilde{B},\widetilde{N}\neq\emptyset. We will perform a further change of variables. Namely, let 𝒂=𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,i\bm{a}=\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},i} for iBB~i\in B\setminus\widetilde{B}. We map 𝐀B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N\mathbf{A}_{B,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}} to \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝐀B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝟏𝒂\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\coloneqq\mathbf{A}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}-\bm{1}\bm{a}^{\top}, 𝒂\bm{a}, so that 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B𝒙Bv𝟏\mathbf{A}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},B}\bm{x}^{\star}_{B}\leq v\bm{1} can be equivalently expressed as \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B~𝒙~v𝟏𝒂{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},\widetilde{B}}\widetilde{\bm{x}}^{\star}\leq v\bm{1}-\bm{a}. The induced density function is now proportional to

μ𝐀B,N(𝐓(𝐐,𝐐𝒙~,𝐐𝒚~,v+𝒙~,𝐐𝒚~))μ𝒂(𝒂)μ𝐀B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111+B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝟏𝒂)ν(),\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle))\mu_{\bm{a}}(\bm{a})\mu_{\mathbf{A}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}+\bm{1}\bm{a}^{\top})\nu(\cdot),

where ν()\nu(\cdot) does not depend on 𝒙~\widetilde{\bm{x}}^{\star} and 𝒂\bm{a}. By Proposition C.5, it is enough to show that for any B,N,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N,𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,𝐐,𝒚,vB,N,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}},\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N},\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}},\mathbf{Q},\bm{y}^{\star},v satisfying 𝐀\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B,N𝒚v𝟏\mathbf{A}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B},N}\bm{y}^{\star}\geq v\bm{1},

𝒙~,𝒂[αPϵmax((𝐐+1)2,(1+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N)(5\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N+|v|+4))\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B~𝒙~v𝟏𝒂]\displaystyle\underset{\widetilde{\bm{x}}^{\star},\bm{a}}{\operatorname{\mathbb{P}}}\left[\alpha_{P}\leq\frac{\epsilon}{\max((\|\mathbf{Q}^{\flat}\|_{\infty}+1)^{2},(1+\|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}^{\flat}\|_{\infty})(5\|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}^{\flat}\|_{\infty}+|v|+4))}\mid{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},\widetilde{B}}\widetilde{\bm{x}}^{\star}\leq v\bm{1}-\bm{a}\right]
ϵ8e2mnmin(n,m)σ2,\displaystyle\leq\epsilon\frac{8e^{2}mn\min(n,m)}{\sigma^{2}}, (39)

where the induced distribution on 𝒙~\widetilde{\bm{x}}^{\star} and 𝒂\bm{a} is proportional to

μ𝐀B,N(𝐓(𝐐,𝐐𝒙~,𝐐𝒚~,v+𝒙~,𝐐𝒚~))μ𝒂(𝒂)μ𝐀B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111+B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝟏𝒂).\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle))\mu_{\bm{a}}(\bm{a})\mu_{\mathbf{A}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}+\bm{1}\bm{a}^{\top}). (40)

We see that 𝒙~\widetilde{\bm{x}}^{\star} is independent of 𝒂\bm{a} and {𝒂j}j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N\{\bm{a}_{j}\}_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}} are pairwise independent. Thus, conditioning on the event \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B~𝒙~v𝟏𝒂{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},\widetilde{B}}\widetilde{\bm{x}}^{\star}\leq v\bm{1}-\bm{a}, the induced distribution on 𝒙~\widetilde{\bm{x}}^{\star} is proportional to

μ𝐀B,N(𝐓(𝐐,𝐐𝒙~,𝐐𝒚~,v+𝒙~,𝐐𝒚~))j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝒂j[\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~,j𝒙~v𝒂j].\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star},\mathbf{Q}\widetilde{\bm{y}}^{\star}\rangle))\prod_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\operatorname{\mathbb{P}}_{\bm{a}_{j}}[\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},j},\widetilde{\bm{x}}^{\star}\rangle\leq v-\bm{a}_{j}].

We can prove (39) by showing that for any fixed iB~i\in\widetilde{B} and 𝒙~B~i\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i},

𝒙~i[𝒙~iϵmax((𝐐+1)2,(1+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N)(5\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N+|v|+4))\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N,B~𝒙~v𝟏𝒂]\displaystyle\underset{\widetilde{\bm{x}}^{\star}_{i}}{\operatorname{\mathbb{P}}}\left[\widetilde{\bm{x}}^{\star}_{i}\leq\frac{\epsilon}{\max((\|\mathbf{Q}^{\flat}\|_{\infty}+1)^{2},(1+\|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}^{\flat}\|_{\infty})(5\|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}^{\flat}\|_{\infty}+|v|+4))}\mid{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\top}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N},\widetilde{B}}\widetilde{\bm{x}}^{\star}\leq v\bm{1}-\bm{a}\right]
ϵ8e2mmin(n,m)σ2,\displaystyle\leq\epsilon\frac{8e^{2}m\min(n,m)}{\sigma^{2}}, (41)

and then applying the union bound over all iB~i\in\widetilde{B}. Having fixed 𝒙~B~i\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}, the induced density on 𝒙~i\widetilde{\bm{x}}^{\star}_{i}, say ρ(t)\rho(t), is proportional to ρ1(t)ρ2(t)\rho_{1}(t)\cdot\rho_{2}(t), where

ρ1(t)μ𝐀B,N(𝐓(𝐐,𝐐:,B~i𝒙~B~i+t𝐐:,i,𝐐𝒚~,v+𝒙~B~i,𝐐B~i,:𝒚~+t𝐐i,:,𝒚~))\rho_{1}(t)\coloneqq\mu_{\mathbf{A}_{B,N}}(\mathbf{T}(\mathbf{Q}^{\flat},\mathbf{Q}_{:,\widetilde{B}-i}^{\top}\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}+t\mathbf{Q}^{\top}_{:,i},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i},\mathbf{Q}_{\widetilde{B}-i,:}\widetilde{\bm{y}}^{\star}\rangle+t\langle\mathbf{Q}_{i,:},\widetilde{\bm{y}}^{\star}\rangle))

and

ρ2(t)j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝒂j[\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,j,B~i𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j].\rho_{2}(t)\coloneqq\prod_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\operatorname{\mathbb{P}}_{\bm{a}_{j}}[\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{j,\widetilde{B}-i},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t\leq v-\bm{a}_{j}].

We will first apply Lemma 3.7 to bound ρ1(t)/ρ1(t)\rho_{1}(t^{\prime})/\rho_{1}(t) for 0ttδ10\leq t\leq t^{\prime}\leq\delta\leq 1 and a sufficiently small δ\delta. We define

𝒓i,j(t)(𝐐,𝐐:,B~i𝒙~B~i+t𝐐:,i,𝐐𝒚~,v+𝒙~B~i,𝐐B~i,:𝒚~+t𝐐i,:,𝒚~),\bm{r}_{i,j}(t)\coloneqq(\mathbf{Q}^{\flat},\mathbf{Q}_{:,\widetilde{B}-i}^{\top}\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}+t\mathbf{Q}^{\top}_{:,i},\mathbf{Q}\widetilde{\bm{y}}^{\star},v+\langle\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i},\mathbf{Q}_{\widetilde{B}-i,:}\widetilde{\bm{y}}^{\star}\rangle+t\langle\mathbf{Q}_{i,:},\widetilde{\bm{y}}^{\star}\rangle),

so that ρ1(t)=(i,j)B×Nμ𝐀i,j(𝐓i,j,𝒓i,j(t))\rho_{1}(t)=\prod_{(i,j)\in B\times N}\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle). Then, we have

|𝐓i,j,𝒓i,j(t)𝒓i,j(t)|4|tt|𝐐,\displaystyle|\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)-\bm{r}_{i,j}(t^{\prime})\rangle|\leq 4|t-t^{\prime}|\|\mathbf{Q}^{\flat}\|_{\infty},

where we used C.2. Further,

|𝐓i,j,𝒓i,j(t)|(t+1)𝐐,|\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle|\leq(t+1)\|\mathbf{Q}^{\flat}\|_{\infty},

and so Lemma C.8 implies that for δ14𝐐\delta\leq\frac{1}{4\|\mathbf{Q}^{\flat}\|_{\infty}},

μ𝐀i,j(𝐓i,j,𝒓i,j(t))μ𝐀i,j(𝐓i,j,𝒓i,j(t))e8δ𝐐(𝐐+1)σ2.\frac{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t^{\prime})\rangle)}{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle)}\geq e^{-\frac{8\delta\|\mathbf{Q}^{\flat}\|_{\infty}(\|\mathbf{Q}^{\flat}\|_{\infty}+1)}{\sigma^{2}}}.

As a result, for δσ28|B||N|𝐐(𝐐+1)\delta\leq\frac{\sigma^{2}}{8|B||N|\|\mathbf{Q}^{\flat}\|_{\infty}(\|\mathbf{Q}^{\flat}\|_{\infty}+1)},

ρ1(t)ρ1(t)=(i,j)B×Nμ𝐀i,j(𝐓i,j,𝒓i,j(t))μ𝐀i,j(𝐓i,j,𝒓i,j(t))e1.\frac{\rho_{1}(t^{\prime})}{\rho_{1}(t)}=\prod_{(i,j)\in B\times N}\frac{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t^{\prime})\rangle)}{\mu_{\mathbf{A}_{i,j}}(\langle\mathbf{T}_{i,j},\bm{r}_{i,j}(t)\rangle)}\geq e^{-1}.

Next, we focus on lower bounding ρ2(t)/ρ2(t)\rho_{2}(t^{\prime})/\rho_{2}(t). From (40), it is not hard to see that 𝒂j\bm{a}_{j} is a Gaussian random variable with expectation |𝔼[𝒂j]|1+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N|{\mathbb{E}}[\bm{a}_{j}]|\leq 1+\|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}^{\flat}\|_{\infty} and variance 𝕍[𝒂j]σ2min(n,m)\mathbb{V}[\bm{a}_{j}]\geq\frac{\sigma^{2}}{\min(n,m)}. Also,

ρ2(t)ρ2(t)\displaystyle\frac{\rho_{2}(t^{\prime})}{\rho_{2}(t)} =j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝒂j[\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j]𝒂j[\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j]\displaystyle=\prod_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\frac{\operatorname{\mathbb{P}}_{\bm{a}_{j}}[\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t^{\prime}\leq v-\bm{a}_{j}]}{\operatorname{\mathbb{P}}_{\bm{a}_{j}}[\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t\leq v-\bm{a}_{j}]}
j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N𝒂j[\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j].\displaystyle\geq\prod_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\operatorname{\mathbb{P}}_{\bm{a}_{j}}[\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t^{\prime}\leq v-\bm{a}_{j}\mid\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t\leq v-\bm{a}_{j}].

By Lemma C.7 (for τ=(2\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N+|v|+1)/(1+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N)\tau=(2\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty}+|v|+1)/(1+\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty})),

𝒂j[\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i\displaystyle\operatorname{\mathbb{P}}_{\bm{a}_{j}}[\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle +\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j]\displaystyle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t^{\prime}\leq v-\bm{a}_{j}\mid\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t\leq v-\bm{a}_{j}]
1δ\displaystyle\geq 1-\delta min(n,m)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(2\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N+|v|+1)σ2eδmin(n,m)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(5\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N+|v|+4)σ2.\displaystyle\frac{\min(n,m)\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty}(2\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty}+|v|+1)}{\sigma^{2}}e^{\delta\frac{\min(n,m)\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty}(5\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty}+|v|+4)}{\sigma^{2}}}.

Thus, for δ12emσ2min(n,m)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(5\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111B~,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N+|v|+4)\delta\leq\frac{1}{2em}\frac{\sigma^{2}}{\min(n,m)\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty}(5\|{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{\flat}_{\widetilde{B},\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\|_{\infty}+|v|+4)},

𝒂j[\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,B~i,j𝒙~B~i+\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ti,jv𝒂j]112m,\operatorname{\mathbb{P}}_{\bm{a}_{j}}[\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t^{\prime}\leq v-\bm{a}_{j}\mid\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\widetilde{B}-i,j},\widetilde{\bm{x}}^{\star}_{\widetilde{B}-i}\rangle+\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{i,j}t\leq v-\bm{a}_{j}]\geq 1-\frac{1}{2m},

which in turn implies that

ρ2(t)ρ2(t)(112m)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111Ne1.\frac{\rho_{2}(t^{\prime})}{\rho_{2}(t)}\geq\left(1-\frac{1}{2m}\right)^{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}\geq e^{-1}.

We conclude that ρ(t)ρ(t)e2\frac{\rho(t^{\prime})}{\rho(t)}\geq e^{-2}, and (41) follows from Lemma 3.7 by lower bounding the value of δ\delta. This completes the proof. ∎

Armed with Propositions 3.10, 3.8 and 3.9, Theorem 1.4 can be obtained from Theorem 3.6, in conjunction with a union bound and the fact that 𝐀\poly(n,m)\|\mathbf{A}^{\flat}\|_{\infty}\leq\poly(n,m) with high probability (by Gaussian concentration).

C.3 Proof of Theorem 1.2

Having established Theorem 1.4, here we explain how existing results imply Theorem 1.2. We first focus on OGDA. We also take the opportunity to explain in more detail how Wei et al. (2021) established Definition 1.3, which was sketched earlier in Section 3.1. Our treatment of the rest of the algorithms will be more brief.

Metric subregularity

A central ingredient in the approach of Wei et al. (2021) is what they refer to as saddle-point metric subregularity, stated below as Definition C.9. For the sake of generality, we give the definition for a general objective function f:𝒳×𝒴(𝒙,𝒚)f(𝒙,𝒚)f:{\mathcal{X}}\times{\mathcal{Y}}\ni(\bm{x},\bm{y})\mapsto f(\bm{x},\bm{y}), assumed to be continuously differentiable; (1) corresponds to the bilinear case f(𝒙,𝒚)=𝒙,𝐀𝒚f(\bm{x},\bm{y})=\langle\bm{x},\mathbf{A}\bm{y}\rangle. We use again the notation F(𝒛)(𝒙f(𝒙,𝒚),𝒚f(𝒙,𝒚))F(\bm{z})\coloneqq(\nabla_{\bm{x}}f(\bm{x},\bm{y}),-\nabla_{\bm{y}}f(\bm{x},\bm{y})), where n+m𝒛(𝒙,𝒚){\mathbb{R}}^{n+m}\ni\bm{z}\coloneqq(\bm{x},\bm{y}). We also let L>0L\in{\mathbb{R}}_{>0} be a Lipschitz continuity parameter for FF with respect to \|\cdot\|, so that F(𝒛)F(𝒛)L𝒛𝒛\|F(\bm{z})-F(\bm{z}^{\prime})\|\leq L\|\bm{z}-\bm{z}^{\prime}\|; in the context of (1), one can always take L𝐀L\coloneqq\|\mathbf{A}\|.

Definition C.9 (Metric subregularity for saddle-point problems (Wei et al., 2021)).

A saddle-point problem satisfies metric subregularity if there exists a problem-dependent parameter κ>0\kappa^{\prime}\in{\mathbb{R}}_{>0} such that for any 𝒛𝒵\bm{z}\in{\mathcal{Z}} and 𝒛Π𝒵(𝒛)\bm{z}^{\star}\coloneqq\Pi_{\mathcal{Z}^{\star}}(\bm{z}),

sup𝒛𝒵F(𝒛),𝒛𝒛𝒛𝒛κ𝒛𝒛.\sup_{\bm{z}^{\prime}\in{\mathcal{Z}}}\frac{\langle F(\bm{z}),\bm{z}-\bm{z}^{\prime}\rangle}{\|\bm{z}-\bm{z}^{\prime}\|}\geq\kappa^{\prime}\|\bm{z}-\bm{z}^{\star}\|. (42)

The nomenclature of Definition C.9 can be justified by the fact that (42) is equivalent to a common type of metric subregularity (Wei et al., 2021, Appendix F); for more background, we refer to Dontchev and Rockafellar (2009). We further remark that Wei et al. (2021) introduced (42) in a more general form by allowing an exponent β0\beta\in{\mathbb{R}}_{\geq 0} in the right-hand side, but that additional flexibility is not relevant for our purposes.666Wei et al. (2021) impose (42) only for points 𝒛𝒵𝒵\bm{z}\in{\mathcal{Z}}\setminus\mathcal{Z}^{\star}, which is easily seen to be equivalent.

Now, there an obvious connection between Definition 1.3 and Definition C.9 in bilinear problems with bounded domain; namely, we have

sup𝒛𝒵F(𝒛),𝒛𝒛𝒛𝒛12Φ(𝒛),\sup_{\bm{z}^{\prime}\in{\mathcal{Z}}}\frac{\langle F(\bm{z}),\bm{z}-\bm{z}^{\prime}\rangle}{\|\bm{z}-\bm{z}^{\prime}\|}\geq\frac{1}{2}\Phi(\bm{z}),

where we used the fact that F(𝒛),𝒛)=0\langle F(\bm{z}),\bm{z})=0 and 𝒛𝒛D𝒵=2\|\bm{z}-\bm{z}^{\prime}\|\leq D_{{\mathcal{Z}}}=2. So, Definition 1.3 with respect to parameter κ\kappa implies Definition C.9 with parameter κκ/2\kappa^{\prime}\coloneqq\kappa/2.

Linear convergence of OGDA

Under metric subregularity, in the sense of Definition C.9, Wei et al. (2021) were able to establish that OGDA converges to the set 𝒵\mathcal{Z}^{\star} at a linear rate:

Theorem C.10 (Wei et al., 2021).

Consider a saddle-point problem (1) satisfying metric subregularity with respect to some κ>0\kappa^{\prime}\in{\mathbb{R}}_{>0}. For any η18L\eta\leq\frac{1}{8L}, the iterates (𝐳(τ))1τt(\bm{z}^{(\tau)})_{1\leq\tau\leq t} of OGDA satisfy

𝖽𝗂𝗌𝗍(𝒛(t),𝒵)8(1+16η2(κ)281)t/2𝖽𝗂𝗌𝗍(𝒛^(1),𝒵).\mathsf{dist}(\bm{z}^{(t)},\mathcal{Z}^{\star})\leq 8\left(1+\frac{16\eta^{2}(\kappa^{\prime})^{2}}{81}\right)^{-t/2}\mathsf{dist}(\widehat{\bm{z}}^{(1)},\mathcal{Z}^{\star}). (43)

As a result, Theorem C.10 implies that OGDA guarantees 𝖽𝗂𝗌𝗍(𝒛(t),𝒵)ϵ\mathsf{dist}(\bm{z}^{(t)},\mathcal{Z}^{\star})\leq\epsilon so long as

t2𝗅𝗈𝗀(8D𝒵ϵ)𝗅𝗈𝗀(1+(κ)2324𝐀2).t\geq 2\left\lceil\frac{\mathsf{log}\left(\frac{8D_{{\mathcal{Z}}}}{\epsilon}\right)}{\mathsf{log}\left(1+\frac{(\kappa^{\prime})^{2}}{324\|\mathbf{A}\|^{2}}\right)}\right\rceil. (44)

In conjunction with Theorem 3.6 and Propositions 3.10, 3.8 and 3.9, this immediately implies that OGDA has a polynomial smoothed complexity with high probability, as claimed earlier in Theorem 1.2.

Before we proceed, it is instructive to explain how Wei et al. (2021) treated the error bound in bilinear problems where 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are polyhedral sets. As we explained earlier, it is enough to show that for any 𝒙𝒳\bm{x}\in{\mathcal{X}} and 𝒚𝒴\bm{y}\in{\mathcal{Y}},

max𝒚𝒴𝒙𝐀𝒚vκ𝒙Π𝒳(𝒙),\displaystyle\max_{\bm{y}\in{\mathcal{Y}}}\bm{x}^{\top}\mathbf{A}\bm{y}-v\geq\kappa\|\bm{x}-\Pi_{\mathcal{X}^{\star}}(\bm{x})\|,
vmin𝒙𝒳𝒙𝐀𝒚κ𝒚Π𝒴(𝒚).\displaystyle v-\min_{\bm{x}\in{\mathcal{X}}}\bm{x}^{\top}\mathbf{A}\bm{y}\geq\kappa\|\bm{y}-\Pi_{\mathcal{Y}^{\star}}(\bm{y})\|.

We focus on the first inequality, which is with respect to Player xx. We let 𝒳{𝒙n:𝒄i𝒙bii[x]}{\mathcal{X}}\coloneqq\{\bm{x}\in{\mathbb{R}}^{n}:\bm{c}_{i}^{\top}\bm{x}\leq b_{i}\quad\forall i\in[\ell_{x}]\}, where x\ell_{x} denotes the number of vertices of 𝒳{\mathcal{X}}. We also let 𝒐j𝐀𝒚j\bm{o}_{j}\coloneqq\mathbf{A}\bm{y}_{j}, where 𝒚j\bm{y}_{j} denotes the jjth vertex of 𝒴{\mathcal{Y}}; for simplicity, we will denote by kyk_{y}\in{\mathbb{N}} the number of vertices of 𝒴{\mathcal{Y}}. We consider a fixed 𝒙𝒳𝒳\bm{x}\in{\mathcal{X}}\setminus\mathcal{X}^{\star} and 𝒙=Π𝒳(𝒙)\bm{x}^{\star}=\Pi_{\mathcal{X}^{\star}}(\bm{x}).

It is easy to see that the set of optimal strategies for Player xx, 𝒳{𝒙𝒳:max𝒚𝒴𝒙,𝐀𝒚v}\mathcal{X}^{\star}\coloneqq\{\bm{x}\in{\mathcal{X}}:\max_{\bm{y}\in{\mathcal{Y}}}\langle\bm{x},\mathbf{A}\bm{y}\rangle\leq v\}, can be expressed as

𝒳{𝒙n:𝒄i𝒙bi,𝒐j𝒙v(i,j)[x]×[ky]}.\mathcal{X}^{\star}\coloneqq\left\{\bm{x}\in{\mathbb{R}}^{n}:\bm{c}_{i}^{\top}\bm{x}\leq b_{i},\bm{o}_{j}^{\top}\bm{x}\leq v\quad\forall(i,j)\in[\ell_{x}]\times[k_{y}]\right\}.

Indeed, any point 𝒚𝒴\bm{y}\in{\mathcal{Y}} is a convex combination of the vertices of 𝒴{\mathcal{Y}}, and the converse direction is also obvious. A feasibility constraint i[x]i\in[\ell_{x}] is said to be tight if 𝒄i𝒙=bi\bm{c}_{i}^{\top}\bm{x}^{\star}=b_{i}; similarly, an optimality constraint j[ky]j\in[k_{y}] is tight if 𝒐j𝒙=v\bm{o}_{j}^{\top}\bm{x}^{\star}=v. We let Lx=Lx(𝒙)[x]L_{x}=L_{x}(\bm{x}^{\star})\subseteq[\ell_{x}] be the set of tight feasibility constraints and Ky=Ky(𝒙)[ky]K_{y}=K_{y}(\bm{x}^{\star})\subseteq[k_{y}] be the set of tight optimality constraints. We can assume without any loss that Lx,KyL_{x},K_{y}\neq\emptyset. It is well-known (e.g.(Rockafellar, 2015)) that the normal cone of 𝒳\mathcal{X}^{\star} at 𝒙\bm{x}^{\star} with respect to 𝒳\mathcal{X}^{\star} can be expressed as

N𝒙{iLxpi𝒄i+jKyqj𝒐j(𝒑,𝒒)0Lx×0Ky}.N_{\bm{x}^{\star}}\coloneqq\left\{\sum_{i\in L_{x}}p_{i}\bm{c}_{i}+\sum_{j\in K_{y}}q_{j}\bm{o}_{j}\quad\forall(\bm{p},\bm{q})\in{\mathbb{R}}^{L_{x}}_{\geq 0}\times{\mathbb{R}}^{K_{y}}_{\geq 0}\right\}.

Wei et al. (2021) also define M𝒙N𝒙M_{\bm{x}^{\star}}\subseteq N_{\bm{x}^{\star}} as

N𝒙{𝒄i𝒙0iLx}.N_{\bm{x}^{\star}}\cap\left\{\bm{c}_{i}^{\top}\bm{x}\leq 0\quad\forall i\in L_{x}\right\}.

Now, the main parameter of interest that relates to Definition 1.3 in the analysis of Wei et al. (2021) stems from the following quantity.

Definition C.11.

We let C>0C\in{\mathbb{R}}_{>0} be defined as the infimum over (0,)(0,\infty) so that

{iLxpi𝒄i+jKyqj𝒐j,0pi,qjC}M𝒙,\left\{\sum_{i\in L_{x}}p_{i}\bm{c}_{i}+\sum_{j\in K_{y}}q_{j}\bm{o}_{j},0\leq p_{i},q_{j}\leq C\right\}\supseteq M_{\bm{x}^{\star}}\cap{\mathcal{B}}_{\infty}, (45)

where n{\mathcal{B}}_{\infty}\subseteq{\mathbb{R}}^{n} is the set of points with \ell_{\infty} norm upper bounded by 11.

By definition of M𝒙M_{\bm{x}^{\star}}, it is evident that there always exists a finite problem-dependent parameter C>0C\in{\mathbb{R}}_{>0} such that Definition C.11 is satisfied. It is then not hard to show that

max𝒚𝒴𝒙𝐀𝒚v1C|Ky|𝒙Π𝒳(𝒙).\max_{\bm{y}\in{\mathcal{Y}}}\bm{x}^{\top}\mathbf{A}\bm{y}-v\geq\frac{1}{C|K_{y}|}\|\bm{x}-\Pi_{{\mathcal{X}}}(\bm{x}^{\star})\|.

Assuming that the number of vertices is polynomial in the dimensions,777In fact, by virtue of Carathéodory’s theorem, one can refine Definition C.11 so that this holds even when the number of vertices is exponential in the dimensions. Namely, a point 𝒗M𝒙\bm{v}\in M_{\bm{x}^{\star}}\cap{\mathcal{B}}_{\infty} can be written as the conical combination of at most nn of the vectors describing the cone in (45), thereby maintaining feasibility. This observation can be used to refine the (worst-case) analysis of Wei et al. (2021) to, for example, extensive-form games wherein the number of vertices is typically exponential in the dimensions. this shows that Definition C.11 essentially captures the complexity of satisfying Definition 1.3. As we explained earlier in Section 3.1, one challenge is that the constraint matrix of the linear program induced by Definition C.11 depends both on the payoff matrix 𝐀\mathbf{A} as well as the set of constraints. It is thus unclear how to use existing results in the model of smoothed complexity (Dunagan et al., 2011) to bound CC. The second and more important challenge revolves around the fact that Definition C.11 depends solely on the tight constraints of the optimal solution, which in turn depends on the randomness of 𝐀\mathbf{A}. Under our characterization, the latter challenge was addressed earlier in Section 3.3.

Continuing for OMWU, we again rely on the analysis of Wei et al. (2021), which relates the rate of convergence of OMWU to three quantities. The first one (Wei et al., 2021, Definition 3) is similar to Definition C.9, but with the difference that the maximization is now constrained to be over points whose support is a subset of the support of the equilibrium; namely,

κxmin𝒙𝒳{𝒙}max𝒚𝒱(𝒴)𝒙𝒙,𝐀𝒚𝒙𝒙1,\kappa_{x}\coloneqq\min_{\bm{x}\in{\mathcal{X}}\setminus\{\bm{x}^{\star}\}}\max_{\bm{y}\in\mathcal{V}^{\star}({\mathcal{Y}})}\frac{\langle\bm{x}-\bm{x}^{\star},\mathbf{A}\bm{y}\rangle}{\|\bm{x}-\bm{x}^{\star}\|_{1}}, (46)

where 𝒱(𝒴){𝒚Δm:𝗌𝗎𝗉𝗉(𝒚)𝗌𝗎𝗉𝗉(𝒚)}\mathcal{V}^{\star}({\mathcal{Y}})\coloneqq\{\bm{y}\in\Delta^{m}:\mathsf{supp}(\bm{y})\subseteq\mathsf{supp}(\bm{y}^{\star})\}. A symmetric definition is to be considered with respect to Player yy. To connect this to (8), we note that, when 𝒚𝒱(𝒴)\bm{y}\in\mathcal{V}^{\star}({\mathcal{Y}}), 𝒙,𝐀𝒚=v\langle\bm{x}^{\star},\mathbf{A}\bm{y}\rangle=v. We are thus left to lower bound max𝒚𝒙,𝐀𝒚v\max_{\bm{y}}\langle\bm{x},\mathbf{A}\bm{y}\rangle-v in terms of 𝒙𝒙1\|\bm{x}-\bm{x}^{\star}\|_{1}, but under the constraint that 𝒚𝒱(𝒴)\bm{y}\in\mathcal{V}^{\star}({\mathcal{Y}}). An inspection of our proof of Theorem 3.6 (and in particular the proof of (8)) reveals that its conclusion holds even when the maximization is subject to the above constraint, and so our analysis immediately lower bounds (46) as well. The second quantity introduced by Wei et al. (2021, Definition 2) corresponds exactly to Item 2, which was bounded in Proposition 3.8. The third quantity (Wei et al., 2021, Definition 4) is where the exponential overhead is introduced. Namely, the iteration complexity of OMWU in their analysis depends on exp(min(αP(𝐀),αD(𝐀))1)\exp\left(\min(\alpha_{P}(\mathbf{A}),\alpha_{D}(\mathbf{A}))^{-1}\right), where we recall the definition in Item 1.888More specifically, the proof of Wei et al. (2021, Theorem 3) upper bounds the Kullback-Leibler divergence KL(𝒛(t),𝒛)\text{KL}(\bm{z}^{(t)},\bm{z}^{\star}) by a quantity that is at least as large as (1+15η2C232)t\left(1+\frac{15\eta^{2}C_{2}}{32}\right)^{-t}, where C2exp(min(αP(𝐀),αD(𝐀))1)C_{2}\leq\exp\left(\min(\alpha_{P}(\mathbf{A}),\alpha_{D}(\mathbf{A}))^{-1}\right). Thus, to guarantee KL(𝒛(t),𝒛)ϵ\text{KL}(\bm{z}^{(t)},\bm{z}^{\star})\leq\epsilon using the analysis of Wei et al. (2021) one needs at least 𝗅𝗈𝗀(1/ϵ)/𝗅𝗈𝗀(1+15η2C232)\mathsf{log}(1/\epsilon)/\mathsf{log}\left(1+\frac{15\eta^{2}C_{2}}{32}\right) iterations. When C21C_{2}\ll 1, this grows with 1/C2exp(min(αP(𝐀),αD(𝐀)))1/C_{2}\geq\exp\left(\min(\alpha_{P}(\mathbf{A}),\alpha_{D}(\mathbf{A}))\right). Unfortunately, for any game, it holds that αP(𝐀)1/n\alpha_{P}(\mathbf{A})\leq 1/n and αD(𝐀)1/m\alpha_{D}(\mathbf{A})\leq 1/m, and so even if the geometry of the problem is favorable, the obtained bound is exponential. (The reason the above quantity is crucial in their analysis is because it lower bounds the probability of playing any action through the trajectory of OMWU.) Nevertheless, using Proposition 3.10, our analysis provides instead a bound of exp(\poly(n,m,1/σ))\exp(\poly(n,m,1/\sigma)) with high probability, which is still a major improvement over the worst-case bound of Wei et al. (2021), which can be doubly exponential in the number of bits LL describing the game—one can easily make sure that αP(𝐀)1/2L\alpha_{P}(\mathbf{A})\approx 1/2^{L} (Proposition 3.1).

Next, for EGDA, Tseng (1995) established linear convergence under the error bound

𝖽𝗂𝗌𝗍(𝒛,𝒛)τ𝒛Π𝒵(𝒛ηF(𝒛))\mathsf{dist}(\bm{z},\bm{z}^{\star})\leq\tau\|\bm{z}-\Pi_{{\mathcal{Z}}}(\bm{z}-\eta F(\bm{z}))\|

for some τ>0\tau>0 and a suitable η>0\eta>0 (Tseng, 1995, Corollary 3.3). It is easy to make the following connection.

Lemma C.12.

It holds that Φ(𝐳)2η𝐳Π𝒵(𝐳ηF(𝐳))\Phi(\bm{z})\leq\frac{2}{\eta}\|\bm{z}-\Pi_{{\mathcal{Z}}}(\bm{z}-\eta F(\bm{z}))\|.

Proof.

Indeed, by the first-order optimality condition for the optimization problem associated with

𝒛Π𝒵(𝒛ηF(𝒛))=argmin𝒛𝒵{𝒛(𝒛ηF(𝒛))2h(𝒛)},\bm{z}^{\prime}\coloneqq\Pi_{{\mathcal{Z}}}(\bm{z}-\eta F(\bm{z}))=\arg\min_{\bm{z}^{\prime}\in{\mathcal{Z}}}\left\{\|\bm{z}^{\prime}-(\bm{z}-\eta F(\bm{z}))\|^{2}\coloneqq h(\bm{z}^{\prime})\right\},

we get 𝒛^𝒛,h(𝒛)0\langle\widehat{\bm{z}}-\bm{z}^{\prime},\nabla h(\bm{z}^{\prime})\rangle\geq 0 for any 𝒛^𝒵\widehat{\bm{z}}\in{\mathcal{Z}}, or equivalently, min𝒛^𝒵𝒛^𝒛,𝒛𝒛+ηF(𝒛)0\min_{\widehat{\bm{z}}\in{\mathcal{Z}}}\langle\widehat{\bm{z}}-\bm{z}^{\prime},\bm{z}^{\prime}-\bm{z}+\eta F(\bm{z})\rangle\geq 0. Observing that min𝒛^𝒵𝒛^,F(𝒛)=Φ(𝒛)\min_{\widehat{\bm{z}}\in{\mathcal{Z}}}\langle\widehat{\bm{z}},F(\bm{z})\rangle=-\Phi(\bm{z}) and bounding

𝒛𝒛,𝒛^𝒛𝒛𝒛𝒛^𝒛D𝒵𝒛𝒛=2𝒛Π𝒵(𝒛ηF(𝒛))\langle\bm{z}-\bm{z}^{\prime},\widehat{\bm{z}}-\bm{z}^{\prime}\rangle\geq-\|\bm{z}-\bm{z}^{\prime}\|\|\widehat{\bm{z}}-\bm{z}^{\prime}\|\geq-D_{{\mathcal{Z}}}\|\bm{z}-\bm{z}^{\prime}\|=-2\|\bm{z}-\Pi_{{\mathcal{Z}}}(\bm{z}-\eta F(\bm{z}))\|

leads to the claim. ∎

It can thus be shown that Definition 1.3 is again sufficient to dictate the rate of convergence of EGDA.

Finally, for IterSmooth, Gilpin et al. (2012) introduced a “condition measure” of the payoff matrix 𝐀\mathbf{A}, which in fact corresponds precisely to Definition 1.3. Thus, Theorem 1.2 with respect to IterSmooth follows readily from (Gilpin et al., 2012, Theorem 2).

C.4 Proof of Theorem 4.2

Finally, we conclude with the proof of Theorem 4.2, which is restated below.

See 4.2

Proof of Theorem 4.2.

We treat each parameter separately.

  • Let us start from βP(𝐀)\beta_{P}(\mathbf{A}) (Item 2). We let jargminj\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N(v𝒙B,𝐀B,j)j^{\prime}\in\arg\min_{j\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}}(v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j}\rangle), where we assume that \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{N}\neq\emptyset. We consider a perturbed matrix 𝐀\mathbf{A}^{\prime} such that

    𝐀i,j={𝐀i,jβP(𝐀)if iB,j=j,𝐀i,jotherwise.\mathbf{A}^{\prime}_{i,j}=\begin{cases}\mathbf{A}_{i,j}-\beta_{P}(\mathbf{A})&\text{if }i\in B,j=j^{\prime},\\ \mathbf{A}_{i,j}&\text{otherwise}.\end{cases}

    Then, the game described by 𝐀\mathbf{A}^{\prime} cannot be non-degenerate with the same support as 𝐀\mathbf{A}. Indeed, in the contrary case it would follow that the (unique) equilibrium (𝒙B,𝒚N)(\bm{x}^{\star}_{B},\bm{y}^{\star}_{N}) remains the same since 𝐀B,N=𝐀B,N\mathbf{A}^{\prime}_{B,N}=\mathbf{A}_{B,N}. But then, v𝒙B,𝐀B,j=v𝒙B,𝐀B,jβP(𝐀)=0v-\langle\bm{x}^{\star}_{B},\mathbf{A}^{\prime}_{B,j^{\prime}}\rangle=v-\langle\bm{x}^{\star}_{B},\mathbf{A}_{B,j^{\prime}}\rangle-\beta_{P}(\mathbf{A})=0, by definition of jj^{\prime} and βP(𝐀)\beta_{P}(\mathbf{A}), which is a contradiction. Further, 𝐀𝐀=βP(𝐀)\|\mathbf{A}-\mathbf{A}^{\prime}\|=\beta_{P}(\mathbf{A}). In turn, this implies that δβP(𝐀)\delta\leq\beta_{P}(\mathbf{A}). Similar reasoning yields that δβD(𝐀)\delta\leq\beta_{D}(\mathbf{A}).

  • Continuing for γP(𝐀)\gamma_{P}(\mathbf{A}) (Item 3), we assume that B~,N~\widetilde{B},\widetilde{N}\neq\emptyset. We let 𝐔𝚺𝐕\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\top} be a singular value decomposition (SVD) of 𝐐\mathbf{Q}. Then, a perturbation to 𝐐\mathbf{Q} of the form 𝐔𝖽𝗂𝖺𝗀(0,0,,σmin(𝐐))𝐕\mathbf{U}\mathsf{diag}(0,0,\dots,\sigma_{\mathrm{min}}(\mathbf{Q}))\mathbf{V}^{\top} leads to a singular matrix 𝐐\mathbf{Q}^{\prime}, which cannot be the case if the perturbed game is non-degenerate with the same support. This perturbation can be cast in terms of 𝐀B,N\mathbf{A}_{B,N}^{\prime} through transformation 𝐓\mathbf{T} in (5). This lower bounds σmin(𝐐)\sigma_{\mathrm{min}}(\mathbf{Q}) in terms of δ\delta, and Proposition C.4 can in turn lower bound γP(𝐀)\gamma_{P}(\mathbf{A}) in terms of σmin(𝐐)\sigma_{\mathrm{min}}(\mathbf{Q}).

  • Finally, we treat αP(𝐀)\alpha_{P}(\mathbf{A}) (Item 1). The non-trivial case is again when B~,N~\widetilde{B},\widetilde{N}\neq\emptyset. Let iargminiB(𝒙i)i^{\prime}\in\arg\min_{i\in B}(\bm{x}^{\star}_{i}). If iB~i^{\prime}\in\widetilde{B}, we define

    B~𝒙~i={0if i=i,𝒙iotherwise.{\mathbb{R}}^{\widetilde{B}}\ni\widetilde{\bm{x}}_{i}^{\prime}=\begin{cases}0&\text{if }i=i^{\prime},\\ \bm{x}^{\star}_{i}&\text{otherwise}.\end{cases}

    We know that 𝐐𝒙~=𝒃\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\star}=\bm{b}. We then consider the perturbed vector 𝒃𝐐𝒙~\bm{b}^{\prime}\coloneqq\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\prime}. If the perturbed game was non-degenerate with the same support, it would follow that (𝒙~,)(\widetilde{\bm{x}}^{\prime},\cdot) is the unique equilibrium, which is a contradiction since 𝒙~i=0\widetilde{\bm{x}}_{i^{\prime}}=0. Further, the norm of the perturbation 𝒃𝒃\|\bm{b}-\bm{b}^{\prime}\| is upper bounded in terms of αP(𝐀)\alpha_{P}(\mathbf{A}), which can be again expressed in terms of 𝐀B,N\mathbf{A}_{B,N} through transformation (5). Similarly, if iB~i^{\prime}\notin\widetilde{B}, we define

    B~𝒙~i=𝒙i+αP(𝐀)|B~|,{\mathbb{R}}^{\widetilde{B}}\ni\widetilde{\bm{x}}_{i}^{\prime}=\bm{x}^{\star}_{i}+\frac{\alpha_{P}(\mathbf{A})}{|\widetilde{B}|},

    and we consider the perturbed vector 𝒃𝐐𝒙~\bm{b}^{\prime}\coloneqq\mathbf{Q}^{\top}\widetilde{\bm{x}}^{\prime}. If the perturbed game was non-degenerate with the same support, it would follow that (𝒙~,)(\widetilde{\bm{x}}^{\prime},\cdot) is the unique equilibrium, which is a contradiction since iB~𝒙~i=iB~𝒙i+αD(𝐀)=1\sum_{i\in\widetilde{B}}\widetilde{\bm{x}}_{i}^{\prime}=\sum_{i\in\widetilde{B}}\bm{x}^{\star}_{i}+\alpha_{D}(\mathbf{A})=1. The norm of the perturbation is again upper bounded in terms of αP(𝐀)\alpha_{P}(\mathbf{A}). Overall, we have shown that δαP(𝐀)\poly(n,m)\delta\leq\alpha_{P}(\mathbf{A})\poly(n,m). Similar reasoning applies with respect to αD(𝐀)\alpha_{D}(\mathbf{A}). This completes the proof.