Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality

\nameDimitris Bertsimas \emaildbertsim@mit.edu
\addrMassachusetts Institute of Technology
Cambridge, MA 02139, USA \AND\nameRyan Cory-Wright \emailryancw@mit.edu
\addrMassachusetts Institute of Technology
Cambridge, MA 02139, USA \AND\nameJean Pauphilet \emailjpauphilet@london.edu
\addrLondon Business School
London, UK

Abstract

Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than $p=100s$ of variables. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting $k=5$ covariates from $p=300$ variables, and provides small bound gaps at a larger scale. We also propose a convex relaxation and greedy rounding scheme that provides bound gaps of $1-2\%$ in practice within minutes for $p=100$ s or hours for $p=1,000$ s and is therefore a viable alternative to the exact method at scale. Using real-world financial and medical datasets, we illustrate our approach’s ability to derive interpretable principal components tractably at scale.

Keywords: Sparse PCA, Sparse Eigenvalues, Semidefinite Optimization

1 Introduction

In the era of big data, interpretable methods for compressing a high-dimensional dataset into a lower dimensional set which shares the same essential characteristics are imperative. Since the work of Hotelling (1933), principal component analysis (PCA) has been one of the most popular approaches for completing this task. Formally, given centered data $\bm{A}\in\mathbb{R}^{n\times p}$ and its normalized empirical covariance matrix $\bm{\Sigma}:=\frac{\bm{A}\bm{A}^{\top}}{{n-1}}\in\mathbb{R}^{p\times p}$ , PCA selects one or more leading eigenvectors of $\bm{\Sigma}$ and subsequently projects $\bm{A}$ onto these eigenvectors. This can be achieved in $O(p^{3})$ time by taking a singular value decomposition $\bm{\Sigma}=\bm{U}\bm{\Lambda}\bm{U}^{\top}$ .

A common criticism¹¹1A second criticism of PCA is that, as set up here, it uses the sample correlation or covariance matrix. This is a drawback, because sample covariance matrices are poorly conditioned estimators which over-disperses the sample eigenvalues, particularly in high-dimensional settings. In practice, this can be rectified by, e.g., using a shrinkage estimator (see, e.g., Ledoit and Wolf, 2004). We do not do so here for simplicity, but we recommend doing so if using the techniques developed in this paper in practice. of PCA is that the columns of $\bm{U}$ are not interpretable, since each eigenvector is a linear combination of all $p$ original features. This causes difficulties because:

•

In medical applications such as cancer detection, PCs generated during exploratory data analysis need to supply interpretable modes of variation (Hsu et al., 2014).
•

In scientific applications such as protein folding, each original co-ordinate axis has a physical interpretation, and the reduced set of co-ordinate axes should too.
•

In finance applications such as investing capital across index funds, each non-zero entry in each eigenvector used to reduce the feature space incurs a transaction cost.
•

If $p\gg n$ , PCA suffers from a curse of dimensionality and becomes physically meaningless (Amini and Wainwright, 2008).

One common method for obtaining interpretable principal components is to stipulate that they are sparse, i.e., maximize variance while containing at most $k$ non-zero entries. This approach leads to the following non-convex mixed-integer quadratically constrained problem (see d’Aspremont et al., 2005):

\displaystyle\max_{\bm{x}\in\mathbb{R}^{p}}\ \bm{x}^{\top}\bm{\Sigma}\bm{x}\quad\text{s.t.}\quad\bm{x}^{\top}\bm{x}=1,\ ||\bm{x}||_{0}\leq k,

(1)

where the constraint $||\bm{x}||_{0}\leq k$ forces variance to be explained in a compelling fashion.

1.1 Background and Literature Review

Owing to sparse PCA’s fundamental importance in a variety of applications including best subset selection (d’Aspremont et al., 2008), natural language processing (Zhang et al., 2012), compressed sensing (Candes and Tao, 2007), and clustering (Luss and d’Aspremont, 2010), three distinct classes of methods for addressing Problem (1) have arisen. Namely, (a) heuristic methods which obtain high-quality sparse PCs in an efficient fashion but do not supply guarantees on the quality of the solution, (b) convex relaxations which obtain certifiably near-optimal solutions by solving a convex relaxation and rounding, and (c) exact methods which obtain certifiably optimal solutions, albeit in exponential time.

Heuristic Approaches:

The importance of identifying a small number of interpretable principal components has been well-documented in the literature since the work of Hotelling (1933) (see also Jeffers, 1967), giving rise to many distinct heuristic approaches for obtaining high-quality solutions to Problem (1). Two interesting such approaches are to rotate dense principal components to promote sparsity (Kaiser, 1958; Richman, 1986; Jolliffe, 1995), or apply an $\ell_{1}$ penalty term as a convex surrogate to the cardinality constraint (Jolliffe et al., 2003; Zou et al., 2006). Unfortunately, the former approach does not provide performance guarantees, while the latter approach still results in a non-convex optimization problem.

More recently, motivated by the need to rapidly obtain high-quality sparse principal components at scale, a wide variety of first-order heuristic methods have emerged. The first such modern heuristic was developed by Journée et al. (2010), and involves combining the power method with thresholding and re-normalization steps. By pursuing similar ideas, several related methods have since been developed (see Witten et al., 2009; Hein and Bühler, 2010; Richtárik et al., 2020; Luss and Teboulle, 2013; Yuan and Zhang, 2013). Unfortunately, while these methods are often very effective in practice, they sometimes badly fail to recover an optimal sparse principal component, and a practitioner using a heuristic method typically has no way of knowing when this has occurred. Indeed, Berk and Bertsimas (2019) recently compared $7$ heuristic methods, including most of those reviewed here, on $14$ instances of sparse PCA, and found that none of the heuristic methods successfully recovered an optimal solution in all $14$ cases (i.e., no heuristic was right all the time).

Convex Relaxations:

Motivated by the shortcomings of heuristic approaches on high-dimensional datasets, and the successful application of semi-definite optimization in obtaining high-quality approximation bounds in other applications (see Goemans and Williamson, 1995; Wolkowicz et al., 2012), a variety of convex relaxations have been proposed for sparse PCA. The first such convex relaxation was proposed by d’Aspremont et al. (2005), who reformulated sparse PCA as the rank-constrained mixed-integer semidefinite optimization problem (MISDO):

\displaystyle\max_{\bm{X}\succeq\bm{0}}\quad\langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \|\bm{X}\|_{0}\leq k^{2},\ \mathrm{Rank}(\bm{X})=1,

(2)

where $\bm{X}$ models the outer product $\bm{x}\bm{x}^{\top}$ . Note that, for a rank-one matrix $\bm{X}$ , the constraint $\|\bm{X}\|_{0}\leq k^{2}$ in (2) is equivalent to the constraint $\|\bm{x}\|_{0}\leq k$ in (1), since a vector $\bm{x}$ is $k$ -sparse if its outer product $\bm{x}\bm{x}^{\top}$ is $k^{2}$ -sparse. After performing this reformulation, d’Aspremont et al. (2005) relaxed both the cardinality and rank constraints and instead solved

\displaystyle\max_{\bm{X}\succeq\bm{0}}\quad\langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \|\bm{X}\|_{1}\leq k,

(3)

which supplies a valid upper bound on Problem (1)’s objective.

The semidefinite approach has since been refined in a number of follow-up works. Among others, d’Aspremont et al. (2008), building upon the work of Ben-Tal and Nemirovski (2002), proposed a different semidefinite relaxation which supplies a sufficient condition for optimality via the primal-dual KKT conditions, and d’Aspremont et al. (2014) analyzed the quality of the semidefinite relaxation in order to obtain high-quality approximation bounds. A common theme in these approaches is that they require solving large-scale semidefinite optimization problems. This presents difficulties for practitioners because state-of-the-art implementations of interior point methods such as Mosek require $O(p^{6})$ memory to solve Problem (3), and therefore currently cannot solve instances of Problem (3) with $p\geq 300$ (see Bertsimas and Cory-Wright, 2020, for a recent comparison). Techniques other than interior point methods, e.g., ADMM or augmented Lagrangian methods as reviewed in Majumdar et al. (2019) could also be used to solve Problem (3), although they tend to require more runtime than IPMs to obtain a solution of a similar accuracy and be numerically unstable for problem sizes where IPMs run out of memory (Majumdar et al., 2019).

A number of works have also studied the statistical estimation properties of Problem (3), by assuming an underlying probabilistic model. Among others, Amini and Wainwright (2008) have demonstrated the asymptotic consistency of Problem (3) under a spiked covariance model once the number of samples used to generate the covariance matrix exceeds a certain threshold; see Vu and Lei (2012); Berthet and Rigollet (2013); Wang et al. (2016) for further results in this direction, Miolane (2018) for a recent survey.

In an complementary direction, Dey et al. (2018) has recently questioned the modeling paradigm of lifting $\bm{x}$ to a higher dimensional space by instead considering the following (tighter) relaxation of sparse PCA in the original problem space

\displaystyle\max_{\bm{x}\in\mathbb{R}^{p}}\quad\bm{x}^{\top}\bm{\Sigma}\bm{x}\quad\text{s.t.}\quad\|\bm{x}\|_{2}=1,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\|\bm{x}\|_{1}\leq\sqrt{k}}.

(4)

Interestingly, Problem (4)’s relaxation provides a $\left(1+\sqrt{\tfrac{k}{k+1}}\right)^{2}$ -factor bound approximation of Problem (1)’s objective, while Problem (3)’s upper bound may be exponentially larger in the worst case (Amini and Wainwright, 2008). This additional tightness, however, comes at a price: Problem (4) is NP-hard to solve—indeed, providing a constant-factor guarantee on sparse PCA is NP-hard (Magdon-Ismail, 2017)—and thus (4) is best formulated as a MIO, while Problem (3) can be solved in polynomial time.

More recently, by building on the work of Kim and Kojima (2001), Bertsimas and Cory-Wright (2020) introduced a second-order cone relaxation of (2) which scales to $p=1000s$ , and matches the semidefinite bound after imposing a small number of cuts. Moreover, it typically supplies bound gaps of less than $5\%$ . However, it does not supply an exact certificate of optimality, which is often desirable, for instance in medical applications.

A fundamental drawback of existing convex relaxation techniques is that they are not coupled with rounding schemes for obtaining high-quality feasible solutions. This is problematic, because optimizers are typically interested in obtaining high-quality solutions, rather than certificates. In this paper, we take a step in this direction, by deriving new convex relaxations that naturally give rise to greedy and random rounding schemes. The fundamental point of difference between our relaxations and existing relaxations is that we derive our relaxations by rewriting sparse PCA as a MISDO and dropping an integrality constraint, rather than using more ad-hoc techniques.

Exact Methods:

Motivated by the successful application of mixed-integer optimization for solving statistical learning problems such as best subset selection (Bertsimas and Van Parys, 2020) and sparse classification (Bertsimas et al., 2017), several exact methods for solving sparse PCA to certifiable optimality have been proposed. The first branch-and-bound algorithm for solving Problem (1) was proposed by Moghaddam et al. (2006), by applying norm equivalence relations to obtain valid bounds. However, Moghaddam et al. (2006) did not couple their approach with high-quality initial solutions and tractable bounds to prune partial solutions. Consequently, they could not scale their approach beyond $p=40$ .

A more sophisticated branch-and-bound scheme was recently proposed by Berk and Bertsimas (2019), which couples tighter Gershgorin Circle Theorem bounds (Horn and Johnson, 1990, Chapter 6) with a fast heuristic due to Yuan and Zhang (2013) to solve problems up to $p=250$ . However, their method cannot scale beyond $p=100$ s, because the bounds obtained are too weak to avoid enumerating a sizeable portion of the tree.

Recently, the authors developed a framework for reformulating convex mixed-integer optimization problems with logical constraints (see Bertsimas et al., 2019), and demonstrated that this framework allows a number of problems of practical relevance to be solved to certifiably optimality via a cutting-plane method. In this paper, we build upon this work by reformulating Problem (1) as a convex mixed-integer semidefinite optimization problem, and leverage this reformulation to design a cutting-plane method which solves sparse PCA to certifiable optimality. A key feature of our approach is that we need not solve any semidefinite subproblems. Rather, we use concepts from SDO to design a semidefinite-free approach which uses simple linear algebra techniques.

Concurrently to our initial submission, Li and Xie (2020) also attempted to reformulate sparse PCA as an MISDO, and proposed valid inequalities for strengthening their formulation and local search algorithms for obtaining high-quality solutions at scale. Our work differs in the following two ways. First, we propose strengthening the MISDO formulation using the Gershgorin circle theorem and demonstrate that this allows our MISDO formulation to scale to problems with $p=100$ s of features, while they do not, to our knowledge, solve any MISDOs to certifiable optimality where $p>13$ . Second, we develop tractable second-order cone relaxations and greedy rounding schemes which allow practitioners to obtain certifiably near optimal sparse principal components even in the presence of $p=1,000$ s of features. More remarkable than the differences between the works however is the similarities: more than $15$ years after d’Aspremont et al. (2005)’s landmark paper first appeared, both works proposed reformulating sparse PCA as an MISDO less than a week apart. In our view, this demonstrates that the ideas contained in both works transcend sparse PCA, and can perhaps be applied to other problems in the optimization literature which have not yet been formulated as MISDOs.

1.2 Contributions and Structure

The main contributions of the paper are twofold. First, we reformulate sparse PCA exactly as a mixed-integer semidefinite optimization problem; a reformulation which is, to the best of our knowledge, novel. Second, we leverage this MISDO formulation to design efficient algorithms for solving non-convex mixed-integer quadratic optimization problems, such as sparse PCA, to certifiable optimality or within $1-2\%$ of optimality in practice at a larger scale than existing state-of-the-art methods. The structure and detailed contributions of the paper are as follows:

•

In Section 2.1, we reformulate Problem (1) as a mixed-integer SDO. We propose a cutting-plane method which solves it to certifiable optimality in Section 2.2. Our algorithm decomposes the problem into a purely binary master problem and a semidefinite separation problem. Interestingly, we show in Section 2.3 that the separation problems can be solved efficiently via a leading eigenvalue computation and does not require any SDO solver. Finally, the Gershgorin Circle theorem has been empirically successful for deriving upper-bounds on the objective value of (1) (Berk and Bertsimas, 2019). We theoretically analyze the quality of such bounds in Section 2.4 and show in Section 2.5 that tighter bounds derived from Brauer’s ovals of Cassini theorem can also be imposed via mixed-integer second-order cone constraints.
•

In Section 3, we analyze the semidefinite reformulation’s convex relaxation, and introduce a greedy rounding scheme (Section 3.1) which supplies high-quality solutions to Problem (1) in polynomial time, together with a sub-optimality gap (see numerical experiments in Section 4). To further improve the quality of rounded solution and the optimality gap, we introduce strengthening inequalities (Section 3.2). While solving the strengthened formulation exactly would result in an intractable MISDO problem, solving its relaxation and rounding the solution appears as an efficient strategy to return high-quality solutions with a numerical certificate of near-optimality.
•

In Section 4, we apply the cutting-plane and random rounding methods to derive optimal and near optimal sparse principal components for problems in the UCI dataset. We also compare our method’s performance against the method of Berk and Bertsimas (2019), and find that our exact cutting-plane method performs comparably, while our relax+round approach successfully scales to problems an order of magnitude larger and often returns solutions which outperform the exact method at sizes which the exact method cannot currently scale to. A key feature of our numerical success is that we sidestep the computational difficulties in solving SDOs at scale by proposing semidefinite-free methods for solving the convex relaxations, i.e., solving second-order cone relaxations.

Notation:

We let nonbold face characters such as $b$ denote scalars, lowercase bold faced characters such as $\bm{x}$ denote vectors, uppercase bold faced characters such as $\bm{X}$ denote matrices, and calligraphic uppercase characters such as $\mathcal{Z}$ denote sets. We let $[p]$ denote the set of running indices $\{1,...,p\}$ . We let $\mathbf{e}$ denote a vector of all $1$ ’s, $\bm{0}$ denote a vector of all $0$ ’s, and $\mathbb{I}$ denote the identity matrix, with dimension implied by the context.

We also use an assortment of matrix operators. We let $\langle\cdot,\cdot\rangle$ denote the Euclidean inner product between two matrices, $\|\cdot\|_{F}$ denote the Frobenius norm of a matrix, $\|\cdot\|_{\sigma}$ denote the spectral norm of a matrix, $\|\cdot\|_{*}$ denote the nuclear norm of a matrix, $\bm{X}^{\dagger}$ denote the Moore-Penrose psuedoinverse of a matrix $\bm{X}$ and $S_{+}^{p}$ denote the $p\times p$ positive semidefinite cone; see Horn and Johnson (1990) for a general theory of matrix operators.

2 An Exact Mixed-Integer Semidefinite Optimization Algorithm

In Section 2.1, we reformulate Problem (1) as a convex mixed-integer semidefinite optimization problem. From this formulation, we propose an outer-approximation scheme (Section 2.2) which, as we show in Section 2.3, does not require solving any semidefinite problems. We improve convergence of the algorithm by deriving quality upper-bounds on Problem’s (1) objective value in Section 2.4 and 2.5.

2.1 A Mixed-Integer Semidefinite Reformulation

Starting from the rank-constrained SDO formulation (2), we introduce binary variables $z_{i}$ to model whether $X_{i,j}$ is non-zero, via the logical constraint $X_{i,j}=0$ if $z_{i}=0$ ; note that we need not require that $X_{i,j}=0$ if $z_{j}=0$ , since $\bm{X}$ is a symmetric matrix. By enforcing the logical constraint via $-M_{i,j}z_{i}\leq X_{i,j}\leq M_{i,j}z_{i}$ for sufficiently large $M_{i,j}>0$ , Problem (2) becomes

	$\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}}\quad$	$\displaystyle\langle\bm{\Sigma},\bm{X}\rangle$
	s.t.	$\displaystyle\mathrm{tr}(\bm{X})=1,\ -M_{i,j}z_{i}\leq X_{i,j}\leq M_{i,j}z_{i}\ \forall i,j\in[p],\ \mathrm{Rank}(\bm{X})=1.$

To obtain a MISDO reformulation, we omit the rank constraint. In general, omitting a rank constraint generates a relaxation and induces some loss of optimality. Remarkably, this omission is without loss of optimality in this case. Indeed, the objective is convex and therefore some rank-one extreme matrices $\bm{X}$ is optimal. We formalize this observation in the following theorem; note that a similar result—although in the context of computing Restricted Isometry constants and with a different proof—exists (Gally and Pfetsh, 2016):

Theorem 1

Problem (1) attains the same optimal objective value as the problem:

$\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}}$	$\displaystyle\langle\bm{\Sigma},\bm{X}\rangle$		(5)
s.t.	$\displaystyle\mathrm{tr}(\bm{X})=1\quad$
	$\displaystyle\|X_{i,j}\|\leq M_{i,j}z_{i}\quad$	$\displaystyle\forall i,j\in[p],$
	$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{j=1}^{p}\|X_{i,j}\|\leq\sqrt{k}z_{i}\quad$	$\displaystyle\forall i\in[p],$

where $M_{i,i}=1$ , and $M_{i,j}=\frac{1}{2}$ if $j\neq i$ .

Remark 2

Observe that if $k\leq\sqrt{n}$ and we set $M_{i,j}=1\ \forall i,j\in[p]$ in Problem (5) and omit the valid inequality $\sum_{j=1}^{p}|X_{i,j}|\leq\sqrt{k}z_{i}$ then the optimal value of the continuous relaxation is trivially $\lambda_{\max}(\bm{\Sigma})$ . Indeed, letting $\bm{x}$ be a leading eigenvector of the unconstrained problem (where $\|\bm{x}\|_{2}=1$ ), we can set $z_{i}=|x_{i}|\geq|x_{i}||x_{j}|$ , where the inequality holds since $\|\bm{x}\|_{2}=1$ , and $X_{i,j}=x_{i}x_{j}$ , meaning (a) $\sum_{i}z_{i}=\|\bm{x}\|_{1}\leq k\leq\sqrt{n}$ by norm equivalence and (b) $|X_{i,j}|\leq z_{i}$ and thus $(\bm{X},\bm{z})$ solves this continuous relaxation. Therefore, setting $M_{i,j}=\frac{1}{2}$ if $j\neq i$ and/or imposing the valid inequality $\sum_{j=1}^{p}|X_{i,j}|\leq\sqrt{k}z_{i}$ is necessary for obtaining non-trivial relaxations whenever $k$ is small.

Proof It suffices to demonstrate that for any feasible solution to (1) we can construct a feasible solution to (5) with an equal or greater payoff, and vice versa.

•

Let $\bm{x}\in\mathbb{R}^{p}$ be a feasible solution to (1). Then, since $\|\bm{x}\|_{1}\leq\sqrt{k}$ , $(\bm{X}:=\bm{x}\bm{x}^{\top},\bm{z})$ is a feasible solution to (5) with equal cost, where $z_{i}=1$ if $|x_{i}|>0$ , $z_{i}=0$ otherwise.
•

Let $(\bm{X},\bm{z})$ be a feasible solution to Problem (5), and let $\bm{X}=\sum_{i=1}^{p}\sigma_{i}\bm{x}_{i}\bm{x}_{i}^{\top}$ be a Cholesky decomposition of $\bm{X}$ , where $\bm{e}^{\top}\bm{\sigma}=1,\bm{\sigma}\geq\bm{0}$ , and $\|\bm{x}_{i}\|_{2}=1\ \forall i\in[p]$ . Observe that $\|\bm{x}_{i}\|_{0}\leq k\ \forall i\in[p],$ since we can perform the Cholesky decomposition on the submatrix of $\bm{X}$ induced by $\bm{z}$ , and “pad” out the remaining entries of each $\bm{x}_{i}$ with $0$ s to obtain the decomposition of $\bm{X}$ . Therefore, let us set $\hat{\bm{x}}:=\arg\max_{i}[\bm{x}_{i}^{\top}\bm{\Sigma}\bm{x}_{i}]$ . Then, $\hat{\bm{x}}$ is a feasible solution to (1) with an equal or greater payoff.

Finally, we let $M_{i,i}=1$ , $M_{i,j}=\frac{1}{2}$ if $i\neq j$ , as the $2\times 2$ minors imply $X_{i,j}^{2}\leq X_{i,i}X_{j,j}\leq\frac{1}{4}$ whenever $i\neq j$ (c.f. Gally and Pfetsh, 2016, Lemma 1).
Theorem 1 reformulates Problem (1) as a mixed-integer SDO. Therefore, we can solve Problem (5) using general branch-and-cut techniques for semidefinite optimization problems (see Gally et al., 2018; Kobayashi and Takano, 2020). However, this approach is not scalable, as it comprises solving a large number of semidefinite subproblems and the community does not know how to efficiently warm-start interior point methods (IPMs) for SDOs.

Alternatively, we propose a saddle-point reformulation of Problem (5) which avoids the computational difficulty of solving a large number of SDOs by exploiting problem structure, as we will show in Section 2.3. The following result reformulates Problem (5) as a max-min saddle-point problem amenable to outer-approximation:

Theorem 3

Problem (5) attains the same optimal value as the following problem:

$\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\ \bm{e}^{\top}\bm{z}\leq k}\quad f(\bm{z})$		(6)
$\displaystyle\quad\text{ where }\quad f(\bm{z}):=\min_{\lambda\in\mathbb{R},\bm{\alpha}\in\mathbb{R}^{p\times p}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0},\bm{\beta}\in\mathbb{R}^{p}}}\quad$	$\displaystyle\lambda+\sum_{i=1}^{p}{z}_{i}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left(\sum_{j=1}^{p}M_{i,j}\max(0,\|\alpha_{i,j}\|-\beta_{i})+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sqrt{k}\beta_{i}}\right)}$	(7)
s.t.	$\displaystyle\lambda\mathbb{I}+\bm{\alpha}\succeq\bm{\Sigma}.$

Remark 4

The above theorem demonstrates that $f(\bm{z})$ is concave in $\bm{z}$ , by rewriting it as the infimum of functions which are linear in $\bm{z}$ (Boyd and Vandenberghe, 2004).

Proof Let us introduce auxiliary variables $U_{i,j}$ to model the absolute value of $X_{i,j}$ and rewrite the inner optimization problem of (5) as

$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}f(\bm{z})=$	$\displaystyle\max_{\bm{X}\succeq\bm{0},\bm{U}}\quad$	$\displaystyle\langle\bm{\Sigma},\bm{X}\rangle$		(8)
s.t.	$\displaystyle\mathrm{tr}(\bm{X})=1,\quad$		$\displaystyle[\lambda]$
	$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}U_{i,j}\leq M_{i,j}z_{i}\$	$\displaystyle\forall i,j\in[p],$	$\displaystyle[\sigma_{i,j}]$
	$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\|X_{i,j}\|\leq U_{i,j}\$	$\displaystyle\forall i,j\in[p],$	$\displaystyle[\alpha_{i,j}]$
	$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{j=1}^{p}U_{i,j}\leq\sqrt{k}z_{i}$	$\displaystyle\forall i\in[p],$	$\displaystyle[\beta_{i}]$

where we associate dual constraint multipliers with primal constraints in square brackets. For $\bm{z}$ such that $\bm{e}^{\top}\bm{z}\geq 1$ , the maximization problem induced by $f(\bm{z})$ satisfies Slater’s condition (see, e.g., Boyd and Vandenberghe, 2004, Chapter 5.2.3), strong duality applies and leads to

	$\displaystyle f(\bm{z})=\min_{\begin{subarray}{c}\lambda\\ \bm{\sigma},\bm{\alpha},\bm{\beta}\geq\bm{0}\end{subarray}}\quad$	$\displaystyle\lambda+\sum_{i,j}\sigma_{i,j}M_{i,j}z_{i}+\sum_{i=1}^{p}\beta_{i}\sqrt{k}z_{i}$
	s.t.	$\displaystyle\lambda\mathbb{I}+\bm{\alpha}\succeq\bm{\Sigma},\|\alpha_{i,j}\|\leq\sigma_{i,j}+\beta_{i}.$

We eliminate $\bm{\sigma}$ from the dual problem above by optimizing over $\sigma_{i,j}$ and setting $\sigma^{\star}_{i,j}=\max(0,|\alpha_{i,j}|-\beta_{i})$ .

Note that for $\bm{z}=\bm{0}$ , the primal subproblem is infeasible and the dual subproblem has objective $-\infty$ , but this can safely be ignored since $\bm{z}=\bm{0}$ is certainly suboptimal.

2.2 A Cutting-Plane Method

Theorem 3 shows that evaluating $f(\bm{\hat{z}})$ yields the globally valid overestimator:

f(\bm{z})\leq f(\hat{\bm{z}})+\bm{g}_{\hat{\bm{z}}}^{\top}(\bm{z}-\hat{\bm{z}}),

where $\bm{g}_{\hat{\bm{z}}}$ is a supergradient of $f$ at $\bm{\hat{z}}$ , at no additional cost. In particular, we have

g_{\hat{\bm{z}},i}={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left(\sum_{j=1}^{p}M_{i,j}\max\left(0,|\alpha_{i,j}^{\star}(\bm{\hat{z}})|-\beta_{i}(\hat{\bm{z}})\right)+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sqrt{k}\beta_{i}(\hat{\bm{z}})}\right)},

where $\bm{\alpha}^{\star}(\hat{\bm{z}})$ , $\bm{\beta}^{\star}(\hat{\bm{z}})$ constitutes an optimal choice of $(\bm{\alpha},\bm{\beta})$ for a fixed $\bm{\hat{\bm{z}}}$ . This observation leads to an efficient strategy for maximizing $f(\bm{z})$ : iteratively maximizing and refining a piecewise linear upper estimator of $f(\bm{z})$ . This strategy is called outer-approximation (OA), and was originally proposed by Duran and Grossmann (1986). OA works by iteratively constructing estimators of the following form at each iteration $t$ :

\displaystyle f^{t}(\bm{z})=\min_{1\leq i\leq t}\left\{f(\bm{z}_{i})+\bm{g}_{\bm{z}_{i}}^{\top}(\bm{z}-\bm{z}_{i})\right\}.

(9)

After constructing each overestimator, we maximize $f^{t}(\bm{z})$ over $\{0,1\}^{p}$ to obtain $\bm{z}_{t}$ , and evaluate $f(\cdot)$ and its supergradient at $\bm{z}_{t}$ . This procedure yields a non-increasing sequence of overestimators $\{f^{t}(\bm{z}_{t})\}_{t=1}^{T}$ which converge to the optimal value of $f(\bm{z})$ within a finite number of iterations $T\leq{p\choose 1}+\ldots+{p\choose k}$ , since $\{0,1\}^{p}$ is a finite set and OA never visits a point twice. Additionally, we can avoid solving a different MILO at each OA iteration by integrating the entire algorithm within a single branch-and-bound tree, as proposed by Quesada and Grossmann (1992), using lazy constraint callbacks. Lazy constraint callbacks are now standard components of modern MILO solvers such as Gurobi or CPLEX and substantially speed-up OA. We formalize this procedure in Algorithm 1; note that $\partial f(\bm{z}_{t+1})$ denotes the set of supergradients of $f$ at $\bm{z}_{t+1}$ .

Algorithm 1 An outer-approximation method for Problem (1)

0: Initial solution

\bm{z}_{1}

t\leftarrow 1

repeat

Compute

\bm{z}_{t+1},\theta_{t+1}

solution of

\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k,\theta}\>\theta\quad\mbox{ s.t. }\theta\leq f(\bm{z}_{i})+\bm{g}_{\bm{z}_{i}}^{\top}(\bm{z}-\bm{z}_{i})\ \forall i\in[t],

Compute

f(\bm{z}_{t+1})

and

\bm{g}_{\bm{z}_{t+1}}\in\partial f(\bm{z}_{t+1})

by solving (7)

t\leftarrow t+1

until

f(\bm{z}_{t})-\theta_{t}\leq\varepsilon

return

\bm{z}_{t}

2.3 A Semidefinite-free Subproblem Strategy

Our derivation and analysis of Algorithm 1 indicates that we can solve Problem (1) to certifiable optimality by solving a (potentially large) number of semidefinite subproblems (7), which might be prohibitive in practice. Therefore, we now derive a computationally efficient subproblem strategy which crucially does not require solving any semidefinite programs. Formally, we have the following result:

Theorem 5

For any $\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k$ , optimal dual variables in (7) are

	$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\lambda=\lambda_{\max}\left(\bm{\Sigma}_{1,1}\right),\ \hat{\bm{\alpha}}=\begin{pmatrix}\hat{\bm{\alpha}}_{1,1}&\hat{\bm{\alpha}}_{1,2}\\ \hat{\bm{\alpha}}_{1,2}^{\top}&\hat{\bm{\alpha}}_{2,2}\end{pmatrix}=\begin{pmatrix}\bm{0}&\bm{0}\\ \bm{0}&\bm{\Sigma}_{2,2}-\lambda\mathbb{I}+\bm{\Sigma}_{1,2}^{\top}\left(\lambda\mathbb{I}-\bm{\Sigma}_{1,1}\right)^{\dagger}\bm{\Sigma}_{1,2}\end{pmatrix},$		(10)
	$\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta_{i}=(1-z_{i})\begin{pmatrix}\|\hat{\alpha}_{i,1}\|,\|\hat{\alpha}_{i,2}\|,\ldots,\|\hat{\alpha}_{i,i}\|,\|\hat{\alpha}_{i,i}\|,\ldots,\|\hat{\alpha}_{i,p}\|\end{pmatrix}_{[\left\lceil\sqrt{k}\ \right\rceil]}\ \forall i\in[p],}$		(11)

where $\lambda_{\max}(\cdot)$ denotes the leading eigenvalue of a matrix, $\hat{\bm{\alpha}}=\begin{pmatrix}\hat{\bm{\alpha}}_{1,1}&\hat{\bm{\alpha}}_{1,2}\\ \hat{\bm{\alpha}_{1,2}^{\top}}&\hat{\bm{\alpha}}_{2,2}\end{pmatrix}$ is a permutation such that $\hat{\bm{\alpha}}_{1,1}$ (resp. $\hat{\bm{\alpha}}_{2,2}$ ) denotes the entries of $\hat{\bm{\alpha}}$ where $z_{i}=z_{j}=1$ ( $z_{i}=z_{j}=0$ ); $\bm{\Sigma}$ is similar, and $(\bm{x})_{[k]}$ denotes the $k$ th largest element of $\bm{x}$ .

Remark 6

By Theorem 5, Problem (7) can be solved by computing the leading eigenvalue of $\bm{\Sigma}_{1,1}$ and solving a linear system. This justifies our claim that we need not solve any SDOs in our algorithmic strategy.

Proof We appeal to strong duality and complementary slackness. Observe that, for any $\bm{z}\in\{0,1\}^{p}$ , $f(\bm{z})$ is the optimal value of a maximization problem over a closed convex compact set. Therefore, there exists some optimal primal solution $\bm{X}^{\star}$ without loss of generality. Moreover, since the primal has non-empty relative interior with respect to the non-affine constraints, it satisfies the Slater constraint qualification and strong duality holds (see, e.g., Boyd and Vandenberghe, 2004, Chapter 5.2.3). Therefore, by complementary slackness (see, e.g., Boyd and Vandenberghe, 2004, Chapter 5.5.2), there must exist some dual-optimal solution $(\lambda,\hat{\bm{\alpha}},\bm{\beta})$ which obeys complementarity with $\bm{X}^{\star}$ . Moreover, $|X_{i,j}|\leq M_{i,j}$ is implied by $\mathrm{tr}(\bm{X})=1,\bm{X}\succeq\bm{0}$ , while $\sum_{j=1}^{p}|X_{i,j}|\leq z_{i}\sqrt{k}$ is implied by $|X_{i,j}|\leq M_{i,j}z_{i}$ and $\bm{e}^{\top}\bm{z}\leq k$ . Therefore, by complementary slackness, we can take the constraints $|X_{i,j}|\leq M_{i,j}z_{i}$ , $\sum_{j=1}^{p}|X_{i,j}|\leq z_{i}\sqrt{k}$ to be inactive when $z_{i}=1$ without loss of generality, which implies that $\hat{\alpha}_{i,j}^{\star},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta_{i}^{\star}}=0$ if $z_{i}=1$ in some dual-optimal solution. Moreover, we also have $\hat{\alpha}_{i,j}^{\star}=0$ if $z_{j}=1$ , since $\hat{\bm{\alpha}}$ obeys the dual feasibility constraint $\lambda\mathbb{I}+\hat{\bm{\alpha}}\succeq\bm{\Sigma}$ , and therefore is itself symmetric.

Next, observe that, by strong duality, $\lambda=\lambda_{\max}(\bm{\Sigma}_{1,1})$ in this dual-optimal solution, since $\bm{\alpha}$ only takes non-zero values if $z_{i}=z_{j}=0$ and does not contribute to the objective, and $\bm{\beta}$ is similar.

Next, observe that, by strong duality and complementary slackness, any dual feasible $(\lambda,\hat{\bm{\alpha}},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bm{\beta}})$ satisfying the above conditions is dual-optimal. Therefore, we need to find an $\hat{\bm{\alpha}}_{2,2}$ such that

\displaystyle\begin{pmatrix}\lambda\mathbb{I}-\bm{\Sigma}_{1,1}&-\bm{\Sigma}_{1,2}\\ -\bm{\Sigma}_{2,1}&\lambda\mathbb{I}+\hat{\bm{\alpha}}_{2,2}-\bm{\Sigma}_{2,2}\end{pmatrix}\succeq\bm{0}.

By the generalized Schur complement lemma (see Boyd et al., 1994, Equation 2.41), this is PSD if and only if

1.

$\lambda\mathbb{I}-\bm{\Sigma}_{1,1}\succeq\bm{0}$ ,
2.

$\left(\mathbb{I}-(\lambda\mathbb{I}-\bm{\Sigma}_{1,1})(\lambda\mathbb{I}-\bm{\Sigma}_{1,1})^{\dagger}\right)\bm{\Sigma}_{1,2}=\bm{0}$ , and
3.

$\lambda\mathbb{I}+\hat{\bm{\alpha}}_{2,2}-\bm{\Sigma}_{2,2}\succeq\bm{\Sigma}_{1,2}^{\top}\left(\lambda\mathbb{I}-\bm{\Sigma}_{1,1}\right)^{\dagger}\bm{\Sigma}_{1,2}$ .

The first two conditions hold because, as argued above, $\lambda$ is optimal and therefore feasible, and the conditions are independent of $\hat{\bm{\alpha}}_{2,2}$ . Therefore, it suffices to pick $\hat{\bm{\alpha}}_{2,2}$ in order that the third condition holds. We achieve this by setting $\hat{\bm{\alpha}}_{2,2}$ so the PSD constraint in condition (3) holds with equality.

Finally, let us optimize for $\bm{\beta}$ to obtain stronger cuts (when $z_{i}=0$ we can pick any feasible $\beta_{i}$ , but optimizing to set $\partial f(\bm{z})_{i}$ to be as small as possible gives stronger cuts). This is equivalent to solving the univariate minimization problem for each $\beta_{i}$ :

\displaystyle\min_{\beta_{i}}\left(\sum_{j=1}^{p}M_{i,j}\max(0,|\alpha_{i,j}|-\beta_{i})+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sqrt{k}\beta_{i}}\right).

Moreover, it is a standard result from max- $k$ optimization (see, e.g., Zakeri et al., 2014; Todd, 2018) that this is achieved by setting $\beta_{i}$ to be the $\lceil\sqrt{k}\ \rceil$ largest element of $\{{\alpha}_{i,j}\}_{j\in[p]}\cup\{\alpha_{i,i}\}$ in absolute magnitude, where we include $\alpha_{i,i}$ twice since $M_{i,i}=1$ while $M_{i,j}=1/2$ if $j\neq i$ .

2.4 Strengthening the Master Problem via the Gershgorin Circle Theorem

To accelerate Algorithm 1, we strengthen the master problem by imposing bounds from the circle theorem. Formally, we have the following result, which can be deduced from (Horn and Johnson, 1990, Theorem 6.1.1):

Theorem 7

For any vector $\bm{z}\in\{0,1\}^{p}$ we have the following upper bound on $f(\bm{z})$

\displaystyle f(\bm{z})\leq\max_{j\in[p]:z_{j}=1}\sum_{i\in[p]}z_{i}|\Sigma_{i,j}|.

(12)

Observe that this bound cannot be used to directly strengthen Algorithm 1’s master problem, since the bound is not convex in $\bm{z}$ . Nonetheless, it can be successfully applied if we (a) impose a big-M assumption on Problem (1)’s optimal objective and (b) introduce $p$ additional binary variables $\bm{s}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{s}=1$ which model whether the $i$ th Gershgorin disc is active; recall that each eigenvalue is contained in the union of the discs. Formally, we impose the following valid inequalities in the master problem:

\displaystyle\exists\bm{s}\in\{0,1\}^{p}:\

\displaystyle\theta\leq\sum_{i\in[p]}z_{i}|\Sigma_{i,j}|+M(1-s_{j})\ \forall j\in[p],\bm{e}^{\top}\bm{s}=1,\bm{s}\leq\bm{z},

(13)

where $\theta$ is the epigraph variable maximized in the master problem stated in Algorithm 1, and $M$ is an upper bound on the sum of the $k$ largest absolute entries in any column of $\bm{\Sigma}$ . Note that we set $\bm{s}\leq\bm{z}$ since if $z_{i}=0$ the $i$ th column of $\bm{\Sigma}$ does not feature in the relevant submatrix of $\bm{\Sigma}$ . In the above inequalities, a valid $M$ is given by any bound on the optimal objective. Since Theorem 7 supplies one such bound for any given $\bm{z}$ , we can compute

\displaystyle M:=\max_{j\in[p]}\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \sum_{i\in[p]}z_{i}|\Sigma_{i,j}|,

(14)

which can be done in $O(p^{2})$ time.

To further improve Algorithm 1, we also make use of the Gershgorin circle theorem before generating each outer-approximation cut. Namely, at a given node in a branch-and-bound tree, there are indices $i$ where $z_{i}$ has been fixed to $1$ , indices $i$ where $z_{i}$ has been fixed to $0$ , and indices $i$ where $z_{i}$ has not yet been fixed. Accordingly, we compute the worst-case Gershgorin bound—by taking the worst-case bound over each index $j$ such that $z_{j}$ has not yet been fixed to $0$ , i.e.,

\max_{j:z_{j}\neq 0}\left\{\max_{\bm{s}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{s}\leq k}\left\{\sum_{i\in[p]}s_{i}|\Sigma_{i,j}|\ \text{s.t.}\ s_{i}=0\ \text{if}\ z_{i}=0,s_{i}=1\ \text{if}\ z_{i}=1\right\}\right\}.

If this bound is larger than our incumbent solution then we generate an outer-approximation cut, otherwise the entire subtree rooted at this node does not contain an optimal solution and we use instruct the solver to avoid exploring this node via a callback.

Our numerical results in Section 4 echo the empirical findings of Berk and Bertsimas (2019) and indicate that Algorithm 1 performs substantially better when the Gershgorin bound is supplied in the master problem. Therefore, it is interesting to theoretically investigate the tightness, or at least the quality, of Gershgorin’s bound. We supply some results in this direction in the following proposition:

Proposition 8

Suppose that $\bm{\Sigma}$ is a scaled diagonally dominant matrix as defined by Boman et al. (2005), i.e., there exists some vector $\bm{d}>0$ such that

d_{i}\Sigma_{i,i}\geq\sum_{j\in[p]:j\neq i}d_{j}|\Sigma_{i,j}|\ \forall i\in[p].

Then, letting $\rho:=\max_{i,j\in[p]}\{\frac{d_{i}}{d_{j}}\}$ , the Gershgorin circle theorem provides a $(1+\rho)$ -factor approximation, i.e.,

\displaystyle f(\bm{z})\leq\max_{j\in[p]}\left\{\sum_{i\in[p]}z_{i}|\Sigma_{i,j}|\right\}\leq(1+\rho)f(\bm{z})\quad\forall\bm{z}\in\{0,1\}^{p}.

(15)

Remark 9

Observe that, for a fixed $\bm{z}$ , the ratio $\rho:=\max_{i,j\in[p]}\{\frac{d_{i}}{d_{j}}\}$ need only be computed over indices $i,j$ such that $z_{i},z_{j}=1$ . Moreover, for a partially specified $\bm{z}$ —which might arise at an intermediate node in a branch-and-bound tree generated by Algorithm 1—the ratio $\rho$ need only be computed over indices $i$ where $z_{i}$ is unspecified or set to $1$ . This suggests that the quality of the Gershgorin bound improves upon branching.

Remark 10

In particular, if $\bm{\Sigma}\in S^{n}_{+}$ is a diagonal matrix, then Equation (13)’s bound is tight - which follows from the fact that the spectrum of $\bm{\Sigma}$ and the discs coincide if and only if $\bm{\Sigma}$ is diagonal (see, e.g, Horn and Johnson, 1990, Chapter 6). Alternatively, if $\bm{\Sigma}$ is a diagonally dominant matrix then $\rho=1$ and the Gershgorin circle theorem provides a $2-$ factor approximation.

Proof Scaled diagonally dominant matrices have scaled diagonally dominant principal minors—this is trivially true because

d_{i}\Sigma_{i,i}\geq\sum_{j\in[p]:j\neq i}d_{j}|\Sigma_{i,j}|\ \forall i\in[p]\implies d_{i}\Sigma_{i,i}\geq\sum_{j\in[p]:j\neq i}d_{j}z_{j}|\Sigma_{i,j}|\ \forall i\in[p]:z_{i}=1

for the same vector $\bm{d}>\bm{0}$ and therefore the following chain of inequalities holds

	$\displaystyle f(\bm{z})\leq$	$\displaystyle\max_{j\in[p]}\{\sum_{i\in[p]}z_{i}\|\Sigma_{i,j}\|\}=\max_{j\in[p]}\{z_{j}\Sigma_{j,j}+\sum_{i\in[p]:j\neq i}z_{i}\|\Sigma_{i,j}\|\}$
		$\displaystyle\leq\max_{j\in[p]}\{z_{j}\Sigma_{j,j}+\sum_{i\in[p]:j\neq i}\rho\frac{d_{i}}{d_{j}}z_{i}\|\Sigma_{i,j}\|\}\leq(1+\rho)\max_{j\in[p]}\{z_{j}\Sigma_{j,j}\}\leq(1+\rho)f(\bm{z})\quad\forall\bm{z}\in\{0,1\}^{p},$

where the second inequality follows because $\rho\geq\frac{d_{i}}{d_{j}}$ , the third inequality follows from the scaled diagonal dominance of the principal submatrices of $\bm{\Sigma}$ , and the fourth inequality holds because the leading eigenvalue of a PSD matrix is at least as large as each diagonal.

To make clear the extent our numerical success depends upon Theorem 7, our results in Section 4 present implementations of Algorithm 1 both with and without the bound.

2.5 Beyond Gershgorin: Further Strengthening via Brauer’s Ovals of Cassini

Given the relevance of Gershgorin’s bound, we propose, in this section, a stronger —yet more expensive to implement— upper bound, based on an generalization of the Gershgorin Circle theorem, namely Brauer’s ovals of Cassini.

First, we derive a new upper-bound on $f(\bm{z})$ that is at least as strong as the one presented in Theorem 7 and often strictly stronger (Horn and Johnson, 1990, Chapter 6):

Theorem 11

For any vector $\bm{z}\in\{0,1\}^{p}$ , we have the following upper bound on $f(\bm{z})$ :

\displaystyle f(\bm{z})\leq\max_{i,j\in[p]:i>j,z_{i}=z_{j}=1}\left\{\frac{\Sigma_{i,i}+\Sigma_{j,j}}{2}+\frac{\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}(\bm{z})R_{j}(\bm{z})}}{2}\right\},

(16)

where $R_{i}(\bm{z}):=\sum_{j\in[p]:j\neq i}z_{j}|\Sigma_{i,j}|$ is the absolute sum of off-diagonal entries in the $i$ th column of the submatrix of $\bm{\Sigma}$ induced by $\bm{z}$ .

Proof Let us first recall that, per Brauer (1952)’s original result, all eigenvalues of a matrix $\bm{\Sigma}\in S^{p}_{+}$ are contained in the union of the following $p(p-1)/2$ ovals of Cassini:

\displaystyle\bigcup_{i\in[p],j\in[p]:i<j}\left\{\lambda\in\mathbb{R}_{+}:|\lambda-\Sigma_{i,i}||\lambda-\Sigma_{j,j}|\leq R_{i}R_{j}\right\},

where $R_{i}:=\sum_{j\in[p]:j\neq i}|\Sigma_{i,j}|$ is the absolute sum of off-diagonal entries in the $i$ th column of $\bm{\Sigma}$ . Next, let us observe that, if $\lambda$ is a dominant eigenvalue of a PSD matrix $\bm{\Sigma}$ then $\lambda\geq\Sigma_{i,i}\ \forall i$ and, in the $(i,j)$ th oval, the bound reduces to

\displaystyle\lambda^{2}-\lambda(\Sigma_{i,i}+\Sigma_{j,j})+\Sigma_{i,i}\Sigma_{j,j}-R_{i}R_{j}\leq 0,

(17)

which, by the quadratic formula, implies an upper bound is $\frac{\Sigma_{i,i}+\Sigma_{j,j}}{2}+\frac{\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}R_{j}}}{2}$ . The result follows because if $z_{i}=0$ the $i$ th row of $\bm{\Sigma}$ cannot be used to bound $f(\bm{z})$ .

Theorem 11’s inequality can be enforced numerically as mixed-integer second order cone constraints. Indeed, the square root term in (16) can be modeled using second-order cone, and the bilinear terms only involve binary variables and can be linearized. Completing the square in Equation (17), (16) is equivalent to the following system of $p(p-1)/2$ mixed-integer second-order cone inequalities:

	$\displaystyle\left(\theta-\frac{1}{2}(\Sigma_{i,i}+\Sigma_{j,j})\right)^{2}\leq\sum_{s,t\in[p]:s\neq i,t\neq j}W_{s,t}\|\Sigma_{i,s}\Sigma_{j,t}\|-\frac{3}{4}\Sigma_{i,i}\Sigma_{j,j}+M(1-s_{i,j})\ \forall i,j\in[p]:i<j,$
	$\displaystyle\sum_{i,j\in[p]:i<j}s_{i,j}=1,s_{i,j}\leq\min(z_{i},z_{j})\ i,j\in[p]:i<j,\ s_{i,j}\in\{0,1\}\ i,j\in[p]:i<j.$

where $W_{i,j}=z_{i}z_{j}$ is a product of binary variables which can be modeled using, e.g., the McCormick inequalities $\max(0,z_{i}+z_{j}-1)\leq W_{i,j}\leq\min(z_{i},z_{j})$ , and $M$ is an upper bound on the right-hand-side of the inequality for any $i,j:i\neq j$ , which can be computed in $O(p^{3})$ time in much the same manner as a big- $M$ constant was computed in the previous section. Note that we do not make use of these inequalities directly in our numerical experiments, due to their high computational cost. However, an interesting extension would be to introduce the binary variables dynamically, via branch-and-cut-and-price (Barnhart et al., 1998).

Since the bound derived from the ovals of Cassini (Theorem 11) is at least as strong as the Gershgorin circle’s one (Theorem 7), it satisfies the same approximation guarantee (Proposition 8). In particular, it is tight when $\bm{\Sigma}$ is diagonal and provides a $2-$ factor approximation for diagonally dominant matrices. Actually, we now prove a stronger result and demonstrate that Theorem 11 provides a $2-$ factor bound on $f(\bm{z})$ for doubly diagonally dominant matrices—a broader class of matrices than diagonally dominant matrices (see Li and Tsatsomeros, 1997, for a general theory):

Proposition 12

Let $\bm{\Sigma}\in S^{p}_{+}$ be a doubly diagonally dominant matrix, i.e.,

\displaystyle\Sigma_{i,i}\Sigma_{j,j}\geq R_{i}R_{j}\ \forall i,j\in[p]:i>j,

where $R_{i}:=\sum_{j\in[p]:j\neq i}|\Sigma_{i,j}|$ is the sum of the off-diagonal entries in the $i$ th column of $\bm{\Sigma}$ . Then, we have that

\displaystyle f(\bm{z})\leq\max_{i,j\in[p]:i>j,z_{i}=z_{j}=1}\left\{\frac{\Sigma_{i,i}+\Sigma_{j,j}}{2}+\frac{\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}(\bm{z})R_{j}(\bm{z})}\}}{2}\right\}\leq 2f(\bm{z}).

(18)

Proof Observe that if $\Sigma_{i,i}\Sigma_{j,j}\geq R_{i}R_{j}$ then

\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}R_{j}}\leq\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4\Sigma_{i,i}\Sigma_{j,j}}=\Sigma_{i,i}+\Sigma_{j,j}.

The result then follows in essentially the same fashion as Proposition 8.

3 Convex Relaxations and Rounding Methods

For large-scale instances, high-quality solutions can be obtained by solving a convex relaxation of Problem (5) and rounding the optimal solution. In Section 3.1, we propose relaxing $\bm{z}\in\{0,1\}^{p}$ in (5) to $\bm{z}\in[0,1]^{p}$ and applying a greedy rounding scheme. We further tighten this relaxation using second-order cones constraints in Section 3.2.

3.1 A Boolean Relaxation and a Greedy Rounding Method

We first consider a Boolean relaxation of (5), which we obtain²²2Note that we omit the $\sum_{j=1}^{p}|X_{i,j}|\leq\sqrt{k}z_{i}$ constraints when we develop our convex relaxations, since they are essentially dominated by the $\|\bm{X}\|_{1}\leq k$ constraint we introduce in the next section; we introduced these inequalities to improve our semidefinite-free subproblem strategy for the exact method. by relaxing $\bm{z}\in\{0,1\}^{p}$ to $\bm{z}\in[0,1]^{p}$ . This gives $\displaystyle\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}\>f(\bm{z})$ , i.e.,

\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}\end{subarray}:\bm{e}^{\top}\bm{z}\leq k}\>\max_{\bm{X}\succeq\bm{0}}

\displaystyle\langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,|X_{i,j}|\leq M_{i,j}z_{i}\ \forall i,j\in[p].

(19)

A useful strategy for obtaining a high-quality feasible solution is to solve (19) and set $z_{i}=1$ for $k$ indices corresponding to the largest $\bm{z}_{j}$ ’s in (19) as proposed in the randomized case for general integer optimization problems by Raghavan and Tompson (1987). We formalize this in Algorithm 2. We remark that rounding strategies for sparse PCA have previously been proposed (see Fountoulakis et al., 2017; Dey et al., 2017; Chowdhury et al., 2020), however, the idea of rounding $\bm{z}$ and then optimizing for $\bm{X}$ appears to be new.

Algorithm 2 A greedy rounding method for Problem (1)

0: Covariance matrix

\bm{\Sigma}

, sparsity parameter

k

Compute

\bm{z}^{\star}

solution of (19) or (20)

Construct

\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}=k

such that

z_{i}\geq z_{j}

z^{\star}_{i}\geq z^{\star}_{j}

Compute

\bm{X}

solution of

\displaystyle\max_{\bm{X}\in S^{p}_{+}}\ \langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,X_{i,j}=0\ \text{if}\ z_{i}z_{j}=0\ \forall i,j\in[p].

return

\bm{z},\bm{X}

Remark 13

Our numerical results in Section 4 reveal that explicitly imposing a PSD constraint on $\bm{X}$ in the relaxation (19)—or the ones derived later in the following section—prevents our approximation algorithm from scaling to larger problem sizes than the exact Algorithm 1 can already solve. Therefore, to improve scalability, the semidefinite cone can be safely approximated via its second-order cone relaxation, $X_{i,j}^{2}\leq X_{i,i}X_{j,j}\ \forall i,j\in[p]$ , plus a small number of cuts of the form $\langle\bm{X},\bm{x}_{t}\bm{x}_{t}^{\top}\rangle\geq 0$ as presented in Bertsimas and Cory-Wright (2020).

Remark 14

Rather than relaxing and greedily rounding $\bm{z}$ , one could consider a higher dimensional relax-and-round scheme where we let $\bm{Z}$ model the outer product $\bm{z}\bm{z}^{\top}$ via $\bm{Z}\succeq\bm{z}\bm{z}^{\top}$ , $\max(0,z_{i}+z_{j}-1)\leq Z_{i,j}\leq\min(z_{i},z_{j})\ \forall i,j\in[p]$ , $Z_{i,i}=z_{i}$ , and require that $\sum_{i,j\in[p]}Z_{i,j}\leq k^{2}$ . Indeed, a natural “round” component of such a relax-and-round scheme is precisely Goemans-Williamson rounding (Goemans and Williamson, 1995; Bertsimas and Ye, 1998), which performs at least as well as greedy rounding in both theory and practice. Unfortunately, some preliminary numerical experiments indicated that Goemans-Williamson rounding is not actually much better than greedy rounding in practice, and is considerably more expensive to implement. Therefore, we defer the details of the Goemans-Williamson scheme to Appendix A, and do not consider it any further in this paper.

3.2 Valid Inequalities for Strengthening Convex Relaxations

We now propose valid inequalities which allow us to improve the quality of the convex relaxations discussed previously. Note that as convex relaxations and random rounding methods are two sides of the same coin (Barak et al., 2014), applying these valid inequalities also improves the quality of the randomly rounded solutions.

Theorem 15

Let $\mathcal{P}_{strong}$ denote the optimal objective value of the following problem:

	$\displaystyle\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}\max_{\bm{X}\in S^{p}_{+}}\ \langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}$	$\displaystyle\mathrm{tr}(\bm{X})=1,\|X_{i,j}\|\leq M_{i,j}z_{i}\ \forall i,j\in[p],$		(20)
		$\displaystyle\sum_{j\in[p]}X_{i,j}^{2}\leq X_{i,i}z_{i},\\|\bm{X}\\|_{1}\leq k.$		(20)

Then, (20) is a stronger relaxation than (19), i.e., the following inequalities hold:

\displaystyle\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z})\geq\mathcal{P}_{strong}\geq\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z}).

Moreover, suppose an optimal solution to (20) is of rank one. Then, the relaxation is tight:

\mathcal{P}_{strong}=\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z}).

Proof The first inequality $\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z})\geq\mathcal{P}_{strong}$ is trivial. The second inequality holds because $\mathcal{P}_{strong}$ is indeed a valid relaxation of Problem (1). Indeed, $\|\bm{X}\|_{1}\leq k$ follows from the cardinality and big-M constraints. The semidefinite constraint $\bm{X}\succeq 0$ impose second-order cone constraints on the $2\times 2$ minors of $\bm{X}$ , $X_{i,j}^{2}\leq z_{i}X_{i,i}X_{j,j}$ , which can be aggregated into $\sum_{j\in[p]}X_{i,j}^{2}\leq X_{i,i}z_{i}$ (see Bertsimas and Cory-Wright, 2020, for derivations).

Finally, suppose that an optimal solution to Problem (20) is of rank one, i.e., the optimal matrix $\bm{X}$ can be decomposed as $\bm{X}=\bm{x}\bm{x}^{\top}$ . Then, the SOCP inequalities imply that $\sum_{j\in[p]}x_{i}^{2}x_{j}^{2}\leq x_{i}^{2}z_{i}$ . However, $\sum_{j\in[p]}x_{j}^{2}=\mathrm{tr}(\bm{X})=1$ , which implies that $x_{i}^{2}\leq x_{i}^{2}z_{i}$ , i.e., $z_{i}=1$ for any index $i$ such that $|x_{i}|>0$ . Since $\bm{e}^{\top}\bm{z}\leq k$ , this implies that $\|\bm{x}\|_{0}\leq k$ , i.e., $\bm{X}$ also solves Problem (2).

As our numerical experiments will demonstrate and despite the simplicity of our rounding mechanism in Algorithm 2, the relaxation (20) provides high-quality solutions to the original sparse PCA problem (1), without introducing any additional variables. We remark that other inequalities, including the second-order cone inequalities proposed in Li and Xie (2020, Lemma 2 (ii)), could further improve the convex relaxation; we leave integrating these inequalities within our framework as future work.

4 Numerical Results

We now assess the numerical behavior of the algorithms proposed in Section 2 and 3. To bridge the gap between theory and practice, we present a Julia code which implements the described convex relaxation and greedy rounding procedure on GitHub³³3https://github.com/ryancorywright/ScalableSPCA.jl. The code requires a conic solver such as Mosek and several open source Julia packages to be installed.

4.1 Performance of Exact Methods

In this section, we apply Algorithm 1 to medium and large-scale sparse principal component analysis problems, with and without Gershgorin circle theorem bounds in the master problem. All experiments were implemented in Julia $1.3$ , using Gurobi $9.1$ and JuMP.jl $0.21.6$ , and performed on a standard Macbook Pro laptop, with a $2.9$ GHz $6$ -Core Intel i9 CPU, using $16$ GB DDR4 RAM. We compare our approach to the branch-and-bound algorithm⁴⁴4The solve times for their method, as reported here, differ from those reported in Berk and Bertsimas (2019) due to a small typo in their implementation (line $110$ of their branchAndBound.jl code should read “if $y[i]==-1$ $||$ $y[i]==1$ ”, not “if $y[i]==-1$ ” in order to correctly compute the Gershgorin circle theorem bound); correcting this is necessary to ensure that we obtain correct results from their method. developed by Berk and Bertsimas (2019) on the UCI pitprops, wine, miniboone, communities, arrythmia and micromass datasets, both in terms of runtime and the number of nodes expanded; we refer to Berk and Bertsimas (2019); Bertsimas and Cory-Wright (2020) for descriptions of these datasets. Note that we normalized all datasets before running the method (i.e., we compute the leading sparse principal components of correlation matrices). Additionally, we warm-start all methods with the solution from the method of Yuan and Zhang (2013), to maintain a fair comparison.

Table 1 reports the time for Algorithm 1 (with and without Gershgorin circle theorem bounds in the master problem) and the method of Berk and Bertsimas (2019) to identify the leading $k$ -sparse principal component for $k\in\{5,10,20\}$ , along with the number of nodes expanded, and the number of outer approximation cuts generated. We impose a relative optimality tolerance of $10^{-3}$ for all approaches , i.e., terminate each method when $(UB-LB)/UB\leq 10^{-3}$ where $UB$ denotes the current objective bound and $LB$ denotes the current incumbent objective value. Note that $p$ denotes the dimensionality of the correlation matrix, and $k\leq p$ denotes the target sparsity.

Table 1: Runtime in seconds per approach. We impose a time limit of

600

s. If a solver fails to converge, we report the relative bound gap at termination in brackets.

Dataset	$p$	$k$	Alg. 1			Alg. 1+ Circle Theorem			Method of B.+B.
			Time(s)	Nodes	Cuts	Time(s)	Nodes	Cuts	Time(s)	Nodes
Pitprops	$13$	$5$	$0.30$	$1,608$	$1,176$	0.06	$38$	$8$	$1.49$	$22$
		$10$	$0.14$	$414$	$387$	0.02	$18$	$21$	0.02	$14$
Wine	$13$	$5$	$0.57$	$2,313$	$1,646$	0.02	$46$	$11$	$0.04$	$34$
		$10$	$0.17$	$376$	$311$	$\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}0.03$	$54$	$58$	0.02	$12$
Miniboone	$50$	$5$	0.01	$0$	$11$	0.01	$0$	$3$	$0.04$	$2$
		$10$	0.01	$0$	$16$	$0.02$	$0$	$3$	$0.04$	$2$
		$20$	$0.03$	$0$	$26$	0.01	$0$	$3$	$1.30$	$5,480$
Communities	$101$	$5$	( $2.87\%$ )	$28,462$	$25,483$	0.20	$201$	$3$	$0.57$	$101$
		$10$	( $13.3\%$ )	$37,479$	$36,251$	0.34	$406$	$39$	$0.94$	$1,298$
		$20$	( $39.6\%)$	$24,566$	$24,632$	( $12.1\%)$	$42,120$	$37,383$	$(\textbf{9.97\%})$	$669,500$
Arrhythmia	$274$	$5$	( $18.1\%$ )	$22,771$	$20,722$	$6.07$	$135$	$1,233$	4.17	$1,469$
		$10$	( $32.6\%)$	$19,500$	$19,314$	( $2.92\%$ )	$15,510$	$6,977$	(0.83%)	$471,680$
		$20$	$(74.4\%)$	$33,773$	$12,374$	( $24.3\%$ )	$33,123$	$19,662$	$(\textbf{18.45\%})$	$311,400$
Micromass	$1300$	$5$	$(1.29\%)$	$3,859$	$3,099$	$163.60$	$2,738$	$6$	24.31	$1,096$
		$10$	$(10.6\%)$	$3,366$	$3,369$	241.86	$3,233$	$121$	$362.4$	$36,690$
		$20$	( $35.9\%)$	$2,797$	$2,839$	( $35.9\%$ )	$2,676$	$2,115$	$(\textbf{10.34}\%)$	$31,990$

Our main findings from these experiments are as follows:

•

For smaller problems, the strength of Algorithm 1’s cuts allows it to outperform state-of-the-art methods such as the method of Berk and Bertsimas (2019). Moreover, for larger problem sizes, the adaptive branching strategy performs comparably to Algorithm 1. This suggests that the relative merits of both approaches are roughly even, and which method is preferable may depend on the problem data.
•

Generating outer-approximation cuts and valid upper bounds from the Gershgorin circle theorem are both powerful ideas, but the greatest aggregate power appears to arise from intersecting these bounds, rather than using one bound alone.

•

Once both $k$ and $p$ are sufficiently large (e.g. $p>300$ and $k>10$ ), no approach is able to solve the problem to provable optimality within $600$ s. This motivates our study of convex relaxations and randomized rounding methods in the next section.

4.2 Convex Relaxations and Randomized Rounding Methods

In this section, we apply Algorithm 2 to obtain high quality convex relaxations and feasible solutions for the datasets studied in the previous subsection, and compare the relaxation to a difference convex relaxation developed by d’Aspremont et al. (2008), in terms of the quality of the upper bound and the resulting greedily rounded solutions. All experiments were implemented using the same specifications as the previous section. Note that d’Aspremont et al. (2008)’s upper bound⁵⁵5 Strictly speaking, d’Aspremont et al. (2008) does not actually write down this formulation in their work. Indeed, their bound involves dual variables which cannot be used directly to generate feasible solutions via greedy rounding. However, the fact that this bound and (d’Aspremont et al., 2008, Problem (8)) are dual to each other follows directly from strong semidefinite duality, and therefore we refer to this formulation as being due to d’Aspremont et al. (2008) (it essentially is). which we compare against is:

\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}\end{subarray}:\bm{e}^{\top}\bm{z}\leq k}\>\max_{\bm{X}\succeq\bm{0},\bm{P}_{i}\succeq\bm{0}\ \forall i\in[p]}

\displaystyle\sum_{i\in[p]}\ \langle\bm{a}_{i}\bm{a}_{i}^{\top},\bm{P}_{i}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \mathrm{tr}(\bm{P}_{i})=z_{i},\ \bm{X}\succeq\bm{P}_{i}\ \forall i\in[p],

(21)

where $\bm{\Sigma}=\sum_{i=1}^{p}\bm{a}_{i}\bm{a}_{i}^{\top}$ is a Cholesky decomposition of $\bm{\Sigma}$ , and we obtain feasible solutions from this relaxation by greedily rounding an optimal $\bm{z}$ in the bound à la Algorithm 2. To allow for a fair comparison, we also consider augmenting this formulation with the inequalities derived in Section 3.2 to obtain the following stronger yet more expensive to solve relaxation:

	$\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}\end{subarray}:\bm{e}^{\top}\bm{z}\leq k}\>\max_{\begin{subarray}{c}\bm{X}\succeq\bm{0},\\ \bm{P}_{i}\succeq\bm{0}\ \forall i\in[p]\end{subarray}}$	$\displaystyle\sum_{i\in[p]}\ \langle\bm{a}_{i}\bm{a}_{i}^{\top},\bm{P}_{i}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \mathrm{tr}(\bm{P}_{i})=z_{i},\ \bm{X}\succeq\bm{P}_{i}\ \forall i\in[p],$		(22)
		$\displaystyle\sum_{j\in[p]}X_{i,j}^{2}\leq X_{i,i}z_{i},\\|\bm{X}\\|_{1}\leq k.$		(22)

We first apply these relaxations on datasets where Algorithm 1 terminates, hence the optimal solution is known and can be compared against. We report the quality of both methods with and without the additional inequalities discussed in Section 3.2, in Tables 2-3 respectively⁶⁶6For the instances of (21) or (22) where $p>13$ we used SCS version $2.1.1$ (with default parameters) instead of Mosek, since Mosek required more memory than was available in our computing environment, and SCS takes an augmented Lagrangian approach which is less numerically stable but requires significantly less memory. That is, (21)’s formulation is too expensive to solve via IPMs on a laptop when $p=50$ ..

Table 2: Quality of relaxation gap (upper bound vs. optimal solution-denoted R. gap), objective gap (rounded solution vs. optimal solution-denoted O. gap) and runtime in seconds per method.

Dataset	$p$	$k$	Alg. 2 with (19)			Alg. 2 with (21)
			R. gap $(\%)$	O. gap $(\%)$	Time(s)	R. gap $(\%)$	O. gap $(\%)$	Time(s)
Pitprops	$13$	$5$	$23.8\%$	$0.00\%$	$0.02$	$23.8\%$	$16.1\%$	$0.46$
		$10$	$1.10\%$	$0.30\%$	$0.03$	$1.10\%$	$1.33\%$	$0.46$
Wine	$13$	$5$	$36.8\%$	$0.00\%$	$0.02$	$36.8\%$	$40.4\%$	$0.433$
		$10$	$2.43\%$	$0.26\%$	$0.03$	$2.43\%$	$15.0\%$	$0.463$
Miniboone	$50$	$5$	$781.3\%$	$235.6\%$	$7.37$	$781.2\%$	$34.7\%$	$1,191.0$
		$10$	$340.6\%$	$117.6\%$	$7.50$	$340.6\%$	$44.9\%$	$1,102.6$
		$20$	$120.3\%$	$38.08\%$	$6.25$	$120.3\%$	$31.9\%$	$1,140.2$

Table 3: Quality of relaxation gap (upper bound vs. optimal solution-denoted R. gap), objective gap (rounded solution vs. optimal solution-denoted O. gap) and runtime in seconds per method, with additional inequalities from Section 3.2.

Dataset	$p$	$k$	Alg. 2 with (20)			Alg. 2 with (22)
			R. gap $(\%)$	O. gap $(\%)$	Time(s)	R. gap $(\%)$	O. gap $(\%)$	Time(s)
Pitprops	$13$	$5$	$0.71\%$	$0.00\%$	$0.17$	$1.53\%$	$0.00\%$	$0.55$
		$10$	$0.12\%$	$0.00\%$	$0.27$	$1.10\%$	$0.00\%$	$3.27$
Wine	$13$	$5$	$1.56\%$	$0.00\%$	$0.24$	$2.98\%$	$15.03\%$	$0.95$
		$10$	$0.40\%$	$0.00\%$	$0.22$	$2.04\%$	$0.00\%$	$1.15$
Miniboone	$50$	$5$	$0.00\%$	$0.00\%$	$163.3$	$0.00\%$	$0.01\%$	$500.7$
		$10$	$0.00\%$	$0.00\%$	$148.5$	$0.00\%$	$0.02\%$	$489.9$
		$20$	$0.00\%$	$0.00\%$	$194.5$	$0.00\%$	$0.00\%$	$776.3$

Observe that applying Algorithm 2 without the additional inequalities (Table 2) yields rather poor relaxations and randomly rounded solutions. However, by intersecting our relaxations with the additional inequalities from Section 3.2 (Table 3), we obtain extremely high quality relaxations. Indeed, with the additional inequalities, Algorithm 2 using formulation (20) identifies the optimal solution in all instances (0% O. gap), and always supplies a bound gap of less than $2\%$ . Moreover, in terms of obtaining high-quality solutions, the new inequalites allow Problem (20) to perform as well or better as Problem (21), despite optimizing over one semidefinite matrix, rather than $p+1$ semidefinite matrices. This suggests that Problem (20) should be considered as a viable, more scalable and more accurate alternative to existing SDO relaxations such as Problem (21). For this reason, we shall only consider using Problem (20)’s formulation for the rest of the paper.

We remark however that the key drawback of applying these methods is that, as implemented in this section, they do not scale to sizes beyond which Algorithm 1 successfully solves. This is a drawback because Algorithm 1 supplies an exact certificate of optimality, while these methods do not. In the following set of experiments, we therefore investigate numerical techniques to improve the scalability of Algorithm 2.

4.3 Scalable Dual Bounds and Randomized Rounding Methods

To improve the scalability of Algorithm 2, we relax the PSD constraint on $\bm{X}$ in (19) and (20). With these enhancements, we demonstrate that Algorithm 2 can be successfully scaled to generate high-quality bounds for $1000s\times 1000s$ matrices.

As discussed in Remark 13, we can replace the PSD constraint $\bm{X}\succeq\bm{0}$ by requiring that the $p(p-1)/2$ two by two minors of $\bm{X}$ are non-negative: $X_{i,j}^{2}\leq X_{i,i}X_{j,j}$ . Second, we consider adding $20$ linear inequalities of the form $\langle\bm{X},\bm{x}_{t}\bm{x}_{t}^{\top}\rangle\geq 0$ , for some vector $\bm{x}_{t}$ (see Bertsimas and Cory-Wright, 2020, for a discussion). Table 4 reports the performance of Algorithm 2 (with the relaxation (20)) with these two approximations of the positive semidefinite cone, “Minors” and “Minors + 20 inequalities” respectively. Note that we report the entire duality gap (i.e. do not break the gap down into its relaxation and objective gap components) since, as reflected in Table 1, some of these instances are currently too large to solve to optimality.

Table 4: Quality of bound gap (rounded solution vs. upper bound) and runtime in seconds of Algorithm 2 with (20), outer-approximation of the PSD cone.

Dataset	$p$	$k$	Minors		Minors + 20 inequalities
			Gap $(\%)$	Time(s)	Gap $(\%)$	Time(s)
Pitprops	$13$	$5$	$1.51\%$	$0.02$	$0.72\%$	$0.36$
		$10$	$5.29\%$	$0.02$	$1.12\%$	$0.36$
Wine	$13$	$5$	$2.22\%$	$0.02$	$1.59\%$	$0.38$
		$10$	$3.81\%$	$0.02$	$1.50\%$	$0.37$
Miniboone	$50$	$5$	$0.00\%$	$0.11$	$0.00\%$	$0.11$
		$10$	$0.00\%$	$0.12$	$0.00\%$	$0.12$
		$20$	$0.00\%$	$0.39$	$0.00\%$	$0.39$
Communities	$101$	$5$	$0.07\%$	$0.67$	$0.07\%$	$14.8$
		$10$	$0.66\%$	$0.68$	$0.66\%$	$14.4$
		$20$	$3.32\%$	$1.84$	$2.23\%$	$33.5$
Arrhythmia	$274$	$5$	$3.37\%$	$27.2$	$1.39\%$	$203.6$
		$10$	$3.01\%$	$25.6$	$1.33\%$	$184.0$
		$20$	$8.87\%$	$21.8$	$4.48\%$	$426.8$
Micromass	$1300$	$5$	$0.04\%$	$239.4$	$0.01\%$	$4,639$
		$10$	$0.63\%$	$232.6$	$0.32\%$	$6,392$
		$20$	$13.1\%$	$983.5$	$5.88\%$	$16,350$

Observe that if we impose constraints on the $2\times 2$ minors only then we obtain a solution certifiably within $13\%$ of optimality in seconds (resp. minutes) for $p=100$ s (resp. $p=1000$ s). Moreover, adding $20$ linear inequalities, we obtain a solution within $6\%$ of optimality in minutes (resp. hours) for $p=100$ s (resp. $p=1000$ s). Moreover, the bound gaps compare favorably to Algorithm 1 and the method of Berk and Bertsimas (2019) for instances which these methods could not solve to certifiable optimality. For instance, for the Arrhythmia dataset when $k=20$ we obtain a bound gap of less than $9\%$ in $20$ s, while the method of Berk and Bertsimas (2019) obtains a bound gap of $18.45\%$ in $600$ s. This illustrates the value of the proposed relax+round method on datasets which are currently too large to be optimized over exactly.

To conclude this section, we explore Algorithm 2’s ability to scale to even higher dimensional datasets in a high performance setting, by running the method on one Intel Xeon E5–2690 v4 2.6GHz CPU core using 600 GB RAM. Table 5 reports the methods scalability and performance on the Wilshire $5000$ , and Arcene UCI datasets. For the Gisette dataset, we report on the methods performance when we include the first $3,000$ and $4,000$ rows/columns (as well as all $5,000$ rows/columns). Similarly, for the Arcene dataset we report on the method’s performance when we include the first $6,000$ , $7,000$ or $8,000$ rows/columns. We do not report results for the Arcene dataset for $p>8,000$ , as computing this requires more memory than was available (i.e. $>600$ GB RAM). We do not report the method’s performance when we impose linear inequalities for the PSD cone, as solving the relaxation without them is already rather time consuming. Moreover, we do not impose the $2\times 2$ minor constraints to save memory, do not impose $|X_{i,j}|\leq M_{i,j}z_{i}$ when $p\geq 4000$ to save even more memory, and report the overall bound gap, as improving upon the randomly rounded solution is challenging in a high-dimensional setting.

Table 5: Quality of bound gap (rounded solution vs. upper bound) and runtime in seconds.

Dataset	$p$	$k$	Algorithm 2 (SOC relax)+Inequalities
			Bound gap $(\%)$	Time(s)
Wilshire $5000$	$2130$	$5$	$0.38\%$	$1,036$
		$10$	$0.24\%$	$1,014$
		$20$	$0.36\%$	$1,059$
Gisette	$3000$	$5$	$1.67\%$	$2,249$
		$10$	$35.81\%$	$2,562$
		$20$	$10.61\%$	$3,424$
Gisette	$4000$	$5$	$1.55\%$	$1,402$
		$10$	$54.4\%$	$1,203$
		$20$	$11.84\%$	$1,435$
Gisette	$5000$	$5$	$1.89\%$	$2,169$
		$10$	$2.22\%$	$2,455$
		$20$	$7.16\%$	$2,190$
Arcene	$6000$	$5$	$0.01\%$	$3,333$
		$10$	$0.06\%$	$3,616$
		$20$	$0.14\%$	$3,198$
Arcene	$7000$	$5$	$0.03\%$	$4,160$
		$10$	$0.05\%$	$4,594$
		$20$	$0.25\%$	$4,730$
Arcene	$8000$	$5$	$0.02\%$	$6,895$
		$10$	$0.17\%$	$8,479$
		$20$	$0.21\%$	$6,335$

These results suggest that if we solve the SOC relaxation using a first-order method rather than an interior point method, our approach could successfully generate certifiably near-optimal PCs when $p=10,000$ s, particularly if combined with a feature screening technique (see d’Aspremont et al., 2008; Atamtürk and Gomez, 2020).

4.4 Performance of Exact and Approximate Methods on Synthetic Data

We now compare the exact and approximate methods against existing state-of-the-art methods in a spiked covariance matrix setting. We use the experimental setup laid out in d’Aspremont et al. (2008, Section 7.1). We recover the leading principal component of a test matrix⁷⁷7This statement of the test matrix is different to d’Aspremont et al. (2008, Section 7.1), who write $\bm{\Sigma}=\bm{U}^{\top}\bm{U}+\sigma\bm{v}\bm{v}^{\top}$ , rather than $\bm{\Sigma}=\frac{1}{n}\bm{U}^{\top}\bm{U}+\frac{\sigma}{\|\bm{v}\|_{2}^{2}}\bm{v}\bm{v}^{\top}$ . However, it agrees with their source code. $\bm{\Sigma}\in S^{p}_{+}$ , where $p=150$ , $\bm{\Sigma}=\frac{1}{n}\bm{U}^{\top}\bm{U}+\frac{\sigma}{\|\bm{v}\|_{2}^{2}}\bm{v}\bm{v}^{\top}$ , $\bm{U}\in[0,1]^{150\times 150}$ is a noisy matrix with i.i.d. standard uniform entries, $\bm{v}\in\mathbb{R}^{150}$ is a vector of signals such that

\displaystyle v_{i}=\begin{cases}1,&\text{if}\ i\leq 50,\\ \frac{1}{i-50},&\text{if}\ 51\leq i\leq 100,\\ 0,&\text{otherwise,}\end{cases}

(23)

and $\sigma=2$ is the signal-to-noise ratio. The methods which we compare are:

•

Exact: Algorithm 1 with Gershgorin inequalities and a time limit of $600$ s.
•

Approximate: Algorithm 2 with Problem (20), the SOC outer-approximation of the PSD cone, no PSD cuts, and the additional SOC inequalities.
•

Greedy: as proposed by Moghaddam et al. (2006) and laid out in (d’Aspremont et al., 2008, Algorithm 1), start with a solution $\bm{z}$ of cardinality $1$ and iteratively augment this solution vector with the index which gives the maximum variance contribution. Note that d’Aspremont et al. (2008) found this method outperformed the $3$ other methods (approximate greedy, thresholding and sorting) they considered in their work.
•

Truncated Power Method: as proposed by Yuan and Zhang (2013), alternate between applying the power method to the solution vector and truncating the vector to ensure that it is $k$ -sparse. Note that Berk and Bertsimas (2019) found that this approach performed better than $5$ other state-of-the-art methods across the real-world datasets studied in the previous section of this paper and often matched the performance of the method of Berk and Bertsimas (2019)—indeed, it functions as a warm-start for the later method.
•

Sorting: sort the entries of $\bm{\Sigma}_{i,i}$ by magnitude and set $z_{i}=1$ for the $k$ largest entries of $\bm{\Sigma}$ , as studied in d’Aspremont et al. (2008). This naive method serves as a benchmark for the value of optimization in the more sophisticated methods considered here.

Figures 1 depicts the ROC curve (true positive rate vs. false positive rate for recovering the support of $\bm{v}$ ) over $20$ synthetic random instances, as we vary $k$ for each instance. We observe that among all methods, the sorting method is the least accurate, with a substantially larger false detection rate for a given true positive rate than the remaining methods (AUC $=0.7028$ ). The truncated power method and our exact method⁸⁸8Note that the exact method would dominate the remaining methods if given an unlimited runtime budget. Its poor performance reflects its inability to find the true optimal solution within $600$ seconds. then offer a substantial improvement over sorting, with respective AUCs of $0.7482$ and $0.7483$ . The greedy method then offers a modest improvement over them (AUC $=0.7561$ ) and the approximate relax+round method is the most accurate (AUC $=0.7593$ ).

In addition to support recovery, Figure 2 reports average runtime (left panel) and average optimality gap (right panel) over the same instances. Observe that among all methods, only the exact and the approximate relax+round methods provide optimality gaps, i.e., numerical certificates of near optimality. On this metric, relax+round supplies average bound gaps of $1\%$ or less on all instances, while the exact method typically supplies bound gaps of $30\%$ or more. This comparison illustrates the tightness of the valid inequalities from Section 3.2 that we included in the relaxation. Moreover, the relax+round method converges in less than one minute on all instances. All told, the relax+round method is the best performing method overall, although if $k$ is set to be sufficiently close to $0$ or $p$ all methods behave comparably. In particular, the relax+round method should be preferred over the exact method, even though the exact method performs better at smaller problem sizes.

Refer to caption — Figure 1: ROC curve over $20$ synthetic instances where $p=150$ , $k_{\text{true}}=100$ is unspecified.

4.5 Summary and Guidelines From Experiments

In summary, our main findings from our numerical experiments are as follows:

•

For small or medium scale problems where $p\leq 100$ or $k\leq 10$ , exact methods such as Algorithm 1 or the method of Berk and Bertsimas (2019) reliably obtain certifiably optimal or near-optimal solutions in a short amount of time, and should therefore be preferred over other methods. However, for larger-scale sparse PCA problems, exact methods currently do not scale as well as approximate or heuristic methods.
•

For larger-scale sparse PCA problems, our proposed combination of solving a second-order cone relaxation and rounding greedily reliably supplies certifiably near-optimal solutions in practice (if not in theory) in a relatively small amount of time. Moreover, it outperforms other state-of-the-art heuristics including the greedy method of Moghaddam et al. (2006); d’Aspremont et al. (2008) and the Truncated Power Method of Yuan and Zhang (2013). Accordingly, it should be considered as a reliable and more accurate alternative for problems where $p=1000$ s.
•

In practice, for even larger-scale problem sizes, we recommend using a combination of these methods: a computationally cheaper method (with $k$ set in the $1000$ s) as a feature screening method, to be followed by the approximate relax+round method (with $k$ set in the $100$ s) and/or the exact method, if time permits.

5 Three Extensions and their Mixed-Integer Conic Formulations

We conclude by discussing three extensions of sparse PCA where our methodology applies.

5.1 Non-Negative Sparse PCA

One potential extension to this paper would be to develop a certifiably optimal algorithm for non-negative sparse PCA (see Zass and Shashua, 2007, for a discussion), i.e., develop a tractable reformulation of

\displaystyle\max_{\bm{x}\in\mathbb{R}^{p}}\quad

\displaystyle\langle\bm{x}\bm{x}^{\top},\bm{\Sigma}\rangle\ \text{s.t.}\ \bm{x}^{\top}\bm{x}=1,\bm{x}\geq\bm{0},\|\bm{x}\|_{0}\leq k.

Unfortunately, we cannot develop a MISDO reformulation of non-negative sparse PCA mutatis mutandis Theorem 1. Indeed, while we can still set $\bm{X}=\bm{x}\bm{x}^{\top}$ and relax the rank-one constraint, if we do so then, by the non-negativity of $\bm{x}$ , lifting $\bm{x}$ yields:

	$\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in\mathcal{C}_{n}}$	$\displaystyle\langle\bm{\Sigma},\bm{X}\rangle$		(24)
	s.t.	$\displaystyle\mathrm{tr}(\bm{X})=1,\ X_{i,j}=0\ \text{if}\ z_{i}=0,\ X_{i,j}=0\ \text{if}\ z_{j}=0\ \forall i,j\in[p].$		(24)

where $\mathcal{C}_{n}:=\{\bm{X}:\ \exists\ \bm{U}\geq\bm{0},\bm{X}=\bm{U}^{\top}\bm{U}\}$ denotes the completely positive cone, which is NP-hard to separate over and cannot currently be optimized over tractably (Dong and Anstreicher, 2013). Nonetheless, we can develop relatively tractable mixed-integer conic upper and lower bounds for non-negative sparse PCA. Indeed, we can obtain a fairly tight upper bound by replacing the completely positive cone with the larger doubly non-negative cone $\mathcal{D}_{n}:=\{\bm{X}\in S^{p}_{+}:\bm{X}\geq\bm{0}\}$ , which is a high-quality outer-approximation of $\mathcal{C}_{n}$ , indeed exact when $k\leq 4$ (Burer et al., 2009).

Unfortunately, this relaxation is strictly different in general, since the extreme rays of the doubly non-negative cone are not necessarily rank-one when $k\geq 5$ (Burer et al., 2009). Nonetheless, to obtain feasible solutions which supply lower bounds, we could inner approximate the completely positive cone with the cone of non-negative scaled diagonally dominant matrices (see Ahmadi and Majumdar, 2019; Bostanabad et al., 2020).

5.2 Sparse PCA on Rectangular Matrices

A second extension would be to extend our methodology to the non-square case:

\displaystyle\max_{\bm{x}\in\mathbb{R}^{m},\bm{y}\in\mathbb{R}^{n}}\quad

\displaystyle\bm{x}^{\top}\bm{A}\bm{y}\ \text{s.t.}\ \|\bm{x}\|_{2}=1,\|\bm{y}\|_{2}=1,\|\bm{x}\|_{0}\leq k,\|\bm{y}\|_{0}\leq k.

(25)

Observe that computing the spectral norm of a matrix $\bm{A}$ is equivalent to:

\displaystyle\max_{\bm{X}\in\mathbb{R}^{n\times m}}\quad\langle\bm{A},\bm{X}\rangle\ \text{s.t.}\ \begin{pmatrix}\bm{U}&\bm{X}\\ \bm{X}^{\top}&\bm{V}\end{pmatrix}\succeq\bm{0},\mathrm{tr}(\bm{U})+\mathrm{tr}(\bm{V})=2,

(26)

where, in an optimal solution, $\bm{U}$ stands for $\bm{x}\bm{x}^{\top}$ , $\bm{V}$ stands for $\bm{y}\bm{y}^{\top}$ and $\bm{X}$ stands for $\bm{x}\bm{y}^{\top}$ —this can be seen by taking the dual of (Recht et al., 2010, Equation 2.4).

Therefore, by using the same argument as in the positive semidefinite case, we can rewrite sparse PCA on rectangular matrices as the following MISDO:

$\displaystyle\max_{\bm{w}\in\{0,1\}^{m},\bm{z}\in\{0,1\}^{n}}\max_{\bm{X}\in\mathbb{R}^{n\times m}}$	$\displaystyle\langle\bm{A},\bm{X}\rangle$	(27)
s.t.	$\displaystyle\ \begin{pmatrix}\bm{U}&\bm{X}\\ \bm{X}^{\top}&\bm{V}\end{pmatrix}\succeq\bm{0},\mathrm{tr}(\bm{U})+\mathrm{tr}(\bm{V})=2,$
	$\displaystyle U_{i,j}=0\ \text{if}\ w_{i}=0\ \forall i,j\in[m],$
	$\displaystyle V_{i,j}=0\ \text{if}\ z_{i}=0\ \forall i,j\in[n],\bm{e}^{\top}\bm{w}\leq k,\bm{e}^{\top}\bm{z}\leq k.$

5.3 Sparse PCA with Multiple Principal Components

A third extension where our methodology is applicable is the problem of obtaining multiple principal components simultaneously, rather than deflating $\bm{\Sigma}$ after obtaining each principal component. As there are multiple definitions of this problem, we now discuss the extent to which our framework encompasses each case.

Common Support:

Perhaps the simplest extension of sparse PCA to a multi-component setting arises when all $r$ principal components have common support. By retaining the vector of binary variables $\bm{z}$ and employing the Ky-Fan theorem (c.f. Wolkowicz et al., 2012, Theorem 2.3.8) to cope with multiple principal components, we obtain the following formulation in much the same manner as previously:

\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}}\quad

\displaystyle\langle\bm{X},\bm{\Sigma}\rangle\ \text{s.t.}\ \bm{0}\preceq\bm{X}\preceq\mathbb{I},\ \mathrm{tr}(\bm{X})=r,\ X_{i,j}=0\ \text{if}\ z_{i}=0\ \forall i\in[p].

(28)

Notably, the logical constraint $X_{i,j}=0$ if $z_{i}=0$ , which formed the basis of our subproblem strategy, still successfully models the sparsity constraint. This suggests that (a) one can derive an equivalent subproblem strategy under common support, and (b) a cutting-plane method for common support should scale equally well as with a single component.

Disjoint Support:

In a sparse PCA problem with disjoint support (Vu and Lei, 2012) , simultaneously computing the first $r$ principal components is equivalent to solving:

	$\displaystyle\max_{\begin{subarray}{c}\bm{z}\in\{0,1\}^{p\times r}:\bm{e}^{\top}\bm{z}_{t}\leq k\ \forall t\in[r],\\ \bm{z}\bm{e}\leq\bm{e}\end{subarray}}\max_{\bm{W}\in\mathbb{R}^{p\times r}}$	$\displaystyle\langle\bm{W}\bm{W}^{\top},\bm{\Sigma}\rangle$		(29)
		$\displaystyle\bm{W}^{\top}\bm{W}=\mathbb{I}_{r},\ W_{i,j}=0\ \text{if}\ z_{i,t}=0\ \forall i\in[p],t\in[r],$		(29)

where $z_{i,t}$ is a binary variable denoting whether feature $i$ is a member of the $t$ th principal component. By applying the technique used to derive Theorem 1 mutatis mutandis, and invoking the Ky-Fan theorem (c.f. Wolkowicz et al., 2012, Theorem 2.3.8) to cope with the rank- $r$ constraint, we obtain:

	$\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\max_{\bm{X}\in S^{p}}$	$\displaystyle\langle\bm{X},\bm{\Sigma}\rangle$		(30)
		$\displaystyle\bm{0}\preceq\bm{X}\preceq\mathbb{I},\ \mathrm{tr}(\bm{X})=r,\ X_{i,j}=0\ \text{if}\ Y_{i,j}=0\ \forall i\in[p],$		(30)

where $Y_{i,j}=\sum_{t=1}^{r}z_{i,t}z_{j,t}$ is a binary matrix denoting whether features $i$ and $j$ are members of the same principal component; this problem can be addressed by a cutting-plane method in much the same manner as when $r=1$ .

Acknowledgments

We are grateful to the three anonymous referees and the associate editor for many valuable comments which improved the paper.

References

Ahmadi and Majumdar [2019] A. A. Ahmadi and A. Majumdar. DSOS and SDSOS optimization: More tractable alternatives to sum of squares and semidefinite optimization. SIAM Journal on Applied Algebra and Geometry, 3(2):193–230, 2019.
Amini and Wainwright [2008] A. A. Amini and M. J. Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. In 2008 IEEE international symposium on information theory, pages 2454–2458. IEEE, 2008.
Atamtürk and Gomez [2020] A. Atamtürk and A. Gomez. Safe screening rules for l0-regression. Optimization Online, 2020.
Bao et al. [2011] X. Bao, N. V. Sahinidis, and M. Tawarmalani. Semidefinite relaxations for quadratically constrained quadratic programming: A review and comparisons. Mathematical Programming, 129(1):129, 2011.
Barak et al. [2014] B. Barak, J. A. Kelner, and D. Steurer. Rounding sum-of-squares relaxations. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 31–40, 2014.
Barnhart et al. [1998] C. Barnhart, E. L. Johnson, G. L. Nemhauser, M. W. Savelsbergh, and P. H. Vance. Branch-and-price: Column generation for solving huge integer programs. Operations Research, 46(3):316–329, 1998.
Ben-Tal and Nemirovski [2002] A. Ben-Tal and A. Nemirovski. On tractable approximations of uncertain linear matrix inequalities affected by interval uncertainty. SIAM Journal on Optimization, 12(3):811–833, 2002.
Berk and Bertsimas [2019] L. Berk and D. Bertsimas. Certifiably optimal sparse principal component analysis. Mathematical Programming Computation, 11(3):381–420, 2019.
Berthet and Rigollet [2013] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in high dimension. Annals of Statistics, 41(4):1780–1815, 2013.
Bertsimas and Cory-Wright [2020] D. Bertsimas and R. Cory-Wright. On polyhedral and second-order-cone decompositions of semidefinite optimization problems. Operations Research Letters, 48(1):78–85, 2020.
Bertsimas and Van Parys [2020] D. Bertsimas and B. Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. Annals of Statistics, 48(1):300–323, 2020.
Bertsimas and Ye [1998] D. Bertsimas and Y. Ye. Semidefinite relaxations, multivariate normal distributions, and order statistics. In Handbook of Combinatorial Optimization, pages 1473–1491. Springer, 1998.
Bertsimas et al. [2017] D. Bertsimas, J. Pauphilet, and B. Van Parys. Sparse classification and phase transitions: A discrete optimization perspective. arXiv preprint arXiv:1710.01352, 2017.
Bertsimas et al. [2019] D. Bertsimas, R. Cory-Wright, and J. Pauphilet. A unified approach to mixed-integer optimization problems with logical constraints. arXiv preprint arXiv:1907.02109, 2019.
Boman et al. [2005] E. G. Boman, D. Chen, O. Parekh, and S. Toledo. On factor width and symmetric h-matrices. Linear Algebra and its Applications, 405:239–248, 2005.
Bostanabad et al. [2020] M. S. Bostanabad, J. Gouveia, and T. K. Pong. Inner approximating the completely positive cone via the cone of scaled diagonally dominant matrices. Journal on Global Optimization, 76:383–405, 2020.
Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, Cambridge, 2004.
Boyd et al. [1994] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear matrix inequalities in system and control theory, volume 15. Studies in Applied Mathematics, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1994.
Brauer [1952] A. Brauer. Limits for the characteristic roots of a matrix iv. Duke Mathematical Journal, 19:75–91, 1952.
Burer et al. [2009] S. Burer, K. M. Anstreicher, and M. Dür. The difference between 5 $\times$ 5 doubly nonnegative and completely positive matrices. Linear Algebra and its Applications, 431(9):1539–1552, 2009.
Candes and Tao [2007] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313–2351, 2007.
Chan et al. [2016] S. O. Chan, D. Papailliopoulos, and A. Rubinstein. On the approximability of sparse PCA. In Conference on Learning Theory, pages 623–646, 2016.
Chen et al. [2019] Y. Chen, Y. Ye, and M. Wang. Approximation hardness for a class of sparse optimization problems. Journal of Machine Learning Research, 20(38):1–27, 2019.
Chowdhury et al. [2020] A. Chowdhury, P. Drineas, D. P. Woodruff, and S. Zhou. Approximation algorithms for sparse principal component analysis. arXiv preprint arXiv:2006.12748, 2020.
d’Aspremont et al. [2005] A. d’Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet. A direct formulation for sparse pca using semidefinite programming. In Advances in neural information processing systems, pages 41–48, 2005.
d’Aspremont et al. [2008] A. d’Aspremont, F. Bach, and L. E. Ghaoui. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9(Jul):1269–1294, 2008.
Dey et al. [2017] S. S. Dey, R. Mazumder, M. Molinaro, and G. Wang. Sparse principal component analysis and its $l\_1$ -relaxation. arXiv preprint arXiv:1712.00800, 2017.
Dey et al. [2018] S. S. Dey, R. Mazumder, and G. Wang. A convex integer programming approach for optimal sparse PCA. arXiv preprint arXiv:1810.09062, 2018.
Dong and Anstreicher [2013] H. Dong and K. Anstreicher. Separating doubly nonnegative and completely positive matrices. Mathematical Programming, 137(1-2):131–153, 2013.
Dong et al. [2015] H. Dong, K. Chen, and J. Linderoth. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. arXiv:1510.06083, 2015.
Duran and Grossmann [1986] M. A. Duran and I. E. Grossmann. An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Mathematical Programming, 36(3):307–339, 1986.
d’Aspremont and Boyd [2003] A. d’Aspremont and S. Boyd. Relaxations and randomized methods for nonconvex QCQPs. EE392o Class Notes, Stanford University, 1:1–16, 2003.
d’Aspremont et al. [2014] A. d’Aspremont, F. Bach, and L. El Ghaoui. Approximation bounds for sparse principal component analysis. Mathematical Programming, 148(1-2):89–110, 2014.
Fountoulakis et al. [2017] K. Fountoulakis, A. Kundu, E.-M. Kontopoulou, and P. Drineas. A randomized rounding algorithm for sparse PCA. ACM Transactions on Knowledge Discovery from Data (TKDD), 11(3):1–26, 2017.
Gally and Pfetsh [2016] T. Gally and M. E. Pfetsh. Computing restricted isometry constants via mixed-integer semidefinite programming. Optimization Online, 2016.
Gally et al. [2018] T. Gally, M. E. Pfetsch, and S. Ulbrich. A framework for solving mixed-integer semidefinite programs. Optimization Methods & Software, 33(3):594–632, 2018.
Goemans and Williamson [1995] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
Hein and Bühler [2010] M. Hein and T. Bühler. An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse PCA. In Advances in neural information processing systems, pages 847–855, 2010.
Horn and Johnson [1990] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 1990.
Hotelling [1933] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
Hsu et al. [2014] Y.-L. Hsu, P.-Y. Huang, and D.-T. Chen. Sparse principal component analysis in cancer research. Translational cancer research, 3(3):182, 2014.
Jeffers [1967] J. N. Jeffers. Two case studies in the application of principal component analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 16(3):225–236, 1967.
Jolliffe [1995] I. T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journal of Applied Statistics, 22(1):29–35, 1995.
Jolliffe et al. [2003] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin. A modified principal component technique based on the lasso. Journal of computational and Graphical Statistics, 12(3):531–547, 2003.
Journée et al. [2010] M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(2), 2010.
Kaiser [1958] H. F. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187–200, 1958.
Kim and Kojima [2001] S. Kim and M. Kojima. Second order cone programming relaxation of nonconvex quadratic optimization problems. Optimization methods and software, 15(3-4):201–224, 2001.
Kobayashi and Takano [2020] K. Kobayashi and Y. Takano. A branch-and-cut algorithm for solving mixed-integer semidefinite optimization problems. Computational Optimization & Applications, 75(2):493–513, 2020.
Ledoit and Wolf [2004] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365–411, 2004.
Li and Tsatsomeros [1997] B. Li and M. Tsatsomeros. Doubly diagonally dominant matrices. Linear Algebra and Its Applications, 261(1-3):221–235, 1997.
Li and Xie [2020] Y. Li and W. Xie. Exact and approximation algorithms for sparse PCA. arXiv preprint arXiv:2008.12438, 2020.
Luss and d’Aspremont [2010] R. Luss and A. d’Aspremont. Clustering and feature selection using sparse principal component analysis. Optimization & Engineering, 11(1):145–157, 2010.
Luss and Teboulle [2013] R. Luss and M. Teboulle. Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint. SIAM Review, 55(1):65–98, 2013.
Magdon-Ismail [2017] M. Magdon-Ismail. NP-hardness and inapproximability of sparse PCA. Information Processing Letters, 126:35–38, 2017.
Majumdar et al. [2019] A. Majumdar, G. Hall, and A. A. Ahmadi. A survey of recent scalability improvements for semidefinite programming with applications in machine learning, control, and robotics. arXiv preprint arXiv:1908.05209, 2019.
Miolane [2018] L. Miolane. Phase transitions in spiked matrix estimation: information-theoretic analysis. arXiv preprint arXiv:1806.04343, 2018.
Moghaddam et al. [2006] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse pca: Exact and greedy algorithms. In Advances in neural information processing systems, pages 915–922, 2006.
Quesada and Grossmann [1992] I. Quesada and I. E. Grossmann. An LP/NLP based branch and bound algorithm for convex MINLP optimization problems. Computers & Chemical Engineering, 16(10-11):937–947, 1992.
Raghavan and Tompson [1987] P. Raghavan and C. D. Tompson. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica, 7(4):365–374, 1987.
Raghavendra and Steurer [2010] P. Raghavendra and D. Steurer. Graph expansion and the unique games conjecture. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 755–764, 2010.
Recht et al. [2010] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.
Richman [1986] M. B. Richman. Rotation of principal components. Journal of climatology, 6(3):293–335, 1986.
Richtárik et al. [2020] P. Richtárik, M. Jahani, S. D. Ahipaşaoğlu, and M. Takáč. Alternating maximization: Unifying framework for 8 sparse PCA formulations and efficient parallel codes. Optimization and Engineering, pages 1–27, 2020.
Todd [2018] M. J. Todd. On max-k-sums. Mathematical Programming, 171(1):489–517, 2018.
Vu and Lei [2012] V. Vu and J. Lei. Minimax rates of estimation for sparse PCA in high dimensions. In Artificial intelligence and statistics, pages 1278–1286, 2012.
Wang et al. [2016] T. Wang, Q. Berthet, and R. J. Samworth. Statistical and computational trade-offs in estimation of sparse principal components. Annals of Statistics, 44(5):1896–1930, 2016.
Witten et al. [2009] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009.
Wolkowicz et al. [2012] H. Wolkowicz, R. Saigal, and L. Vandenberghe. Handbook of semidefinite programming: theory, algorithms, and applications, volume 27. Springer Science & Business Media, 2012.
Yuan and Zhang [2013] X.-T. Yuan and T. Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899–925, 2013.
Zakeri et al. [2014] G. Zakeri, D. Craigie, A. Philpott, and M. Todd. Optimization of demand response through peak shaving. Oper. Res. Letters, 42(1):97–101, 2014.
Zass and Shashua [2007] R. Zass and A. Shashua. Nonnegative sparse PCA. In Advances in neural information processing systems, pages 1561–1568, 2007.
Zhang et al. [2012] Y. Zhang, A. d’Aspremont, and L. El Ghaoui. Sparse PCA: Convex relaxations, algorithms and applications. In Handbook on Semidefinite, Conic and Polynomial Optimization, pages 915–940. Springer, 2012.
Zou et al. [2006] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

A A Doubly Non-Negative Relaxation and a Goemans-Williamson Rounding Scheme

The MISDO formulation (5) we derived in Section 2 features big- $M$ constraints of the form $|X_{i,j}|\leq M_{i,j}z_{i}$ . We did not include the equally valid inequalities $|X_{i,j}|\leq M_{i,j}z_{j}$ , because they are redundant with the fact that $\bm{X}$ is symmetric. Actually, (5) is equivalent to

\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}}\quad\langle\bm{\Sigma},\bm{X}\rangle\text{ s.t. }\quad\mathrm{tr}(\bm{X})=1,\ |X_{i,j}|\leq M_{i,j}z_{i}z_{j},\ \forall i,j\in[p].

(31)

The formulation above features products of binary variables $z_{i}z_{j}$ .Therefore, unlike several other problems involving cardinality constraints such as compressed sensing, relaxations of sparse PCA benefit from invoking an optimization hierarchy (see d’Aspremont and Boyd, 2003, Section 2.4.1, for a counterexample specific to compressed sensing). In particular, let us model the outer product $\bm{z}\bm{z}^{\top}$ by introducing a matrix $\bm{Z}$ and imposing the semidefinite constraint $\bm{Z}\succeq\bm{z}\bm{z}^{\top}$ . We tighten the formulation by requiring that $Z_{i,i}=z_{i}$ and imposing the linear inequalities $\max(z_{i}+z_{j}-1,0)\leq Z_{i,j}\leq\min(z_{i},z_{j})$ . Hence, we obtain:

	$\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k,\\ \bm{Z}\in\mathbb{R}^{p\times p}_{+}\end{subarray}}\>\max_{\bm{X}\succeq\bm{0}}\>\langle\bm{\Sigma},\bm{X}\rangle\text{ s.t. }$	$\displaystyle\mathrm{tr}(\bm{X})=1,\ \|X_{i,j}\|\leq M_{i,j}Z_{i,j},\ \langle\bm{E},\bm{Z}\rangle\leq k^{2},\>Z_{i,i}=z_{i},$		(32)
		$\displaystyle\max(z_{i}+z_{j}-1,0)\leq Z_{i,j}\leq\min(z_{i},z_{j}),\ \begin{pmatrix}1&\bm{z}^{\top}\\ \bm{z}&\bm{Z}\end{pmatrix}\succeq\bm{0}.$

Problem (32) is a doubly non-negative relaxation, as we have intersected the Shor and RLT relaxations. This is noteworthy, because doubly non-negative relaxations dominate most other popular relaxations with $O(p^{2})$ variables (Bao et al., 2011, Theorem 1).

Relaxation (32) is amenable to a Goemans-Williamson rounding scheme (Goemans and Williamson, 1995). Namely, let $(\bm{z}^{\star},\bm{Z}^{\star})$ denote optimal choices of $(\bm{z},\bm{Z})$ in Problem (32), $\hat{\bm{z}}$ be normally distributed random vector such that $\hat{\bm{z}}\sim\mathcal{N}(\bm{z}^{\star},\bm{Z}^{\star}-\bm{z}^{\star}\bm{z}^{\star\top})$ , and $\bar{\bm{z}}$ be a rounding of the vector such that $\bar{z_{i}}=1$ for the $k$ largest entries of $\hat{z}_{i}$ ; this is, up to feasibility on $\hat{\bm{z}}$ , equivalent to the hyperplane rounding scheme of Goemans and Williamson (1995) (see Bertsimas and Ye, 1998, for a proof). We formalize this procedure in Algorithm 3. As Algorithm 3 returns one of multiple possible $\bar{\bm{z}}$ ’s, a computationally useful strategy is to run the random rounding component several times and return the best solution.

Algorithm 3 A Goemans-Williamson rounding method for Problem (1)

0: Covariance matrix

\bm{\Sigma}

, sparsity parameter

k

Compute

\bm{z}^{\star},\bm{Z}^{\star}

solution of (32)

Compute

\hat{\bm{z}}\sim\mathcal{N}(\bm{z}^{\star},\bm{Z}^{\star}-\bm{z}^{\star}\bm{z}^{\star\top})

Construct

\bar{\bm{z}}\in\{0,1\}^{p}:\bm{e}^{\top}\bar{\bm{z}}=k

such that

\bar{z}_{i}\geq\bar{z}_{j}

\hat{z}_{i}\geq\hat{z}_{j}

Compute

\bm{X}

solution of

\displaystyle\max_{\bm{X}\in S^{p}_{+}}\ \langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,X_{i,j}=0\ \text{if}\ \bar{z}_{i}\bar{z}_{j}=0\ \forall i,j\in[p].

return

\bm{z},\bm{X}

A very interesting question is whether it is possible to produce a constant factor guarantee on the quality of Algorithm 3’s rounding, as Goemans and Williamson (1995) successfully did for binary quadratic optimization. Unfortunately, despite our best effort, this does not appear to be possible as the quality of the rounding depends on the value of the optimal dual variables, which are hard to control in this setting. This should not be too surprising for two distinct reasons. Namely, (a) sparse regression, which reduces to sparse PCA (see d’Aspremont et al., 2008, Section 6.1) is strongly NP-hard (Chen et al., 2019), and (b) sparse PCA is hard to approximate within a constant factor under the Small Set Expansion (SSE) hypothesis (Chan et al., 2016), meaning that producing a constant factor guarantee would contradict the SSE hypothesis of Raghavendra and Steurer (2010).

We close this appendix by noting that a similar in spirit (although different in both derivation and implementation) combination of taking a semidefinite relaxation of $\bm{z}\in\{0,1\}^{p}$ and rounding à la Goemans-Williamson has been proposed for sparse regression problems (Dong et al., 2015).