This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality

\nameDimitris Bertsimas \emaildbertsim@mit.edu
\addrMassachusetts Institute of Technology
Cambridge, MA 02139, USA \AND\nameRyan Cory-Wright \emailryancw@mit.edu
\addrMassachusetts Institute of Technology
Cambridge, MA 02139, USA \AND\nameJean Pauphilet \emailjpauphilet@london.edu
\addrLondon Business School
London, UK
Abstract

Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than p=100sp=100s of variables. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting k=5k=5 covariates from p=300p=300 variables, and provides small bound gaps at a larger scale. We also propose a convex relaxation and greedy rounding scheme that provides bound gaps of 12%1-2\% in practice within minutes for p=100p=100s or hours for p=1,000p=1,000s and is therefore a viable alternative to the exact method at scale. Using real-world financial and medical datasets, we illustrate our approach’s ability to derive interpretable principal components tractably at scale.

Keywords: Sparse PCA, Sparse Eigenvalues, Semidefinite Optimization

1 Introduction

In the era of big data, interpretable methods for compressing a high-dimensional dataset into a lower dimensional set which shares the same essential characteristics are imperative. Since the work of Hotelling (1933), principal component analysis (PCA) has been one of the most popular approaches for completing this task. Formally, given centered data 𝑨n×p\bm{A}\in\mathbb{R}^{n\times p} and its normalized empirical covariance matrix 𝚺:=𝑨𝑨n1p×p\bm{\Sigma}:=\frac{\bm{A}\bm{A}^{\top}}{{n-1}}\in\mathbb{R}^{p\times p}, PCA selects one or more leading eigenvectors of 𝚺\bm{\Sigma} and subsequently projects 𝑨\bm{A} onto these eigenvectors. This can be achieved in O(p3)O(p^{3}) time by taking a singular value decomposition 𝚺=𝑼𝚲𝑼\bm{\Sigma}=\bm{U}\bm{\Lambda}\bm{U}^{\top}.

A common criticism111A second criticism of PCA is that, as set up here, it uses the sample correlation or covariance matrix. This is a drawback, because sample covariance matrices are poorly conditioned estimators which over-disperses the sample eigenvalues, particularly in high-dimensional settings. In practice, this can be rectified by, e.g., using a shrinkage estimator (see, e.g., Ledoit and Wolf, 2004). We do not do so here for simplicity, but we recommend doing so if using the techniques developed in this paper in practice. of PCA is that the columns of 𝑼\bm{U} are not interpretable, since each eigenvector is a linear combination of all pp original features. This causes difficulties because:

  • In medical applications such as cancer detection, PCs generated during exploratory data analysis need to supply interpretable modes of variation (Hsu et al., 2014).

  • In scientific applications such as protein folding, each original co-ordinate axis has a physical interpretation, and the reduced set of co-ordinate axes should too.

  • In finance applications such as investing capital across index funds, each non-zero entry in each eigenvector used to reduce the feature space incurs a transaction cost.

  • If pnp\gg n, PCA suffers from a curse of dimensionality and becomes physically meaningless (Amini and Wainwright, 2008).

One common method for obtaining interpretable principal components is to stipulate that they are sparse, i.e., maximize variance while containing at most kk non-zero entries. This approach leads to the following non-convex mixed-integer quadratically constrained problem (see d’Aspremont et al., 2005):

max𝒙p𝒙𝚺𝒙s.t.𝒙𝒙=1,𝒙0k,\displaystyle\max_{\bm{x}\in\mathbb{R}^{p}}\ \bm{x}^{\top}\bm{\Sigma}\bm{x}\quad\text{s.t.}\quad\bm{x}^{\top}\bm{x}=1,\ ||\bm{x}||_{0}\leq k, (1)

where the constraint 𝒙0k||\bm{x}||_{0}\leq k forces variance to be explained in a compelling fashion.

1.1 Background and Literature Review

Owing to sparse PCA’s fundamental importance in a variety of applications including best subset selection (d’Aspremont et al., 2008), natural language processing (Zhang et al., 2012), compressed sensing (Candes and Tao, 2007), and clustering (Luss and d’Aspremont, 2010), three distinct classes of methods for addressing Problem (1) have arisen. Namely, (a) heuristic methods which obtain high-quality sparse PCs in an efficient fashion but do not supply guarantees on the quality of the solution, (b) convex relaxations which obtain certifiably near-optimal solutions by solving a convex relaxation and rounding, and (c) exact methods which obtain certifiably optimal solutions, albeit in exponential time.

Heuristic Approaches:

The importance of identifying a small number of interpretable principal components has been well-documented in the literature since the work of Hotelling (1933) (see also Jeffers, 1967), giving rise to many distinct heuristic approaches for obtaining high-quality solutions to Problem (1). Two interesting such approaches are to rotate dense principal components to promote sparsity (Kaiser, 1958; Richman, 1986; Jolliffe, 1995), or apply an 1\ell_{1} penalty term as a convex surrogate to the cardinality constraint (Jolliffe et al., 2003; Zou et al., 2006). Unfortunately, the former approach does not provide performance guarantees, while the latter approach still results in a non-convex optimization problem.

More recently, motivated by the need to rapidly obtain high-quality sparse principal components at scale, a wide variety of first-order heuristic methods have emerged. The first such modern heuristic was developed by Journée et al. (2010), and involves combining the power method with thresholding and re-normalization steps. By pursuing similar ideas, several related methods have since been developed (see Witten et al., 2009; Hein and Bühler, 2010; Richtárik et al., 2020; Luss and Teboulle, 2013; Yuan and Zhang, 2013). Unfortunately, while these methods are often very effective in practice, they sometimes badly fail to recover an optimal sparse principal component, and a practitioner using a heuristic method typically has no way of knowing when this has occurred. Indeed, Berk and Bertsimas (2019) recently compared 77 heuristic methods, including most of those reviewed here, on 1414 instances of sparse PCA, and found that none of the heuristic methods successfully recovered an optimal solution in all 1414 cases (i.e., no heuristic was right all the time).

Convex Relaxations:

Motivated by the shortcomings of heuristic approaches on high-dimensional datasets, and the successful application of semi-definite optimization in obtaining high-quality approximation bounds in other applications (see Goemans and Williamson, 1995; Wolkowicz et al., 2012), a variety of convex relaxations have been proposed for sparse PCA. The first such convex relaxation was proposed by d’Aspremont et al. (2005), who reformulated sparse PCA as the rank-constrained mixed-integer semidefinite optimization problem (MISDO):

max𝑿𝟎𝚺,𝑿s.t.tr(𝑿)=1,𝑿0k2,Rank(𝑿)=1,\displaystyle\max_{\bm{X}\succeq\bm{0}}\quad\langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \|\bm{X}\|_{0}\leq k^{2},\ \mathrm{Rank}(\bm{X})=1, (2)

where 𝑿\bm{X} models the outer product 𝒙𝒙\bm{x}\bm{x}^{\top}. Note that, for a rank-one matrix 𝑿\bm{X}, the constraint 𝑿0k2\|\bm{X}\|_{0}\leq k^{2} in (2) is equivalent to the constraint 𝒙0k\|\bm{x}\|_{0}\leq k in (1), since a vector 𝒙\bm{x} is kk-sparse if its outer product 𝒙𝒙\bm{x}\bm{x}^{\top} is k2k^{2}-sparse. After performing this reformulation, d’Aspremont et al. (2005) relaxed both the cardinality and rank constraints and instead solved

max𝑿𝟎𝚺,𝑿s.t.tr(𝑿)=1,𝑿1k,\displaystyle\max_{\bm{X}\succeq\bm{0}}\quad\langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \|\bm{X}\|_{1}\leq k, (3)

which supplies a valid upper bound on Problem (1)’s objective.

The semidefinite approach has since been refined in a number of follow-up works. Among others, d’Aspremont et al. (2008), building upon the work of Ben-Tal and Nemirovski (2002), proposed a different semidefinite relaxation which supplies a sufficient condition for optimality via the primal-dual KKT conditions, and d’Aspremont et al. (2014) analyzed the quality of the semidefinite relaxation in order to obtain high-quality approximation bounds. A common theme in these approaches is that they require solving large-scale semidefinite optimization problems. This presents difficulties for practitioners because state-of-the-art implementations of interior point methods such as Mosek require O(p6)O(p^{6}) memory to solve Problem (3), and therefore currently cannot solve instances of Problem (3) with p300p\geq 300 (see Bertsimas and Cory-Wright, 2020, for a recent comparison). Techniques other than interior point methods, e.g., ADMM or augmented Lagrangian methods as reviewed in Majumdar et al. (2019) could also be used to solve Problem (3), although they tend to require more runtime than IPMs to obtain a solution of a similar accuracy and be numerically unstable for problem sizes where IPMs run out of memory (Majumdar et al., 2019).

A number of works have also studied the statistical estimation properties of Problem (3), by assuming an underlying probabilistic model. Among others, Amini and Wainwright (2008) have demonstrated the asymptotic consistency of Problem (3) under a spiked covariance model once the number of samples used to generate the covariance matrix exceeds a certain threshold; see Vu and Lei (2012); Berthet and Rigollet (2013); Wang et al. (2016) for further results in this direction, Miolane (2018) for a recent survey.

In an complementary direction, Dey et al. (2018) has recently questioned the modeling paradigm of lifting 𝒙\bm{x} to a higher dimensional space by instead considering the following (tighter) relaxation of sparse PCA in the original problem space

max𝒙p𝒙𝚺𝒙s.t.𝒙2=1,𝒙1k.\displaystyle\max_{\bm{x}\in\mathbb{R}^{p}}\quad\bm{x}^{\top}\bm{\Sigma}\bm{x}\quad\text{s.t.}\quad\|\bm{x}\|_{2}=1,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\|\bm{x}\|_{1}\leq\sqrt{k}}. (4)

Interestingly, Problem (4)’s relaxation provides a (1+kk+1)2\left(1+\sqrt{\tfrac{k}{k+1}}\right)^{2}-factor bound approximation of Problem (1)’s objective, while Problem (3)’s upper bound may be exponentially larger in the worst case (Amini and Wainwright, 2008). This additional tightness, however, comes at a price: Problem (4) is NP-hard to solve—indeed, providing a constant-factor guarantee on sparse PCA is NP-hard (Magdon-Ismail, 2017)—and thus (4) is best formulated as a MIO, while Problem (3) can be solved in polynomial time.

More recently, by building on the work of Kim and Kojima (2001), Bertsimas and Cory-Wright (2020) introduced a second-order cone relaxation of (2) which scales to p=1000sp=1000s, and matches the semidefinite bound after imposing a small number of cuts. Moreover, it typically supplies bound gaps of less than 5%5\%. However, it does not supply an exact certificate of optimality, which is often desirable, for instance in medical applications.

A fundamental drawback of existing convex relaxation techniques is that they are not coupled with rounding schemes for obtaining high-quality feasible solutions. This is problematic, because optimizers are typically interested in obtaining high-quality solutions, rather than certificates. In this paper, we take a step in this direction, by deriving new convex relaxations that naturally give rise to greedy and random rounding schemes. The fundamental point of difference between our relaxations and existing relaxations is that we derive our relaxations by rewriting sparse PCA as a MISDO and dropping an integrality constraint, rather than using more ad-hoc techniques.

Exact Methods:

Motivated by the successful application of mixed-integer optimization for solving statistical learning problems such as best subset selection (Bertsimas and Van Parys, 2020) and sparse classification (Bertsimas et al., 2017), several exact methods for solving sparse PCA to certifiable optimality have been proposed. The first branch-and-bound algorithm for solving Problem (1) was proposed by Moghaddam et al. (2006), by applying norm equivalence relations to obtain valid bounds. However, Moghaddam et al. (2006) did not couple their approach with high-quality initial solutions and tractable bounds to prune partial solutions. Consequently, they could not scale their approach beyond p=40p=40.

A more sophisticated branch-and-bound scheme was recently proposed by Berk and Bertsimas (2019), which couples tighter Gershgorin Circle Theorem bounds (Horn and Johnson, 1990, Chapter 6) with a fast heuristic due to Yuan and Zhang (2013) to solve problems up to p=250p=250. However, their method cannot scale beyond p=100p=100s, because the bounds obtained are too weak to avoid enumerating a sizeable portion of the tree.

Recently, the authors developed a framework for reformulating convex mixed-integer optimization problems with logical constraints (see Bertsimas et al., 2019), and demonstrated that this framework allows a number of problems of practical relevance to be solved to certifiably optimality via a cutting-plane method. In this paper, we build upon this work by reformulating Problem (1) as a convex mixed-integer semidefinite optimization problem, and leverage this reformulation to design a cutting-plane method which solves sparse PCA to certifiable optimality. A key feature of our approach is that we need not solve any semidefinite subproblems. Rather, we use concepts from SDO to design a semidefinite-free approach which uses simple linear algebra techniques.

Concurrently to our initial submission, Li and Xie (2020) also attempted to reformulate sparse PCA as an MISDO, and proposed valid inequalities for strengthening their formulation and local search algorithms for obtaining high-quality solutions at scale. Our work differs in the following two ways. First, we propose strengthening the MISDO formulation using the Gershgorin circle theorem and demonstrate that this allows our MISDO formulation to scale to problems with p=100p=100s of features, while they do not, to our knowledge, solve any MISDOs to certifiable optimality where p>13p>13. Second, we develop tractable second-order cone relaxations and greedy rounding schemes which allow practitioners to obtain certifiably near optimal sparse principal components even in the presence of p=1,000p=1,000s of features. More remarkable than the differences between the works however is the similarities: more than 1515 years after d’Aspremont et al. (2005)’s landmark paper first appeared, both works proposed reformulating sparse PCA as an MISDO less than a week apart. In our view, this demonstrates that the ideas contained in both works transcend sparse PCA, and can perhaps be applied to other problems in the optimization literature which have not yet been formulated as MISDOs.

1.2 Contributions and Structure

The main contributions of the paper are twofold. First, we reformulate sparse PCA exactly as a mixed-integer semidefinite optimization problem; a reformulation which is, to the best of our knowledge, novel. Second, we leverage this MISDO formulation to design efficient algorithms for solving non-convex mixed-integer quadratic optimization problems, such as sparse PCA, to certifiable optimality or within 12%1-2\% of optimality in practice at a larger scale than existing state-of-the-art methods. The structure and detailed contributions of the paper are as follows:

  • In Section 2.1, we reformulate Problem (1) as a mixed-integer SDO. We propose a cutting-plane method which solves it to certifiable optimality in Section 2.2. Our algorithm decomposes the problem into a purely binary master problem and a semidefinite separation problem. Interestingly, we show in Section 2.3 that the separation problems can be solved efficiently via a leading eigenvalue computation and does not require any SDO solver. Finally, the Gershgorin Circle theorem has been empirically successful for deriving upper-bounds on the objective value of (1) (Berk and Bertsimas, 2019). We theoretically analyze the quality of such bounds in Section 2.4 and show in Section 2.5 that tighter bounds derived from Brauer’s ovals of Cassini theorem can also be imposed via mixed-integer second-order cone constraints.

  • In Section 3, we analyze the semidefinite reformulation’s convex relaxation, and introduce a greedy rounding scheme (Section 3.1) which supplies high-quality solutions to Problem (1) in polynomial time, together with a sub-optimality gap (see numerical experiments in Section 4). To further improve the quality of rounded solution and the optimality gap, we introduce strengthening inequalities (Section 3.2). While solving the strengthened formulation exactly would result in an intractable MISDO problem, solving its relaxation and rounding the solution appears as an efficient strategy to return high-quality solutions with a numerical certificate of near-optimality.

  • In Section 4, we apply the cutting-plane and random rounding methods to derive optimal and near optimal sparse principal components for problems in the UCI dataset. We also compare our method’s performance against the method of Berk and Bertsimas (2019), and find that our exact cutting-plane method performs comparably, while our relax+round approach successfully scales to problems an order of magnitude larger and often returns solutions which outperform the exact method at sizes which the exact method cannot currently scale to. A key feature of our numerical success is that we sidestep the computational difficulties in solving SDOs at scale by proposing semidefinite-free methods for solving the convex relaxations, i.e., solving second-order cone relaxations.

Notation:

We let nonbold face characters such as bb denote scalars, lowercase bold faced characters such as 𝒙\bm{x} denote vectors, uppercase bold faced characters such as 𝑿\bm{X} denote matrices, and calligraphic uppercase characters such as 𝒵\mathcal{Z} denote sets. We let [p][p] denote the set of running indices {1,,p}\{1,...,p\}. We let 𝐞\mathbf{e} denote a vector of all 11’s, 𝟎\bm{0} denote a vector of all 0’s, and 𝕀\mathbb{I} denote the identity matrix, with dimension implied by the context.

We also use an assortment of matrix operators. We let ,\langle\cdot,\cdot\rangle denote the Euclidean inner product between two matrices, F\|\cdot\|_{F} denote the Frobenius norm of a matrix, σ\|\cdot\|_{\sigma} denote the spectral norm of a matrix, \|\cdot\|_{*} denote the nuclear norm of a matrix, 𝑿\bm{X}^{\dagger} denote the Moore-Penrose psuedoinverse of a matrix 𝑿\bm{X} and S+pS_{+}^{p} denote the p×pp\times p positive semidefinite cone; see Horn and Johnson (1990) for a general theory of matrix operators.

2 An Exact Mixed-Integer Semidefinite Optimization Algorithm

In Section 2.1, we reformulate Problem (1) as a convex mixed-integer semidefinite optimization problem. From this formulation, we propose an outer-approximation scheme (Section 2.2) which, as we show in Section 2.3, does not require solving any semidefinite problems. We improve convergence of the algorithm by deriving quality upper-bounds on Problem’s (1) objective value in Section 2.4 and 2.5.

2.1 A Mixed-Integer Semidefinite Reformulation

Starting from the rank-constrained SDO formulation (2), we introduce binary variables ziz_{i} to model whether Xi,jX_{i,j} is non-zero, via the logical constraint Xi,j=0X_{i,j}=0 if zi=0z_{i}=0; note that we need not require that Xi,j=0X_{i,j}=0 if zj=0z_{j}=0, since 𝑿\bm{X} is a symmetric matrix. By enforcing the logical constraint via Mi,jziXi,jMi,jzi-M_{i,j}z_{i}\leq X_{i,j}\leq M_{i,j}z_{i} for sufficiently large Mi,j>0M_{i,j}>0, Problem (2) becomes

max𝒛{0,1}p:𝒆𝒛kmax𝑿S+p\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}}\quad 𝚺,𝑿\displaystyle\langle\bm{\Sigma},\bm{X}\rangle
s.t. tr(𝑿)=1,Mi,jziXi,jMi,jzii,j[p],Rank(𝑿)=1.\displaystyle\mathrm{tr}(\bm{X})=1,\ -M_{i,j}z_{i}\leq X_{i,j}\leq M_{i,j}z_{i}\ \forall i,j\in[p],\ \mathrm{Rank}(\bm{X})=1.

To obtain a MISDO reformulation, we omit the rank constraint. In general, omitting a rank constraint generates a relaxation and induces some loss of optimality. Remarkably, this omission is without loss of optimality in this case. Indeed, the objective is convex and therefore some rank-one extreme matrices 𝑿\bm{X} is optimal. We formalize this observation in the following theorem; note that a similar result—although in the context of computing Restricted Isometry constants and with a different proof—exists (Gally and Pfetsh, 2016):

Theorem 1

Problem (1) attains the same optimal objective value as the problem:

max𝒛{0,1}p:𝒆𝒛kmax𝑿S+p\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}} 𝚺,𝑿\displaystyle\langle\bm{\Sigma},\bm{X}\rangle (5)
s.t. tr(𝑿)=1\displaystyle\mathrm{tr}(\bm{X})=1\quad
|Xi,j|Mi,jzi\displaystyle|X_{i,j}|\leq M_{i,j}z_{i}\quad i,j[p],\displaystyle\forall i,j\in[p],
j=1p|Xi,j|kzi\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{j=1}^{p}|X_{i,j}|\leq\sqrt{k}z_{i}\quad i[p],\displaystyle\forall i\in[p],

where Mi,i=1M_{i,i}=1, and Mi,j=12M_{i,j}=\frac{1}{2} if jij\neq i.

Remark 2

Observe that if knk\leq\sqrt{n} and we set Mi,j=1i,j[p]M_{i,j}=1\ \forall i,j\in[p] in Problem (5) and omit the valid inequality j=1p|Xi,j|kzi\sum_{j=1}^{p}|X_{i,j}|\leq\sqrt{k}z_{i} then the optimal value of the continuous relaxation is trivially λmax(𝚺)\lambda_{\max}(\bm{\Sigma}). Indeed, letting 𝐱\bm{x} be a leading eigenvector of the unconstrained problem (where 𝐱2=1\|\bm{x}\|_{2}=1), we can set zi=|xi||xi||xj|z_{i}=|x_{i}|\geq|x_{i}||x_{j}|, where the inequality holds since 𝐱2=1\|\bm{x}\|_{2}=1, and Xi,j=xixjX_{i,j}=x_{i}x_{j}, meaning (a) izi=𝐱1kn\sum_{i}z_{i}=\|\bm{x}\|_{1}\leq k\leq\sqrt{n} by norm equivalence and (b) |Xi,j|zi|X_{i,j}|\leq z_{i} and thus (𝐗,𝐳)(\bm{X},\bm{z}) solves this continuous relaxation. Therefore, setting Mi,j=12M_{i,j}=\frac{1}{2} if jij\neq i and/or imposing the valid inequality j=1p|Xi,j|kzi\sum_{j=1}^{p}|X_{i,j}|\leq\sqrt{k}z_{i} is necessary for obtaining non-trivial relaxations whenever kk is small.

Proof  It suffices to demonstrate that for any feasible solution to (1) we can construct a feasible solution to (5) with an equal or greater payoff, and vice versa.

  • Let 𝒙p\bm{x}\in\mathbb{R}^{p} be a feasible solution to (1). Then, since 𝒙1k\|\bm{x}\|_{1}\leq\sqrt{k}, (𝑿:=𝒙𝒙,𝒛)(\bm{X}:=\bm{x}\bm{x}^{\top},\bm{z}) is a feasible solution to (5) with equal cost, where zi=1z_{i}=1 if |xi|>0|x_{i}|>0, zi=0z_{i}=0 otherwise.

  • Let (𝑿,𝒛)(\bm{X},\bm{z}) be a feasible solution to Problem (5), and let 𝑿=i=1pσi𝒙i𝒙i\bm{X}=\sum_{i=1}^{p}\sigma_{i}\bm{x}_{i}\bm{x}_{i}^{\top} be a Cholesky decomposition of 𝑿\bm{X}, where 𝒆𝝈=1,𝝈𝟎\bm{e}^{\top}\bm{\sigma}=1,\bm{\sigma}\geq\bm{0}, and 𝒙i2=1i[p]\|\bm{x}_{i}\|_{2}=1\ \forall i\in[p]. Observe that 𝒙i0ki[p],\|\bm{x}_{i}\|_{0}\leq k\ \forall i\in[p], since we can perform the Cholesky decomposition on the submatrix of 𝑿\bm{X} induced by 𝒛\bm{z}, and “pad” out the remaining entries of each 𝒙i\bm{x}_{i} with 0s to obtain the decomposition of 𝑿\bm{X}. Therefore, let us set 𝒙^:=argmaxi[𝒙i𝚺𝒙i]\hat{\bm{x}}:=\arg\max_{i}[\bm{x}_{i}^{\top}\bm{\Sigma}\bm{x}_{i}]. Then, 𝒙^\hat{\bm{x}} is a feasible solution to (1) with an equal or greater payoff.

Finally, we let Mi,i=1M_{i,i}=1, Mi,j=12M_{i,j}=\frac{1}{2} if iji\neq j, as the 2×22\times 2 minors imply Xi,j2Xi,iXj,j14X_{i,j}^{2}\leq X_{i,i}X_{j,j}\leq\frac{1}{4} whenever iji\neq j (c.f. Gally and Pfetsh, 2016, Lemma 1).  
Theorem 1 reformulates Problem (1) as a mixed-integer SDO. Therefore, we can solve Problem (5) using general branch-and-cut techniques for semidefinite optimization problems (see Gally et al., 2018; Kobayashi and Takano, 2020). However, this approach is not scalable, as it comprises solving a large number of semidefinite subproblems and the community does not know how to efficiently warm-start interior point methods (IPMs) for SDOs.

Alternatively, we propose a saddle-point reformulation of Problem (5) which avoids the computational difficulty of solving a large number of SDOs by exploiting problem structure, as we will show in Section 2.3. The following result reformulates Problem (5) as a max-min saddle-point problem amenable to outer-approximation:

Theorem 3

Problem (5) attains the same optimal value as the following problem:

max𝒛{0,1}p:𝒆𝒛kf(𝒛)\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\ \bm{e}^{\top}\bm{z}\leq k}\quad f(\bm{z}) (6)
 where f(𝒛):=minλ,𝜶p×p,𝜷p\displaystyle\quad\text{ where }\quad f(\bm{z}):=\min_{\lambda\in\mathbb{R},\bm{\alpha}\in\mathbb{R}^{p\times p}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0},\bm{\beta}\in\mathbb{R}^{p}}}\quad λ+i=1pzi(j=1pMi,jmax(0,|αi,j|βi)+kβi)\displaystyle\lambda+\sum_{i=1}^{p}{z}_{i}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left(\sum_{j=1}^{p}M_{i,j}\max(0,|\alpha_{i,j}|-\beta_{i})+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sqrt{k}\beta_{i}}\right)} (7)
s.t. λ𝕀+𝜶𝚺.\displaystyle\lambda\mathbb{I}+\bm{\alpha}\succeq\bm{\Sigma}.
Remark 4

The above theorem demonstrates that f(𝐳)f(\bm{z}) is concave in 𝐳\bm{z}, by rewriting it as the infimum of functions which are linear in 𝐳\bm{z} (Boyd and Vandenberghe, 2004).

Proof  Let us introduce auxiliary variables Ui,jU_{i,j} to model the absolute value of Xi,jX_{i,j} and rewrite the inner optimization problem of (5) as

f(𝒛):=\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}f(\bm{z})= max𝑿𝟎,𝑼\displaystyle\max_{\bm{X}\succeq\bm{0},\bm{U}}\quad 𝚺,𝑿\displaystyle\langle\bm{\Sigma},\bm{X}\rangle (8)
s.t. tr(𝑿)=1,\displaystyle\mathrm{tr}(\bm{X})=1,\quad [λ]\displaystyle[\lambda]
Ui,jMi,jzi\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}U_{i,j}\leq M_{i,j}z_{i}\ i,j[p],\displaystyle\forall i,j\in[p], [σi,j]\displaystyle[\sigma_{i,j}]
|Xi,j|Ui,j\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}|X_{i,j}|\leq U_{i,j}\ i,j[p],\displaystyle\forall i,j\in[p], [αi,j]\displaystyle[\alpha_{i,j}]
j=1pUi,jkzi\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{j=1}^{p}U_{i,j}\leq\sqrt{k}z_{i} i[p],\displaystyle\forall i\in[p], [βi]\displaystyle[\beta_{i}]

where we associate dual constraint multipliers with primal constraints in square brackets. For 𝒛\bm{z} such that 𝒆𝒛1\bm{e}^{\top}\bm{z}\geq 1, the maximization problem induced by f(𝒛)f(\bm{z}) satisfies Slater’s condition (see, e.g., Boyd and Vandenberghe, 2004, Chapter 5.2.3), strong duality applies and leads to

f(𝒛)=minλ𝝈,𝜶,𝜷𝟎\displaystyle f(\bm{z})=\min_{\begin{subarray}{c}\lambda\\ \bm{\sigma},\bm{\alpha},\bm{\beta}\geq\bm{0}\end{subarray}}\quad λ+i,jσi,jMi,jzi+i=1pβikzi\displaystyle\lambda+\sum_{i,j}\sigma_{i,j}M_{i,j}z_{i}+\sum_{i=1}^{p}\beta_{i}\sqrt{k}z_{i}
s.t. λ𝕀+𝜶𝚺,|αi,j|σi,j+βi.\displaystyle\lambda\mathbb{I}+\bm{\alpha}\succeq\bm{\Sigma},|\alpha_{i,j}|\leq\sigma_{i,j}+\beta_{i}.

We eliminate 𝝈\bm{\sigma} from the dual problem above by optimizing over σi,j\sigma_{i,j} and setting σi,j=max(0,|αi,j|βi)\sigma^{\star}_{i,j}=\max(0,|\alpha_{i,j}|-\beta_{i}).

Note that for 𝒛=𝟎\bm{z}=\bm{0}, the primal subproblem is infeasible and the dual subproblem has objective -\infty, but this can safely be ignored since 𝒛=𝟎\bm{z}=\bm{0} is certainly suboptimal.  

2.2 A Cutting-Plane Method

Theorem 3 shows that evaluating f(𝒛^)f(\bm{\hat{z}}) yields the globally valid overestimator:

f(𝒛)f(𝒛^)+𝒈𝒛^(𝒛𝒛^),f(\bm{z})\leq f(\hat{\bm{z}})+\bm{g}_{\hat{\bm{z}}}^{\top}(\bm{z}-\hat{\bm{z}}),

where 𝒈𝒛^\bm{g}_{\hat{\bm{z}}} is a supergradient of ff at 𝒛^\bm{\hat{z}}, at no additional cost. In particular, we have

g𝒛^,i=(j=1pMi,jmax(0,|αi,j(𝒛^)|βi(𝒛^))+kβi(𝒛^)),g_{\hat{\bm{z}},i}={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left(\sum_{j=1}^{p}M_{i,j}\max\left(0,|\alpha_{i,j}^{\star}(\bm{\hat{z}})|-\beta_{i}(\hat{\bm{z}})\right)+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sqrt{k}\beta_{i}(\hat{\bm{z}})}\right)},

where 𝜶(𝒛^)\bm{\alpha}^{\star}(\hat{\bm{z}}), 𝜷(𝒛^)\bm{\beta}^{\star}(\hat{\bm{z}}) constitutes an optimal choice of (𝜶,𝜷)(\bm{\alpha},\bm{\beta}) for a fixed 𝒛^\bm{\hat{\bm{z}}}. This observation leads to an efficient strategy for maximizing f(𝒛)f(\bm{z}): iteratively maximizing and refining a piecewise linear upper estimator of f(𝒛)f(\bm{z}). This strategy is called outer-approximation (OA), and was originally proposed by Duran and Grossmann (1986). OA works by iteratively constructing estimators of the following form at each iteration tt:

ft(𝒛)=min1it{f(𝒛i)+𝒈𝒛i(𝒛𝒛i)}.\displaystyle f^{t}(\bm{z})=\min_{1\leq i\leq t}\left\{f(\bm{z}_{i})+\bm{g}_{\bm{z}_{i}}^{\top}(\bm{z}-\bm{z}_{i})\right\}. (9)

After constructing each overestimator, we maximize ft(𝒛)f^{t}(\bm{z}) over {0,1}p\{0,1\}^{p} to obtain 𝒛t\bm{z}_{t}, and evaluate f()f(\cdot) and its supergradient at 𝒛t\bm{z}_{t}. This procedure yields a non-increasing sequence of overestimators {ft(𝒛t)}t=1T\{f^{t}(\bm{z}_{t})\}_{t=1}^{T} which converge to the optimal value of f(𝒛)f(\bm{z}) within a finite number of iterations T(p1)++(pk)T\leq{p\choose 1}+\ldots+{p\choose k}, since {0,1}p\{0,1\}^{p} is a finite set and OA never visits a point twice. Additionally, we can avoid solving a different MILO at each OA iteration by integrating the entire algorithm within a single branch-and-bound tree, as proposed by Quesada and Grossmann (1992), using lazy constraint callbacks. Lazy constraint callbacks are now standard components of modern MILO solvers such as Gurobi or CPLEX and substantially speed-up OA. We formalize this procedure in Algorithm 1; note that f(𝒛t+1)\partial f(\bm{z}_{t+1}) denotes the set of supergradients of ff at 𝒛t+1\bm{z}_{t+1}.

Algorithm 1 An outer-approximation method for Problem (1)
0: Initial solution 𝒛1\bm{z}_{1}
t1t\leftarrow 1
repeat
  Compute 𝒛t+1,θt+1\bm{z}_{t+1},\theta_{t+1} solution of
max𝒛{0,1}p:𝒆𝒛k,θθ s.t. θf(𝒛i)+𝒈𝒛i(𝒛𝒛i)i[t],\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k,\theta}\>\theta\quad\mbox{ s.t. }\theta\leq f(\bm{z}_{i})+\bm{g}_{\bm{z}_{i}}^{\top}(\bm{z}-\bm{z}_{i})\ \forall i\in[t],
  Compute f(𝒛t+1)f(\bm{z}_{t+1}) and 𝒈𝒛t+1f(𝒛t+1)\bm{g}_{\bm{z}_{t+1}}\in\partial f(\bm{z}_{t+1}) by solving (7)
  tt+1t\leftarrow t+1
until f(𝒛t)θtεf(\bm{z}_{t})-\theta_{t}\leq\varepsilon
return  𝒛t\bm{z}_{t}

2.3 A Semidefinite-free Subproblem Strategy

Our derivation and analysis of Algorithm 1 indicates that we can solve Problem (1) to certifiable optimality by solving a (potentially large) number of semidefinite subproblems (7), which might be prohibitive in practice. Therefore, we now derive a computationally efficient subproblem strategy which crucially does not require solving any semidefinite programs. Formally, we have the following result:

Theorem 5

For any 𝐳{0,1}p:𝐞𝐳k\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k, optimal dual variables in (7) are

λ=λmax(𝚺1,1),𝜶^=(𝜶^1,1𝜶^1,2𝜶^1,2𝜶^2,2)=(𝟎𝟎𝟎𝚺2,2λ𝕀+𝚺1,2(λ𝕀𝚺1,1)𝚺1,2),\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\lambda=\lambda_{\max}\left(\bm{\Sigma}_{1,1}\right),\ \hat{\bm{\alpha}}=\begin{pmatrix}\hat{\bm{\alpha}}_{1,1}&\hat{\bm{\alpha}}_{1,2}\\ \hat{\bm{\alpha}}_{1,2}^{\top}&\hat{\bm{\alpha}}_{2,2}\end{pmatrix}=\begin{pmatrix}\bm{0}&\bm{0}\\ \bm{0}&\bm{\Sigma}_{2,2}-\lambda\mathbb{I}+\bm{\Sigma}_{1,2}^{\top}\left(\lambda\mathbb{I}-\bm{\Sigma}_{1,1}\right)^{\dagger}\bm{\Sigma}_{1,2}\end{pmatrix}, (10)
βi=(1zi)(|α^i,1|,|α^i,2|,,|α^i,i|,|α^i,i|,,|α^i,p|)[k]i[p],\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta_{i}=(1-z_{i})\begin{pmatrix}|\hat{\alpha}_{i,1}|,|\hat{\alpha}_{i,2}|,\ldots,|\hat{\alpha}_{i,i}|,|\hat{\alpha}_{i,i}|,\ldots,|\hat{\alpha}_{i,p}|\end{pmatrix}_{[\left\lceil\sqrt{k}\ \right\rceil]}\ \forall i\in[p],} (11)

where λmax()\lambda_{\max}(\cdot) denotes the leading eigenvalue of a matrix, 𝛂^=(𝛂^1,1𝛂^1,2𝛂1,2^𝛂^2,2)\hat{\bm{\alpha}}=\begin{pmatrix}\hat{\bm{\alpha}}_{1,1}&\hat{\bm{\alpha}}_{1,2}\\ \hat{\bm{\alpha}_{1,2}^{\top}}&\hat{\bm{\alpha}}_{2,2}\end{pmatrix} is a permutation such that 𝛂^1,1\hat{\bm{\alpha}}_{1,1} (resp. 𝛂^2,2\hat{\bm{\alpha}}_{2,2}) denotes the entries of 𝛂^\hat{\bm{\alpha}} where zi=zj=1z_{i}=z_{j}=1 (zi=zj=0z_{i}=z_{j}=0); 𝚺\bm{\Sigma} is similar, and (𝐱)[k](\bm{x})_{[k]} denotes the kkth largest element of 𝐱\bm{x}.

Remark 6

By Theorem 5, Problem (7) can be solved by computing the leading eigenvalue of 𝚺1,1\bm{\Sigma}_{1,1} and solving a linear system. This justifies our claim that we need not solve any SDOs in our algorithmic strategy.

Proof  We appeal to strong duality and complementary slackness. Observe that, for any 𝒛{0,1}p\bm{z}\in\{0,1\}^{p}, f(𝒛)f(\bm{z}) is the optimal value of a maximization problem over a closed convex compact set. Therefore, there exists some optimal primal solution 𝑿\bm{X}^{\star} without loss of generality. Moreover, since the primal has non-empty relative interior with respect to the non-affine constraints, it satisfies the Slater constraint qualification and strong duality holds (see, e.g., Boyd and Vandenberghe, 2004, Chapter 5.2.3). Therefore, by complementary slackness (see, e.g., Boyd and Vandenberghe, 2004, Chapter 5.5.2), there must exist some dual-optimal solution (λ,𝜶^,𝜷)(\lambda,\hat{\bm{\alpha}},\bm{\beta}) which obeys complementarity with 𝑿\bm{X}^{\star}. Moreover, |Xi,j|Mi,j|X_{i,j}|\leq M_{i,j} is implied by tr(𝑿)=1,𝑿𝟎\mathrm{tr}(\bm{X})=1,\bm{X}\succeq\bm{0}, while j=1p|Xi,j|zik\sum_{j=1}^{p}|X_{i,j}|\leq z_{i}\sqrt{k} is implied by |Xi,j|Mi,jzi|X_{i,j}|\leq M_{i,j}z_{i} and 𝒆𝒛k\bm{e}^{\top}\bm{z}\leq k. Therefore, by complementary slackness, we can take the constraints |Xi,j|Mi,jzi|X_{i,j}|\leq M_{i,j}z_{i}, j=1p|Xi,j|zik\sum_{j=1}^{p}|X_{i,j}|\leq z_{i}\sqrt{k} to be inactive when zi=1z_{i}=1 without loss of generality, which implies that α^i,j,βi=0\hat{\alpha}_{i,j}^{\star},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta_{i}^{\star}}=0 if zi=1z_{i}=1 in some dual-optimal solution. Moreover, we also have α^i,j=0\hat{\alpha}_{i,j}^{\star}=0 if zj=1z_{j}=1, since 𝜶^\hat{\bm{\alpha}} obeys the dual feasibility constraint λ𝕀+𝜶^𝚺\lambda\mathbb{I}+\hat{\bm{\alpha}}\succeq\bm{\Sigma}, and therefore is itself symmetric.

Next, observe that, by strong duality, λ=λmax(𝚺1,1)\lambda=\lambda_{\max}(\bm{\Sigma}_{1,1}) in this dual-optimal solution, since 𝜶\bm{\alpha} only takes non-zero values if zi=zj=0z_{i}=z_{j}=0 and does not contribute to the objective, and 𝜷\bm{\beta} is similar.

Next, observe that, by strong duality and complementary slackness, any dual feasible (λ,𝜶^,𝜷)(\lambda,\hat{\bm{\alpha}},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bm{\beta}}) satisfying the above conditions is dual-optimal. Therefore, we need to find an 𝜶^2,2\hat{\bm{\alpha}}_{2,2} such that

(λ𝕀𝚺1,1𝚺1,2𝚺2,1λ𝕀+𝜶^2,2𝚺2,2)𝟎.\displaystyle\begin{pmatrix}\lambda\mathbb{I}-\bm{\Sigma}_{1,1}&-\bm{\Sigma}_{1,2}\\ -\bm{\Sigma}_{2,1}&\lambda\mathbb{I}+\hat{\bm{\alpha}}_{2,2}-\bm{\Sigma}_{2,2}\end{pmatrix}\succeq\bm{0}.

By the generalized Schur complement lemma (see Boyd et al., 1994, Equation 2.41), this is PSD if and only if

  1. 1.

    λ𝕀𝚺1,1𝟎\lambda\mathbb{I}-\bm{\Sigma}_{1,1}\succeq\bm{0},

  2. 2.

    (𝕀(λ𝕀𝚺1,1)(λ𝕀𝚺1,1))𝚺1,2=𝟎\left(\mathbb{I}-(\lambda\mathbb{I}-\bm{\Sigma}_{1,1})(\lambda\mathbb{I}-\bm{\Sigma}_{1,1})^{\dagger}\right)\bm{\Sigma}_{1,2}=\bm{0}, and

  3. 3.

    λ𝕀+𝜶^2,2𝚺2,2𝚺1,2(λ𝕀𝚺1,1)𝚺1,2\lambda\mathbb{I}+\hat{\bm{\alpha}}_{2,2}-\bm{\Sigma}_{2,2}\succeq\bm{\Sigma}_{1,2}^{\top}\left(\lambda\mathbb{I}-\bm{\Sigma}_{1,1}\right)^{\dagger}\bm{\Sigma}_{1,2}.

The first two conditions hold because, as argued above, λ\lambda is optimal and therefore feasible, and the conditions are independent of 𝜶^2,2\hat{\bm{\alpha}}_{2,2}. Therefore, it suffices to pick 𝜶^2,2\hat{\bm{\alpha}}_{2,2} in order that the third condition holds. We achieve this by setting 𝜶^2,2\hat{\bm{\alpha}}_{2,2} so the PSD constraint in condition (3) holds with equality.

Finally, let us optimize for 𝜷\bm{\beta} to obtain stronger cuts (when zi=0z_{i}=0 we can pick any feasible βi\beta_{i}, but optimizing to set f(𝒛)i\partial f(\bm{z})_{i} to be as small as possible gives stronger cuts). This is equivalent to solving the univariate minimization problem for each βi\beta_{i}:

minβi(j=1pMi,jmax(0,|αi,j|βi)+kβi).\displaystyle\min_{\beta_{i}}\left(\sum_{j=1}^{p}M_{i,j}\max(0,|\alpha_{i,j}|-\beta_{i})+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sqrt{k}\beta_{i}}\right).

Moreover, it is a standard result from max-kk optimization (see, e.g., Zakeri et al., 2014; Todd, 2018) that this is achieved by setting βi\beta_{i} to be the k\lceil\sqrt{k}\ \rceil largest element of {αi,j}j[p]{αi,i}\{{\alpha}_{i,j}\}_{j\in[p]}\cup\{\alpha_{i,i}\} in absolute magnitude, where we include αi,i\alpha_{i,i} twice since Mi,i=1M_{i,i}=1 while Mi,j=1/2M_{i,j}=1/2 if jij\neq i.  

2.4 Strengthening the Master Problem via the Gershgorin Circle Theorem

To accelerate Algorithm 1, we strengthen the master problem by imposing bounds from the circle theorem. Formally, we have the following result, which can be deduced from (Horn and Johnson, 1990, Theorem 6.1.1):

Theorem 7

For any vector 𝐳{0,1}p\bm{z}\in\{0,1\}^{p} we have the following upper bound on f(𝐳)f(\bm{z})

f(𝒛)maxj[p]:zj=1i[p]zi|Σi,j|.\displaystyle f(\bm{z})\leq\max_{j\in[p]:z_{j}=1}\sum_{i\in[p]}z_{i}|\Sigma_{i,j}|. (12)

Observe that this bound cannot be used to directly strengthen Algorithm 1’s master problem, since the bound is not convex in 𝒛\bm{z}. Nonetheless, it can be successfully applied if we (a) impose a big-M assumption on Problem (1)’s optimal objective and (b) introduce pp additional binary variables 𝒔{0,1}p:𝒆𝒔=1\bm{s}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{s}=1 which model whether the iith Gershgorin disc is active; recall that each eigenvalue is contained in the union of the discs. Formally, we impose the following valid inequalities in the master problem:

𝒔{0,1}p:\displaystyle\exists\bm{s}\in\{0,1\}^{p}:\ θi[p]zi|Σi,j|+M(1sj)j[p],𝒆𝒔=1,𝒔𝒛,\displaystyle\theta\leq\sum_{i\in[p]}z_{i}|\Sigma_{i,j}|+M(1-s_{j})\ \forall j\in[p],\bm{e}^{\top}\bm{s}=1,\bm{s}\leq\bm{z}, (13)

where θ\theta is the epigraph variable maximized in the master problem stated in Algorithm 1, and MM is an upper bound on the sum of the kk largest absolute entries in any column of 𝚺\bm{\Sigma}. Note that we set 𝒔𝒛\bm{s}\leq\bm{z} since if zi=0z_{i}=0 the iith column of 𝚺\bm{\Sigma} does not feature in the relevant submatrix of 𝚺\bm{\Sigma}. In the above inequalities, a valid MM is given by any bound on the optimal objective. Since Theorem 7 supplies one such bound for any given 𝒛\bm{z}, we can compute

M:=maxj[p]max𝒛{0,1}p:𝒆𝒛ki[p]zi|Σi,j|,\displaystyle M:=\max_{j\in[p]}\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \sum_{i\in[p]}z_{i}|\Sigma_{i,j}|, (14)

which can be done in O(p2)O(p^{2}) time.

To further improve Algorithm 1, we also make use of the Gershgorin circle theorem before generating each outer-approximation cut. Namely, at a given node in a branch-and-bound tree, there are indices ii where ziz_{i} has been fixed to 11, indices ii where ziz_{i} has been fixed to 0, and indices ii where ziz_{i} has not yet been fixed. Accordingly, we compute the worst-case Gershgorin bound—by taking the worst-case bound over each index jj such that zjz_{j} has not yet been fixed to 0, i.e.,

maxj:zj0{max𝒔{0,1}p:𝒆𝒔k{i[p]si|Σi,j|s.t.si=0ifzi=0,si=1ifzi=1}}.\max_{j:z_{j}\neq 0}\left\{\max_{\bm{s}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{s}\leq k}\left\{\sum_{i\in[p]}s_{i}|\Sigma_{i,j}|\ \text{s.t.}\ s_{i}=0\ \text{if}\ z_{i}=0,s_{i}=1\ \text{if}\ z_{i}=1\right\}\right\}.

If this bound is larger than our incumbent solution then we generate an outer-approximation cut, otherwise the entire subtree rooted at this node does not contain an optimal solution and we use instruct the solver to avoid exploring this node via a callback.

Our numerical results in Section 4 echo the empirical findings of Berk and Bertsimas (2019) and indicate that Algorithm 1 performs substantially better when the Gershgorin bound is supplied in the master problem. Therefore, it is interesting to theoretically investigate the tightness, or at least the quality, of Gershgorin’s bound. We supply some results in this direction in the following proposition:

Proposition 8

Suppose that 𝚺\bm{\Sigma} is a scaled diagonally dominant matrix as defined by Boman et al. (2005), i.e., there exists some vector 𝐝>0\bm{d}>0 such that

diΣi,ij[p]:jidj|Σi,j|i[p].d_{i}\Sigma_{i,i}\geq\sum_{j\in[p]:j\neq i}d_{j}|\Sigma_{i,j}|\ \forall i\in[p].

Then, letting ρ:=maxi,j[p]{didj}\rho:=\max_{i,j\in[p]}\{\frac{d_{i}}{d_{j}}\}, the Gershgorin circle theorem provides a (1+ρ)(1+\rho)-factor approximation, i.e.,

f(𝒛)maxj[p]{i[p]zi|Σi,j|}(1+ρ)f(𝒛)𝒛{0,1}p.\displaystyle f(\bm{z})\leq\max_{j\in[p]}\left\{\sum_{i\in[p]}z_{i}|\Sigma_{i,j}|\right\}\leq(1+\rho)f(\bm{z})\quad\forall\bm{z}\in\{0,1\}^{p}. (15)
Remark 9

Observe that, for a fixed 𝐳\bm{z}, the ratio ρ:=maxi,j[p]{didj}\rho:=\max_{i,j\in[p]}\{\frac{d_{i}}{d_{j}}\} need only be computed over indices i,ji,j such that zi,zj=1z_{i},z_{j}=1. Moreover, for a partially specified 𝐳\bm{z}—which might arise at an intermediate node in a branch-and-bound tree generated by Algorithm 1—the ratio ρ\rho need only be computed over indices ii where ziz_{i} is unspecified or set to 11. This suggests that the quality of the Gershgorin bound improves upon branching.

Remark 10

In particular, if 𝚺S+n\bm{\Sigma}\in S^{n}_{+} is a diagonal matrix, then Equation (13)’s bound is tight - which follows from the fact that the spectrum of 𝚺\bm{\Sigma} and the discs coincide if and only if 𝚺\bm{\Sigma} is diagonal (see, e.g, Horn and Johnson, 1990, Chapter 6). Alternatively, if 𝚺\bm{\Sigma} is a diagonally dominant matrix then ρ=1\rho=1 and the Gershgorin circle theorem provides a 22-factor approximation.

Proof  Scaled diagonally dominant matrices have scaled diagonally dominant principal minors—this is trivially true because

diΣi,ij[p]:jidj|Σi,j|i[p]diΣi,ij[p]:jidjzj|Σi,j|i[p]:zi=1d_{i}\Sigma_{i,i}\geq\sum_{j\in[p]:j\neq i}d_{j}|\Sigma_{i,j}|\ \forall i\in[p]\implies d_{i}\Sigma_{i,i}\geq\sum_{j\in[p]:j\neq i}d_{j}z_{j}|\Sigma_{i,j}|\ \forall i\in[p]:z_{i}=1

for the same vector 𝒅>𝟎\bm{d}>\bm{0} and therefore the following chain of inequalities holds

f(𝒛)\displaystyle f(\bm{z})\leq maxj[p]{i[p]zi|Σi,j|}=maxj[p]{zjΣj,j+i[p]:jizi|Σi,j|}\displaystyle\max_{j\in[p]}\{\sum_{i\in[p]}z_{i}|\Sigma_{i,j}|\}=\max_{j\in[p]}\{z_{j}\Sigma_{j,j}+\sum_{i\in[p]:j\neq i}z_{i}|\Sigma_{i,j}|\}
maxj[p]{zjΣj,j+i[p]:jiρdidjzi|Σi,j|}(1+ρ)maxj[p]{zjΣj,j}(1+ρ)f(𝒛)𝒛{0,1}p,\displaystyle\leq\max_{j\in[p]}\{z_{j}\Sigma_{j,j}+\sum_{i\in[p]:j\neq i}\rho\frac{d_{i}}{d_{j}}z_{i}|\Sigma_{i,j}|\}\leq(1+\rho)\max_{j\in[p]}\{z_{j}\Sigma_{j,j}\}\leq(1+\rho)f(\bm{z})\quad\forall\bm{z}\in\{0,1\}^{p},

where the second inequality follows because ρdidj\rho\geq\frac{d_{i}}{d_{j}}, the third inequality follows from the scaled diagonal dominance of the principal submatrices of 𝚺\bm{\Sigma}, and the fourth inequality holds because the leading eigenvalue of a PSD matrix is at least as large as each diagonal.  

To make clear the extent our numerical success depends upon Theorem 7, our results in Section 4 present implementations of Algorithm 1 both with and without the bound.

2.5 Beyond Gershgorin: Further Strengthening via Brauer’s Ovals of Cassini

Given the relevance of Gershgorin’s bound, we propose, in this section, a stronger —yet more expensive to implement— upper bound, based on an generalization of the Gershgorin Circle theorem, namely Brauer’s ovals of Cassini.

First, we derive a new upper-bound on f(𝒛)f(\bm{z}) that is at least as strong as the one presented in Theorem 7 and often strictly stronger (Horn and Johnson, 1990, Chapter 6):

Theorem 11

For any vector 𝐳{0,1}p\bm{z}\in\{0,1\}^{p}, we have the following upper bound on f(𝐳)f(\bm{z}):

f(𝒛)maxi,j[p]:i>j,zi=zj=1{Σi,i+Σj,j2+(Σi,iΣj,j)2+4Ri(𝒛)Rj(𝒛)2},\displaystyle f(\bm{z})\leq\max_{i,j\in[p]:i>j,z_{i}=z_{j}=1}\left\{\frac{\Sigma_{i,i}+\Sigma_{j,j}}{2}+\frac{\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}(\bm{z})R_{j}(\bm{z})}}{2}\right\}, (16)

where Ri(𝐳):=j[p]:jizj|Σi,j|R_{i}(\bm{z}):=\sum_{j\in[p]:j\neq i}z_{j}|\Sigma_{i,j}| is the absolute sum of off-diagonal entries in the iith column of the submatrix of 𝚺\bm{\Sigma} induced by 𝐳\bm{z}.

Proof  Let us first recall that, per Brauer (1952)’s original result, all eigenvalues of a matrix 𝚺S+p\bm{\Sigma}\in S^{p}_{+} are contained in the union of the following p(p1)/2p(p-1)/2 ovals of Cassini:

i[p],j[p]:i<j{λ+:|λΣi,i||λΣj,j|RiRj},\displaystyle\bigcup_{i\in[p],j\in[p]:i<j}\left\{\lambda\in\mathbb{R}_{+}:|\lambda-\Sigma_{i,i}||\lambda-\Sigma_{j,j}|\leq R_{i}R_{j}\right\},

where Ri:=j[p]:ji|Σi,j|R_{i}:=\sum_{j\in[p]:j\neq i}|\Sigma_{i,j}| is the absolute sum of off-diagonal entries in the iith column of 𝚺\bm{\Sigma}. Next, let us observe that, if λ\lambda is a dominant eigenvalue of a PSD matrix 𝚺\bm{\Sigma} then λΣi,ii\lambda\geq\Sigma_{i,i}\ \forall i and, in the (i,j)(i,j)th oval, the bound reduces to

λ2λ(Σi,i+Σj,j)+Σi,iΣj,jRiRj0,\displaystyle\lambda^{2}-\lambda(\Sigma_{i,i}+\Sigma_{j,j})+\Sigma_{i,i}\Sigma_{j,j}-R_{i}R_{j}\leq 0, (17)

which, by the quadratic formula, implies an upper bound is Σi,i+Σj,j2+(Σi,iΣj,j)2+4RiRj2\frac{\Sigma_{i,i}+\Sigma_{j,j}}{2}+\frac{\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}R_{j}}}{2}. The result follows because if zi=0z_{i}=0 the iith row of 𝚺\bm{\Sigma} cannot be used to bound f(𝒛)f(\bm{z}).  

Theorem 11’s inequality can be enforced numerically as mixed-integer second order cone constraints. Indeed, the square root term in (16) can be modeled using second-order cone, and the bilinear terms only involve binary variables and can be linearized. Completing the square in Equation (17), (16) is equivalent to the following system of p(p1)/2p(p-1)/2 mixed-integer second-order cone inequalities:

(θ12(Σi,i+Σj,j))2s,t[p]:si,tjWs,t|Σi,sΣj,t|34Σi,iΣj,j+M(1si,j)i,j[p]:i<j,\displaystyle\left(\theta-\frac{1}{2}(\Sigma_{i,i}+\Sigma_{j,j})\right)^{2}\leq\sum_{s,t\in[p]:s\neq i,t\neq j}W_{s,t}|\Sigma_{i,s}\Sigma_{j,t}|-\frac{3}{4}\Sigma_{i,i}\Sigma_{j,j}+M(1-s_{i,j})\ \forall i,j\in[p]:i<j,
i,j[p]:i<jsi,j=1,si,jmin(zi,zj)i,j[p]:i<j,si,j{0,1}i,j[p]:i<j.\displaystyle\sum_{i,j\in[p]:i<j}s_{i,j}=1,s_{i,j}\leq\min(z_{i},z_{j})\ i,j\in[p]:i<j,\ s_{i,j}\in\{0,1\}\ i,j\in[p]:i<j.

where Wi,j=zizjW_{i,j}=z_{i}z_{j} is a product of binary variables which can be modeled using, e.g., the McCormick inequalities max(0,zi+zj1)Wi,jmin(zi,zj)\max(0,z_{i}+z_{j}-1)\leq W_{i,j}\leq\min(z_{i},z_{j}), and MM is an upper bound on the right-hand-side of the inequality for any i,j:iji,j:i\neq j, which can be computed in O(p3)O(p^{3}) time in much the same manner as a big-MM constant was computed in the previous section. Note that we do not make use of these inequalities directly in our numerical experiments, due to their high computational cost. However, an interesting extension would be to introduce the binary variables dynamically, via branch-and-cut-and-price (Barnhart et al., 1998).

Since the bound derived from the ovals of Cassini (Theorem 11) is at least as strong as the Gershgorin circle’s one (Theorem 7), it satisfies the same approximation guarantee (Proposition 8). In particular, it is tight when 𝚺\bm{\Sigma} is diagonal and provides a 22-factor approximation for diagonally dominant matrices. Actually, we now prove a stronger result and demonstrate that Theorem 11 provides a 22-factor bound on f(𝒛)f(\bm{z}) for doubly diagonally dominant matrices—a broader class of matrices than diagonally dominant matrices (see Li and Tsatsomeros, 1997, for a general theory):

Proposition 12

Let 𝚺S+p\bm{\Sigma}\in S^{p}_{+} be a doubly diagonally dominant matrix, i.e.,

Σi,iΣj,jRiRji,j[p]:i>j,\displaystyle\Sigma_{i,i}\Sigma_{j,j}\geq R_{i}R_{j}\ \forall i,j\in[p]:i>j,

where Ri:=j[p]:ji|Σi,j|R_{i}:=\sum_{j\in[p]:j\neq i}|\Sigma_{i,j}| is the sum of the off-diagonal entries in the iith column of 𝚺\bm{\Sigma}. Then, we have that

f(𝒛)maxi,j[p]:i>j,zi=zj=1{Σi,i+Σj,j2+(Σi,iΣj,j)2+4Ri(𝒛)Rj(𝒛)}2}2f(𝒛).\displaystyle f(\bm{z})\leq\max_{i,j\in[p]:i>j,z_{i}=z_{j}=1}\left\{\frac{\Sigma_{i,i}+\Sigma_{j,j}}{2}+\frac{\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}(\bm{z})R_{j}(\bm{z})}\}}{2}\right\}\leq 2f(\bm{z}). (18)

Proof  Observe that if Σi,iΣj,jRiRj\Sigma_{i,i}\Sigma_{j,j}\geq R_{i}R_{j} then

(Σi,iΣj,j)2+4RiRj(Σi,iΣj,j)2+4Σi,iΣj,j=Σi,i+Σj,j.\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4R_{i}R_{j}}\leq\sqrt{(\Sigma_{i,i}-\Sigma_{j,j})^{2}+4\Sigma_{i,i}\Sigma_{j,j}}=\Sigma_{i,i}+\Sigma_{j,j}.

The result then follows in essentially the same fashion as Proposition 8.  

3 Convex Relaxations and Rounding Methods

For large-scale instances, high-quality solutions can be obtained by solving a convex relaxation of Problem (5) and rounding the optimal solution. In Section 3.1, we propose relaxing 𝒛{0,1}p\bm{z}\in\{0,1\}^{p} in (5) to 𝒛[0,1]p\bm{z}\in[0,1]^{p} and applying a greedy rounding scheme. We further tighten this relaxation using second-order cones constraints in Section 3.2.

3.1 A Boolean Relaxation and a Greedy Rounding Method

We first consider a Boolean relaxation of (5), which we obtain222Note that we omit the j=1p|Xi,j|kzi\sum_{j=1}^{p}|X_{i,j}|\leq\sqrt{k}z_{i} constraints when we develop our convex relaxations, since they are essentially dominated by the 𝑿1k\|\bm{X}\|_{1}\leq k constraint we introduce in the next section; we introduced these inequalities to improve our semidefinite-free subproblem strategy for the exact method. by relaxing 𝒛{0,1}p\bm{z}\in\{0,1\}^{p} to 𝒛[0,1]p\bm{z}\in[0,1]^{p}. This gives max𝒛[0,1]p:𝒆𝒛kf(𝒛)\displaystyle\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}\>f(\bm{z}), i.e.,

max𝒛[0,1]p:𝒆𝒛kmax𝑿𝟎\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}\end{subarray}:\bm{e}^{\top}\bm{z}\leq k}\>\max_{\bm{X}\succeq\bm{0}} 𝚺,𝑿s.t.tr(𝑿)=1,|Xi,j|Mi,jzii,j[p].\displaystyle\langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,|X_{i,j}|\leq M_{i,j}z_{i}\ \forall i,j\in[p]. (19)

A useful strategy for obtaining a high-quality feasible solution is to solve (19) and set zi=1z_{i}=1 for kk indices corresponding to the largest 𝒛j\bm{z}_{j}’s in (19) as proposed in the randomized case for general integer optimization problems by Raghavan and Tompson (1987). We formalize this in Algorithm 2. We remark that rounding strategies for sparse PCA have previously been proposed (see Fountoulakis et al., 2017; Dey et al., 2017; Chowdhury et al., 2020), however, the idea of rounding 𝒛\bm{z} and then optimizing for 𝑿\bm{X} appears to be new.

Algorithm 2 A greedy rounding method for Problem (1)
0: Covariance matrix 𝚺\bm{\Sigma}, sparsity parameter kk
 Compute 𝒛\bm{z}^{\star} solution of (19) or (20)
 Construct 𝒛{0,1}p:𝒆𝒛=k\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}=k such that zizjz_{i}\geq z_{j} if zizjz^{\star}_{i}\geq z^{\star}_{j}.
 Compute 𝑿\bm{X} solution of
max𝑿S+p𝚺,𝑿s.t.tr(𝑿)=1,Xi,j=0ifzizj=0i,j[p].\displaystyle\max_{\bm{X}\in S^{p}_{+}}\ \langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,X_{i,j}=0\ \text{if}\ z_{i}z_{j}=0\ \forall i,j\in[p].
return  𝒛,𝑿\bm{z},\bm{X}.
Remark 13

Our numerical results in Section 4 reveal that explicitly imposing a PSD constraint on 𝐗\bm{X} in the relaxation (19)—or the ones derived later in the following section—prevents our approximation algorithm from scaling to larger problem sizes than the exact Algorithm 1 can already solve. Therefore, to improve scalability, the semidefinite cone can be safely approximated via its second-order cone relaxation, Xi,j2Xi,iXj,ji,j[p]X_{i,j}^{2}\leq X_{i,i}X_{j,j}\ \forall i,j\in[p], plus a small number of cuts of the form 𝐗,𝐱t𝐱t0\langle\bm{X},\bm{x}_{t}\bm{x}_{t}^{\top}\rangle\geq 0 as presented in Bertsimas and Cory-Wright (2020).

Remark 14

Rather than relaxing and greedily rounding 𝐳\bm{z}, one could consider a higher dimensional relax-and-round scheme where we let 𝐙\bm{Z} model the outer product 𝐳𝐳\bm{z}\bm{z}^{\top} via 𝐙𝐳𝐳\bm{Z}\succeq\bm{z}\bm{z}^{\top}, max(0,zi+zj1)Zi,jmin(zi,zj)i,j[p]\max(0,z_{i}+z_{j}-1)\leq Z_{i,j}\leq\min(z_{i},z_{j})\ \forall i,j\in[p], Zi,i=ziZ_{i,i}=z_{i}, and require that i,j[p]Zi,jk2\sum_{i,j\in[p]}Z_{i,j}\leq k^{2}. Indeed, a natural “round” component of such a relax-and-round scheme is precisely Goemans-Williamson rounding (Goemans and Williamson, 1995; Bertsimas and Ye, 1998), which performs at least as well as greedy rounding in both theory and practice. Unfortunately, some preliminary numerical experiments indicated that Goemans-Williamson rounding is not actually much better than greedy rounding in practice, and is considerably more expensive to implement. Therefore, we defer the details of the Goemans-Williamson scheme to Appendix A, and do not consider it any further in this paper.

3.2 Valid Inequalities for Strengthening Convex Relaxations

We now propose valid inequalities which allow us to improve the quality of the convex relaxations discussed previously. Note that as convex relaxations and random rounding methods are two sides of the same coin (Barak et al., 2014), applying these valid inequalities also improves the quality of the randomly rounded solutions.

Theorem 15

Let 𝒫strong\mathcal{P}_{strong} denote the optimal objective value of the following problem:

max𝒛[0,1]p:𝒆𝒛kmax𝑿S+p𝚺,𝑿s.t.\displaystyle\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}\max_{\bm{X}\in S^{p}_{+}}\ \langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.} tr(𝑿)=1,|Xi,j|Mi,jzii,j[p],\displaystyle\mathrm{tr}(\bm{X})=1,|X_{i,j}|\leq M_{i,j}z_{i}\ \forall i,j\in[p], (20)
j[p]Xi,j2Xi,izi,𝑿1k.\displaystyle\sum_{j\in[p]}X_{i,j}^{2}\leq X_{i,i}z_{i},\|\bm{X}\|_{1}\leq k.

Then, (20) is a stronger relaxation than (19), i.e., the following inequalities hold:

max𝒛[0,1]p:𝒆𝒛kf(𝒛)𝒫strongmax𝒛{0,1}p:𝒆𝒛kf(𝒛).\displaystyle\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z})\geq\mathcal{P}_{strong}\geq\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z}).

Moreover, suppose an optimal solution to (20) is of rank one. Then, the relaxation is tight:

𝒫strong=max𝒛{0,1}p:𝒆𝒛kf(𝒛).\mathcal{P}_{strong}=\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z}).

Proof  The first inequality max𝒛[0,1]p:𝒆𝒛kf(𝒛)𝒫strong\max_{\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k}f(\bm{z})\geq\mathcal{P}_{strong} is trivial. The second inequality holds because 𝒫strong\mathcal{P}_{strong} is indeed a valid relaxation of Problem (1). Indeed, 𝑿1k\|\bm{X}\|_{1}\leq k follows from the cardinality and big-M constraints. The semidefinite constraint 𝑿0\bm{X}\succeq 0 impose second-order cone constraints on the 2×22\times 2 minors of 𝑿\bm{X}, Xi,j2ziXi,iXj,jX_{i,j}^{2}\leq z_{i}X_{i,i}X_{j,j}, which can be aggregated into j[p]Xi,j2Xi,izi\sum_{j\in[p]}X_{i,j}^{2}\leq X_{i,i}z_{i} (see Bertsimas and Cory-Wright, 2020, for derivations).

Finally, suppose that an optimal solution to Problem (20) is of rank one, i.e., the optimal matrix 𝑿\bm{X} can be decomposed as 𝑿=𝒙𝒙\bm{X}=\bm{x}\bm{x}^{\top}. Then, the SOCP inequalities imply that j[p]xi2xj2xi2zi\sum_{j\in[p]}x_{i}^{2}x_{j}^{2}\leq x_{i}^{2}z_{i}. However, j[p]xj2=tr(𝑿)=1\sum_{j\in[p]}x_{j}^{2}=\mathrm{tr}(\bm{X})=1, which implies that xi2xi2zix_{i}^{2}\leq x_{i}^{2}z_{i}, i.e., zi=1z_{i}=1 for any index ii such that |xi|>0|x_{i}|>0. Since 𝒆𝒛k\bm{e}^{\top}\bm{z}\leq k, this implies that 𝒙0k\|\bm{x}\|_{0}\leq k, i.e., 𝑿\bm{X} also solves Problem (2).  

As our numerical experiments will demonstrate and despite the simplicity of our rounding mechanism in Algorithm 2, the relaxation (20) provides high-quality solutions to the original sparse PCA problem (1), without introducing any additional variables. We remark that other inequalities, including the second-order cone inequalities proposed in Li and Xie (2020, Lemma 2 (ii)), could further improve the convex relaxation; we leave integrating these inequalities within our framework as future work.

4 Numerical Results

We now assess the numerical behavior of the algorithms proposed in Section 2 and 3. To bridge the gap between theory and practice, we present a Julia code which implements the described convex relaxation and greedy rounding procedure on GitHub333https://github.com/ryancorywright/ScalableSPCA.jl. The code requires a conic solver such as Mosek and several open source Julia packages to be installed.

4.1 Performance of Exact Methods

In this section, we apply Algorithm 1 to medium and large-scale sparse principal component analysis problems, with and without Gershgorin circle theorem bounds in the master problem. All experiments were implemented in Julia 1.31.3, using Gurobi 9.19.1 and JuMP.jl 0.21.60.21.6, and performed on a standard Macbook Pro laptop, with a 2.92.9GHz 66-Core Intel i9 CPU, using 1616 GB DDR4 RAM. We compare our approach to the branch-and-bound algorithm444The solve times for their method, as reported here, differ from those reported in Berk and Bertsimas (2019) due to a small typo in their implementation (line 110110 of their branchAndBound.jl code should read “if y[i]==1y[i]==-1 |||| y[i]==1y[i]==1”, not “if y[i]==1y[i]==-1” in order to correctly compute the Gershgorin circle theorem bound); correcting this is necessary to ensure that we obtain correct results from their method. developed by Berk and Bertsimas (2019) on the UCI pitprops, wine, miniboone, communities, arrythmia and micromass datasets, both in terms of runtime and the number of nodes expanded; we refer to Berk and Bertsimas (2019); Bertsimas and Cory-Wright (2020) for descriptions of these datasets. Note that we normalized all datasets before running the method (i.e., we compute the leading sparse principal components of correlation matrices). Additionally, we warm-start all methods with the solution from the method of Yuan and Zhang (2013), to maintain a fair comparison.

Table 1 reports the time for Algorithm 1 (with and without Gershgorin circle theorem bounds in the master problem) and the method of Berk and Bertsimas (2019) to identify the leading kk-sparse principal component for k{5,10,20}k\in\{5,10,20\}, along with the number of nodes expanded, and the number of outer approximation cuts generated. We impose a relative optimality tolerance of 10310^{-3} for all approaches , i.e., terminate each method when (UBLB)/UB103(UB-LB)/UB\leq 10^{-3} where UBUB denotes the current objective bound and LBLB denotes the current incumbent objective value. Note that pp denotes the dimensionality of the correlation matrix, and kpk\leq p denotes the target sparsity.

Table 1: Runtime in seconds per approach. We impose a time limit of 600600s. If a solver fails to converge, we report the relative bound gap at termination in brackets.
Dataset pp kk Alg. 1 Alg. 1+ Circle Theorem Method of B.+B.
Time(s) Nodes Cuts Time(s) Nodes Cuts Time(s) Nodes
Pitprops 1313 55 0.300.30 1,6081,608 1,1761,176 0.06 3838 88 1.491.49 2222
1010 0.140.14 414414 387387 0.02 1818 2121 0.02 1414
Wine 1313 55 0.570.57 2,3132,313 1,6461,646 0.02 4646 1111 0.040.04 3434
1010 0.170.17 376376 311311 0.03\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}0.03 5454 5858 0.02 1212
Miniboone 5050 55 0.01 0 1111 0.01 0 33 0.040.04 22
1010 0.01 0 1616 0.020.02 0 33 0.040.04 22
2020 0.030.03 0 2626 0.01 0 33 1.301.30 5,4805,480
Communities 101101 55 (2.87%2.87\%) 28,46228,462 25,48325,483 0.20 201201 33 0.570.57 101101
1010 (13.3%13.3\%) 37,47937,479 36,25136,251 0.34 406406 3939 0.940.94 1,2981,298
2020 (39.6%)39.6\%) 24,56624,566 24,63224,632 (12.1%)12.1\%) 42,12042,120 37,38337,383 (9.97%)(\textbf{9.97\%}) 669,500669,500
Arrhythmia 274274 55 (18.1%18.1\%) 22,77122,771 20,72220,722 6.076.07 135135 1,2331,233 4.17 1,4691,469
1010 (32.6%)32.6\%) 19,50019,500 19,31419,314 (2.92%2.92\%) 15,51015,510 6,9776,977 (0.83%) 471,680471,680
2020 (74.4%)(74.4\%) 33,77333,773 12,37412,374 (24.3%24.3\%) 33,12333,123 19,66219,662 (18.45%)(\textbf{18.45\%}) 311,400311,400
Micromass 13001300 55 (1.29%)(1.29\%) 3,8593,859 3,0993,099 163.60163.60 2,7382,738 66 24.31 1,0961,096
1010 (10.6%)(10.6\%) 3,3663,366 3,3693,369 241.86 3,2333,233 121121 362.4362.4 36,69036,690
2020 (35.9%)35.9\%) 2,7972,797 2,8392,839 (35.9%35.9\%) 2,6762,676 2,1152,115 (10.34%)(\textbf{10.34}\%) 31,99031,990

Our main findings from these experiments are as follows:

  • For smaller problems, the strength of Algorithm 1’s cuts allows it to outperform state-of-the-art methods such as the method of Berk and Bertsimas (2019). Moreover, for larger problem sizes, the adaptive branching strategy performs comparably to Algorithm 1. This suggests that the relative merits of both approaches are roughly even, and which method is preferable may depend on the problem data.

  • Generating outer-approximation cuts and valid upper bounds from the Gershgorin circle theorem are both powerful ideas, but the greatest aggregate power appears to arise from intersecting these bounds, rather than using one bound alone.

  • Once both kk and pp are sufficiently large (e.g. p>300p>300 and k>10k>10), no approach is able to solve the problem to provable optimality within 600600s. This motivates our study of convex relaxations and randomized rounding methods in the next section.

4.2 Convex Relaxations and Randomized Rounding Methods

In this section, we apply Algorithm 2 to obtain high quality convex relaxations and feasible solutions for the datasets studied in the previous subsection, and compare the relaxation to a difference convex relaxation developed by d’Aspremont et al. (2008), in terms of the quality of the upper bound and the resulting greedily rounded solutions. All experiments were implemented using the same specifications as the previous section. Note that d’Aspremont et al. (2008)’s upper bound555 Strictly speaking, d’Aspremont et al. (2008) does not actually write down this formulation in their work. Indeed, their bound involves dual variables which cannot be used directly to generate feasible solutions via greedy rounding. However, the fact that this bound and (d’Aspremont et al., 2008, Problem (8)) are dual to each other follows directly from strong semidefinite duality, and therefore we refer to this formulation as being due to d’Aspremont et al. (2008) (it essentially is). which we compare against is:

max𝒛[0,1]p:𝒆𝒛kmax𝑿𝟎,𝑷i𝟎i[p]\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}\end{subarray}:\bm{e}^{\top}\bm{z}\leq k}\>\max_{\bm{X}\succeq\bm{0},\bm{P}_{i}\succeq\bm{0}\ \forall i\in[p]} i[p]𝒂i𝒂i,𝑷is.t.tr(𝑿)=1,tr(𝑷i)=zi,𝑿𝑷ii[p],\displaystyle\sum_{i\in[p]}\ \langle\bm{a}_{i}\bm{a}_{i}^{\top},\bm{P}_{i}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \mathrm{tr}(\bm{P}_{i})=z_{i},\ \bm{X}\succeq\bm{P}_{i}\ \forall i\in[p], (21)

where 𝚺=i=1p𝒂i𝒂i\bm{\Sigma}=\sum_{i=1}^{p}\bm{a}_{i}\bm{a}_{i}^{\top} is a Cholesky decomposition of 𝚺\bm{\Sigma}, and we obtain feasible solutions from this relaxation by greedily rounding an optimal 𝒛\bm{z} in the bound à la Algorithm 2. To allow for a fair comparison, we also consider augmenting this formulation with the inequalities derived in Section 3.2 to obtain the following stronger yet more expensive to solve relaxation:

max𝒛[0,1]p:𝒆𝒛kmax𝑿𝟎,𝑷i𝟎i[p]\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}\end{subarray}:\bm{e}^{\top}\bm{z}\leq k}\>\max_{\begin{subarray}{c}\bm{X}\succeq\bm{0},\\ \bm{P}_{i}\succeq\bm{0}\ \forall i\in[p]\end{subarray}} i[p]𝒂i𝒂i,𝑷is.t.tr(𝑿)=1,tr(𝑷i)=zi,𝑿𝑷ii[p],\displaystyle\sum_{i\in[p]}\ \langle\bm{a}_{i}\bm{a}_{i}^{\top},\bm{P}_{i}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,\ \mathrm{tr}(\bm{P}_{i})=z_{i},\ \bm{X}\succeq\bm{P}_{i}\ \forall i\in[p], (22)
j[p]Xi,j2Xi,izi,𝑿1k.\displaystyle\sum_{j\in[p]}X_{i,j}^{2}\leq X_{i,i}z_{i},\|\bm{X}\|_{1}\leq k.

We first apply these relaxations on datasets where Algorithm 1 terminates, hence the optimal solution is known and can be compared against. We report the quality of both methods with and without the additional inequalities discussed in Section 3.2, in Tables 2-3 respectively666For the instances of (21) or (22) where p>13p>13 we used SCS version 2.1.12.1.1 (with default parameters) instead of Mosek, since Mosek required more memory than was available in our computing environment, and SCS takes an augmented Lagrangian approach which is less numerically stable but requires significantly less memory. That is, (21)’s formulation is too expensive to solve via IPMs on a laptop when p=50p=50..

Table 2: Quality of relaxation gap (upper bound vs. optimal solution-denoted R. gap), objective gap (rounded solution vs. optimal solution-denoted O. gap) and runtime in seconds per method.
Dataset pp kk Alg. 2 with (19) Alg. 2 with (21)
R. gap (%)(\%) O. gap (%)(\%) Time(s) R. gap (%)(\%) O. gap (%)(\%) Time(s)
Pitprops 1313 55 23.8%23.8\% 0.00%0.00\% 0.020.02 23.8%23.8\% 16.1%16.1\% 0.460.46
1010 1.10%1.10\% 0.30%0.30\% 0.030.03 1.10%1.10\% 1.33%1.33\% 0.460.46
Wine 1313 55 36.8%36.8\% 0.00%0.00\% 0.020.02 36.8%36.8\% 40.4%40.4\% 0.4330.433
1010 2.43%2.43\% 0.26%0.26\% 0.030.03 2.43%2.43\% 15.0%15.0\% 0.4630.463
Miniboone 5050 55 781.3%781.3\% 235.6%235.6\% 7.377.37 781.2%781.2\% 34.7%34.7\% 1,191.01,191.0
1010 340.6%340.6\% 117.6%117.6\% 7.507.50 340.6%340.6\% 44.9%44.9\% 1,102.61,102.6
2020 120.3%120.3\% 38.08%38.08\% 6.256.25 120.3%120.3\% 31.9%31.9\% 1,140.21,140.2
Table 3: Quality of relaxation gap (upper bound vs. optimal solution-denoted R. gap), objective gap (rounded solution vs. optimal solution-denoted O. gap) and runtime in seconds per method, with additional inequalities from Section 3.2.
Dataset pp kk Alg. 2 with (20) Alg. 2 with (22)
R. gap (%)(\%) O. gap (%)(\%) Time(s) R. gap (%)(\%) O. gap (%)(\%) Time(s)
Pitprops 1313 55 0.71%0.71\% 0.00%0.00\% 0.170.17 1.53%1.53\% 0.00%0.00\% 0.550.55
1010 0.12%0.12\% 0.00%0.00\% 0.270.27 1.10%1.10\% 0.00%0.00\% 3.273.27
Wine 1313 55 1.56%1.56\% 0.00%0.00\% 0.240.24 2.98%2.98\% 15.03%15.03\% 0.950.95
1010 0.40%0.40\% 0.00%0.00\% 0.220.22 2.04%2.04\% 0.00%0.00\% 1.151.15
Miniboone 5050 55 0.00%0.00\% 0.00%0.00\% 163.3163.3 0.00%0.00\% 0.01%0.01\% 500.7500.7
1010 0.00%0.00\% 0.00%0.00\% 148.5148.5 0.00%0.00\% 0.02%0.02\% 489.9489.9
2020 0.00%0.00\% 0.00%0.00\% 194.5194.5 0.00%0.00\% 0.00%0.00\% 776.3776.3

Observe that applying Algorithm 2 without the additional inequalities (Table 2) yields rather poor relaxations and randomly rounded solutions. However, by intersecting our relaxations with the additional inequalities from Section 3.2 (Table 3), we obtain extremely high quality relaxations. Indeed, with the additional inequalities, Algorithm 2 using formulation (20) identifies the optimal solution in all instances (0% O. gap), and always supplies a bound gap of less than 2%2\%. Moreover, in terms of obtaining high-quality solutions, the new inequalites allow Problem (20) to perform as well or better as Problem (21), despite optimizing over one semidefinite matrix, rather than p+1p+1 semidefinite matrices. This suggests that Problem (20) should be considered as a viable, more scalable and more accurate alternative to existing SDO relaxations such as Problem (21). For this reason, we shall only consider using Problem (20)’s formulation for the rest of the paper.

We remark however that the key drawback of applying these methods is that, as implemented in this section, they do not scale to sizes beyond which Algorithm 1 successfully solves. This is a drawback because Algorithm 1 supplies an exact certificate of optimality, while these methods do not. In the following set of experiments, we therefore investigate numerical techniques to improve the scalability of Algorithm 2.

4.3 Scalable Dual Bounds and Randomized Rounding Methods

To improve the scalability of Algorithm 2, we relax the PSD constraint on 𝑿\bm{X} in (19) and (20). With these enhancements, we demonstrate that Algorithm 2 can be successfully scaled to generate high-quality bounds for 1000s×1000s1000s\times 1000s matrices.

As discussed in Remark 13, we can replace the PSD constraint 𝑿𝟎\bm{X}\succeq\bm{0} by requiring that the p(p1)/2p(p-1)/2 two by two minors of 𝑿\bm{X} are non-negative: Xi,j2Xi,iXj,jX_{i,j}^{2}\leq X_{i,i}X_{j,j}. Second, we consider adding 2020 linear inequalities of the form 𝑿,𝒙t𝒙t0\langle\bm{X},\bm{x}_{t}\bm{x}_{t}^{\top}\rangle\geq 0, for some vector 𝒙t\bm{x}_{t} (see Bertsimas and Cory-Wright, 2020, for a discussion). Table 4 reports the performance of Algorithm 2 (with the relaxation (20)) with these two approximations of the positive semidefinite cone, “Minors” and “Minors + 20 inequalities” respectively. Note that we report the entire duality gap (i.e. do not break the gap down into its relaxation and objective gap components) since, as reflected in Table 1, some of these instances are currently too large to solve to optimality.

Table 4: Quality of bound gap (rounded solution vs. upper bound) and runtime in seconds of Algorithm 2 with (20), outer-approximation of the PSD cone.
Dataset pp kk Minors Minors + 20 inequalities
Gap (%)(\%) Time(s) Gap (%)(\%) Time(s)
Pitprops 1313 55 1.51%1.51\% 0.020.02 0.72%0.72\% 0.360.36
1010 5.29%5.29\% 0.020.02 1.12%1.12\% 0.360.36
Wine 1313 55 2.22%2.22\% 0.020.02 1.59%1.59\% 0.380.38
1010 3.81%3.81\% 0.020.02 1.50%1.50\% 0.370.37
Miniboone 5050 55 0.00%0.00\% 0.110.11 0.00%0.00\% 0.110.11
1010 0.00%0.00\% 0.120.12 0.00%0.00\% 0.120.12
2020 0.00%0.00\% 0.390.39 0.00%0.00\% 0.390.39
Communities 101101 55 0.07%0.07\% 0.670.67 0.07%0.07\% 14.814.8
1010 0.66%0.66\% 0.680.68 0.66%0.66\% 14.414.4
2020 3.32%3.32\% 1.841.84 2.23%2.23\% 33.533.5
Arrhythmia 274274 55 3.37%3.37\% 27.227.2 1.39%1.39\% 203.6203.6
1010 3.01%3.01\% 25.625.6 1.33%1.33\% 184.0184.0
2020 8.87%8.87\% 21.821.8 4.48%4.48\% 426.8426.8
Micromass 13001300 55 0.04%0.04\% 239.4239.4 0.01%0.01\% 4,6394,639
1010 0.63%0.63\% 232.6232.6 0.32%0.32\% 6,3926,392
2020 13.1%13.1\% 983.5983.5 5.88%5.88\% 16,35016,350

Observe that if we impose constraints on the 2×22\times 2 minors only then we obtain a solution certifiably within 13%13\% of optimality in seconds (resp. minutes) for p=100p=100s (resp. p=1000p=1000s). Moreover, adding 2020 linear inequalities, we obtain a solution within 6%6\% of optimality in minutes (resp. hours) for p=100p=100s (resp. p=1000p=1000s). Moreover, the bound gaps compare favorably to Algorithm 1 and the method of Berk and Bertsimas (2019) for instances which these methods could not solve to certifiable optimality. For instance, for the Arrhythmia dataset when k=20k=20 we obtain a bound gap of less than 9%9\% in 2020s, while the method of Berk and Bertsimas (2019) obtains a bound gap of 18.45%18.45\% in 600600s. This illustrates the value of the proposed relax+round method on datasets which are currently too large to be optimized over exactly.

To conclude this section, we explore Algorithm 2’s ability to scale to even higher dimensional datasets in a high performance setting, by running the method on one Intel Xeon E5–2690 v4 2.6GHz CPU core using 600 GB RAM. Table 5 reports the methods scalability and performance on the Wilshire 50005000, and Arcene UCI datasets. For the Gisette dataset, we report on the methods performance when we include the first 3,0003,000 and 4,0004,000 rows/columns (as well as all 5,0005,000 rows/columns). Similarly, for the Arcene dataset we report on the method’s performance when we include the first 6,0006,000, 7,0007,000 or 8,0008,000 rows/columns. We do not report results for the Arcene dataset for p>8,000p>8,000, as computing this requires more memory than was available (i.e. >600>600 GB RAM). We do not report the method’s performance when we impose linear inequalities for the PSD cone, as solving the relaxation without them is already rather time consuming. Moreover, we do not impose the 2×22\times 2 minor constraints to save memory, do not impose |Xi,j|Mi,jzi|X_{i,j}|\leq M_{i,j}z_{i} when p4000p\geq 4000 to save even more memory, and report the overall bound gap, as improving upon the randomly rounded solution is challenging in a high-dimensional setting.

Table 5: Quality of bound gap (rounded solution vs. upper bound) and runtime in seconds.
Dataset pp kk Algorithm 2 (SOC relax)+Inequalities
Bound gap (%)(\%) Time(s)
Wilshire 50005000 21302130 55 0.38%0.38\% 1,0361,036
1010 0.24%0.24\% 1,0141,014
2020 0.36%0.36\% 1,0591,059
Gisette 30003000 55 1.67%1.67\% 2,2492,249
1010 35.81%35.81\% 2,5622,562
2020 10.61%10.61\% 3,4243,424
Gisette 40004000 55 1.55%1.55\% 1,4021,402
1010 54.4%54.4\% 1,2031,203
2020 11.84%11.84\% 1,4351,435
Gisette 50005000 55 1.89%1.89\% 2,1692,169
1010 2.22%2.22\% 2,4552,455
2020 7.16%7.16\% 2,1902,190
Arcene 60006000 55 0.01%0.01\% 3,3333,333
1010 0.06%0.06\% 3,6163,616
2020 0.14%0.14\% 3,1983,198
Arcene 70007000 55 0.03%0.03\% 4,1604,160
1010 0.05%0.05\% 4,5944,594
2020 0.25%0.25\% 4,7304,730
Arcene 80008000 55 0.02%0.02\% 6,8956,895
1010 0.17%0.17\% 8,4798,479
2020 0.21%0.21\% 6,3356,335

These results suggest that if we solve the SOC relaxation using a first-order method rather than an interior point method, our approach could successfully generate certifiably near-optimal PCs when p=10,000p=10,000s, particularly if combined with a feature screening technique (see d’Aspremont et al., 2008; Atamtürk and Gomez, 2020).

4.4 Performance of Exact and Approximate Methods on Synthetic Data

We now compare the exact and approximate methods against existing state-of-the-art methods in a spiked covariance matrix setting. We use the experimental setup laid out in d’Aspremont et al. (2008, Section 7.1). We recover the leading principal component of a test matrix777This statement of the test matrix is different to d’Aspremont et al. (2008, Section 7.1), who write 𝚺=𝑼𝑼+σ𝒗𝒗\bm{\Sigma}=\bm{U}^{\top}\bm{U}+\sigma\bm{v}\bm{v}^{\top}, rather than 𝚺=1n𝑼𝑼+σ𝒗22𝒗𝒗\bm{\Sigma}=\frac{1}{n}\bm{U}^{\top}\bm{U}+\frac{\sigma}{\|\bm{v}\|_{2}^{2}}\bm{v}\bm{v}^{\top}. However, it agrees with their source code. 𝚺S+p\bm{\Sigma}\in S^{p}_{+}, where p=150p=150, 𝚺=1n𝑼𝑼+σ𝒗22𝒗𝒗\bm{\Sigma}=\frac{1}{n}\bm{U}^{\top}\bm{U}+\frac{\sigma}{\|\bm{v}\|_{2}^{2}}\bm{v}\bm{v}^{\top}, 𝑼[0,1]150×150\bm{U}\in[0,1]^{150\times 150} is a noisy matrix with i.i.d. standard uniform entries, 𝒗150\bm{v}\in\mathbb{R}^{150} is a vector of signals such that

vi={1,ifi50,1i50,if 51i100,0,otherwise,\displaystyle v_{i}=\begin{cases}1,&\text{if}\ i\leq 50,\\ \frac{1}{i-50},&\text{if}\ 51\leq i\leq 100,\\ 0,&\text{otherwise,}\end{cases} (23)

and σ=2\sigma=2 is the signal-to-noise ratio. The methods which we compare are:

  • Exact: Algorithm 1 with Gershgorin inequalities and a time limit of 600600s.

  • Approximate: Algorithm 2 with Problem (20), the SOC outer-approximation of the PSD cone, no PSD cuts, and the additional SOC inequalities.

  • Greedy: as proposed by Moghaddam et al. (2006) and laid out in (d’Aspremont et al., 2008, Algorithm 1), start with a solution 𝒛\bm{z} of cardinality 11 and iteratively augment this solution vector with the index which gives the maximum variance contribution. Note that d’Aspremont et al. (2008) found this method outperformed the 33 other methods (approximate greedy, thresholding and sorting) they considered in their work.

  • Truncated Power Method: as proposed by Yuan and Zhang (2013), alternate between applying the power method to the solution vector and truncating the vector to ensure that it is kk-sparse. Note that Berk and Bertsimas (2019) found that this approach performed better than 55 other state-of-the-art methods across the real-world datasets studied in the previous section of this paper and often matched the performance of the method of Berk and Bertsimas (2019)—indeed, it functions as a warm-start for the later method.

  • Sorting: sort the entries of 𝚺i,i\bm{\Sigma}_{i,i} by magnitude and set zi=1z_{i}=1 for the kk largest entries of 𝚺\bm{\Sigma}, as studied in d’Aspremont et al. (2008). This naive method serves as a benchmark for the value of optimization in the more sophisticated methods considered here.

Figures 1 depicts the ROC curve (true positive rate vs. false positive rate for recovering the support of 𝒗\bm{v}) over 2020 synthetic random instances, as we vary kk for each instance. We observe that among all methods, the sorting method is the least accurate, with a substantially larger false detection rate for a given true positive rate than the remaining methods (AUC=0.7028=0.7028). The truncated power method and our exact method888Note that the exact method would dominate the remaining methods if given an unlimited runtime budget. Its poor performance reflects its inability to find the true optimal solution within 600600 seconds. then offer a substantial improvement over sorting, with respective AUCs of 0.74820.7482 and 0.74830.7483. The greedy method then offers a modest improvement over them (AUC=0.7561=0.7561) and the approximate relax+round method is the most accurate (AUC=0.7593=0.7593).

In addition to support recovery, Figure 2 reports average runtime (left panel) and average optimality gap (right panel) over the same instances. Observe that among all methods, only the exact and the approximate relax+round methods provide optimality gaps, i.e., numerical certificates of near optimality. On this metric, relax+round supplies average bound gaps of 1%1\% or less on all instances, while the exact method typically supplies bound gaps of 30%30\% or more. This comparison illustrates the tightness of the valid inequalities from Section 3.2 that we included in the relaxation. Moreover, the relax+round method converges in less than one minute on all instances. All told, the relax+round method is the best performing method overall, although if kk is set to be sufficiently close to 0 or pp all methods behave comparably. In particular, the relax+round method should be preferred over the exact method, even though the exact method performs better at smaller problem sizes.

Refer to caption
Figure 1: ROC curve over 2020 synthetic instances where p=150p=150, ktrue=100k_{\text{true}}=100 is unspecified.
Refer to caption
Refer to caption
Figure 2: Average time to compute solution, and optimality gap over 2020 synthetic instances where p=150p=150, ktrue=100k_{\text{true}}=100 is unspecified.

4.5 Summary and Guidelines From Experiments

In summary, our main findings from our numerical experiments are as follows:

  • For small or medium scale problems where p100p\leq 100 or k10k\leq 10, exact methods such as Algorithm 1 or the method of Berk and Bertsimas (2019) reliably obtain certifiably optimal or near-optimal solutions in a short amount of time, and should therefore be preferred over other methods. However, for larger-scale sparse PCA problems, exact methods currently do not scale as well as approximate or heuristic methods.

  • For larger-scale sparse PCA problems, our proposed combination of solving a second-order cone relaxation and rounding greedily reliably supplies certifiably near-optimal solutions in practice (if not in theory) in a relatively small amount of time. Moreover, it outperforms other state-of-the-art heuristics including the greedy method of Moghaddam et al. (2006); d’Aspremont et al. (2008) and the Truncated Power Method of Yuan and Zhang (2013). Accordingly, it should be considered as a reliable and more accurate alternative for problems where p=1000p=1000s.

  • In practice, for even larger-scale problem sizes, we recommend using a combination of these methods: a computationally cheaper method (with kk set in the 10001000s) as a feature screening method, to be followed by the approximate relax+round method (with kk set in the 100100s) and/or the exact method, if time permits.

5 Three Extensions and their Mixed-Integer Conic Formulations

We conclude by discussing three extensions of sparse PCA where our methodology applies.

5.1 Non-Negative Sparse PCA

One potential extension to this paper would be to develop a certifiably optimal algorithm for non-negative sparse PCA (see Zass and Shashua, 2007, for a discussion), i.e., develop a tractable reformulation of

max𝒙p\displaystyle\max_{\bm{x}\in\mathbb{R}^{p}}\quad 𝒙𝒙,𝚺s.t.𝒙𝒙=1,𝒙𝟎,𝒙0k.\displaystyle\langle\bm{x}\bm{x}^{\top},\bm{\Sigma}\rangle\ \text{s.t.}\ \bm{x}^{\top}\bm{x}=1,\bm{x}\geq\bm{0},\|\bm{x}\|_{0}\leq k.

Unfortunately, we cannot develop a MISDO reformulation of non-negative sparse PCA mutatis mutandis Theorem 1. Indeed, while we can still set 𝑿=𝒙𝒙\bm{X}=\bm{x}\bm{x}^{\top} and relax the rank-one constraint, if we do so then, by the non-negativity of 𝒙\bm{x}, lifting 𝒙\bm{x} yields:

max𝒛{0,1}p:𝒆𝒛kmax𝑿𝒞n\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in\mathcal{C}_{n}} 𝚺,𝑿\displaystyle\langle\bm{\Sigma},\bm{X}\rangle (24)
s.t. tr(𝑿)=1,Xi,j=0ifzi=0,Xi,j=0ifzj=0i,j[p].\displaystyle\mathrm{tr}(\bm{X})=1,\ X_{i,j}=0\ \text{if}\ z_{i}=0,\ X_{i,j}=0\ \text{if}\ z_{j}=0\ \forall i,j\in[p].

where 𝒞n:={𝑿:𝑼𝟎,𝑿=𝑼𝑼}\mathcal{C}_{n}:=\{\bm{X}:\ \exists\ \bm{U}\geq\bm{0},\bm{X}=\bm{U}^{\top}\bm{U}\} denotes the completely positive cone, which is NP-hard to separate over and cannot currently be optimized over tractably (Dong and Anstreicher, 2013). Nonetheless, we can develop relatively tractable mixed-integer conic upper and lower bounds for non-negative sparse PCA. Indeed, we can obtain a fairly tight upper bound by replacing the completely positive cone with the larger doubly non-negative cone 𝒟n:={𝑿S+p:𝑿𝟎}\mathcal{D}_{n}:=\{\bm{X}\in S^{p}_{+}:\bm{X}\geq\bm{0}\}, which is a high-quality outer-approximation of 𝒞n\mathcal{C}_{n}, indeed exact when k4k\leq 4 (Burer et al., 2009).

Unfortunately, this relaxation is strictly different in general, since the extreme rays of the doubly non-negative cone are not necessarily rank-one when k5k\geq 5 (Burer et al., 2009). Nonetheless, to obtain feasible solutions which supply lower bounds, we could inner approximate the completely positive cone with the cone of non-negative scaled diagonally dominant matrices (see Ahmadi and Majumdar, 2019; Bostanabad et al., 2020).

5.2 Sparse PCA on Rectangular Matrices

A second extension would be to extend our methodology to the non-square case:

max𝒙m,𝒚n\displaystyle\max_{\bm{x}\in\mathbb{R}^{m},\bm{y}\in\mathbb{R}^{n}}\quad 𝒙𝑨𝒚s.t.𝒙2=1,𝒚2=1,𝒙0k,𝒚0k.\displaystyle\bm{x}^{\top}\bm{A}\bm{y}\ \text{s.t.}\ \|\bm{x}\|_{2}=1,\|\bm{y}\|_{2}=1,\|\bm{x}\|_{0}\leq k,\|\bm{y}\|_{0}\leq k. (25)

Observe that computing the spectral norm of a matrix 𝑨\bm{A} is equivalent to:

max𝑿n×m𝑨,𝑿s.t.(𝑼𝑿𝑿𝑽)𝟎,tr(𝑼)+tr(𝑽)=2,\displaystyle\max_{\bm{X}\in\mathbb{R}^{n\times m}}\quad\langle\bm{A},\bm{X}\rangle\ \text{s.t.}\ \begin{pmatrix}\bm{U}&\bm{X}\\ \bm{X}^{\top}&\bm{V}\end{pmatrix}\succeq\bm{0},\mathrm{tr}(\bm{U})+\mathrm{tr}(\bm{V})=2, (26)

where, in an optimal solution, 𝑼\bm{U} stands for 𝒙𝒙\bm{x}\bm{x}^{\top}, 𝑽\bm{V} stands for 𝒚𝒚\bm{y}\bm{y}^{\top} and 𝑿\bm{X} stands for 𝒙𝒚\bm{x}\bm{y}^{\top}—this can be seen by taking the dual of (Recht et al., 2010, Equation 2.4).

Therefore, by using the same argument as in the positive semidefinite case, we can rewrite sparse PCA on rectangular matrices as the following MISDO:

max𝒘{0,1}m,𝒛{0,1}nmax𝑿n×m\displaystyle\max_{\bm{w}\in\{0,1\}^{m},\bm{z}\in\{0,1\}^{n}}\max_{\bm{X}\in\mathbb{R}^{n\times m}} 𝑨,𝑿\displaystyle\langle\bm{A},\bm{X}\rangle (27)
s.t. (𝑼𝑿𝑿𝑽)𝟎,tr(𝑼)+tr(𝑽)=2,\displaystyle\ \begin{pmatrix}\bm{U}&\bm{X}\\ \bm{X}^{\top}&\bm{V}\end{pmatrix}\succeq\bm{0},\mathrm{tr}(\bm{U})+\mathrm{tr}(\bm{V})=2,
Ui,j=0ifwi=0i,j[m],\displaystyle U_{i,j}=0\ \text{if}\ w_{i}=0\ \forall i,j\in[m],
Vi,j=0ifzi=0i,j[n],𝒆𝒘k,𝒆𝒛k.\displaystyle V_{i,j}=0\ \text{if}\ z_{i}=0\ \forall i,j\in[n],\bm{e}^{\top}\bm{w}\leq k,\bm{e}^{\top}\bm{z}\leq k.

5.3 Sparse PCA with Multiple Principal Components

A third extension where our methodology is applicable is the problem of obtaining multiple principal components simultaneously, rather than deflating 𝚺\bm{\Sigma} after obtaining each principal component. As there are multiple definitions of this problem, we now discuss the extent to which our framework encompasses each case.

Common Support:

Perhaps the simplest extension of sparse PCA to a multi-component setting arises when all rr principal components have common support. By retaining the vector of binary variables 𝒛\bm{z} and employing the Ky-Fan theorem (c.f. Wolkowicz et al., 2012, Theorem 2.3.8) to cope with multiple principal components, we obtain the following formulation in much the same manner as previously:

max𝒛{0,1}p:𝒆𝒛kmax𝑿S+p\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}}\quad 𝑿,𝚺s.t. 0𝑿𝕀,tr(𝑿)=r,Xi,j=0ifzi=0i[p].\displaystyle\langle\bm{X},\bm{\Sigma}\rangle\ \text{s.t.}\ \bm{0}\preceq\bm{X}\preceq\mathbb{I},\ \mathrm{tr}(\bm{X})=r,\ X_{i,j}=0\ \text{if}\ z_{i}=0\ \forall i\in[p]. (28)

Notably, the logical constraint Xi,j=0X_{i,j}=0 if zi=0z_{i}=0, which formed the basis of our subproblem strategy, still successfully models the sparsity constraint. This suggests that (a) one can derive an equivalent subproblem strategy under common support, and (b) a cutting-plane method for common support should scale equally well as with a single component.

Disjoint Support:

In a sparse PCA problem with disjoint support (Vu and Lei, 2012) , simultaneously computing the first rr principal components is equivalent to solving:

max𝒛{0,1}p×r:𝒆𝒛tkt[r],𝒛𝒆𝒆max𝑾p×r\displaystyle\max_{\begin{subarray}{c}\bm{z}\in\{0,1\}^{p\times r}:\bm{e}^{\top}\bm{z}_{t}\leq k\ \forall t\in[r],\\ \bm{z}\bm{e}\leq\bm{e}\end{subarray}}\max_{\bm{W}\in\mathbb{R}^{p\times r}} 𝑾𝑾,𝚺\displaystyle\langle\bm{W}\bm{W}^{\top},\bm{\Sigma}\rangle (29)
𝑾𝑾=𝕀r,Wi,j=0ifzi,t=0i[p],t[r],\displaystyle\bm{W}^{\top}\bm{W}=\mathbb{I}_{r},\ W_{i,j}=0\ \text{if}\ z_{i,t}=0\ \forall i\in[p],t\in[r],

where zi,tz_{i,t} is a binary variable denoting whether feature ii is a member of the ttth principal component. By applying the technique used to derive Theorem 1 mutatis mutandis, and invoking the Ky-Fan theorem (c.f. Wolkowicz et al., 2012, Theorem 2.3.8) to cope with the rank-rr constraint, we obtain:

max𝒛{0,1}p:𝒆𝒛kmax𝑿Sp\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\max_{\bm{X}\in S^{p}} 𝑿,𝚺\displaystyle\langle\bm{X},\bm{\Sigma}\rangle (30)
𝟎𝑿𝕀,tr(𝑿)=r,Xi,j=0ifYi,j=0i[p],\displaystyle\bm{0}\preceq\bm{X}\preceq\mathbb{I},\ \mathrm{tr}(\bm{X})=r,\ X_{i,j}=0\ \text{if}\ Y_{i,j}=0\ \forall i\in[p],

where Yi,j=t=1rzi,tzj,tY_{i,j}=\sum_{t=1}^{r}z_{i,t}z_{j,t} is a binary matrix denoting whether features ii and jj are members of the same principal component; this problem can be addressed by a cutting-plane method in much the same manner as when r=1r=1.

Acknowledgments

We are grateful to the three anonymous referees and the associate editor for many valuable comments which improved the paper.

References

  • Ahmadi and Majumdar [2019] A. A. Ahmadi and A. Majumdar. DSOS and SDSOS optimization: More tractable alternatives to sum of squares and semidefinite optimization. SIAM Journal on Applied Algebra and Geometry, 3(2):193–230, 2019.
  • Amini and Wainwright [2008] A. A. Amini and M. J. Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. In 2008 IEEE international symposium on information theory, pages 2454–2458. IEEE, 2008.
  • Atamtürk and Gomez [2020] A. Atamtürk and A. Gomez. Safe screening rules for l0-regression. Optimization Online, 2020.
  • Bao et al. [2011] X. Bao, N. V. Sahinidis, and M. Tawarmalani. Semidefinite relaxations for quadratically constrained quadratic programming: A review and comparisons. Mathematical Programming, 129(1):129, 2011.
  • Barak et al. [2014] B. Barak, J. A. Kelner, and D. Steurer. Rounding sum-of-squares relaxations. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 31–40, 2014.
  • Barnhart et al. [1998] C. Barnhart, E. L. Johnson, G. L. Nemhauser, M. W. Savelsbergh, and P. H. Vance. Branch-and-price: Column generation for solving huge integer programs. Operations Research, 46(3):316–329, 1998.
  • Ben-Tal and Nemirovski [2002] A. Ben-Tal and A. Nemirovski. On tractable approximations of uncertain linear matrix inequalities affected by interval uncertainty. SIAM Journal on Optimization, 12(3):811–833, 2002.
  • Berk and Bertsimas [2019] L. Berk and D. Bertsimas. Certifiably optimal sparse principal component analysis. Mathematical Programming Computation, 11(3):381–420, 2019.
  • Berthet and Rigollet [2013] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in high dimension. Annals of Statistics, 41(4):1780–1815, 2013.
  • Bertsimas and Cory-Wright [2020] D. Bertsimas and R. Cory-Wright. On polyhedral and second-order-cone decompositions of semidefinite optimization problems. Operations Research Letters, 48(1):78–85, 2020.
  • Bertsimas and Van Parys [2020] D. Bertsimas and B. Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. Annals of Statistics, 48(1):300–323, 2020.
  • Bertsimas and Ye [1998] D. Bertsimas and Y. Ye. Semidefinite relaxations, multivariate normal distributions, and order statistics. In Handbook of Combinatorial Optimization, pages 1473–1491. Springer, 1998.
  • Bertsimas et al. [2017] D. Bertsimas, J. Pauphilet, and B. Van Parys. Sparse classification and phase transitions: A discrete optimization perspective. arXiv preprint arXiv:1710.01352, 2017.
  • Bertsimas et al. [2019] D. Bertsimas, R. Cory-Wright, and J. Pauphilet. A unified approach to mixed-integer optimization problems with logical constraints. arXiv preprint arXiv:1907.02109, 2019.
  • Boman et al. [2005] E. G. Boman, D. Chen, O. Parekh, and S. Toledo. On factor width and symmetric h-matrices. Linear Algebra and its Applications, 405:239–248, 2005.
  • Bostanabad et al. [2020] M. S. Bostanabad, J. Gouveia, and T. K. Pong. Inner approximating the completely positive cone via the cone of scaled diagonally dominant matrices. Journal on Global Optimization, 76:383–405, 2020.
  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, Cambridge, 2004.
  • Boyd et al. [1994] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear matrix inequalities in system and control theory, volume 15. Studies in Applied Mathematics, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1994.
  • Brauer [1952] A. Brauer. Limits for the characteristic roots of a matrix iv. Duke Mathematical Journal, 19:75–91, 1952.
  • Burer et al. [2009] S. Burer, K. M. Anstreicher, and M. Dür. The difference between 5×\times 5 doubly nonnegative and completely positive matrices. Linear Algebra and its Applications, 431(9):1539–1552, 2009.
  • Candes and Tao [2007] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313–2351, 2007.
  • Chan et al. [2016] S. O. Chan, D. Papailliopoulos, and A. Rubinstein. On the approximability of sparse PCA. In Conference on Learning Theory, pages 623–646, 2016.
  • Chen et al. [2019] Y. Chen, Y. Ye, and M. Wang. Approximation hardness for a class of sparse optimization problems. Journal of Machine Learning Research, 20(38):1–27, 2019.
  • Chowdhury et al. [2020] A. Chowdhury, P. Drineas, D. P. Woodruff, and S. Zhou. Approximation algorithms for sparse principal component analysis. arXiv preprint arXiv:2006.12748, 2020.
  • d’Aspremont et al. [2005] A. d’Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet. A direct formulation for sparse pca using semidefinite programming. In Advances in neural information processing systems, pages 41–48, 2005.
  • d’Aspremont et al. [2008] A. d’Aspremont, F. Bach, and L. E. Ghaoui. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9(Jul):1269–1294, 2008.
  • Dey et al. [2017] S. S. Dey, R. Mazumder, M. Molinaro, and G. Wang. Sparse principal component analysis and its l_1l\_1-relaxation. arXiv preprint arXiv:1712.00800, 2017.
  • Dey et al. [2018] S. S. Dey, R. Mazumder, and G. Wang. A convex integer programming approach for optimal sparse PCA. arXiv preprint arXiv:1810.09062, 2018.
  • Dong and Anstreicher [2013] H. Dong and K. Anstreicher. Separating doubly nonnegative and completely positive matrices. Mathematical Programming, 137(1-2):131–153, 2013.
  • Dong et al. [2015] H. Dong, K. Chen, and J. Linderoth. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. arXiv:1510.06083, 2015.
  • Duran and Grossmann [1986] M. A. Duran and I. E. Grossmann. An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Mathematical Programming, 36(3):307–339, 1986.
  • d’Aspremont and Boyd [2003] A. d’Aspremont and S. Boyd. Relaxations and randomized methods for nonconvex QCQPs. EE392o Class Notes, Stanford University, 1:1–16, 2003.
  • d’Aspremont et al. [2014] A. d’Aspremont, F. Bach, and L. El Ghaoui. Approximation bounds for sparse principal component analysis. Mathematical Programming, 148(1-2):89–110, 2014.
  • Fountoulakis et al. [2017] K. Fountoulakis, A. Kundu, E.-M. Kontopoulou, and P. Drineas. A randomized rounding algorithm for sparse PCA. ACM Transactions on Knowledge Discovery from Data (TKDD), 11(3):1–26, 2017.
  • Gally and Pfetsh [2016] T. Gally and M. E. Pfetsh. Computing restricted isometry constants via mixed-integer semidefinite programming. Optimization Online, 2016.
  • Gally et al. [2018] T. Gally, M. E. Pfetsch, and S. Ulbrich. A framework for solving mixed-integer semidefinite programs. Optimization Methods & Software, 33(3):594–632, 2018.
  • Goemans and Williamson [1995] M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
  • Hein and Bühler [2010] M. Hein and T. Bühler. An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse PCA. In Advances in neural information processing systems, pages 847–855, 2010.
  • Horn and Johnson [1990] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 1990.
  • Hotelling [1933] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
  • Hsu et al. [2014] Y.-L. Hsu, P.-Y. Huang, and D.-T. Chen. Sparse principal component analysis in cancer research. Translational cancer research, 3(3):182, 2014.
  • Jeffers [1967] J. N. Jeffers. Two case studies in the application of principal component analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 16(3):225–236, 1967.
  • Jolliffe [1995] I. T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journal of Applied Statistics, 22(1):29–35, 1995.
  • Jolliffe et al. [2003] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin. A modified principal component technique based on the lasso. Journal of computational and Graphical Statistics, 12(3):531–547, 2003.
  • Journée et al. [2010] M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(2), 2010.
  • Kaiser [1958] H. F. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187–200, 1958.
  • Kim and Kojima [2001] S. Kim and M. Kojima. Second order cone programming relaxation of nonconvex quadratic optimization problems. Optimization methods and software, 15(3-4):201–224, 2001.
  • Kobayashi and Takano [2020] K. Kobayashi and Y. Takano. A branch-and-cut algorithm for solving mixed-integer semidefinite optimization problems. Computational Optimization & Applications, 75(2):493–513, 2020.
  • Ledoit and Wolf [2004] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365–411, 2004.
  • Li and Tsatsomeros [1997] B. Li and M. Tsatsomeros. Doubly diagonally dominant matrices. Linear Algebra and Its Applications, 261(1-3):221–235, 1997.
  • Li and Xie [2020] Y. Li and W. Xie. Exact and approximation algorithms for sparse PCA. arXiv preprint arXiv:2008.12438, 2020.
  • Luss and d’Aspremont [2010] R. Luss and A. d’Aspremont. Clustering and feature selection using sparse principal component analysis. Optimization & Engineering, 11(1):145–157, 2010.
  • Luss and Teboulle [2013] R. Luss and M. Teboulle. Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint. SIAM Review, 55(1):65–98, 2013.
  • Magdon-Ismail [2017] M. Magdon-Ismail. NP-hardness and inapproximability of sparse PCA. Information Processing Letters, 126:35–38, 2017.
  • Majumdar et al. [2019] A. Majumdar, G. Hall, and A. A. Ahmadi. A survey of recent scalability improvements for semidefinite programming with applications in machine learning, control, and robotics. arXiv preprint arXiv:1908.05209, 2019.
  • Miolane [2018] L. Miolane. Phase transitions in spiked matrix estimation: information-theoretic analysis. arXiv preprint arXiv:1806.04343, 2018.
  • Moghaddam et al. [2006] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse pca: Exact and greedy algorithms. In Advances in neural information processing systems, pages 915–922, 2006.
  • Quesada and Grossmann [1992] I. Quesada and I. E. Grossmann. An LP/NLP based branch and bound algorithm for convex MINLP optimization problems. Computers & Chemical Engineering, 16(10-11):937–947, 1992.
  • Raghavan and Tompson [1987] P. Raghavan and C. D. Tompson. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica, 7(4):365–374, 1987.
  • Raghavendra and Steurer [2010] P. Raghavendra and D. Steurer. Graph expansion and the unique games conjecture. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 755–764, 2010.
  • Recht et al. [2010] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.
  • Richman [1986] M. B. Richman. Rotation of principal components. Journal of climatology, 6(3):293–335, 1986.
  • Richtárik et al. [2020] P. Richtárik, M. Jahani, S. D. Ahipaşaoğlu, and M. Takáč. Alternating maximization: Unifying framework for 8 sparse PCA formulations and efficient parallel codes. Optimization and Engineering, pages 1–27, 2020.
  • Todd [2018] M. J. Todd. On max-k-sums. Mathematical Programming, 171(1):489–517, 2018.
  • Vu and Lei [2012] V. Vu and J. Lei. Minimax rates of estimation for sparse PCA in high dimensions. In Artificial intelligence and statistics, pages 1278–1286, 2012.
  • Wang et al. [2016] T. Wang, Q. Berthet, and R. J. Samworth. Statistical and computational trade-offs in estimation of sparse principal components. Annals of Statistics, 44(5):1896–1930, 2016.
  • Witten et al. [2009] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009.
  • Wolkowicz et al. [2012] H. Wolkowicz, R. Saigal, and L. Vandenberghe. Handbook of semidefinite programming: theory, algorithms, and applications, volume 27. Springer Science & Business Media, 2012.
  • Yuan and Zhang [2013] X.-T. Yuan and T. Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899–925, 2013.
  • Zakeri et al. [2014] G. Zakeri, D. Craigie, A. Philpott, and M. Todd. Optimization of demand response through peak shaving. Oper. Res. Letters, 42(1):97–101, 2014.
  • Zass and Shashua [2007] R. Zass and A. Shashua. Nonnegative sparse PCA. In Advances in neural information processing systems, pages 1561–1568, 2007.
  • Zhang et al. [2012] Y. Zhang, A. d’Aspremont, and L. El Ghaoui. Sparse PCA: Convex relaxations, algorithms and applications. In Handbook on Semidefinite, Conic and Polynomial Optimization, pages 915–940. Springer, 2012.
  • Zou et al. [2006] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

A A Doubly Non-Negative Relaxation and a Goemans-Williamson Rounding Scheme

The MISDO formulation (5) we derived in Section 2 features big-MM constraints of the form |Xi,j|Mi,jzi|X_{i,j}|\leq M_{i,j}z_{i}. We did not include the equally valid inequalities |Xi,j|Mi,jzj|X_{i,j}|\leq M_{i,j}z_{j}, because they are redundant with the fact that 𝑿\bm{X} is symmetric. Actually, (5) is equivalent to

max𝒛{0,1}p:𝒆𝒛kmax𝑿S+p𝚺,𝑿 s.t. tr(𝑿)=1,|Xi,j|Mi,jzizj,i,j[p].\displaystyle\max_{\bm{z}\in\{0,1\}^{p}:\bm{e}^{\top}\bm{z}\leq k}\ \max_{\bm{X}\in S^{p}_{+}}\quad\langle\bm{\Sigma},\bm{X}\rangle\text{ s.t. }\quad\mathrm{tr}(\bm{X})=1,\ |X_{i,j}|\leq M_{i,j}z_{i}z_{j},\ \forall i,j\in[p]. (31)

The formulation above features products of binary variables zizjz_{i}z_{j}.Therefore, unlike several other problems involving cardinality constraints such as compressed sensing, relaxations of sparse PCA benefit from invoking an optimization hierarchy (see d’Aspremont and Boyd, 2003, Section 2.4.1, for a counterexample specific to compressed sensing). In particular, let us model the outer product 𝒛𝒛\bm{z}\bm{z}^{\top} by introducing a matrix 𝒁\bm{Z} and imposing the semidefinite constraint 𝒁𝒛𝒛\bm{Z}\succeq\bm{z}\bm{z}^{\top}. We tighten the formulation by requiring that Zi,i=ziZ_{i,i}=z_{i} and imposing the linear inequalities max(zi+zj1,0)Zi,jmin(zi,zj)\max(z_{i}+z_{j}-1,0)\leq Z_{i,j}\leq\min(z_{i},z_{j}). Hence, we obtain:

max𝒛[0,1]p:𝒆𝒛k,𝒁+p×pmax𝑿𝟎𝚺,𝑿 s.t.\displaystyle\max_{\begin{subarray}{c}\bm{z}\in[0,1]^{p}:\bm{e}^{\top}\bm{z}\leq k,\\ \bm{Z}\in\mathbb{R}^{p\times p}_{+}\end{subarray}}\>\max_{\bm{X}\succeq\bm{0}}\>\langle\bm{\Sigma},\bm{X}\rangle\text{ s.t. } tr(𝑿)=1,|Xi,j|Mi,jZi,j,𝑬,𝒁k2,Zi,i=zi,\displaystyle\mathrm{tr}(\bm{X})=1,\ |X_{i,j}|\leq M_{i,j}Z_{i,j},\ \langle\bm{E},\bm{Z}\rangle\leq k^{2},\>Z_{i,i}=z_{i}, (32)
max(zi+zj1,0)Zi,jmin(zi,zj),(1𝒛𝒛𝒁)𝟎.\displaystyle\max(z_{i}+z_{j}-1,0)\leq Z_{i,j}\leq\min(z_{i},z_{j}),\ \begin{pmatrix}1&\bm{z}^{\top}\\ \bm{z}&\bm{Z}\end{pmatrix}\succeq\bm{0}.

Problem (32) is a doubly non-negative relaxation, as we have intersected the Shor and RLT relaxations. This is noteworthy, because doubly non-negative relaxations dominate most other popular relaxations with O(p2)O(p^{2}) variables (Bao et al., 2011, Theorem 1).

Relaxation (32) is amenable to a Goemans-Williamson rounding scheme (Goemans and Williamson, 1995). Namely, let (𝒛,𝒁)(\bm{z}^{\star},\bm{Z}^{\star}) denote optimal choices of (𝒛,𝒁)(\bm{z},\bm{Z}) in Problem (32), 𝒛^\hat{\bm{z}} be normally distributed random vector such that 𝒛^𝒩(𝒛,𝒁𝒛𝒛)\hat{\bm{z}}\sim\mathcal{N}(\bm{z}^{\star},\bm{Z}^{\star}-\bm{z}^{\star}\bm{z}^{\star\top}), and 𝒛¯\bar{\bm{z}} be a rounding of the vector such that zi¯=1\bar{z_{i}}=1 for the kk largest entries of z^i\hat{z}_{i}; this is, up to feasibility on 𝒛^\hat{\bm{z}}, equivalent to the hyperplane rounding scheme of Goemans and Williamson (1995) (see Bertsimas and Ye, 1998, for a proof). We formalize this procedure in Algorithm 3. As Algorithm 3 returns one of multiple possible 𝒛¯\bar{\bm{z}}’s, a computationally useful strategy is to run the random rounding component several times and return the best solution.

Algorithm 3 A Goemans-Williamson rounding method for Problem (1)
0: Covariance matrix 𝚺\bm{\Sigma}, sparsity parameter kk
 Compute 𝒛,𝒁\bm{z}^{\star},\bm{Z}^{\star} solution of (32)
 Compute 𝒛^𝒩(𝒛,𝒁𝒛𝒛)\hat{\bm{z}}\sim\mathcal{N}(\bm{z}^{\star},\bm{Z}^{\star}-\bm{z}^{\star}\bm{z}^{\star\top})
 Construct 𝒛¯{0,1}p:𝒆𝒛¯=k\bar{\bm{z}}\in\{0,1\}^{p}:\bm{e}^{\top}\bar{\bm{z}}=k such that z¯iz¯j\bar{z}_{i}\geq\bar{z}_{j} if z^iz^j\hat{z}_{i}\geq\hat{z}_{j}.
 Compute 𝑿\bm{X} solution of
max𝑿S+p𝚺,𝑿s.t.tr(𝑿)=1,Xi,j=0ifz¯iz¯j=0i,j[p].\displaystyle\max_{\bm{X}\in S^{p}_{+}}\ \langle\bm{\Sigma},\bm{X}\rangle\ \text{s.t.}\ \mathrm{tr}(\bm{X})=1,X_{i,j}=0\ \text{if}\ \bar{z}_{i}\bar{z}_{j}=0\ \forall i,j\in[p].
return  𝒛,𝑿\bm{z},\bm{X}.

A very interesting question is whether it is possible to produce a constant factor guarantee on the quality of Algorithm 3’s rounding, as Goemans and Williamson (1995) successfully did for binary quadratic optimization. Unfortunately, despite our best effort, this does not appear to be possible as the quality of the rounding depends on the value of the optimal dual variables, which are hard to control in this setting. This should not be too surprising for two distinct reasons. Namely, (a) sparse regression, which reduces to sparse PCA (see d’Aspremont et al., 2008, Section 6.1) is strongly NP-hard (Chen et al., 2019), and (b) sparse PCA is hard to approximate within a constant factor under the Small Set Expansion (SSE) hypothesis (Chan et al., 2016), meaning that producing a constant factor guarantee would contradict the SSE hypothesis of Raghavendra and Steurer (2010).

We close this appendix by noting that a similar in spirit (although different in both derivation and implementation) combination of taking a semidefinite relaxation of 𝒛{0,1}p\bm{z}\in\{0,1\}^{p} and rounding à la Goemans-Williamson has been proposed for sparse regression problems (Dong et al., 2015).