This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stackMath

Detection and Recovery of Hidden Submatrices

Marom Dadon    Wasim Huleihel    Tamir Bendory M. D., W. H., and T. B. are with the Department of Electrical Engineering-Systems at Tel Aviv university, Tel Aviv 6997801, Israel (e-mails: marom.dadon@gmail.com, wasimh@tauex.tau.ac.il, bendory@tauex.tau.ac.il ). W.H. is supported by ISF under Grant 1734/21. T.B. is supported in part by BSF under Grant 2020159, in part by NSF-BSF under Grant 2019752, and in part by ISF under Grant 1924/21.
Abstract

In this paper, we study the problems of detection and recovery of hidden submatrices with elevated means inside a large Gaussian random matrix. We consider two different structures for the planted submatrices. In the first model, the planted matrices are disjoint, and their row and column indices can be arbitrary. Inspired by scientific applications, the second model restricts the row and column indices to be consecutive. In the detection problem, under the null hypothesis, the observed matrix is a realization of independent and identically distributed standard normal entries. Under the alternative, there exists a set of hidden submatrices with elevated means inside the same standard normal matrix. Recovery refers to the task of locating the hidden submatrices. For both problems, and for both models, we characterize the statistical and computational barriers by deriving information-theoretic lower bounds, designing and analyzing algorithms matching those bounds, and proving computational lower bounds based on the low-degree polynomials conjecture. In particular, we show that the space of the model parameters (i.e., number of planted submatrices, their dimensions, and elevated mean) can be partitioned into three regions: the impossible regime, where all algorithms fail; the hard regime, where while detection or recovery are statistically possible, we give some evidence that polynomial-time algorithm do not exist; and finally the easy regime, where polynomial-time algorithms exist.

1 Introduction

This paper studies the detection and recovery problems of hidden submatrices inside a large Gaussian random matrix. In the detection problem, under the null hypothesis, the observed matrix is a realization of an independent and identically distributed random matrix with standard normal entries. Under the alternative, there exists a set of hidden submatrices with elevated means inside the same standard normal matrix. Our task is to design a statistical test (i.e., an algorithm) to decide which hypothesis is correct. The recovery task is the problem of locating the hidden submatrices. In this case, the devised algorithm estimates the location of the submatrices.

We consider two statistical models for the planted submatrices. In the first model, the planted matrices are disjoint, and their row and column indices can be arbitrary. The detection and recovery variants of this model are well-known as the submatrix detection and submatrix recovery (or localization) problems, respectively, and received significant attention in the last few years, e.g., [46, 37, 12, 11, 2, 33, 41, 48, 42, 44, 1, 9, 23, 22, 4, 5, 32], and references therein. Specifically, for the case of a single planted submatrix, the task is to detect the presence of a small k×kk\times k submatrix with entries sampled from a distribution 𝒫{\cal P} in an n×nn\times n matrix of samples from a distribution 𝒬{\cal Q}. In the special case where 𝒫{\cal P} and 𝒬{\cal Q} are Gaussians, the statistical and computational barriers, i.e., information-theoretic lower bounds, algorithms, and computational lower bounds, were studied in great detail and were characterized in [11, 41, 46, 37, 12, 42, 5]. When 𝒫{\cal P} and 𝒬{\cal Q} are Bernoulli random variables, the detection task is well-known as the planted dense subgraph problem, which has also been studied extensively in the literature, e.g., [11, 2, 48, 33, 4]. Most notably, for both the Gaussian and Bernoulli problems, it is well understood by now that there appears to be a statistical-computational gap between the minimum value of kk at which detection can be solved, and the minimum value of kk at which detection can be solved in polynomial time (i.e., with an efficient algorithm). The statistical and computational barriers to the recovery problem have also received significant attention in the literature, e.g., [23, 40, 20, 34, 35, 22, 4], covering several types of distributions, as well as single and (non-overlapping) multiple planted submatrices.

The submatrix model above, where the planted column and row indices are arbitrary, might be less realistic in certain scientific and engineering applications. Accordingly, we also analyze a second model that restricts the row and column indices to be consecutive. One important motivation for this model stems from single-particle cryo-electron microscopy (cryo-EM): a leading technology to elucidate the three-dimensional atomic structure of macromolecules, such as proteins [15, 39]. At the beginning of the algorithmic pipeline of cryo-EM, it is required to locate multiple particle images (tomographic projections of randomly oriented copies of the sought molecular structure) in a highly noisy, large image [43, 8]. This task is dubbed particle picking. While many particle picking algorithms were designed, e.g.,  [50, 27, 14, 24], this work can be seen as a first attempt to unveil the statistical and computational properties of this task that were not analyzed heretofore.

Main contributions.

To present our results, let us introduce a few notations. In our models, we have mm disjoint k×kk\times k submatrices planted in an n×nn\times n matrix. We denote the observed matrix by 𝖷\mathsf{X}. As mentioned above, we deal with the Gaussian setting, where the entries of the planted submatrices are independent Gaussian random variables with mean λ>0\lambda>0 and unit variance, while the entries of the other entries in 𝖷\mathsf{X} are independent Gaussian random variables with zero mean and unit variance. This falls under the general “signal+noise” model, in the sense that 𝖷=λ𝖲+𝖹\mathsf{X}=\lambda\cdot\mathsf{S}+\mathsf{Z}, with 𝖲\mathsf{S} being the signal of interest, 𝖹\mathsf{Z} is a Gaussian noise matrix, and λ\lambda describes the signal-to-noise ratio (SNR) of the problem. As mentioned above, in this paper, we consider two models for 𝖲\mathsf{S}; the first with the arbitrary placement of the mm planted submatrices, and the second with each of the mm planted submatrices having consecutive row and column indices. We will refer to the detection/recovery of the first model as submatrix detection/recovery, while for the second as consecutive submatrix detection/recovery. Contrary to the consecutive submatrix detection and recovery problems, which were not studied in the literature, the non-consecutive submatrix detection and recovery problems received significant attention; our contribution in this paper to this problem is the analysis of the detection of multiple (possibly growing) number of planted submatrices, which seems to be overlooked in the literature. As mentioned above, the recovery counterpart of multiple planted submatrices was studied in, e.g., [22].

For the submatrix detection, the consecutive submatrix detection, and the consecutive submatrix recovery problems, we study the computational and statistical boundaries and derive information-theoretic lower bounds, algorithmic upper bounds, and computational lower bounds. In particular, we show that the space of the model parameters (k,m,λ)(k,m,\lambda) can be partitioned into different disjoint regions: the impossible regime, where all algorithms fail; the hard regime, where while detection or recovery are statistically possible, we give some evidence that polynomial-time algorithms do not exist; and finally the easy regime, where polynomial-time algorithms exist. Table 1 summarizes the statistical and computational thresholds for the detection and recovery problems discussed above. We emphasize that the bounds in the second row of Table 1 (submatrix recovery), as well as the first row (submatrix detection) for m=1m=1, are known in the literature, as mentioned above.

Type Impossible Hard Easy
𝖲𝖣\mathsf{SD} λnmk21k\lambda\ll\frac{n}{mk^{2}}\wedge\frac{1}{\sqrt{k}} nmk21kλ1nmk2\frac{n}{mk^{2}}\wedge\frac{1}{\sqrt{k}}\ll\lambda\ll 1\wedge\frac{n}{mk^{2}} λ1nmk2\lambda\gg 1\wedge\frac{n}{mk^{2}}
𝖲𝖱\mathsf{SR} λ1k\lambda\ll\frac{1}{\sqrt{k}} 1kλ1nk\frac{1}{\sqrt{k}}\ll\lambda\ll 1\wedge\frac{\sqrt{n}}{k} λ1nk\lambda\gg 1\wedge\frac{\sqrt{n}}{k}
𝖢𝖲𝖣\mathsf{CSD} λ1k\lambda\ll\frac{1}{k} 𝖭𝖮\mathsf{NO} λ1k\lambda\gg\frac{1}{k}
𝖢𝖲𝖱\mathsf{CSR} λ1k\lambda\ll\frac{1}{\sqrt{k}} 𝖭𝖮\mathsf{NO} λ1k\lambda\gg\frac{1}{\sqrt{k}}
Table 1: Statistical and computational thresholds for submatrix detection (𝖲𝖣\mathsf{SD}), submatrix recovery (𝖲𝖱\mathsf{SR}), consecutive submatrix detection (𝖢𝖲𝖣\mathsf{CSD}), and consecutive submatrix recovery (𝖢𝖲𝖱\mathsf{CSR}), up to poly-log factors. The bounds in the first row for the special case of m=1m=1 and the second row, are known in the literature (e.g., [11, 42, 23, 22]).

Interestingly, while it is well-known that the number of planted submatrices mm does not play any significant role in the statistical and computational barriers in the submatrix recovery problem, it can be seen that this is not the case for the submatrix detection problem. Similarly to the submatrix recovery problem (and of course the single planted submatrix detection problem), the submatrix detection problem undergoes a statistical-computational gap. To provide evidence for this phenomenon, we follow a recent line of work [31, 28, 13, 21, 25] and show that the class of low-degree polynomials fail to solve the detection problem in this conjecturally hard regime. Furthermore, it can be seen that the consecutive submatrix detection and recovery problems are either impossible or easy to solve, namely, there is no hard regime. Here, for both the detection and recovery problems, the number of planted submatrices mm does not play an inherent role. We note that there is a statistical gap between consecutive detection and recovery; the former is statistically easier. This is true as long as exact recovery is the performance criterion. We also analyze the correlated recovery (also known as weak recovery) criterion, where recovery is successful if only a fraction of planted entries are recovered. For this weaker criterion, we show that recovery and detection are asymptotically equivalent.

Notation.

Given a distribution \mathbb{P}, let n\mathbb{P}^{\otimes n} denote the distribution of the nn-dimensional random vector (X1,X2,,Xn)(X_{1},X_{2},\dots,X_{n}), where the XiX_{i} are i.i.d. according to \mathbb{P}. Similarly, m×n\mathbb{P}^{\otimes m\times n} denotes the distribution on m×n\mathbb{R}^{m\times n} with i.i.d. entries distributed as \mathbb{P}. Given a finite or measurable set 𝒳\mathcal{X}, let Unif[𝒳]\text{Unif}[\mathcal{X}] denote the uniform distribution on 𝒳\mathcal{X}. The relation XYX\perp\!\!\!\perp Y means that the random variables XX and YY are statistically independent. The Hadamard and inner product between two n×nn\times n matrices 𝖠\mathsf{A} and 𝖡\mathsf{B} are denoted, respectively, by 𝖠𝖡\mathsf{A}\odot\mathsf{B} and 𝖠,𝖡=𝗍𝗋𝖺𝖼𝖾(𝖠T𝖡)\left\langle\mathsf{A},\mathsf{B}\right\rangle=\mathsf{trace}(\mathsf{A}^{T}\mathsf{B}). For xx\in\mathbb{R}, we define [x]+=max(x,0)[x]_{+}=\max(x,0). The nuclear norm of a symmetric matrix 𝖠\mathsf{A} is denoted by 𝖠\left\|\mathsf{A}\right\|_{\star}, and equals the summation of the absolute values of the eigenvalues of 𝖠\mathsf{A}.

Let 𝒩(μ,σ2){\cal N}(\mu,\sigma^{2}) denote a normal random variable with mean μ\mu and variance σ2\sigma^{2}, when μ\mu\in\mathbb{R} and σ0\sigma\in\mathbb{R}_{\geq 0}. Let 𝒩(μ,Σ){\cal N}(\mu,\Sigma) denote a multivariate normal random vector with mean μd\mu\in\mathbb{R}^{d} and covariance matrix Σ\Sigma, where Σ\Sigma is a d×dd\times d positive semidefinite matrix. Let Φ\Phi denote the cumulative distribution of a standard normal random variable with Φ(x)=xet2/2𝑑t\Phi(x)=\int_{-\infty}^{x}e^{-t^{2}/2}dt. For probability measures \mathbb{P} and \mathbb{Q}, let d𝖳𝖵(,)=12|dd|d_{\mathsf{TV}}(\mathbb{P},\mathbb{Q})=\frac{1}{2}\int|\mathrm{d}\mathbb{P}-\mathrm{d}\mathbb{Q}|, χ2(||)=(dd)2d\chi^{2}(\mathbb{P}||\mathbb{Q})=\int\frac{(\mathrm{d}\mathbb{P}-\mathrm{d}\mathbb{Q})^{2}}{\mathrm{d}\mathbb{Q}}, and d𝖪𝖫(||)=𝔼logddd_{\mathsf{KL}}(\mathbb{P}||\mathbb{Q})=\mathbb{E}_{\mathbb{P}}\log\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}, denote the total variation distance, the χ2\chi^{2}-divergence, and the Kullback-Leibler (KL) divergence, respectively. Let 𝖡𝖾𝗋𝗇(p)\mathsf{Bern}(p) and 𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(n,p)\mathsf{Binomial}(n,p) denote the Bernoulli and Binomial distributions with parameters pp and nn, respectively. We denote by 𝖧𝗒𝗉𝖾𝗋𝗀𝖾𝗈𝗆𝖾𝗍𝗋𝗂𝖼(n,k,m)\mathsf{Hypergeometric}(n,k,m) the Hypergeometric distribution with parameters (n,k,m)(n,k,m).

We use standard asymptotic notation. For two positive sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write an=O(bn)a_{n}=O(b_{n}) if anCbna_{n}\leq Cb_{n}, for some absolute constant CC and for all nn; an=Ω(bn)a_{n}=\Omega(b_{n}), if bn=O(an)b_{n}=O(a_{n}); an=Θ(bn)a_{n}=\Theta(b_{n}), if an=O(bn)a_{n}=O(b_{n}) and an=Ω(bn)a_{n}=\Omega(b_{n}), an=o(bn)a_{n}=o(b_{n}) or bn=ω(an)b_{n}=\omega(a_{n}), if an/bn0a_{n}/b_{n}\to 0, as nn\to\infty. Finally, for a,ba,b\in\mathbb{R}, we let abmax{a,b}a\vee b\triangleq\max\{a,b\} and abmin{a,b}a\wedge b\triangleq\min\{a,b\}. Throughout the paper, 𝖢\mathsf{C} refers to any constant independent of the parameters of the problem at hand and will be reused for different constants. The notation \ll refers to polynomially less than in nn, namely, anbna_{n}\ll b_{n} if lim infnlognan<lim infnlognbn\liminf_{n\to\infty}\log_{n}a_{n}<\liminf_{n\to\infty}\log_{n}b_{n}, e.g., nn2n\ll n^{2}, but n≪̸nlog2nn\not\ll n\log_{2}n. For nn\in\mathbb{N}, we let [n]={1,2,,n}[n]=\{1,2,\dots,n\}. For a subset SS\subseteq\mathbb{R}, we let 𝟙{S}\mathbbm{1}\{S\} denote the indicator function of the set SS.

2 Problem Formulation

In this section, we present our model and define the detection and recovery problems we investigate, starting with the former. For simplicity of notations, we denote 𝒬=𝒩(0,1){\cal Q}={\cal N}(0,1) and 𝒫=𝒩(λ,1){\cal P}={\cal N}(\lambda,1), for some λ>0\lambda>0, which can be interpreted as the signal-to-noise ratio (SNR) parameter of the underlying model.

2.1 The detection problem

Let (m,k,n)(m,k,n) be three natural numbers, satisfying mknm\cdot k\leq n. We emphasize that the values of mm, kk, and λ\lambda, are allowed to be functions of nn—the dimension of the observation. Let 𝒦k,m,n{\cal K}_{k,m,n} denote all possible sets that can be represented as a union of mm disjoint subsets of [n][n], each of size kk; see Figure 1 for an illustration. Formally,

𝒦k,m,n{\displaystyle{\cal K}_{k,m,n}\triangleq\biggl{\{} 𝖪k,m=i=1m𝖲i×𝖳i:𝖲i,𝖳i𝒞k,i[m],\displaystyle\mathsf{K}_{k,m}=\bigcup_{i=1}^{m}\mathsf{S}_{i}\times\mathsf{T}_{i}:\;\mathsf{S}_{i},\mathsf{T}_{i}\subset{\cal C}_{k},\;\forall i\in[m],
(𝖲i×𝖳i)(𝖲j×𝖳j)=,ij[m]},\displaystyle\hskip 103.85237pt(\mathsf{S}_{i}\times\mathsf{T}_{i})\cap(\mathsf{S}_{j}\times\mathsf{T}_{j})=\emptyset,\;\forall i\neq j\in[m]\biggl{\}}, (1)

where 𝒞k{𝖲[n]:|𝖲|=k}{\cal C}_{k}\triangleq\left\{\mathsf{S}\subset[n]:\;|\mathsf{S}|=k\right\}, namely, it is the set of all subsets of [n][n] of size kk.

Figure 1: Illustration of the models considered in this paper: 𝒦k,m,n\mathcal{K}_{k,m,n} of Definition 1 (left) and 𝒦k,m,n𝖼𝗈𝗇\mathcal{K}_{k,m,n}^{\mathsf{con}} of Definition 2 (right), for k=4k=4, m=2m=2, and n=16n=16.

We next formulate two different detection problems that we wish to investigate, starting with the following one, a generalization of the Gaussian planted clique problem (or, bi-clustering, see, e.g., [42]) to multiple hidden submatrices (or, clusters).

Definition 1 (Submatrix detection).

Let (𝒫,𝒬)({\cal P},{\cal Q}) be a pair of distributions over a measurable space (,)(\mathbb{R},\mathcal{B}). Let 𝖲𝖣(n,k,m,𝒫,𝒬)\mathsf{SD}(n,k,m,{\cal P},{\cal Q}) denote the hypothesis testing problem with observation 𝖷n×n\mathsf{X}\in\mathbb{R}^{n\times n} and hypotheses

0:𝖷𝒬n×n𝗏𝗌.1:𝖷𝒟(n,k,m,𝒫,𝒬),\displaystyle{\cal H}_{0}:\mathsf{X}\sim{\cal Q}^{\otimes n\times n}\quad\quad\mathsf{vs.}\quad\quad{\cal H}_{1}:\mathsf{X}\sim\mathcal{D}(n,k,m,{\cal P},{\cal Q}), (2)

where 𝒟(n,k,m,𝒫,𝒬)\mathcal{D}(n,k,m,{\cal P},{\cal Q}) is the distribution of matrices 𝖷\mathsf{X} with entries 𝖷ij𝒫\mathsf{X}_{ij}\sim{\cal P} if i,j𝖪k,mi,j\in\mathsf{K}_{k,m} and 𝖷ij𝒬\mathsf{X}_{ij}\sim{\cal Q} otherwise that are conditionally independent given 𝖪k,m\mathsf{K}_{k,m}, which is chosen uniformly at random over all subsets of 𝒦k,m,n{\cal K}_{k,m,n}.

To wit, under 0{\cal H}_{0} the elements of 𝖷\mathsf{X} are all distributed i.i.d. according to 𝒬{\cal Q}, while under 1{\cal H}_{1}, there are mm planted disjoint submatrices 𝖪k,m\mathsf{K}_{k,m} in 𝖷\mathsf{X} with entries distributed according to 𝒫{\cal P}, and the other entries (outside of 𝖪k,m\mathsf{K}_{k,m}) are distributed according to 𝒬{\cal Q}.

Note that the columns and row indices of the planted submatrices in (1) can appear everywhere; in particular, they are not necessarily consecutive. In some applications, however, we would like those submatrices to be defined by a set of consecutive rows and a set of consecutive columns (e.g., when those submatrices model images like in cryo-EM). Accordingly, we consider the following set:

𝒦k,m,n𝖼𝗈𝗇{\displaystyle{\cal K}_{k,m,n}^{\mathsf{con}}\triangleq\biggl{\{} 𝖪k,m=i=1m𝖲i×𝖳i:𝖲i,𝖳i𝒞k𝖼𝗈𝗇,i[m],\displaystyle\mathsf{K}_{k,m}=\bigcup_{i=1}^{m}\mathsf{S}_{i}\times\mathsf{T}_{i}:\;\mathsf{S}_{i},\mathsf{T}_{i}\subset{\cal C}_{k}^{\mathsf{con}},\;\forall i\in[m],
(𝖲i×𝖳i)(𝖲j×𝖳j)=,ij[m]},\displaystyle\hskip 103.85237pt(\mathsf{S}_{i}\times\mathsf{T}_{i})\cap(\mathsf{S}_{j}\times\mathsf{T}_{j})=\emptyset,\;\forall i\neq j\in[m]\biggl{\}}, (3)

where 𝒞k𝖼𝗈𝗇{𝖲[n]:|𝖲|=k,𝖲 is consecutive}{\cal C}_{k}^{\mathsf{con}}\triangleq\left\{\mathsf{S}\subset[n]:\;|\mathsf{S}|=k,\;\mathsf{S}\text{ is consecutive}\right\}, namely, it is the set of all subsets of [n][n] of size kk with consecutive elements. For example, for n=4n=4, we have 𝒞3𝖼𝗈𝗇={1,2,3}{2,3,4}{\cal C}_{3}^{\mathsf{con}}=\{1,2,3\}\cup\{2,3,4\}. The difference between 𝒦k,m,n{\cal K}_{k,m,n} and 𝒦k,m,ncon{\cal K}_{k,m,n}^{\text{con}} is depicted in Figure 1; it is evident that the submatrices in 𝒦k,m,n{\cal K}_{k,m,n} can appear everywhere, while those in 𝒦k,m,n𝖼𝗈𝗇{\cal K}_{k,m,n}^{\mathsf{con}} are consecutive. Consider the following detection problem.

Definition 2 (Consecutive submatrix detection).

Let (𝒫,𝒬)({\cal P},{\cal Q}) be a pair of distributions over a measurable space (,)(\mathbb{R},\mathcal{B}). Let 𝖢𝖲𝖣(n,k,m,𝒫,𝒬)\mathsf{CSD}(n,k,m,{\cal P},{\cal Q}) denote the hypothesis testing problem with observation 𝖷n×n\mathsf{X}\in\mathbb{R}^{n\times n} and hypotheses

0:𝖷𝒬n×n𝗏𝗌.1:𝖷𝒟~(n,k,m,𝒫,𝒬),\displaystyle{\cal H}_{0}:\mathsf{X}\sim{\cal Q}^{\otimes n\times n}\quad\quad\mathsf{vs.}\quad\quad{\cal H}_{1}:\mathsf{X}\sim\widetilde{\mathcal{D}}(n,k,m,{\cal P},{\cal Q}), (4)

where 𝒟~(n,k,m,𝒫,𝒬)\widetilde{\mathcal{D}}(n,k,m,{\cal P},{\cal Q}) is the distribution of matrices 𝖷\mathsf{X} with entries 𝖷ij𝒫\mathsf{X}_{ij}\sim{\cal P} if i,j𝖪k,mi,j\in\mathsf{K}_{k,m} and 𝖷ij𝒬\mathsf{X}_{ij}\sim{\cal Q} otherwise that are conditionally independent given 𝖪k,m\mathsf{K}_{k,m}, which is chosen uniformly at random over all subsets of 𝒦k,m,n𝖼𝗈𝗇{\cal K}_{k,m,n}^{\mathsf{con}}.

Observing 𝖷\mathsf{X}, a detection algorithm 𝒜n{\cal A}_{n} for the problems above is tasked with outputting a decision in {0,1}\{0,1\}. We define the risk of a detection algorithm 𝒜n{\cal A}_{n} as the sum of its 𝖳𝗒𝗉𝖾\mathsf{Type}-𝖨\mathsf{I} and 𝖳𝗒𝗉𝖾\mathsf{Type}-𝖨𝖨\mathsf{II} errors probabilities, namely,

𝖱(𝒜n)=0(𝒜n(𝖷)=1)+1(𝒜n(𝖷)=0),\displaystyle\mathsf{R}({\cal A}_{n})=\mathbb{P}_{{\cal H}_{0}}({\cal A}_{n}(\mathsf{X})=1)+\mathbb{P}_{{\cal H}_{1}}({\cal A}_{n}(\mathsf{X})=0), (5)

where 0\mathbb{P}_{{\cal H}_{0}} and 1\mathbb{P}_{{\cal H}_{1}} denote the probability distributions under the null hypothesis and the alternative hypothesis, respectively. If 𝖱(𝒜n)0\mathsf{R}({\cal A}_{n})\to 0 as nn\to\infty, then we say that 𝒜n{\cal A}_{n} solves the detection problem. The algorithms we consider here are either unconstrained (and thus might be computationally expensive) or run in polynomial time (computationally efficient). Typically, unconstrained algorithms are considered in order to show that information-theoretic lower bounds are asymptotically tight. An algorithm that runs in polynomial time must run in 𝗉𝗈𝗅𝗒(n)\mathsf{poly}(n) time, where nn is the size of the input. As mentioned in the introduction, our goal is to derive necessary and sufficient conditions for when it is impossible and possible to detect the underlying submatrices, with and without computational constraints, for both the 𝖲𝖣\mathsf{SD} and 𝖢𝖲𝖣\mathsf{CSD} models.

2.2 The recovery problem

Next, we consider the recovery variant of the problem in Definition 2. Note that the submatrix recovery problem that corresponds to the problem in Definition 1, where the entries of the submatrices are not necessarily consecutive, was investigated in [23]. In the recovery problem, we assume that the data follow the distribution under 1{\cal H}_{1} in Definition 2, and the inference task is to recover the location of the planted submatrices. This is the analog of the particle picking problem in cryo-EM that was introduced in Section 1. Consider the following definition.

Definition 3 (Consecutive submatrix recovery).

Let (𝒫,𝒬)({\cal P},{\cal Q}) be a pair of distributions over a measurable space (,)(\mathbb{R},\mathcal{B}). Assume that 𝖷n×n𝒟~(n,k,m,𝒫,𝒬)\mathsf{X}\in\mathbb{R}^{n\times n}\sim\widetilde{\mathcal{D}}(n,k,m,{\cal P},{\cal Q}), where 𝒟~(n,k,m,𝒫,𝒬)\widetilde{\mathcal{D}}(n,k,m,{\cal P},{\cal Q}) is the distribution of matrices 𝖷\mathsf{X} with entries 𝖷ij𝒫\mathsf{X}_{ij}\sim{\cal P} if i,j𝖪i,j\in\mathsf{K}^{\star} and 𝖷ij𝒬\mathsf{X}_{ij}\sim{\cal Q} otherwise that are conditionally independent given 𝖪𝒦k,m,n𝖼𝗈𝗇\mathsf{K}^{\star}\in{\cal K}_{k,m,n}^{\mathsf{con}}. The goal is to recover the hidden submatrices 𝖪\mathsf{K}^{\star}, up to a permutation of the submatrices indices, given the matrix 𝖷\mathsf{X}. We let 𝖢𝖲𝖱(n,k,m,𝒫,𝒬)\mathsf{CSR}(n,k,m,{\cal P},{\cal Q}) denote this recovery problem.

Several metrics of reconstruction accuracy are possible, and we will focus on two: exact and correlated recovery criteria. Our estimation procedures produce a set 𝖪^=𝖪^(𝖷)\hat{\mathsf{K}}=\hat{\mathsf{K}}(\mathsf{X}) aimed to estimate at best the underlying true submatrices 𝖪\mathsf{K}^{\star}. Consider the following definitions.

Definition 4 (Exact recovery).

We say that 𝖪^\hat{\mathsf{K}} achieves exact recovery of 𝖪\mathsf{K}^{\star}, if, as nn\to\infty, sup𝖪𝒦k,m,n𝖼𝗈𝗇(𝖪^𝖪)0\sup_{\mathsf{K}^{\star}\in{\cal K}_{k,m,n}^{\mathsf{con}}}\mathbb{P}(\hat{\mathsf{K}}\neq\mathsf{K}^{\star})\to 0.

Definition 5 (Correlated recovery).

The overlap of 𝖪\mathsf{K}^{\star} and 𝖪^\hat{\mathsf{K}} is defined as the expected size of their intersection, i.e.,

𝗈𝗏𝖾𝗋𝗅𝖺𝗉(𝖪,𝖪^)𝔼𝖪,𝖪^=i=1n(i𝖪𝖪^).\displaystyle\mathsf{overlap}(\mathsf{K}^{\star},\hat{\mathsf{K}})\triangleq\mathbb{E}\langle{\mathsf{K}^{\star},\hat{\mathsf{K}}}\rangle=\sum_{i=1}^{n}\mathbb{P}(i\in\mathsf{K}^{\star}\cap\hat{\mathsf{K}}). (6)

We say that 𝖪^\hat{\mathsf{K}} achieves correlated recovery of 𝖪\mathsf{K}^{\star} if there exists a fixed constant ϵ>0\epsilon>0, such that limnsup𝖪𝒦k,m,n𝖼𝗈𝗇𝗈𝗏𝖾𝗋𝗅𝖺𝗉(𝖪,𝖪^)mk2ϵ\lim_{n\to\infty}\sup_{\mathsf{K}^{\star}\in{\cal K}_{k,m,n}^{\mathsf{con}}}\frac{\mathsf{overlap}(\mathsf{K}^{\star},\hat{\mathsf{K}})}{mk^{2}}\geq\epsilon.

Similarly to the detection problem, also here we will care about both unconstrained and polynomial time algorithms, and we aim to derive necessary and sufficient conditions for when it is impossible and possible to recover the underlying submatrices.

3 Main Results

In this section, we present our main results for the detection and recovery problems, starting with the former. For both problems, we derive the statistical and computational bounds for the two models we presented in the previous section.

3.1 The detection problem

Upper bounds.

We start by presenting our upper bounds. To that end, we propose three algorithms and analyze their performance. Define the statistics,

𝖳𝗌𝗎𝗆(𝖷)\displaystyle\mathsf{T}_{\mathsf{sum}}(\mathsf{X}) i,j[n]𝖷ij,\displaystyle\triangleq\sum_{i,j\in[n]}\mathsf{X}_{ij}, (7)
𝖳𝗌𝖼𝖺𝗇𝖲𝖣(𝖷)\displaystyle\mathsf{T}^{\mathsf{SD}}_{\mathsf{scan}}(\mathsf{X}) max𝖪𝒦k,1,ni,j𝖪𝖷ij,\displaystyle\triangleq\max_{\mathsf{K}\in{\cal K}_{k,1,n}}\sum_{i,j\in\mathsf{K}}\mathsf{X}_{ij}, (8)
𝖳𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷)\displaystyle\mathsf{T}^{\mathsf{CSD}}_{\mathsf{scan}}(\mathsf{X}) max𝖪𝒦k,1,n𝖼𝗈𝗇i,j𝖪𝖷ij.\displaystyle\triangleq\max_{\mathsf{K}\in{\cal K}_{k,1,n}^{\mathsf{con}}}\sum_{i,j\in\mathsf{K}}\mathsf{X}_{ij}. (9)

The statistics in (7) amounts to adding up all the elements of 𝖷\mathsf{X}, while (8) and (9) enumerate all k×kk\times k submatrices of 𝖷\mathsf{X} in 𝒦k,1,n{\cal K}_{k,1,n} and 𝒦k,1,n𝖼𝗈𝗇{\cal K}_{k,1,n}^{\mathsf{con}}, and take the submatrix with the maximal sum of entries, respectively. Fix δ>0\delta>0. Then, our tests are defined as,

𝒜𝗌𝗎𝗆(𝖷)\displaystyle{\cal A}_{\mathsf{sum}}(\mathsf{X}) 𝟙{𝖳𝗌𝗎𝗆(𝖷)τ𝗌𝗎𝗆},\displaystyle\triangleq\mathbbm{1}\left\{\mathsf{T}_{\mathsf{sum}}(\mathsf{X})\geq\tau_{\mathsf{sum}}\right\}, (10)
𝒜𝗌𝖼𝖺𝗇𝖲𝖣(𝖷)\displaystyle{\cal A}_{\mathsf{scan}}^{\mathsf{SD}}(\mathsf{X}) 𝟙{𝖳𝗌𝖼𝖺𝗇𝖲𝖣(𝖷)τ𝗌𝖼𝖺𝗇𝖲𝖣},\displaystyle\triangleq\mathbbm{1}\left\{\mathsf{T}_{\mathsf{scan}}^{\mathsf{SD}}(\mathsf{X})\geq\tau^{\mathsf{SD}}_{\mathsf{scan}}\right\}, (11)
𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷)\displaystyle{\cal A}_{\mathsf{scan}}^{\mathsf{CSD}}(\mathsf{X}) 𝟙{𝖳𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷)τ𝗌𝖼𝖺𝗇𝖢𝖲𝖣},\displaystyle\triangleq\mathbbm{1}\left\{\mathsf{T}_{\mathsf{scan}}^{\mathsf{CSD}}(\mathsf{X})\geq\tau^{\mathsf{CSD}}_{\mathsf{scan}}\right\}, (12)

where the thresholds are given by τ𝗌𝗎𝗆mk2λ2\tau_{\mathsf{sum}}\triangleq\frac{mk^{2}\lambda}{2}, τ𝗌𝖼𝖺𝗇𝖲𝖣(4+δ)k2log(nk)\tau^{\mathsf{SD}}_{\mathsf{scan}}\triangleq\sqrt{(4+\delta)k^{2}\log\binom{n}{k}}, and τ𝗌𝖼𝖺𝗇𝖢𝖲𝖣(4+δ)k2logn\tau^{\mathsf{CSD}}_{\mathsf{scan}}\triangleq\sqrt{(4+\delta)k^{2}\log{n}}, and correspond roughly to the average between the expected values of each of the statistics in (7)–(9) under the null and alternative hypotheses. It should be emphasized that the tests in (10)–(11) were proposed in, e.g., [37, 11, 42], for the single planted submatrix detection problem.

A few important remarks are in order. First, note that in the scan test, we search for a single planted matrix rather than mm such matrices. Second, the sum test exhibits polynomial computational complexity, of O(n2)O(n^{2}) operations, and hence efficient. The scan test in (11), however, exhibits an exponential computational complexity, and thus is inefficient. Indeed, the search space in (11) is of cardinality |𝒦k,1,n|=(nk)2|{{\cal K}}_{k,1,n}|=\binom{n}{k}^{2}. On the other hand, the scan test 𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣{\cal A}_{\mathsf{scan}}^{\mathsf{CSD}} for the consecutive setting is efficient because |𝒦k,1,n𝖼𝗈𝗇|n2|{\cal K}_{k,1,n}^{\mathsf{con}}|\leq n^{2}. The following result provides sufficient conditions under which the risk of each of the above tests is asymptotically small.

Theorem 1 (Detection upper bounds).

Consider the detection problems in Definitions 1 and 2. Then, we have the following bounds:

  1. 1.

    (Efficient 𝖲𝖣\mathsf{SD}) There exists an efficient algorithm 𝒜𝗌𝗎𝗆{\cal A}_{\mathsf{sum}} in (10), such that if

    λ=ω(nmk2),\displaystyle\lambda=\omega\left(\frac{n}{mk^{2}}\right), (13)

    then 𝖱(𝒜𝗌𝗎𝗆)0\mathsf{R}\left({\cal A}_{\mathsf{sum}}\right)\to 0, as nn\to\infty, for the problems in Definitions 1 and 2.

  2. 2.

    (Exhaustive 𝖲𝖣\mathsf{SD}) There exists an algorithm 𝒜𝗌𝖼𝖺𝗇𝖲𝖣{\cal A}_{\mathsf{scan}}^{\mathsf{SD}} in (11), such that if

    λ=ω(lognkk),\displaystyle\lambda=\omega\left(\sqrt{\frac{\log\frac{n}{k}}{k}}\right), (14)

    then 𝖱(𝒜𝗌𝖼𝖺𝗇𝖲𝖣)0\mathsf{R}\left({\cal A}_{\mathsf{scan}}^{\mathsf{SD}}\right)\to 0, as nn\to\infty, for the problem in Definition 1.

  3. 3.

    (Efficient 𝖢𝖲𝖣\mathsf{CSD}) There exists an efficient algorithm 𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣{\cal A}_{\mathsf{scan}}^{\mathsf{CSD}} in (12), such that if

    λ=ω(lognkk),\displaystyle\lambda=\omega\left(\frac{\sqrt{\log\frac{n}{k}}}{k}\right), (15)

    then 𝖱(𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣)0\mathsf{R}({\cal A}_{\mathsf{scan}}^{\mathsf{CSD}})\to 0, as nn\to\infty, for the problem in Definition 2.

As can be seen from Theorem 1, only the sum test performance barrier exhibits dependency on mm. The scan test is, for both 𝖲𝖣\mathsf{SD} and 𝖢𝖲𝖣\mathsf{CSD}, inherently independent of mm. This makes sense because when summing all the elements of 𝖷\mathsf{X}, as mm gets larger the mean (the “signal”) under the alternative hypothesis gets larger as well. On the other hand, since the scan test searches for a single planted submatrix, the number of planted submatrices does not play a role. One could argue that it might be beneficial to search for the mm planted submatrices in the scan test, however, as we show below, this is not needed, and the bounds above are asymptotically tight.

Lower bounds.

To present our lower bounds, we first recall that the optimal testing error probability is determined by the total variation distance between the distributions under the null and the alternative hypotheses as follows (see, e.g., [47, Lemma 2.1]),

min𝒜n:n×n{0,1}0(𝒜n(𝖷)=1)+1(𝒜n(𝖷)=0)=1d𝖳𝖵(0,1).\displaystyle\min_{{\cal A}_{n}:\mathbb{R}^{n\times n}\to\left\{0,1\right\}}\mathbb{P}_{{\cal H}_{0}}({\cal A}_{n}(\mathsf{X})=1)+\mathbb{P}_{{\cal H}_{1}}({\cal A}_{n}(\mathsf{X})=0)=1-d_{\mathsf{TV}}(\mathbb{P}_{{\cal H}_{0}},\mathbb{P}_{{\cal H}_{1}}). (16)

The following result shows that under certain conditions the total variation between the null and alternative distributions is asymptotically small, and thus, there exists no test which can solve the above detection problems reliably.

Theorem 2 (Information-theoretic lower bounds).

We have the following results.

  1. 1.

    Consider the detection problem in Definition 1. If,

    λ=o(nmk21k),\displaystyle\lambda=o\left(\frac{n}{mk^{2}}\wedge\frac{1}{\sqrt{k}}\right), (17)

    then d𝖳𝖵(0,1)=o(1)d_{\mathsf{TV}}(\mathbb{P}_{{\cal H}_{0}},\mathbb{P}_{{\cal H}_{1}})=o(1).

  2. 2.

    Consider the detection problem in Definition 2. If λ=o(k1)\lambda=o\left(k^{-1}\right), then d𝖳𝖵(0,1)=o(1)d_{\mathsf{TV}}(\mathbb{P}_{{\cal H}_{0}},\mathbb{P}_{{\cal H}_{1}})=o(1).

Theorem 2 above shows that our upper bounds in Theorem 1 are tight up to poly-log factors. Indeed, item 1 in Theorem 2 complements Items 1-2 in Theorem 1, for the 𝖲𝖣\mathsf{SD} problem, while item 2 in Theorem 2 complements Item 3 in Theorem 1, for the 𝖢𝖲𝖣\mathsf{CSD} problem. In the sequel, we illustrate our results using phase diagrams that show the tradeoff between kk and λ\lambda as a function of nn. One evident and important observation here is that the statistical limit for the 𝖢𝖲𝖣\mathsf{CSD} problem is attained using an efficient test. Thus, there is no statistical computational gap in the detection problem in Definition 2, and accordingly, it is either statistically impossible to solve the detection problem or it can be solved in polynomial time. This is not the case for the 𝖲𝖣\mathsf{SD} problem. Note that both the efficient sum and the exhaustive scan tests are needed to attain the information-theoretic lower bound (up to poly-log factors). As discussed above, however, here the scan test is not efficient. We next give evidence that, based on the low-degree polynomial conjecture, efficient algorithms that run in polynomial-time do not exist in the regime where the scan test succeeds while the sum test fails.

Computational lower bounds.

Note that the problem in Definition 1 exhibits a gap in terms of what can be achieved by the proposed polynomial-time algorithm and the computationally expensive scan test algorithm. In particular, it can be seen that in the regime where 1kλnmk2\frac{1}{\sqrt{k}}\ll\lambda\ll\frac{n}{mk^{2}}, while the problem can be solved by an exhaustive search using the scan test, we do not have a polynomial-time algorithm. Next, we give evidence that, in fact, an efficient algorithm does not exist in this region. To that end, we start with a brief introduction to the method of low-degree polynomials.

The premise of this method is to take low-degree multivariate polynomials in the entries of the observations as a proxy for efficiently-computable functions. The ideas below were first developed in a sequence of works in the sum-of-squares optimization literature [10, 28, 31, 30].

In the following, we follow the notations and definitions of [28, 18]. Any distribution 0\mathbb{P}_{{\cal H}_{0}} on Ωn=n×n\Omega_{n}=\mathbb{R}^{n\times n} induces an inner product of measurable functions f,g:Ωnf,g:\Omega_{n}\to\mathbb{R} given by f,g0=𝔼0[f(𝖷)g(𝖷)]\left\langle f,g\right\rangle_{{\cal H}_{0}}=\mathbb{E}_{{\cal H}_{0}}[f(\mathsf{X})g(\mathsf{X})], and norm f0=f,f01/2\left\|f\right\|_{{\cal H}_{0}}=\left\langle f,f\right\rangle_{{\cal H}_{0}}^{1/2}. We Let L2(0)L^{2}(\mathbb{P}_{{\cal H}_{0}}) denote the Hilbert space consisting of functions ff for which f0<\left\|f\right\|_{{\cal H}_{0}}<\infty, endowed with the above inner product and norm. In the computationally-unbounded case, the Neyman-Pearson lemma shows that the likelihood ratio test achieves the optimal tradeoff between 𝖳𝗒𝗉𝖾\mathsf{Type}-𝖨\mathsf{I} and 𝖳𝗒𝗉𝖾\mathsf{Type}-𝖨𝖨\mathsf{II} error probabilities. Furthermore, it is well-known that the same test optimally distinguishes 0\mathbb{P}_{{\cal H}_{0}} from 1\mathbb{P}_{{\cal H}_{1}} in the L2L^{2} sense. Specifically, denoting by 𝖫n1/0\mathsf{L}_{n}\triangleq\mathbb{P}_{{\cal H}_{1}}/\mathbb{P}_{{\cal H}_{0}} the likelihood ratio, the second-moment method for contiguity (see, e.g., [18]) shows that if 𝖫n02\left\|\mathsf{L}_{n}\right\|_{{\cal H}_{0}}^{2} remains bounded as nn\to\infty, then 1\mathbb{P}_{{\cal H}_{1}} is contiguous to 0\mathbb{P}_{{\cal H}_{0}}. This implies that 1\mathbb{P}_{{\cal H}_{1}} and 0\mathbb{P}_{{\cal H}_{0}} are statistically indistinguishable, i.e., no test can have both 𝖳𝗒𝗉𝖾\mathsf{Type}-𝖨\mathsf{I} and 𝖳𝗒𝗉𝖾\mathsf{Type}-𝖨𝖨\mathsf{II} error probabilities tending to zero.

We now describe the low-degree method. The idea is to find the low-degree polynomial that best distinguishes 0\mathbb{P}_{{\cal H}_{0}} from 1\mathbb{P}_{{\cal H}_{1}} in the L2L^{2} sense. To that end, we let 𝒱n,𝖣L2(0){\cal V}_{n,\leq\mathsf{D}}\subset L^{2}(\mathbb{P}_{{\cal H}_{0}}) denote the linear subspace of polynomials Ωn\Omega_{n}\to\mathbb{R} of degree at most 𝖣\mathsf{D}\in\mathbb{N}. We further define 𝒫𝖣:L2(0)𝒱n,𝖣{\cal P}_{\leq\mathsf{D}}:L^{2}(\mathbb{P}_{{\cal H}_{0}})\to{\cal V}_{n,\leq\mathsf{D}} the orthogonal projection operator. Then, the 𝖣\mathsf{D}-low-degree likelihood ratio 𝖫n𝖣\mathsf{L}_{n}^{\leq\mathsf{D}} is the projection of a function 𝖫n\mathsf{L}_{n} to the span of coordinate-degree-𝖣\mathsf{D} functions, where the projection is orthogonal with respect to the inner product ,0\left\langle\cdot,\cdot\right\rangle_{{\cal H}_{0}}. As discussed above, the likelihood ratio optimally distinguishes 0\mathbb{P}_{{\cal H}_{0}} from 1\mathbb{P}_{{\cal H}_{1}} in the L2L^{2} sense. The next lemma shows that over the set of low-degree polynomials, the 𝖣\mathsf{D}-low-degree likelihood ratio have exhibit the same property.

Lemma 1 (Optimally of 𝖫n𝖣\mathsf{L}_{n}^{\leq\mathsf{D}} [31, 30, 18]).

Consider the following optimization problem:

max𝔼1f(𝖷)s.t.𝔼0f2(𝖷)=1,f𝒱n,𝖣.\displaystyle\mathrm{max}\;\mathbb{E}_{{\cal H}_{1}}f(\mathsf{X})\quad\mathrm{s.t.}\quad\mathbb{E}_{{\cal H}_{0}}f^{2}(\mathsf{X})=1,\;f\in{\cal V}_{n,\leq\mathsf{D}}. (18)

Then, the unique solution ff^{\star} for (18) is the 𝖣\mathsf{D}-low degree likelihood ratio f=𝖫n𝖣/𝖫n𝖣0f^{\star}=\mathsf{L}_{n}^{\leq\mathsf{D}}/\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|_{{\cal H}_{0}}, and the value of the optimization problem is 𝖫n𝖣0\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|_{{\cal H}_{0}}.

As was mentioned above, in the computationally-unbounded regime, an important property of the likelihood ratio is that if 𝖫n0\left\|\mathsf{L}_{n}\right\|_{{\cal H}_{0}} is bounded, then 0\mathbb{P}_{{\cal H}_{0}} and 1\mathbb{P}_{{\cal H}_{1}} are statistically indistinguishable. The following conjecture states that a computational analog of this property holds, with 𝖫n𝖣\mathsf{L}_{n}^{\leq\mathsf{D}} playing the role of the likelihood ratio. In fact, it also postulates that polynomials of degree logn\approx\log n are a proxy for polynomial-time algorithms. The conjecture below is based on [28, 31, 30], and [28, Conj. 2.2.4]. We give an informal statement of this conjecture, which appears in [18, Conj. 1.16]. For a precise statement, we refer the reader to [28, Conj. 2.2.4] and [18, Sec. 4].

Conjecture 1 (Low-degree conjecture, informal).

Given a sequence of probability measures 0\mathbb{P}_{{\cal H}_{0}} and 1\mathbb{P}_{{\cal H}_{1}}, if there exists ϵ>0\epsilon>0 and 𝖣=𝖣(n)(logn)1+ϵ\mathsf{D}=\mathsf{D}(n)\geq(\log n)^{1+\epsilon}, such that 𝖫n𝖣0\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|_{{\cal H}_{0}} remains bounded as nn\to\infty, then there is no polynomial-time algorithm that distinguishes 0\mathbb{P}_{{\cal H}_{0}} and 1\mathbb{P}_{{\cal H}_{1}}.

In the sequel, we will rely on Conjecture 1 to give evidence for the statistical-computational gap observed for the problem in Definition 1 in the regime where 1kλnmk2\frac{1}{\sqrt{k}}\ll\lambda\ll\frac{n}{mk^{2}}. At this point we would like to mention [28, Hypothesis 2.1.5], which states a more general form of Conjecture 1 in the sense that it postulates that degree-𝖣\mathsf{D} polynomials are a proxy for nO(D)n^{O(D)}-time algorithms. Note that if 𝖫n𝖣0=O(1)\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|_{{\cal H}_{0}}=O(1), then we expect detection in time 𝖳(n)=e𝖣(n)\mathsf{T}(n)=e^{\mathsf{D}(n)} to be impossible.

Theorem 3 (Computational lower bound).

Consider the detection problem in Definition 1. Then, if λ\lambda is such that 1kλnmk2\frac{1}{\sqrt{k}}\ll\lambda\ll\frac{n}{mk^{2}}, then 𝖫n𝖣0O(1)\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|_{{\cal H}_{0}}\leq O(1), for any 𝖣=Ω(logn)\mathsf{D}=\Omega(\log n). On the other hand, if λ\lambda is such that λnmk2\lambda\gg\frac{n}{mk^{2}}, then 𝖫n𝖣0ω(1)\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|_{{\cal H}_{0}}\geq\omega(1).

Together with Conjecture 1, Theorem 3 implies that if we take degree-logn\log n polynomials as a proxy for all efficient algorithms, our calculations predict that an nO(logn)n^{O(\log n)} algorithm does not exist when 1kλnmk2\frac{1}{\sqrt{k}}\ll\lambda\ll\frac{n}{mk^{2}}. This is summarized in the following corollary.

Corollary 4.

Consider the detection problem in Definition 1, and assume that Conjecture 1 holds. An nO(logn)n^{O(\log n)} algorithm that achieves strong detection does not exist if λ\lambda is such that 1kλnmk2\frac{1}{\sqrt{k}}\ll\lambda\ll\frac{n}{mk^{2}}.

These predictions agree precisely with the previously established statistical-computational tradeoffs in the previous subsections. A more explicit formula for the computational barrier which exhibits dependency on 𝖣\mathsf{D} and λ\lambda can be deduced from the proof of Theorem 3; to keep the exposition simple we opted to present the refined result above.

We note that numerical and theoretical evidence for the existence of computational-statistical gaps were observed in other statistical models that are also inspired by cryo-EM, including heterogeneous multi-reference alignment [7, 49] and sparse multi-reference alignment [16].

Phase diagrams.

Using Theorems 13 we are now in a position to draw the obtained phase diagrams for our detection problems. Specifically, treating kk and λ\lambda as polynomials in nn, i.e., k=Θ(nβ)k=\Theta(n^{\beta}) and λ=Θ(nα)\lambda=\Theta(n^{-\alpha}), for some α(0,1)\alpha\in(0,1) and β(0,1)\beta\in(0,1), we obtain the phase diagrams in Figure 2(a), for a fixed number of submatrices m=O(1)m=O(1). Specifically,

  1. 1.

    Computationally easy regime (blue region): there is a polynomial-time algorithm for the detection task when α<2β1\alpha<2\beta-1.

  2. 2.

    Computationally hard regime (red region): there is an inefficient algorithm for detection when α<β/2\alpha<\beta/2 and α>2β1\alpha>2\beta-1, but the problem is computationally hard (no polynomial-time algorithm exists) in the sense that the class of low-degree polynomials fails in this region.

  3. 3.

    Statistically impossible regime: detection is statistically impossible when α>β2(2β1)\alpha>\frac{\beta}{2}\vee(2\beta-1).

β\betaα\alpha1112\frac{1}{2}1113\frac{1}{3}23\frac{2}{3}0𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥𝐥𝐲\mathsf{\mathbf{Statistically}}𝐈𝐦𝐩𝐨𝐬𝐬𝐢𝐛𝐥𝐞\mathsf{\mathbf{Impossible}}𝐄𝐚𝐬𝐲\mathsf{\mathbf{Easy}}𝐇𝐚𝐫𝐝\mathsf{\mathbf{Hard}}
(a) m=O(1)m=O(1)
β\betaα\alpha1138\frac{3}{8}54\frac{5}{4}14\frac{1}{4}12\frac{1}{2}0𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥𝐥𝐲\mathsf{\mathbf{Statistically}}𝐈𝐦𝐩𝐨𝐬𝐬𝐢𝐛𝐥𝐞\mathsf{\mathbf{Impossible}}𝐄𝐚𝐬𝐲\mathsf{\mathbf{Easy}}𝐇𝐚𝐫𝐝\mathsf{\mathbf{Hard}}
(b) m=Θ(n1/4)m=\Theta(n^{1/4})
Figure 2: Phase diagrams for submatrix detection as a function of k=Θ(nβ)k=\Theta(n^{\beta}), and λ=Θ(nα)\lambda=\Theta(n^{-\alpha}), for m=O(1)m=O(1) and m=Θ(n1/4)m=\Theta(n^{1/4}).

When the number of submatrices grows with n=ω(1)n=\omega(1), we get different phase diagrams depending on its value. For example, if m=Θ(n1/4)m=\Theta(n^{1/4}), we get Figure 2(b). Specifically,

  1. 1.

    Computationally easy regime (blue region): there is a polynomial-time algorithm for the detection task when α<2β34\alpha<2\beta-\frac{3}{4}.

  2. 2.

    Computationally hard regime (red region): there is an inefficient algorithm for detection when α<β/2\alpha<\beta/2 and α>2β34\alpha>2\beta-\frac{3}{4}, but the problem is computationally hard (no polynomial-time algorithm exists) in the sense that the class of low-degree polynomials fails in this region.

  3. 3.

    Statistically impossible regime: detection is statistically impossible when α>β2(2β3/4)\alpha>\frac{\beta}{2}\vee(2\beta-3/4).

Finally, for the consecutive problem, we get the phase diagram in Figure 3, independently of the value of mm. Here, there are only two regions where the problem is either statistically impossible or easy to solve.

β\betaα\alpha11110𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥𝐥𝐲\mathsf{\mathbf{Statistically}}𝐈𝐦𝐩𝐨𝐬𝐬𝐢𝐛𝐥𝐞\mathsf{\mathbf{Impossible}}𝐄𝐚𝐬𝐲\mathsf{\mathbf{Easy}}
Figure 3: Phase diagram for consecutive submatrix detection, as a function of k=Θ(nβ)k=\Theta(n^{\beta}), and λ=Θ(nα)\lambda=\Theta(n^{-\alpha}), for any mm.

3.2 The recovery problem

Upper bounds.

We start by presenting our upper bounds for both exact and correlated types of recovery for the consecutive problem in Definition 3. To that end, we propose the following recovery algorithm. It can be shown that the maximum-likelihood (ML) estimator, minimizing the error probability, is given by (see Subsection 4.4 for a complete derivation),

𝖪^𝖬𝖫(𝖷)=argmax𝖪𝒦k,m,n𝖼𝗈𝗇(i,j)𝖪𝖷ij.\displaystyle\hat{\mathsf{K}}_{\mathsf{ML}}(\mathsf{X})=\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,m,n}^{\mathsf{con}}}\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij}. (19)

The computational complexity of the exhaustive search in (19) is of order n2mn^{2m}. Thus, for m=O(1)m=O(1), the ML estimator runs in polynomial time, and thus, is efficient. However, if m=ω(1)m=\omega(1) then the exhaustive search is not efficient anymore. Nonetheless, the following straightforward modification of (19) provably achieves the same asymptotic performance of the ML estimator above, and at the same time computationally efficient.

Before we present this algorithm, we make a simplifying technical assumption on the possible set of planted submatrices, and then explain how this assumption can be removed. We assume that each pair of submatrices in the underlying planted submatrices 𝖪\mathsf{K}^{\star} are at least kk columns and rows far way. In other words, there are at least kk columns and kk rows separating any pair of submatrices in 𝖪\mathsf{K}^{\star}. Similar assumptions are frequently taken when analyzing statistical models inspired by cryo-EM, see, for example [6]. We will refer to the above as the separation assumption.

Our recovery algorithm works as follows: in the [m]\ell\in[m] step, we find the ML estimate of a single submatrix using,

𝖪^(𝖷())=argmax𝖪𝒦k,1,n𝖼𝗈𝗇(i,j)𝖪𝖷ij(),\displaystyle\hat{\mathsf{K}}_{\ell}(\mathsf{X}^{(\ell)})=\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,1,n}^{\mathsf{con}}}\sum_{(i,j)\in\mathsf{K}}\mathsf{X}^{(\ell)}_{ij}, (20)

where 𝖷()\mathsf{X}^{(\ell)} is defined recursively as follows: 𝖷(1)𝖷\mathsf{X}^{(1)}\triangleq\mathsf{X}, and for 2\ell\geq 2,

𝖷()=𝖷(1)𝖤(𝖪^1),\displaystyle\mathsf{X}^{(\ell)}=\mathsf{X}^{(\ell-1)}\odot\mathsf{E}(\hat{\mathsf{K}}_{\ell-1}), (21)

where 𝖤(𝖪^1)\mathsf{E}(\hat{\mathsf{K}}_{\ell-1}) is an n×nn\times n matrix such that [𝖤(𝖪^1)]ij=[\mathsf{E}(\hat{\mathsf{K}}_{\ell-1})]_{ij}=-\infty, for (i,j)𝖪^1(i,j)\in\hat{\mathsf{K}}_{\ell-1}, and [𝖤(𝖪^1)]ij=1[\mathsf{E}(\hat{\mathsf{K}}_{\ell-1})]_{ij}=1, otherwise. To wit, in each step of the algorithm we “peel” the set of estimated indices (or, estimated submatrices) in previous steps from the search space. This is done by setting the corresponding entries of 𝖷\mathsf{X} to -\infty so that the sum in (20) will not be maximized by previously chosen sets of indices. We denote by 𝖪^𝗉𝖾𝖾𝗅(𝖷)={𝖪^}=1m\hat{\mathsf{K}}_{\mathsf{peel}}(\mathsf{X})=\{\hat{\mathsf{K}}_{\ell}\}_{\ell=1}^{m} the output of the above algorithm.

Remark 1.

Without the assumption above, the fact that the peeling algorithm succeeds is not trivial. If, for example, the chosen planted matrices are such that they include a pair of adjacent matrices, then it could be the case that at some step of the peeling algorithm, the estimated set of indices corresponds to a certain submatrix of the union of those adjacent matrices. However, one can easily modify the peeling algorithm, drop the assumption above, and obtain the same statistical guarantees stated below. Indeed, consider the following modification to the peeling routine in Algorithm 1.

Algorithm 1 Modified Peeling
  1. 1.

    Initialize 𝖿𝗅𝖺𝗀0\mathsf{flag}\leftarrow 0, 1\ell\leftarrow 1, 𝒦\mathscr{K}\leftarrow\emptyset, 𝖠=𝟎n×n\mathsf{A}=\mathbf{0}_{n\times n}.

  2. 2.

    while 𝖿𝗅𝖺𝗀=0\mathsf{flag}=0

    1. (a)

      𝖪^(𝖷)argmax𝖪𝒦k,1,n𝖼𝗈𝗇𝒦(i,j)𝖪𝖷ij.\hat{\mathsf{K}}_{\ell}(\mathsf{X})\leftarrow\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,1,n}^{\mathsf{con}}\setminus\mathscr{K}}\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij}.

    2. (b)

      𝖠ij1\mathsf{A}_{ij}\leftarrow 1, for (i,j)𝖪^(𝖷)(i,j)\in\hat{\mathsf{K}}_{\ell}(\mathsf{X}), and 𝖠ij0\mathsf{A}_{ij}\leftarrow 0, otherwise.

    3. (c)

      𝒦𝒦𝖪^(𝖷).\mathscr{K}\leftarrow\mathscr{K}\cup\hat{\mathsf{K}}_{\ell}(\mathsf{X}).

    4. (d)

      if 𝐉,𝖠=mk2\left\langle\mathbf{J},\mathsf{A}\right\rangle=mk^{2}

      𝖿𝗅𝖺𝗀1.\mathsf{flag}\leftarrow 1.

    5. (e)

      else

      +1.\ell\leftarrow\ell+1.

  3. 3.

    Output 𝖠\mathsf{A}.

The key idea is as follows. In the first step, we find the k×kk\times k submatrix in 𝖷\mathsf{X} with the maximum sum of entries. We denote this submatrix by 𝖪^1\hat{\mathsf{K}}_{1}. This is exactly the same first step of the peeling algorithm. In the second step, we again search for the k×kk\times k submatrix in 𝖷\mathsf{X} with the maximum sum of entries, but of course, remove 𝖪^1\hat{\mathsf{K}}_{1} from the search space. More generally, in the \ell-th step, we again search for the k×kk\times k submatrix in 𝖷\mathsf{X} with maximum sum of entries, but remove 𝒦=i=11𝖪^i\mathscr{K}=\cup_{i=1}^{\ell-1}\hat{\mathsf{K}}_{i} from the search space. We terminate this process once i=1𝖪^i𝒦k,m,n𝖼𝗈𝗇\cup_{i=1}^{\ell}\hat{\mathsf{K}}_{i}\in{\cal K}_{k,m,n}^{\mathsf{con}}, i.e., the union of the estimated sets of matrices can cast as a proper set of planted submatrices. This can easily be checked by forming the matrix 𝖠\mathsf{A} in Step 2(b), and checking the conditions in Step 2(d). If the actually planted submatrices are not adjacent, then this will be the case (under the conditions in the theorem below) after =m\ell=m steps, with high probability. Otherwise, if at least two planted submatrices are adjacent, then while \ell might be larger than mm it is bounded by n2n^{2}, and it is guaranteed that such a union exists. Once we find such a union, it is easy to revert the set of mm consecutive k×kk\times k submatrices from 𝖠\mathsf{A}.

We have the following result.

Theorem 5 (Recovery upper bounds).

Consider the recovery problem in Definition 3, and let 𝖢\mathsf{C} be a universal constant. Then, we have the following set of bounds:

  1. 1.

    (ML Exact Recovery) Consider the ML estimator in (19). If

    lim infnλ𝖢k1logn>1,\displaystyle\liminf_{n\to\infty}\frac{\lambda}{\sqrt{\mathsf{C}k^{-1}\log n}}>1, (22)

    then exact recovery is possible.

  2. 2.

    (Peeling Exact Recovery) Consider the peeling estimator in (20), and assume that the separation assumption holds. Then, if

    lim infnλ𝖢k1logn>1,\displaystyle\liminf_{n\to\infty}\frac{\lambda}{\sqrt{\mathsf{C}k^{-1}\log n}}>1, (23)

    then exact recovery is possible.

  3. 3.

    (Peeling Correlated Recovery) Consider the peeling estimator in (20), and assume that the separation assumption holds. If

    lim infnλ𝖢k2logn>1,\displaystyle\liminf_{n\to\infty}\frac{\lambda}{\sqrt{\mathsf{C}k^{-2}\log n}}>1, (24)

    then correlated recovery is possible.

Lower bounds.

The following result shows that under certain conditions, exact and correlated recoveries are impossible.

Theorem 6 (Information-theoretic recovery lower bounds).

Consider the recovery problem in Definition 3. Then:

  1. 1.

    If λ<𝖢logmk\lambda<\mathsf{C}\sqrt{\frac{\log m}{k}}, exact recovery is impossible, i.e.,

    inf𝖪^sup𝖪𝒦k,m,n𝖼𝗈𝗇[𝖪^(𝖷)𝖪]>12,\inf_{\hat{\mathsf{K}}}\sup_{\mathsf{K}^{\star}\in{\cal K}_{k,m,n}^{\mathsf{con}}}\mathbb{P}[\hat{\mathsf{K}}(\mathsf{X})\neq\mathsf{K}^{\star}]>\frac{1}{2},

    where the infimum ranges over all measurable functions of the matrix 𝖷\mathsf{X}.

  2. 2.

    If λ=o(k1)\lambda=o(k^{-1}), correlated recovery is impossible, i.e., sup𝖪𝒦k,m,n𝖼𝗈𝗇𝗈𝗏𝖾𝗋𝗅𝖺𝗉(𝖪,𝖪^)=o(mk2)\sup_{\mathsf{K}^{\star}\in{\cal K}_{k,m,n}^{\mathsf{con}}}\mathsf{overlap}(\mathsf{K}^{\star},\hat{\mathsf{K}})=o(mk^{2}).

Thus, similarly to the detection problem, the consecutive recovery problem is either statistically impossible or easy to solve. The corresponding phase diagram for exact and correlated types of recoveries is given in Figure 4. Roughly speaking, exact recovery is possible if λ=ω(k1/2)\lambda=\omega(k^{-1/2}) and impossible if λ=o(k1/2)\lambda=o(k^{-1/2}). Correlated recovery is possible if λ=ω(k1)\lambda=\omega(k^{-1}) and impossible if λ=o(k1)\lambda=o(k^{-1}).

A few remarks are in order. First, note that there is a gap between detection and exact recovery; the barrier for λ\lambda for the former is at k1k^{-1}, while for the latter at k1/2k^{-1/2}. In the context of cryo-EM, this indicates a gap between the ability to detect the existence of particle images in the data set, and the ability to perform successful particle picking (exact recovery). Recently, new computational methods were devised to elucidate molecular structures without particle picking, thus bypassing the limit of exact recovery, allowing constructing structures in very low SNR environments, e.g.,  [6, 36, 38]. This in turn opens the door to recovering small molecular structures that induce low SNR [29]. Second, there is no gap between detection and correlated recovery, and these different tasks are asymptotically statistically the same. The same gap exists between correlated and exact recoveries, implying that exact recovery is strictly harder than correlated recovery.

β\betaα\alpha1111012\frac{1}{2}12\frac{1}{2}𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥𝐥𝐲\mathsf{\mathbf{Statistically}}𝐈𝐦𝐩𝐨𝐬𝐬𝐢𝐛𝐥𝐞\mathsf{\mathbf{Impossible}}𝐏𝐞𝐞𝐥𝐢𝐧𝐠\mathsf{\mathbf{Peeling}}
β\betaα\alpha1111012\frac{1}{2}12\frac{1}{2}𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥𝐥𝐲\mathsf{\mathbf{Statistically}}𝐈𝐦𝐩𝐨𝐬𝐬𝐢𝐛𝐥𝐞\mathsf{\mathbf{Impossible}}𝐏𝐞𝐞𝐥𝐢𝐧𝐠\mathsf{\mathbf{Peeling}}
Figure 4: Phase diagram for consecutive submatrix exact recovery (left) and correlated recovery (right), as a function of k=Θ(nβ)k=\Theta(n^{\beta}), and λ=Θ(nα)\lambda=\Theta(n^{-\alpha}), for any mm.

4 Proofs

4.1 Proof of Theorem 1

4.1.1 Sum test

Recall the sum test in (10), and let τmk2λ2\tau\triangleq\frac{mk^{2}\lambda}{2}. Let us analyze the corresponding error probability. On the one hand, under 0{\cal H}_{0}, it is clear that 𝖳𝗌𝗎𝗆(𝖷)𝒩(0,n2)\mathsf{T}_{\mathsf{sum}}(\mathsf{X})\sim{\cal N}(0,n^{2}). Thus,

0(𝒜𝗌𝗎𝗆(𝖷)=1)\displaystyle\mathbb{P}_{{\cal H}_{0}}\left({\cal A}_{\mathsf{sum}}(\mathsf{X})=1\right) =0(𝖳𝗌𝗎𝗆(𝖷)τ)\displaystyle=\mathbb{P}_{{\cal H}_{0}}\left(\mathsf{T}_{\mathsf{sum}}(\mathsf{X})\geq\tau\right)
=(𝒩(0,n2)τ)\displaystyle=\mathbb{P}({\cal N}(0,n^{2})\geq\tau) (25)
12exp(τ22n2).\displaystyle\leq\frac{1}{2}\exp\left(-\frac{\tau^{2}}{2n^{2}}\right).

On the other hand, under 1{\cal H}_{1}, 𝖳𝗌𝗎𝗆(𝖷)𝒩(mk2λ,n2)\mathsf{T}_{\mathsf{sum}}(\mathsf{X})\sim{\cal N}(mk^{2}\lambda,n^{2}). Thus,

1(𝒜𝗌𝗎𝗆(𝖷)=0)\displaystyle\mathbb{P}_{{\cal H}_{1}}\left({\cal A}_{\mathsf{sum}}(\mathsf{X})=0\right) =1(𝖳𝗌𝗎𝗆(𝖷)τ)\displaystyle=\mathbb{P}_{{\cal H}_{1}}\left(\mathsf{T}_{\mathsf{sum}}(\mathsf{X})\leq\tau\right)
=(𝒩(mk2λ,n2)τ)\displaystyle=\mathbb{P}({\cal N}(mk^{2}\lambda,n^{2})\leq\tau) (26)
12exp((τmk2λ)22n2).\displaystyle\leq\frac{1}{2}\exp\left(-\frac{(\tau-mk^{2}\lambda)^{2}}{2n^{2}}\right).

Substituting τ=mk2λ2\tau=\frac{mk^{2}\lambda}{2}, we obtain that

𝖱(𝒜𝗌𝗎𝗆)exp(m2k4λ28n2).\displaystyle\mathsf{R}\left({\cal A}_{\mathsf{sum}}\right)\leq\exp\left(-\frac{m^{2}k^{4}\lambda^{2}}{8n^{2}}\right). (27)

Thus, if mk2λn\frac{mk^{2}\lambda}{n}\to\infty, then 𝖱(𝒜𝗌𝗎𝗆)0\mathsf{R}\left({\cal A}_{\mathsf{sum}}\right)\to 0, as nn\to\infty. Note that the analysis above holds true for both detection problems in Definitions 1 and 2.

4.1.2 Scan test

Recall the scan test 𝒜𝗌𝖼𝖺𝗇𝖲𝖣(𝖷){\cal A}_{\mathsf{scan}}^{\mathsf{SD}}(\mathsf{X}) in (11), and its consecutive version 𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷){\cal A}_{\mathsf{scan}}^{\mathsf{CSD}}(\mathsf{X}). Let us start by analyzing the error probability associated with 𝒜𝗌𝖼𝖺𝗇𝖲𝖣{\cal A}^{\mathsf{SD}}_{\mathsf{scan}}. For simplicity of notation, we let τ(4+δ)k2log(nk)\tau\triangleq\sqrt{(4+\delta)k^{2}\log\binom{n}{k}}. On the one hand, under 0{\cal H}_{0}, we have

0(𝒜𝗌𝖼𝖺𝗇𝖲𝖣(𝖷)=1)\displaystyle\mathbb{P}_{{\cal H}_{0}}\left({\cal A}^{\mathsf{SD}}_{\mathsf{scan}}(\mathsf{X})=1\right) =0(𝖳𝗌𝖼𝖺𝗇𝖲𝖣(𝖷)τ)\displaystyle=\mathbb{P}_{{\cal H}_{0}}\left(\mathsf{T}^{\mathsf{SD}}_{\mathsf{scan}}(\mathsf{X})\geq\tau\right)
(nk)2(𝒩(0,k2)τ)\displaystyle\leq\binom{n}{k}^{2}\mathbb{P}({\cal N}(0,k^{2})\geq\tau) (28)
12exp(2log(nk)τ22k2).\displaystyle\leq\frac{1}{2}\exp\left(2\log\binom{n}{k}-\frac{\tau^{2}}{2k^{2}}\right).

On the other hand, under 1{\cal H}_{1}, we have

1(𝒜𝗌𝖼𝖺𝗇𝖲𝖣(𝖷)=0)\displaystyle\mathbb{P}_{{\cal H}_{1}}\left({\cal A}^{\mathsf{SD}}_{\mathsf{scan}}(\mathsf{X})=0\right) =1(𝖳𝗌𝖼𝖺𝗇𝖲𝖣(𝖷)τ)\displaystyle=\mathbb{P}_{{\cal H}_{1}}\left(\mathsf{T}^{\mathsf{SD}}_{\mathsf{scan}}(\mathsf{X})\leq\tau\right) (29)
(𝒩(k2λ,k2)τ)\displaystyle\leq\mathbb{P}({\cal N}(k^{2}\lambda,k^{2})\leq\tau) (30)
12exp((k2λτ)+22k2).\displaystyle\leq\frac{1}{2}\exp\left(-\frac{(k^{2}\lambda-\tau)_{+}^{2}}{2k^{2}}\right). (31)

Thus,

𝖱(𝒜𝗌𝖼𝖺𝗇𝖲𝖣)12exp(2log(nk)τ22k2)+exp((k2λτ)+22k2).\displaystyle\mathsf{R}\left({\cal A}^{\mathsf{SD}}_{\mathsf{scan}}\right)\leq\frac{1}{2}\exp\left(2\log\binom{n}{k}-\frac{\tau^{2}}{2k^{2}}\right)+\exp\left(-\frac{(k^{2}\lambda-\tau)_{+}^{2}}{2k^{2}}\right). (32)

Substituting τ=(4+δ)k2log(nk)\tau=\sqrt{(4+\delta)k^{2}\log\binom{n}{k}}, we get

𝖱(𝒜𝗌𝖼𝖺𝗇𝖲𝖣)12exp(δ2log(nk))+exp(k2(λτk2)+22),\displaystyle\mathsf{R}\left({\cal A}^{\mathsf{SD}}_{\mathsf{scan}}\right)\leq\frac{1}{2}\exp\left(-\frac{\delta}{2}\cdot\log\binom{n}{k}\right)+\exp\left(-k^{2}\frac{\left(\lambda-\frac{\tau}{k^{2}}\right)_{+}^{2}}{2}\right), (33)

and thus 𝖱(𝒜𝗌𝖼𝖺𝗇𝖲𝖣)0\mathsf{R}\left({\cal A}^{\mathsf{SD}}_{\mathsf{scan}}\right)\to 0, as nn\to\infty, provided that lim infnλ4k1lognk>1\liminf_{n\to\infty}\frac{\lambda}{\sqrt{4k^{-1}\log\frac{n}{k}}}>1, as claimed. Next, we analyze 𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣{\cal A}^{\mathsf{CSD}}_{\mathsf{scan}}. Let τ𝖼(4+δ)k2logn\tau_{\mathsf{c}}\triangleq\sqrt{(4+\delta)k^{2}\log{n}}. As above, we have

0(𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷)=1)\displaystyle\mathbb{P}_{{\cal H}_{0}}\left({\cal A}^{\mathsf{CSD}}_{\mathsf{scan}}(\mathsf{X})=1\right) =0(𝖳𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷)τ𝖼)\displaystyle=\mathbb{P}_{{\cal H}_{0}}\left(\mathsf{T}^{\mathsf{CSD}}_{\mathsf{scan}}(\mathsf{X})\geq\tau_{\mathsf{c}}\right) (34)
n2(𝒩(0,k2)τ𝖼)\displaystyle\leq n^{2}\mathbb{P}({\cal N}(0,k^{2})\geq\tau_{\mathsf{c}}) (35)
12exp(2lognτ𝖼22k2).\displaystyle\leq\frac{1}{2}\exp\left(2\log{n}-\frac{\tau_{\mathsf{c}}^{2}}{2k^{2}}\right). (36)

On the other hand, under 1{\cal H}_{1}, the result remained intact:

1(𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷)=0)\displaystyle\mathbb{P}_{{\cal H}_{1}}\left({\cal A}^{\mathsf{CSD}}_{\mathsf{scan}}(\mathsf{X})=0\right) =1(𝖳𝗌𝖼𝖺𝗇𝖢𝖲𝖣(𝖷)τ𝖼)\displaystyle=\mathbb{P}_{{\cal H}_{1}}\left(\mathsf{T}^{\mathsf{CSD}}_{\mathsf{scan}}(\mathsf{X})\leq\tau_{\mathsf{c}}\right) (37)
(𝒩(k2λ,k2)τ𝖼)\displaystyle\leq\mathbb{P}({\cal N}(k^{2}\lambda,k^{2})\leq\tau_{\mathsf{c}}) (38)
12exp((k2λτ𝖼)+22k2).\displaystyle\leq\frac{1}{2}\exp\left(-\frac{(k^{2}\lambda-\tau_{\mathsf{c}})_{+}^{2}}{2k^{2}}\right). (39)

Thus,

𝖱(𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣)12exp(2lognτ𝖼22k2)+exp((k2λτ𝖼)+22k2).\displaystyle\mathsf{R}\left({\cal A}^{\mathsf{CSD}}_{\mathsf{scan}}\right)\leq\frac{1}{2}\exp\left(2\log{n}-\frac{\tau_{\mathsf{c}}^{2}}{2k^{2}}\right)+\exp\left(-\frac{(k^{2}\lambda-\tau_{\mathsf{c}})_{+}^{2}}{2k^{2}}\right). (40)

Substituting τ𝖼=(4+δ)k2logn\tau_{\mathsf{c}}=\sqrt{(4+\delta)k^{2}\log{n}}, for δ>0\delta>0, we get

𝖱(𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣)12exp(δ2logn)+exp(k2(λτ𝖼k2)+22),\displaystyle\mathsf{R}\left({\cal A}^{\mathsf{CSD}}_{\mathsf{scan}}\right)\leq\frac{1}{2}\exp\left(-\frac{\delta}{2}\cdot\log{n}\right)+\exp\left(-k^{2}\frac{\left(\lambda-\frac{\tau_{\mathsf{c}}}{k^{2}}\right)_{+}^{2}}{2}\right), (41)

and thus 𝖱(𝒜𝗌𝖼𝖺𝗇𝖢𝖲𝖣)0\mathsf{R}\left({\cal A}^{\mathsf{CSD}}_{\mathsf{scan}}\right)\to 0, as nn\to\infty, provided that lim infnλ4k2lognk>1\liminf_{n\to\infty}\frac{\lambda}{4\sqrt{k^{-2}\log\frac{n}{k}}}>1, as claimed.

4.2 Proof of Theorem 2

4.2.1 Submatrix detection

Recall that the optimal test 𝒜n{\cal A}_{n}^{\ast} that minimizes the risk is the likelihood ratio test defined as follows,

𝒜n(𝖷)𝟙{𝖫n(𝖷)1},\displaystyle{\cal A}_{n}^{\ast}\left(\mathsf{X}\right)\triangleq\mathbbm{1}\left\{\mathsf{L}_{n}\left(\mathsf{X}\right)\geq 1\right\}, (42)

where 𝖫n(𝖷)1(𝖷)0(𝖷)\mathsf{L}_{n}\left(\mathsf{X}\right)\triangleq\frac{\mathbb{P}_{{\cal H}_{1}}\left(\mathsf{X}\right)}{\mathbb{P}_{{\cal H}_{0}}\left(\mathsf{X}\right)}. The optimal risk, denoted by 𝖱=𝖱(𝒜n)\mathsf{R}^{\ast}=\mathsf{R}({\cal A}_{n}^{\ast}), can be lower bounded using the Cauchy–Schwartz inequality as follows,

𝖱\displaystyle\mathsf{R}^{\ast} =112𝔼0|𝖫n(𝖷)1|\displaystyle=1-\frac{1}{2}\mathbb{E}_{{\cal H}_{0}}\left|\mathsf{L}_{n}\left(\mathsf{X}\right)-1\right| (43)
112𝔼0[(𝖫n(𝖷)1)2]\displaystyle\geq 1-\frac{1}{2}\sqrt{\mathbb{E}_{{\cal H}_{0}}\left[\left(\mathsf{L}_{n}\left(\mathsf{X}\right)-1\right)^{2}\right]}
=112𝔼0[(𝖫n(𝖷))2]1.\displaystyle=1-\frac{1}{2}\sqrt{\mathbb{E}_{{\cal H}_{0}}\left[\left(\mathsf{L}_{n}\left(\mathsf{X}\right)\right)^{2}\right]-1}.

Thus, in order to lower bound the risk, we need to upper bound 𝔼0[(𝖫n(𝖷))2]\mathbb{E}_{{\cal H}_{0}}\left[\left(\mathsf{L}_{n}\left(\mathsf{X}\right)\right)^{2}\right]. Below, we provide a lower bound that holds for any pair of distributions 𝒫{\cal P} and 𝒬{\cal Q}.

Corollary 7.

The following holds:

𝔼0[(𝖫n(𝖷))2]=𝔼𝖪𝖪[(1+χ2(𝒫||𝒬))|𝖪𝖪|]𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|],\displaystyle\mathbb{E}_{{\cal H}_{0}}\left[\left(\mathsf{L}_{n}\left(\mathsf{X}\right)\right)^{2}\right]=\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[(1+\chi^{2}({\cal P}||{\cal Q}))^{\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right]\leq\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right], (44)

where 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} are two independent copies drawn uniformly at random from 𝒦k,m,n{\cal K}_{k,m,n} (or, 𝒦¯k,m,n\bar{{\cal K}}_{k,m,n}), and

χ2(𝒫||𝒬)𝔼X𝒬[𝒫(X)𝒬(X)]21.\displaystyle\chi^{2}({\cal P}||{\cal Q})\triangleq\mathbb{E}_{X\sim{\cal Q}}\left[\frac{{\cal P}(X)}{{\cal Q}(X)}\right]^{2}-1. (45)
Proof of Corollary 7.

First, note that the likelihood can be written as follows:

𝖫n(𝖷)\displaystyle\mathsf{L}_{n}\left(\mathsf{X}\right) =1(𝖷)0(𝖷)=𝔼𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)((i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij)).\displaystyle=\frac{\mathbb{P}_{{\cal H}_{1}}(\mathsf{X})}{\mathbb{P}_{{\cal H}_{0}}(\mathsf{X})}=\mathbb{E}_{\mathsf{K}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left(\prod_{(i,j)\in\mathsf{K}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right). (46)

Now, note that the square of the right-hand side of (46) can be rewritten as:

[𝔼𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)((i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij))]2=𝔼𝖪𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)((i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij)(i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij)).\displaystyle\left[\mathbb{E}_{\mathsf{K}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left(\prod_{(i,j)\in\mathsf{K}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right)\right]^{2}=\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left(\prod_{(i,j)\in\mathsf{K}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\prod_{(i,j)\in\mathsf{K}^{\prime}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right). (47)

Therefore,

𝔼0[(𝖫n(𝖷))2]=𝔼0[𝔼𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)((i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij))]2\displaystyle\mathbb{E}_{{\cal H}_{0}}\left[\left(\mathsf{L}_{n}\left(\mathsf{X}\right)\right)^{2}\right]=\mathbb{E}_{{\cal H}_{0}}\left[\mathbb{E}_{\mathsf{K}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left(\prod_{(i,j)\in\mathsf{K}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right)\right]^{2} (48)
=𝔼0[𝔼𝖪𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)((i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij)(i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij))]\displaystyle\hskip 28.45274pt=\mathbb{E}_{{\cal H}_{0}}\left[\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left(\prod_{(i,j)\in\mathsf{K}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\prod_{(i,j)\in\mathsf{K}^{\prime}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right)\right] (49)
=𝔼𝖪𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)[𝔼0((i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij)(i,j)𝖪𝒫(𝖷ij)𝒬(𝖷ij))]\displaystyle\hskip 28.45274pt=\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left[\mathbb{E}_{{\cal H}_{0}}\left(\prod_{(i,j)\in\mathsf{K}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\prod_{(i,j)\in\mathsf{K}^{\prime}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right)\right] (50)
=𝔼𝖪𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)[𝔼0((i,j)𝖪𝖪𝖪𝖪𝒫(𝖷ij)𝒬(𝖷ij)(i,j)𝖪𝖪(𝒫(𝖷ij)𝒬(𝖷ij))2)]\displaystyle\hskip 28.45274pt=\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left[\mathbb{E}_{{\cal H}_{0}}\left(\prod_{(i,j)\in\mathsf{K\cup K^{\prime}\setminus K\cap K^{\prime}}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\prod_{(i,j)\in\mathsf{K\cap K^{\prime}}}\left(\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right)^{2}\right)\right] (51)
=𝔼𝖪𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)[(i,j)𝖪𝖪𝖪𝖪𝔼0[𝒫(𝖷ij)𝒬(𝖷ij)](i,j)𝖪𝖪𝔼0[𝒫(𝖷ij)𝒬(𝖷ij)]2]\displaystyle\hskip 28.45274pt=\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left[\prod_{(i,j)\in\mathsf{K\cup K^{\prime}\setminus K\cap K^{\prime}}}\mathbb{E}_{{\cal H}_{0}}\left[\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right]\prod_{(i,j)\in\mathsf{K\cap K^{\prime}}}\mathbb{E}_{{\cal H}_{0}}\left[\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right]^{2}\right] (52)
=(a)𝔼𝖪𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)[(𝔼0[𝒫(𝖷ij)𝒬(𝖷ij)]2)|𝖪𝖪|]\displaystyle\hskip 28.45274pt\stackrel{{\scriptstyle(a)}}{{=}}\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left[\left(\mathbb{E}_{{\cal H}_{0}}\left[\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}\right]^{2}\right)^{\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right] (53)
=𝔼𝖪𝖪𝖴𝗇𝗂𝖿(𝒦k,m,n)[(1+χ2(𝒫||𝒬))|𝖪𝖪|]\displaystyle\hskip 28.45274pt=\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}\sim\mathsf{Unif}({\cal K}_{k,m,n})}\left[\left(1+\chi^{2}({\cal P}||{\cal Q})\right)^{\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right] (54)
(b)𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|],\displaystyle\hskip 28.45274pt\stackrel{{\scriptstyle(b)}}{{\leq}}\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right], (55)

where (a)(a) is because 𝔼𝒬𝒫(𝖷ij)𝒬(𝖷ij)=1\mathbb{E}_{{\cal Q}}\frac{{\cal P}(\mathsf{X}_{ij})}{{\cal Q}(\mathsf{X}_{ij})}=1, and (b)(b) is because 1+xexp(x)1+x\leq\exp(x), for any xx\in\mathbb{R}. ∎

Based on Corollary 7, it suffices to upper bound 𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|]\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right]. Recall that 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} are decomposed as 𝖪==1m𝖲×𝖳\mathsf{K}=\bigcup_{\ell=1}^{m}\mathsf{S}_{\ell}\times\mathsf{T}_{\ell} and 𝖪==1m𝖲×𝖳\mathsf{K}^{\prime}=\bigcup_{\ell=1}^{m}\mathsf{S}^{\prime}_{\ell}\times\mathsf{T}^{\prime}_{\ell}. Thus, we note that the intersection of 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} can be rewritten as

|𝖪𝖪|\displaystyle\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right| =1=1m2=1m|(𝖲1𝖲2)×(𝖳1𝖳2)|\displaystyle=\sum_{\ell_{1}=1}^{m}\sum_{\ell_{2}=1}^{m}|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})\times(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})| (56)
=1=1m2=1m|(𝖲1𝖲2)||(𝖳1𝖳2)|.\displaystyle=\sum_{\ell_{1}=1}^{m}\sum_{\ell_{2}=1}^{m}|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})|\cdot|(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})|. (57)

For each 1,2[m]\ell_{1},\ell_{2}\in[m], define 𝖹1,2|(𝖲1𝖲2)|\mathsf{Z}_{\ell_{1},\ell_{2}}\triangleq|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})| and 𝖱1,2|(𝖳1𝖳2)|\mathsf{R}_{\ell_{1},\ell_{2}}\triangleq|(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})|. Note that the sequence of random variables {𝖹1,2}1,2\{\mathsf{Z}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}} are statistically independent of the sequence {𝖱1,2}1,2\{\mathsf{R}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}}. Next, it is easy to show that 𝖹1,2𝖧𝗒𝗉𝖾𝗋𝗀𝖾𝗈𝗆𝖾𝗍𝗋𝗂𝖼(n,k,k)\mathsf{Z}_{\ell_{1},\ell_{2}}\sim\mathsf{Hypergeometric}(n,k,k) and 𝖱1,2𝖧𝗒𝗉𝖾𝗋𝗀𝖾𝗈𝗆𝖾𝗍𝗋𝗂𝖼(n,k,k)\mathsf{R}_{\ell_{1},\ell_{2}}\sim\mathsf{Hypergeometric}(n,k,k), for each 1,2[m]\ell_{1},\ell_{2}\in[m], for any 1,2[m]\ell_{1},\ell_{2}\in[m]. Indeed, if we have an urn of nn balls among which kk balls are red, the random variable 𝖹1,2\mathsf{Z}_{\ell_{1},\ell_{2}} (and 𝖱1,2\mathsf{R}_{\ell_{1},\ell_{2}}) is exactly the number of red balls if we draw kk balls from the urn uniformly at random without replacement, which is the definition of a Hypergeometric random variable. While the random variables {𝖹1,2}1,2\{\mathsf{Z}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}} (and similarly {𝖱1,2}1,2\{\mathsf{R}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}}) are not independent, they are negatively associated. Thus,

𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|]1=1m2=1m𝔼[eχ2(𝒫||𝒬)𝖹1,2𝖱1,2]=[𝔼(eχ2(𝒫||𝒬)𝖹1,1𝖱1,1)]m2.\displaystyle\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right]\leq\prod_{\ell_{1}=1}^{m}\prod_{\ell_{2}=1}^{m}\mathbb{E}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{Z}_{\ell_{1},\ell_{2}}\mathsf{R}_{\ell_{1},\ell_{2}}}\right]=\left[\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{Z}_{1,1}\mathsf{R}_{1,1}}\right)\right]^{m^{2}}. (58)

Next, it is well-known that 𝖹1,1=𝖧𝗒𝗉𝖾𝗋𝗀𝖾𝗈𝗆𝖾𝗍𝗋𝗂𝖼(n,k,k)\mathsf{Z}_{1,1}=\mathsf{Hypergeometric}(n,k,k) (and similarly 𝖱1,1=𝖧𝗒𝗉𝖾𝗋𝗀𝖾𝗈𝗆𝖾𝗍𝗋𝗂𝖼(n,k,k)\mathsf{R}_{1,1}=\mathsf{Hypergeometric}(n,k,k)) is stochastically dominated by 𝖡𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(k,k/n)=i=1k𝖡𝖾𝗋𝗇(k/n)\mathsf{B}\sim\mathsf{Binomial}(k,k/n)=\sum_{i=1}^{k}\mathsf{Bern}(k/n). Thus,

𝔼(eχ2(𝒫||𝒬)𝖹1,1𝖱1,1)𝔼(eχ2(𝒫||𝒬)𝖡𝖡),\displaystyle\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{Z}_{1,1}\mathsf{R}_{1,1}}\right)\leq\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{B}\mathsf{B}^{\prime}}\right), (59)

where 𝖡\mathsf{B}^{\prime} be an independent copy of 𝖡\mathsf{B}. Thus,

𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|][𝔼(eχ2(𝒫||𝒬)𝖡𝖡)]m2.\displaystyle\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right]\leq\left[\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{B}\mathsf{B}^{\prime}}\right)\right]^{m^{2}}. (60)

We show that, if χ2(𝒫||𝒬)\chi^{2}({\cal P}||{\cal Q}) satisfies the condition of Theorem 2, the term on the right-hand side of (60) is at most 1+δ1+\delta, for any δ>0\delta>0. We have

[𝔼(eχ2(𝒫||𝒬)𝖡𝖡)]m2=[𝔼(1+kn(eχ2(𝒫||𝒬)𝖡1))k]m2.\displaystyle\left[\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{B}\mathsf{B}^{\prime}}\right)\right]^{m^{2}}=\left[\mathbb{E}\left(1+\frac{k}{n}\left(e^{\chi^{2}({\cal P}||{\cal Q})\mathsf{B}}-1\right)\right)^{k}\right]^{m^{2}}. (61)

Next, note that 𝖡k\mathsf{B}\leq k and we also assume the following, for reasons that will become clear,

χ2(𝒫||𝒬)1k.\displaystyle\chi^{2}({\cal P}||{\cal Q})\leq\frac{1}{k}. (62)

Therefore, using the inequality ex1x+x2e^{x}-1\leq x+x^{2}, for x<1x<1, the following holds

[𝔼(eχ2(𝒫||𝒬)𝖡𝖡)]m2\displaystyle\left[\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{B}\mathsf{B}^{\prime}}\right)\right]^{m^{2}} [𝔼(1+kn(χ2(𝒫||𝒬)𝖡+χ4(𝒫||𝒬)𝖡2))k]m2\displaystyle\leq\left[\mathbb{E}\left(1+\frac{k}{n}\left(\chi^{2}({\cal P}||{\cal Q})\mathsf{B}+\chi^{4}({\cal P}||{\cal Q})\mathsf{B}^{2}\right)\right)^{k}\right]^{m^{2}} (63)
[𝔼(1+2knχ2(𝒫||𝒬)𝖡)k]m2\displaystyle\leq\left[\mathbb{E}\left(1+2\frac{k}{n}\chi^{2}({\cal P}||{\cal Q})\mathsf{B}\right)^{k}\right]^{m^{2}} (64)
[𝔼(e2k2nχ2(𝒫||𝒬)𝖡)]m2\displaystyle\leq\left[\mathbb{E}\left(e^{2\frac{k^{2}}{n}\chi^{2}({\cal P}||{\cal Q})\mathsf{B}}\right)\right]^{m^{2}} (65)
=[1+kn(e2k2nχ2(𝒫||𝒬)1)]km2.\displaystyle=\left[1+\frac{k}{n}\left(e^{2\frac{k^{2}}{n}\chi^{2}({\cal P}||{\cal Q})}-1\right)\right]^{km^{2}}. (66)

This is at most 1+δ1+\delta if

kn(e2k2nχ2(𝒫||𝒬)1)(1+δ)1km21.\displaystyle\frac{k}{n}\left(e^{2\frac{k^{2}}{n}\chi^{2}({\cal P}||{\cal Q})}-1\right)\leq(1+\delta)^{\frac{1}{km^{2}}}-1. (67)

Since (1+δ)1km21log(1+δ)/(km2)(1+\delta)^{\frac{1}{km^{2}}}-1\geq\log(1+\delta)/(km^{2}), this is implied by

χ2(𝒫||𝒬)n2k2log(1+nlog(1+δ)m2k2).\displaystyle\chi^{2}({\cal P}||{\cal Q})\leq\frac{n}{2k^{2}}\log\left(1+\frac{n\log(1+\delta)}{m^{2}k^{2}}\right). (68)

Putting altogether, we obtained that 𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|]1+δ\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right]\leq 1+\delta, if

χ2(𝒫||𝒬)\displaystyle\chi^{2}({\cal P}||{\cal Q}) min{1k,n2k2log(1+nlog(1+δ)m2k2)}\displaystyle\leq\min\left\{\frac{1}{k},\frac{n}{2k^{2}}\log\left(1+\frac{n\log(1+\delta)}{m^{2}k^{2}}\right)\right\} (69)
=min{1k,n2log(1+δ)2m2k4}.\displaystyle=\min\left\{\frac{1}{k},\frac{n^{2}\log(1+\delta)}{2m^{2}k^{4}}\right\}. (70)

Finally, note that in the Gaussian case, χ2(𝒩(λ,1)||𝒩(0,1))=12[exp(λ2)1]\chi^{2}({\cal N}(\lambda,1)||{\cal N}(0,1))=\frac{1}{2}\left[\exp\left(\lambda^{2}\right)-1\right]. Thus, for λ=o(1)\lambda=o(1), we have χ2(𝒩(λ,1)||𝒩(0,1))λ22\chi^{2}({\cal N}(\lambda,1)||{\cal N}(0,1))\to\frac{\lambda^{2}}{2}, which concludes the proof.

4.2.2 Consecutive submatrix detection

For the consecutive case, we notice that by using the steps as in the previous subsection, we have

𝔼0[(𝖫n(𝖷))2]𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|],\displaystyle\mathbb{E}_{{\cal H}_{0}}\left[\left(\mathsf{L}_{n}\left(\mathsf{X}\right)\right)^{2}\right]\leq\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right], (71)

where 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} are two independent copies drawn uniformly at random from 𝒦k,m,n𝖼𝗈𝗇{\cal K}_{k,m,n}^{\mathsf{con}}. The key distinction from the previous case lies in the distribution of |𝖪𝖪||\mathsf{K}\cap\mathsf{K}^{\prime}|. Recall that 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} are decomposed as 𝖪==1m𝖲×𝖳\mathsf{K}=\bigcup_{\ell=1}^{m}\mathsf{S}_{\ell}\times\mathsf{T}_{\ell} and 𝖪==1m𝖲×𝖳\mathsf{K}^{\prime}=\bigcup_{\ell=1}^{m}\mathsf{S}^{\prime}_{\ell}\times\mathsf{T}^{\prime}_{\ell}. Thus, we note that the intersection of 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} can be rewritten as

|𝖪𝖪|\displaystyle\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right| =1=1m2=1m|(𝖲1𝖲2)×(𝖳1𝖳2)|\displaystyle=\sum_{\ell_{1}=1}^{m}\sum_{\ell_{2}=1}^{m}|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})\times(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})| (72)
=1=1m2=1m|(𝖲1𝖲2)||(𝖳1𝖳2)|\displaystyle=\sum_{\ell_{1}=1}^{m}\sum_{\ell_{2}=1}^{m}|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})|\cdot|(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})| (73)
1=1m2=1m𝖹1,2.\displaystyle\triangleq\sum_{\ell_{1}=1}^{m}\sum_{\ell_{2}=1}^{m}\mathsf{Z}_{\ell_{1},\ell_{2}}. (74)

Note that for a given pair (1,2)(\ell_{1},\ell_{2}), we have

(|(𝖲1𝖲2)|=z)={n2k+1n,forz=02n,forz=1,2,,k11n,forz=k,\displaystyle\mathbb{P}(|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})|=z)=\begin{cases}\frac{n-2k+1}{n},\ &\mathrm{for}\ z=0\\ \frac{2}{n},\ &\mathrm{for}\ z=1,2,...,k-1\\ \frac{1}{n},\ &\mathrm{for}\ z=k,\end{cases} (75)

and the exact same distribution for |(𝖳1𝖳2)||(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})|. Thus, we may write 𝖹1,2=(d)𝖧𝖧\mathsf{Z}_{\ell_{1},\ell_{2}}\stackrel{{\scriptstyle(d)}}{{=}}\mathsf{H}\cdot\mathsf{H}^{\prime}, where 𝖧\mathsf{H} and 𝖧\mathsf{H}^{\prime} are statistically independent and follow the distribution given in (75). Thus, using the fact that the random variables {𝖹1,2}1,2\left\{\mathsf{Z}_{\ell_{1},\ell_{2}}\right\}_{\ell_{1},\ell_{2}} are negatively associated, we get,

𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|]1=1m2=1m𝔼[eχ2(𝒫||𝒬)𝖹1,2]=[𝔼(eχ2(𝒫||𝒬)𝖧𝖧)]m2.\displaystyle\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right]\leq\prod_{\ell_{1}=1}^{m}\prod_{\ell_{2}=1}^{m}\mathbb{E}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{Z}_{\ell_{1},\ell_{2}}}\right]=\left[\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{H}\cdot\mathsf{H}^{\prime}}\right)\right]^{m^{2}}. (76)

Now,

𝔼(eχ2(𝒫||𝒬)𝖧𝖧)\displaystyle\mathbb{E}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot\mathsf{H}\cdot\mathsf{H}^{\prime}}\right) =𝔼(n2k+1n+2ni=1k1eχ2(𝒫||𝒬)i𝖧+eχ2(𝒫||𝒬)k𝖧n)\displaystyle=\mathbb{E}\left(\frac{n-2k+1}{n}+\frac{2}{n}\sum_{i=1}^{k-1}e^{\chi^{2}({\cal P}||{\cal Q})\cdot i\mathsf{H}^{\prime}}+\frac{e^{\chi^{2}({\cal P}||{\cal Q})\cdot k\mathsf{H}^{\prime}}}{n}\right) (77)
𝔼(n2kn+2kneχ2(𝒫||𝒬)k𝖧)\displaystyle\leq\mathbb{E}\left(\frac{n-2k}{n}+\frac{2k}{n}e^{\chi^{2}({\cal P}||{\cal Q})\cdot k\mathsf{H}^{\prime}}\right) (78)
=n2kn+2kn(n2k+1n+2ni=1k1eχ2(𝒫||𝒬)ik+eχ2(𝒫||𝒬)k2n)\displaystyle=\frac{n-2k}{n}+\frac{2k}{n}\left(\frac{n-2k+1}{n}+\frac{2}{n}\sum_{i=1}^{k-1}e^{\chi^{2}({\cal P}||{\cal Q})\cdot ik}+\frac{e^{\chi^{2}({\cal P}||{\cal Q})\cdot k^{2}}}{n}\right) (79)
n2kn+2kn(n2kn+2kneχ2(𝒫||𝒬)k2)\displaystyle\leq\frac{n-2k}{n}+\frac{2k}{n}\left(\frac{n-2k}{n}+\frac{2k}{n}e^{\chi^{2}({\cal P}||{\cal Q})\cdot k^{2}}\right) (80)
=1+4k2n2(eχ2(𝒫||𝒬)k21).\displaystyle=1+\frac{4k^{2}}{n^{2}}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot k^{2}}-1\right). (81)

Therefore,

𝔼𝖪𝖪[eχ2(𝒫||𝒬)|𝖪𝖪|][1+4k2n2(eχ2(𝒫||𝒬)k21)]m2.\displaystyle\mathbb{E}_{\mathsf{K}\perp\!\!\!\perp\mathsf{K}^{\prime}}\left[e^{\chi^{2}({\cal P}||{\cal Q})\cdot\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|}\right]\leq\left[1+\frac{4k^{2}}{n^{2}}\left(e^{\chi^{2}({\cal P}||{\cal Q})\cdot k^{2}}-1\right)\right]^{m^{2}}. (82)

This is at most 1+δ1+\delta if,

4k2n2(eχ2(𝒫||𝒬)k21)(1+δ)1m21.\displaystyle\frac{4k^{2}}{n^{2}}\left(e^{\chi^{2}({\cal P}||{\cal Q})k^{2}}-1\right)\leq(1+\delta)^{\frac{1}{m^{2}}}-1. (83)

Since (1+δ)1m21log(1+δ)/(m2)(1+\delta)^{\frac{1}{m^{2}}}-1\geq\log(1+\delta)/(m^{2}), this is implied by

χ2(𝒫||𝒬)1k2log(1+n2log(1+δ)4k2m2).\displaystyle\chi^{2}({\cal P}||{\cal Q})\leq\frac{1}{k^{2}}\log\left(1+\frac{n^{2}\log(1+\delta)}{4k^{2}m^{2}}\right). (84)

Finally, note that since kmnkm\leq n, the logarithmic factor in (84) can be lower bounded by log(1+log(1+δ)/4)\log(1+\log(1+\delta)/4), which concludes the proof.

4.3 Proof of Theorem 3

In order to prove Theorem 3, we use the following result [18, Theorem 2.6].

Lemma 2.

Let 𝐒\mathbf{S} be an nn dimensional random vector drawn from some distribution 𝒟n{\cal D}_{n}, and let 𝐙\mathbf{Z} be an i.i.d. nn dimensional random vector with standard normal entries. Consider the detection problem:

0:𝐘=𝐙𝗏𝗌.1:𝐘=𝐒+𝐙.\displaystyle{\cal H}_{0}:\mathbf{Y}=\mathbf{Z}\quad\quad\mathsf{vs.}\quad\quad{\cal H}_{1}:\mathbf{Y}=\mathbf{S}+\mathbf{Z}. (85)

Then,

𝖫n𝖣20=𝔼𝐒𝐒[d=0𝖣1d!𝐒,𝐒d],\displaystyle\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|^{2}_{{\cal H}_{0}}=\mathbb{E}_{\mathbf{S}\perp\!\!\!\perp\mathbf{S}^{\prime}}\left[\sum_{d=0}^{\mathsf{D}}\frac{1}{d!}\left\langle\mathbf{S},\mathbf{S}^{\prime}\right\rangle^{d}\right], (86)

where 𝐒\mathbf{S} and 𝐒\mathbf{S}^{\prime} are drawn from 𝒟n{\cal D}_{n}, and 𝖫nD\mathsf{L}_{n}^{\leq D} is the 𝖣\mathsf{D}-low-degree likelihood ratio.

Our 𝖲𝖣\mathsf{SD} problem falls under the setting of Lemma 2. Specifically, let 𝖪𝖴𝗇𝗂𝖿[𝒦k,m,n]\mathsf{K}\sim\mathsf{Unif}\left[{\cal K}_{k,m,n}\right], and define 𝐒~\tilde{\mathbf{S}} to be an n×nn\times n matrix such that [𝐒~]ij=λ[\tilde{\mathbf{S}}]_{ij}=\lambda, if i,j𝖪i,j\in\mathsf{K}, and [𝐒~]ij=0[\tilde{\mathbf{S}}]_{ij}=0, otherwise. Also, we define 𝐒\mathbf{S} as the vectorized version of 𝐒~\tilde{\mathbf{S}}. Then, it is clear that our 𝖲𝖣\mathsf{SD} problem cast as the detection problem in Lemma 2, and thus,

𝖫n𝖣2\displaystyle\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|^{2} =𝔼𝐒𝐒[d=0𝖣1d!𝐒,𝐒d]\displaystyle=\mathbb{E}_{\mathbf{S}\perp\!\!\!\perp\mathbf{S}^{\prime}}\left[\sum_{d=0}^{\mathsf{D}}\frac{1}{d!}\left\langle\mathbf{S},\mathbf{S}^{\prime}\right\rangle^{d}\right] (87)
=d=0𝖣λ2dd!𝔼|𝖪𝖪|d,\displaystyle=\sum_{d=0}^{\mathsf{D}}\frac{\lambda^{2d}}{d!}\mathbb{E}\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|^{d}, (88)

where we have used the fact that 𝐒,𝐒=𝐒T𝐒=𝐒𝐒1=λ2|𝖪𝖪|\left\langle\mathbf{S},\mathbf{S}^{\prime}\right\rangle=\mathbf{S}^{T}\mathbf{S}^{\prime}=\left\|\mathbf{S}\odot\mathbf{S}\right\|_{1}=\lambda^{2}\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|, and 𝖪\mathsf{K}^{\prime} is an independent copy of 𝖪\mathsf{K}. Now, recall that 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} are decomposed as 𝖪==1m𝖲×𝖳\mathsf{K}=\bigcup_{\ell=1}^{m}\mathsf{S}_{\ell}\times\mathsf{T}_{\ell} and 𝖪==1m𝖲×𝖳\mathsf{K}^{\prime}=\bigcup_{\ell=1}^{m}\mathsf{S}^{\prime}_{\ell}\times\mathsf{T}^{\prime}_{\ell}. Thus, we note that the intersection of 𝖪\mathsf{K} and 𝖪\mathsf{K}^{\prime} can be rewritten as

|𝖪𝖪|\displaystyle\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right| =1=1m2=1m|(𝖲1𝖲2)×(𝖳1𝖳2)|\displaystyle=\sum_{\ell_{1}=1}^{m}\sum_{\ell_{2}=1}^{m}|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})\times(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})| (89)
=1=1m2=1m|(𝖲1𝖲2)||(𝖳1𝖳2)|.\displaystyle=\sum_{\ell_{1}=1}^{m}\sum_{\ell_{2}=1}^{m}|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})|\cdot|(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})|. (90)

For each 1,2[m]\ell_{1},\ell_{2}\in[m], define 𝖹1,2|(𝖲1𝖲2)|\mathsf{Z}_{\ell_{1},\ell_{2}}\triangleq|(\mathsf{S}_{\ell_{1}}\cap\mathsf{S}^{\prime}_{\ell_{2}})| and 𝖱1,2|(𝖳1𝖳2)|\mathsf{R}_{\ell_{1},\ell_{2}}\triangleq|(\mathsf{T}_{\ell_{1}}\cap\mathsf{T}^{\prime}_{\ell_{2}})|. Recall from the previous subsection that the sequence of random variables {𝖹1,2}1,2\{\mathsf{Z}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}} are statistically independent of the sequence {𝖱1,2}1,2\{\mathsf{R}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}}, and that 𝖹1,2𝖧𝗒𝗉𝖾𝗋𝗀𝖾𝗈𝗆𝖾𝗍𝗋𝗂𝖼(n,k,k)\mathsf{Z}_{\ell_{1},\ell_{2}}\sim\mathsf{Hypergeometric}(n,k,k) and 𝖱1,2𝖧𝗒𝗉𝖾𝗋𝗀𝖾𝗈𝗆𝖾𝗍𝗋𝗂𝖼(n,k,k)\mathsf{R}_{\ell_{1},\ell_{2}}\sim\mathsf{Hypergeometric}(n,k,k), for each 1,2[m]\ell_{1},\ell_{2}\in[m], for any 1,2[m]\ell_{1},\ell_{2}\in[m]. Furthermore, {𝖹1,2}1,2\{\mathsf{Z}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}} (and similarly {𝖱1,2}1,2\{\mathsf{R}_{\ell_{1},\ell_{2}}\}_{\ell_{1},\ell_{2}}) are negatively associated. Finally, recall that both 𝖹1,2\mathsf{Z}_{\ell_{1},\ell_{2}} and 𝖱1,2\mathsf{R}_{\ell_{1},\ell_{2}} are stochastically dominated by 𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(k,k/n)\mathsf{Binomial}(k,k/n). Thus, using [19] (see also [3, Theorem 1]), we have

𝔼|𝖪𝖪|d𝖡d2max{m2k4n2,(m2k4n2)d},\displaystyle\mathbb{E}\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|^{d}\leq\mathsf{B}_{d}^{2}\max\left\{\frac{m^{2}k^{4}}{n^{2}},\left(\frac{m^{2}k^{4}}{n^{2}}\right)^{d}\right\}, (91)

where 𝖡d\mathsf{B}_{d} is the ddth Bell number. Thus,

𝖫n𝖣2\displaystyle\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|^{2} 1+d=1𝖣λ2dd!𝖡d2max{m2k4n2,(m2k4n2)d}1+d=1𝖣𝖳d.\displaystyle\leq 1+\sum_{d=1}^{\mathsf{D}}\frac{\lambda^{2d}}{d!}\mathsf{B}_{d}^{2}\max\left\{\frac{m^{2}k^{4}}{n^{2}},\left(\frac{m^{2}k^{4}}{n^{2}}\right)^{d}\right\}\triangleq 1+\sum_{d=1}^{\mathsf{D}}\mathsf{T}_{d}. (92)

If m2k4n2<1\frac{m^{2}k^{4}}{n^{2}}<1, then it is clear that for d=1𝖣𝖳d=O(1)\sum_{d=1}^{\mathsf{D}}\mathsf{T}_{d}=O(1), it suffices that λ<1\lambda<1. On the other hand, if m2k4n2>1\frac{m^{2}k^{4}}{n^{2}}>1, then consider the ratio between successive terms:

𝖳d+1𝖳d=𝖡d+12(d+1)𝖡d2λ2m2k4n2.\displaystyle\frac{\mathsf{T}_{d+1}}{\mathsf{T}_{d}}=\frac{\mathsf{B}_{d+1}^{2}}{(d+1)\mathsf{B}_{d}^{2}}\lambda^{2}m^{2}\frac{k^{4}}{n^{2}}. (93)

Thus if λ\lambda is small enough, namely if

mk2λnd+12𝖡d𝖡d+1,\displaystyle\frac{mk^{2}\lambda}{n}\leq\frac{\sqrt{d+1}}{\sqrt{2}}\frac{\mathsf{B}_{d}}{\mathsf{B}_{d+1}}, (94)

then 𝖳d+1𝖳d1/2\frac{\mathsf{T}_{d+1}}{\mathsf{T}_{d}}\leq 1/2, for all 1d𝖣1\leq d\leq\mathsf{D}. In this case, by comparing with a geometric sum, we may bound 𝖫n𝖣2O(1)\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|^{2}\leq O(1). This concludes the proof.

To show that the analysis above is tight, note that

𝖫n𝖣2\displaystyle\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|^{2} d=0𝖣λ2dd!𝔼|𝖪𝖪|d\displaystyle\sum_{d=0}^{\mathsf{D}}\frac{\lambda^{2d}}{d!}\mathbb{E}\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right|^{d} (95)
λ2𝔼|𝖪𝖪|\displaystyle\geq\lambda^{2}\mathbb{E}\left|\mathsf{K}\cap\mathsf{K}^{\prime}\right| (96)
=λ2m2k4n2.\displaystyle=\lambda^{2}m^{2}\frac{k^{4}}{n^{2}}. (97)

Thus, if λ\lambda is large enough, namely if λ=ω(n/(mk2))\lambda=\omega(n/(mk^{2})), then 𝖫n𝖣2=ω(1)\left\|\mathsf{L}_{n}^{\leq\mathsf{D}}\right\|^{2}=\omega(1).

4.4 ML Estimator Derivation

The derivation below applies for both the case where 𝖪𝒦k,m,n\mathsf{K}\in{\cal K}_{k,m,n} and the consecutive case where 𝖪𝒦k,m,n𝖼𝗈𝗇\mathsf{K}\in{\cal K}_{k,m,n}^{\mathsf{con}}. Let 1|𝖪(𝖷|𝖪)\mathbb{P}_{{\cal H}_{1}|\mathsf{K}}(\mathsf{X}|\mathsf{K}) denote the conditional distribution of 𝖷\mathsf{X} given 𝖪\mathsf{K}. Recall that the ML estimate of 𝖪\mathsf{K} is given by

𝖪^𝖬𝖫(𝖷)=argmax𝖪𝒦k,m,nlog1|𝖪(𝖷|𝖪).\displaystyle\hat{\mathsf{K}}_{\mathsf{ML}}(\mathsf{X})=\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,m,n}}\log\mathbb{P}_{{\cal H}_{1}|\mathsf{K}}(\mathsf{X}|\mathsf{K}). (98)

Given 𝖪\mathsf{K}, the distribution of 𝖷\mathsf{X} under 1{\cal H}_{1} is given by,

log1|𝖪(𝖷|𝖪)\displaystyle\log\mathbb{P}_{{\cal H}_{1}|\mathsf{K}}(\mathsf{X}|\mathsf{K}) =n22log(2πe)12(i,j)𝖪(𝖷ijλ)212(i,j)𝖪𝖷2ij\displaystyle=-\frac{n^{2}}{2}\log(2\pi e)-\frac{1}{2}\sum_{(i,j)\in\mathsf{K}}(\mathsf{X}_{ij}-\lambda)^{2}-\frac{1}{2}\sum_{(i,j)\not\in\mathsf{K}}\mathsf{X}^{2}_{ij} (99)
=n22log(2πe)+λ2mk212(i,j)[n]2𝖷2ij+λ2(i,j)𝖪𝖷ij.\displaystyle=-\frac{n^{2}}{2}\log(2\pi e)+\lambda^{2}mk^{2}-\frac{1}{2}\sum_{(i,j)\in[n]^{2}}\mathsf{X}^{2}_{ij}+\frac{\lambda}{2}\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij}. (100)

Noticing that only the last term at the r.h.s. of (100) depends on 𝖪\mathsf{K}, the ML estimator in (98) boils down to

𝖪^𝖬𝖫(𝖷)=argmax𝖪𝒦k,m,n(i,j)𝖪𝖷ij.\displaystyle\hat{\mathsf{K}}_{\mathsf{ML}}(\mathsf{X})=\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,m,n}}\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij}. (101)

For the consecutive model, the ML estimator is given by (101), but with 𝒦k,m,n{\cal K}_{k,m,n} replaced by 𝒦k,m,n𝖼𝗈𝗇{\cal K}_{k,m,n}^{\mathsf{con}}. This problem maximizes the sum of entries among all mm principal submatrices of size k×kk\times k of 𝖷\mathsf{X}.

4.5 Proof of Theorem 5

4.5.1 Exact recovery using the ML estimator

In this subsection, we analyze the ML estimator. Recall that,

𝖪^𝖬𝖫(𝖷)=argmax𝖪𝒦k,m,n𝒮(𝖪),\displaystyle\hat{\mathsf{K}}_{\mathsf{ML}}(\mathsf{X})=\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,m,n}}{\cal S}(\mathsf{K}), (102)

where 𝒮(𝖪)(i,j)𝖪𝖷ij{\cal S}(\mathsf{K})\triangleq\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij}. We next prove the conditions for which 𝖪^𝖬𝖫=𝖪\hat{\mathsf{K}}_{\mathsf{ML}}=\mathsf{K}^{\star}, with high probability, where 𝖪\mathsf{K}^{\star} are the mm planted submatrices. To prove the theorem, it suffices to show that 𝒮(𝖪)>𝒮(𝖪){\cal S}(\mathsf{K}^{\star})>{\cal S}(\mathsf{K}), for all feasible 𝖪\mathsf{K} with 𝖪𝖪\mathsf{K}\neq\mathsf{K}^{\star}. Let 𝖣(𝖪)𝒮(𝖪)𝒮(𝖪)\mathsf{D}(\mathsf{K})\triangleq{\cal S}(\mathsf{K}^{\star})-{\cal S}(\mathsf{K}). Note that

𝖣(𝖪)\displaystyle\mathsf{D}(\mathsf{K}) =(i,j)𝖪𝖷ij(i,j)𝖪𝖷ij\displaystyle=\sum_{(i,j)\in\mathsf{K}^{\star}}\mathsf{X}_{ij}-\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij} (103)
=(i,j)𝖪𝔼𝖷ij(i,j)𝖪𝔼𝖷ij+(i,j)𝖪[𝖷ij𝔼𝖷ij](i,j)𝖪[𝖷ij𝔼𝖷ij]\displaystyle=\sum_{(i,j)\in\mathsf{K}^{\star}}\mathbb{E}\mathsf{X}_{ij}-\sum_{(i,j)\in\mathsf{K}}\mathbb{E}\mathsf{X}_{ij}+\sum_{(i,j)\in\mathsf{K}^{\star}}[\mathsf{X}_{ij}-\mathbb{E}\mathsf{X}_{ij}]-\sum_{(i,j)\in\mathsf{K}}[\mathsf{X}_{ij}-\mathbb{E}\mathsf{X}_{ij}] (104)
=λ(mk2|𝖪𝖪|)+(i,j)𝖪𝖪[𝖷ijλ](i,j)𝖪𝖪𝖷ij\displaystyle=\lambda\cdot(mk^{2}-\left|\mathsf{K}^{\star}\cap\mathsf{K}\right|)+\sum_{(i,j)\in\mathsf{K}^{\star}\setminus\mathsf{K}}[\mathsf{X}_{ij}-\lambda]-\sum_{(i,j)\in\mathsf{K}\setminus\mathsf{K}^{\star}}\mathsf{X}_{ij} (105)
=λ(mk2|𝖪𝖪|)+𝖶1(𝖪)+𝖶2(𝖪),\displaystyle=\lambda\cdot(mk^{2}-\left|\mathsf{K}^{\star}\cap\mathsf{K}\right|)+\mathsf{W}_{1}(\mathsf{K})+\mathsf{W}_{2}(\mathsf{K}), (106)

where 𝖶1(𝖪)(i,j)𝖪𝖪[𝖷ijλ]\mathsf{W}_{1}(\mathsf{K})\triangleq\sum_{(i,j)\in\mathsf{K}^{\star}\setminus\mathsf{K}}[\mathsf{X}_{ij}-\lambda] and 𝖶2(𝖪)(i,j)𝖪𝖪𝖷ij\mathsf{W}_{2}(\mathsf{K})\triangleq-\sum_{(i,j)\in\mathsf{K}\setminus\mathsf{K}^{\star}}\mathsf{X}_{ij}. Because |𝖪|=|𝖪|=mk2|\mathsf{K}|=|\mathsf{K}^{\star}|=mk^{2}, we have |𝖪𝖪|=|𝖪𝖪|=mk2|𝖪𝖪||\mathsf{K}^{\star}\setminus\mathsf{K}|=|\mathsf{K}\setminus\mathsf{K}^{\star}|=mk^{2}-\left|\mathsf{K}^{\star}\cap\mathsf{K}\right|. Thus, both 𝖶1(𝖪)\mathsf{W}_{1}(\mathsf{K}) and 𝖶2(𝖪)\mathsf{W}_{2}(\mathsf{K}) are composed of the sum of mk2|𝖪𝖪|mk^{2}-\left|\mathsf{K}^{\star}\cap\mathsf{K}\right| i.i.d. centered Gaussian random variables with unit variance. Accordingly, for i=1,2i=1,2, and each fixed 𝖪\mathsf{K},

(𝖶i(𝖪)λ2(mk2|𝖪𝖪|))12exp[12λ2(mk2|𝖪𝖪|)],\displaystyle\mathbb{P}\left(\mathsf{W}_{i}(\mathsf{K})\leq-\frac{\lambda}{2}(mk^{2}-|\mathsf{K}^{\star}\cap\mathsf{K}|)\right)\leq\frac{1}{2}\exp\left[-\frac{1}{2}\lambda^{2}(mk^{2}-|\mathsf{K}^{\star}\cap\mathsf{K}|)\right], (107)

and therefore, by the union bound and (106),

(𝖣(𝖪)0)exp[12λ2(mk2|𝖪𝖪|)].\displaystyle\mathbb{P}\left(\mathsf{D}(\mathsf{K})\leq 0\right)\leq\exp\left[-\frac{1}{2}\lambda^{2}(mk^{2}-|\mathsf{K}^{\star}\cap\mathsf{K}|)\right]. (108)

Using (108) and the union bound once again, we get

(𝖪^𝖬𝖫(𝖷)𝖪)\displaystyle\mathbb{P}\left(\hat{\mathsf{K}}_{\mathsf{ML}}(\mathsf{X})\neq\mathsf{K}^{\star}\right) =[𝖪𝖪𝖣(𝖪)0]\displaystyle=\mathbb{P}\left[\bigcup_{\mathsf{K}\neq\mathsf{K}^{\star}}\mathsf{D}(\mathsf{K})\leq 0\right] (109)
𝖪𝖪(𝖣(𝖪)0)\displaystyle\leq\sum_{\mathsf{K}\neq\mathsf{K}^{\star}}\mathbb{P}\left(\mathsf{D}(\mathsf{K})\leq 0\right) (110)
𝖪𝖪exp[12λ2(mk2|𝖪𝖪|)]\displaystyle\leq\sum_{\mathsf{K}\neq\mathsf{K}^{\star}}\exp\left[-\frac{1}{2}\lambda^{2}(mk^{2}-|\mathsf{K}^{\star}\cap\mathsf{K}|)\right] (111)
==0mk2k|𝖪𝒦k,m,n𝖼𝗈𝗇:|𝖪𝖪|=|e12λ2(mk2),\displaystyle=\sum_{\ell=0}^{mk^{2}-k}\left|\mathsf{K}\in{\cal K}_{k,m,n}^{\mathsf{con}}:|\mathsf{K}^{\star}\cap\mathsf{K}|=\ell\right|e^{-\frac{1}{2}\lambda^{2}(mk^{2}-\ell)}, (112)

where the last equality follows from the fact that since 𝖪,𝖪𝒦k,m,n𝖼𝗈𝗇\mathsf{K}^{\star},\mathsf{K}\in{\cal K}_{k,m,n}^{\mathsf{con}} and 𝖪𝖪\mathsf{K}^{\star}\cap\mathsf{K}\neq\emptyset, we must have that |𝖪𝖪|mk2k|\mathsf{\mathsf{K}^{\star}}\cap\mathsf{K}|\leq mk^{2}-k. It can be shown that |𝖪𝒦k,m,n𝖼𝗈𝗇:|𝖪𝖪|=|C(mk2)2k2nC(mk2)k\left|\mathsf{K}\in{\cal K}_{k,m,n}^{\mathsf{con}}:|\mathsf{K}^{\star}\cap\mathsf{K}|=\ell\right|\leq C\frac{(mk^{2}-\ell)^{2}}{k^{2}}n^{\frac{C^{\prime}(mk^{2}-\ell)}{k}}, from some C,C>0C,C^{\prime}>0, see, e.g., [23, Lemma 7]. Then,

(𝖪^𝖬𝖫(𝖷)𝖪)\displaystyle\mathbb{P}\left(\hat{\mathsf{K}}_{\mathsf{ML}}(\mathsf{X})\neq\mathsf{K}^{\star}\right) C=0mk2k(mk2)2k2nC(mk2)ke12λ2(mk2)\displaystyle\leq C\sum_{\ell=0}^{mk^{2}-k}\frac{(mk^{2}-\ell)^{2}}{k^{2}}n^{\frac{C^{\prime}(mk^{2}-\ell)}{k}}e^{-\frac{1}{2}\lambda^{2}(mk^{2}-\ell)} (113)
=C=kmk22k2nCke12λ2\displaystyle=C\sum_{\ell=k}^{mk^{2}}\frac{\ell^{2}}{k^{2}}n^{\frac{C^{\prime}\ell}{k}}e^{-\frac{1}{2}\lambda^{2}\ell} (114)
C=kmk2n4nCke12λ2\displaystyle\leq C\sum_{\ell=k}^{mk^{2}}n^{4}n^{\frac{C^{\prime}\ell}{k}}e^{-\frac{1}{2}\lambda^{2}\ell} (115)
=Cn4=kmk2eCklogn12λ2\displaystyle=Cn^{4}\sum_{\ell=k}^{mk^{2}}e^{\frac{C^{\prime}\ell}{k}\log n-\frac{1}{2}\lambda^{2}\ell} (116)
=Cn4=kmk2e(12λ2Cklogn)\displaystyle=Cn^{4}\sum_{\ell=k}^{mk^{2}}e^{-\ell\cdot\left(\frac{1}{2}\lambda^{2}-\frac{C^{\prime}}{k}\log n\right)} (117)
(a)Cn4=kmk2e8klogn\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}Cn^{4}\sum_{\ell=k}^{mk^{2}}e^{-\frac{8\ell}{k}\log n} (118)
Cn4mk2n8=Cmk2n4,\displaystyle\leq\frac{Cn^{4}mk^{2}}{n^{8}}=C\frac{mk^{2}}{n^{4}}, (119)

where in (a) we have used the fact that λ2>(2C+16)lognk\lambda^{2}>\frac{(2C^{\prime}+16)\log n}{k}. Thus, we get that (𝖪^𝖬𝖫(𝖷)𝖪)\mathbb{P}(\hat{\mathsf{K}}_{\mathsf{ML}}(\mathsf{X})\neq\mathsf{K}^{\star}) converges to zero, as nn\to\infty.

4.5.2 Exact recovery using the peeling estimator

We analyze the first step of the peeling algorithm (which boils down to the ML estimator for a single planted submatrix), and the strategy to bound each of the other sequential steps is exactly the same. Recall that,

𝖪^1(𝖷)=argmax𝖪𝒦k,1,n𝖼𝗈𝗇𝒮(𝖪),\displaystyle\hat{\mathsf{K}}_{1}(\mathsf{X})=\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,1,n}^{\mathsf{con}}}{\cal S}(\mathsf{K}), (120)

where 𝒮(𝖪)(i,j)𝖪𝖷ij{\cal S}(\mathsf{K})\triangleq\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij}. We next prove the conditions for which 𝖪^1(𝖷)=𝖪\hat{\mathsf{K}}_{1}(\mathsf{X})=\mathsf{K}^{\star}_{\ell}, with high probability, for some [m]\ell\in[m], where 𝖪==1m𝖪\mathsf{K}^{\star}=\cup_{\ell=1}^{m}\mathsf{K}_{\ell}^{\star} are the mm planted submatrices. To prove the theorem it suffices to show that 𝒮(𝖪)>max[m]𝒮(𝖪){\cal S}(\mathsf{K})>\max_{\ell\in[m]}{\cal S}(\mathsf{K}^{\star}_{\ell}) is asymptotically small, for all feasible 𝖪\mathsf{K} with 𝖪𝖪\mathsf{K}\neq\mathsf{K}_{\ell}^{\star}, for [m]\ell\in[m]. Let 𝖣(𝖪)𝒮(𝖪)𝒮(𝖪)\mathsf{D}_{\ell}(\mathsf{K})\triangleq{\cal S}(\mathsf{K}^{\star}_{\ell})-{\cal S}(\mathsf{K}). Note that

𝖣(𝖪)\displaystyle\mathsf{D}_{\ell}(\mathsf{K}) =(i,j)𝖪𝖷ij(i,j)𝖪𝖷ij\displaystyle=\sum_{(i,j)\in\mathsf{K}_{\ell}^{\star}}\mathsf{X}_{ij}-\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij} (121)
=(i,j)𝖪𝔼𝖷ij(i,j)𝖪𝔼𝖷ij+(i,j)𝖪[𝖷ij𝔼𝖷ij](i,j)𝖪[𝖷ij𝔼𝖷ij]\displaystyle=\sum_{(i,j)\in\mathsf{K}_{\ell}^{\star}}\mathbb{E}\mathsf{X}_{ij}-\sum_{(i,j)\in\mathsf{K}}\mathbb{E}\mathsf{X}_{ij}+\sum_{(i,j)\in\mathsf{K}_{\ell}^{\star}}[\mathsf{X}_{ij}-\mathbb{E}\mathsf{X}_{ij}]-\sum_{(i,j)\in\mathsf{K}}[\mathsf{X}_{ij}-\mathbb{E}\mathsf{X}_{ij}] (122)
=λ(k2|𝖪𝖪|)+(i,j)𝖪𝖪[𝖷ijλ](i,j)𝖪𝖪[𝖷ij𝔼𝖷ij]\displaystyle=\lambda\cdot(k^{2}-\left|\mathsf{K}^{\star}\cap\mathsf{K}\right|)+\sum_{(i,j)\in\mathsf{K}_{\ell}^{\star}\setminus\mathsf{K}}[\mathsf{X}_{ij}-\lambda]-\sum_{(i,j)\in\mathsf{K}\setminus\mathsf{K}_{\ell}^{\star}}[\mathsf{X}_{ij}-\mathbb{E}\mathsf{X}_{ij}] (123)
=λ(k2|𝖪𝖪|)+𝖶1(𝖪)+𝖶2(𝖪),\displaystyle=\lambda\cdot(k^{2}-\left|\mathsf{K}^{\star}\cap\mathsf{K}\right|)+\mathsf{W}_{1}(\mathsf{K})+\mathsf{W}_{2}(\mathsf{K}), (124)

where 𝖶1(𝖪)(i,j)𝖪𝖪[𝖷ijλ]\mathsf{W}_{1}(\mathsf{K})\triangleq\sum_{(i,j)\in\mathsf{K}_{\ell}^{\star}\setminus\mathsf{K}}[\mathsf{X}_{ij}-\lambda] and 𝖶2(𝖪)(i,j)𝖪𝖪[𝖷ij𝔼𝖷ij]\mathsf{W}_{2}(\mathsf{K})\triangleq-\sum_{(i,j)\in\mathsf{K}\setminus\mathsf{K}_{\ell}^{\star}}[\mathsf{X}_{ij}-\mathbb{E}\mathsf{X}_{ij}]. Because |𝖪|=|𝖪|=k2|\mathsf{K}|=|\mathsf{K}_{\ell}^{\star}|=k^{2}, we have |𝖪𝖪|=|𝖪𝖪|=k2|𝖪𝖪||\mathsf{K}_{\ell}^{\star}\setminus\mathsf{K}|=|\mathsf{K}\setminus\mathsf{K}_{\ell}^{\star}|=k^{2}-\left|\mathsf{K}_{\ell}^{\star}\cap\mathsf{K}\right|. Thus, both 𝖶1(𝖪)\mathsf{W}_{1}(\mathsf{K}) and 𝖶2(𝖪)\mathsf{W}_{2}(\mathsf{K}) are composed of sum of k2|𝖪𝖪|k^{2}-\left|\mathsf{K}_{\ell}^{\star}\cap\mathsf{K}\right| i.i.d. centered Gaussian random variables with unit variance. Accordingly, for i=1,2i=1,2, and each fixed 𝖪\mathsf{K},

(𝖶i(𝖪)λ2(k2|𝖪𝖪|))\displaystyle\mathbb{P}\left(\mathsf{W}_{i}(\mathsf{K})\leq-\frac{\lambda}{2}(k^{2}-|\mathsf{K}^{\star}\cap\mathsf{K}|)\right) 12exp[λ28(k2|𝖪𝖪|)2k2|𝖪𝖪|].\displaystyle\leq\frac{1}{2}\exp\left[-\frac{\lambda^{2}}{8}\frac{(k^{2}-|\mathsf{K}^{\star}\cap\mathsf{K}|)^{2}}{k^{2}-\left|\mathsf{K}_{\ell}^{\star}\cap\mathsf{K}\right|}\right]. (125)

Therefore, by the union bound and (124),

(𝖣(𝖪)0)exp[λ28(k2|𝖪𝖪|)2k2|𝖪𝖪|].\displaystyle\mathbb{P}\left(\mathsf{D}_{\ell}(\mathsf{K})\leq 0\right)\leq\exp\left[-\frac{\lambda^{2}}{8}\frac{(k^{2}-|\mathsf{K}^{\star}\cap\mathsf{K}|)^{2}}{k^{2}-\left|\mathsf{K}_{\ell}^{\star}\cap\mathsf{K}\right|}\right]. (126)

Note that due to the separation assumption, it must be the case that either |𝖪𝖪|=|𝖪j𝖪|0|\mathsf{K}^{\star}\cap\mathsf{K}|=|\mathsf{K}_{j}^{\star}\cap\mathsf{K}|\neq 0, for some j[m]j\in[m], or |𝖪𝖪|=0|\mathsf{K}^{\star}\cap\mathsf{K}|=0. In the later case, we have

(𝖣(𝖪)0)exp[λ2k28],\displaystyle\mathbb{P}\left(\mathsf{D}_{\ell}(\mathsf{K})\leq 0\right)\leq\exp\left[-\frac{\lambda^{2}k^{2}}{8}\right], (127)

while in the former the exists a unique j[m]j\in[m], such that,

min[m](𝖣(𝖪)0)\displaystyle\min_{\ell\in[m]}\mathbb{P}\left(\mathsf{D}_{\ell}(\mathsf{K})\leq 0\right) min[m]exp[λ28(k2|𝖪j𝖪|)2k2|𝖪𝖪|]\displaystyle\leq\min_{\ell\in[m]}\exp\left[-\frac{\lambda^{2}}{8}\frac{(k^{2}-|\mathsf{K}_{j}^{\star}\cap\mathsf{K}|)^{2}}{k^{2}-\left|\mathsf{K}_{\ell}^{\star}\cap\mathsf{K}\right|}\right] (128)
exp[λ28(k2|𝖪j𝖪|)]\displaystyle\leq\exp\left[-\frac{\lambda^{2}}{8}(k^{2}-|\mathsf{K}_{j}^{\star}\cap\mathsf{K}|)\right] (129)
exp[λ2k8],\displaystyle\leq\exp\left[-\frac{\lambda^{2}k}{8}\right], (130)

where the third inequity is since 𝖪j,𝖪𝒦k,1,n𝖼𝗈𝗇\mathsf{K}_{j}^{\star},\mathsf{K}\in{\cal K}_{k,1,n}^{\mathsf{con}} and 𝖪j𝖪\mathsf{K}_{j}^{\star}\cap\mathsf{K}\neq\emptyset, we must have that |𝖪𝗃𝖪|k2k|\mathsf{\mathsf{K}_{j}^{\star}}\cap\mathsf{K}|\leq k^{2}-k. Accordingly, using (126) and the union bound once again, we get

(𝖪^1(𝖷)𝖪for some [m])\displaystyle\mathbb{P}\left(\hat{\mathsf{K}}_{1}(\mathsf{X})\neq\mathsf{K}_{\ell}^{\star}\;\text{for some }\ell\in[m]\right) =[𝖪(𝖪1,,𝖪m){𝒮(𝖪)>max[m]𝒮(𝖪)}]\displaystyle=\mathbb{P}\left[\bigcup_{\mathsf{K}\neq(\mathsf{K}_{1}^{\star},\ldots,\mathsf{K}_{m}^{\star})}\left\{{\cal S}(\mathsf{K})>\max_{\ell\in[m]}{\cal S}(\mathsf{K}^{\star}_{\ell})\right\}\right] (131)
=[𝖪(𝖪1,,𝖪m){𝖣1(𝖪)0,,𝖣m(𝖪)0}]\displaystyle=\mathbb{P}\left[\bigcup_{\mathsf{K}\neq(\mathsf{K}_{1}^{\star},\ldots,\mathsf{K}_{m}^{\star})}\left\{\mathsf{D}_{1}(\mathsf{K})\leq 0,\ldots,\mathsf{D}_{m}(\mathsf{K})\leq 0\right\}\right] (132)
𝖪(𝖪1,,𝖪m)min[m](𝖣(𝖪)0)\displaystyle\leq\sum_{\mathsf{K}\neq(\mathsf{K}_{1}^{\star},\ldots,\mathsf{K}_{m}^{\star})}\min_{\ell\in[m]}\mathbb{P}\left(\mathsf{D}_{\ell}(\mathsf{K})\leq 0\right) (133)
𝖪(𝖪1,,𝖪m)exp[λ2k8]\displaystyle\leq\sum_{\mathsf{K}\neq(\mathsf{K}_{1}^{\star},\ldots,\mathsf{K}_{m}^{\star})}\exp\left[-\frac{\lambda^{2}k}{8}\right] (134)
n2e18λ2k,\displaystyle\leq n^{2}e^{-\frac{1}{8}\lambda^{2}k}, (135)

where the last inequality is because |𝒦k,1,n𝖼𝗈𝗇|n2|{\cal K}_{k,1,n}^{\mathsf{con}}|\leq n^{2}. Thus, we see that if λ2>(24+ϵ)lognk\lambda^{2}>\frac{(24+\epsilon)\log n}{k}, then (𝖪^1(𝖷)𝖪for some [m])n(1+ϵ/8)\mathbb{P}\left(\hat{\mathsf{K}}_{1}(\mathsf{X})\neq\mathsf{K}_{\ell}^{\star}\;\text{for some }\ell\in[m]\right)\leq n^{-(1+\epsilon/8)}. Using the same steps above, it is clear that, (𝖪^(𝖷)𝖪)n(1+ϵ/8)\mathbb{P}(\hat{\mathsf{K}}_{\ell}(\mathsf{X})\neq\mathsf{K}_{\ell}^{\star})\leq n^{-(1+\epsilon/8)}, for any 2m2\leq\ell\leq m, provided that λ2>(24+ϵ)lognk\lambda^{2}>\frac{(24+\epsilon)\log n}{k}. Thus,

[𝖪^𝗉𝖾𝖾𝗅𝖪]=[=1m𝖪^𝖪]mn(1+ϵ/8)=nϵ/8,\displaystyle\mathbb{P}\left[\hat{\mathsf{K}}_{\mathsf{peel}}\neq\mathsf{K}^{\star}\right]=\mathbb{P}\left[\bigcup_{\ell=1}^{m}\hat{\mathsf{K}}_{\ell}\neq\mathsf{K}_{\ell}^{\star}\right]\leq\frac{m}{n^{(1+\epsilon/8)}}=n^{-\epsilon/8}, (136)

which converges to zero as nn\to\infty.

4.5.3 Correlated recovery using the peeling estimator

Our analysis for correlated recovery relies on standard arguments as in [17, 51]. Recall the peeling estimator in (20). Denote the planted submatrices by 𝖪=i=1m𝖳i×𝖲i𝒦𝖼𝗈𝗇k,m,n\mathsf{K}^{\star}=\bigcup_{i=1}^{m}\mathsf{T}^{\star}_{i}\times\mathsf{S}_{i}^{\star}\in{\cal K}^{\mathsf{con}}_{k,m,n}. We let 𝖪i𝖳i×𝖲i\mathsf{K}_{i}^{\star}\triangleq\mathsf{T}^{\star}_{i}\times\mathsf{S}_{i}^{\star}, for i[m]i\in[m], and fix ϵ>0\epsilon>0. Let us analyze the first step of the algorithm, i.e., 𝖪^1(𝖷(1))=𝖪^1(𝖷)\hat{\mathsf{K}}_{1}(\mathsf{X}^{(1)})=\hat{\mathsf{K}}_{1}(\mathsf{X}). Recall that

𝖪^1(𝖷(1))=argmax𝖪𝒦k,1,n𝖼𝗈𝗇𝒮1(𝖪),\displaystyle\hat{\mathsf{K}}_{1}(\mathsf{X}^{(1)})=\operatorname*{arg\,max}_{\mathsf{K}\in{\cal K}_{k,1,n}^{\mathsf{con}}}{\cal S}_{1}(\mathsf{K}), (137)

where we define 𝒮(𝖪)(i,j)𝖪𝖷ij(){\cal S}_{\ell}(\mathsf{K})\triangleq\sum_{(i,j)\in\mathsf{K}}\mathsf{X}_{ij}^{(\ell)}, for 𝖪𝒦k,1,n𝖼𝗈𝗇\mathsf{K}\in{\cal K}_{k,1,n}^{\mathsf{con}} and [m]\ell\in[m]. Under the planted model, 𝒮1(𝖪)𝒩(λ𝖪,𝖪,k2){\cal S}_{1}(\mathsf{K})\sim{\cal N}(\lambda\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle,k^{2}). Hence, the distribution of 𝒮1(𝖪){\cal S}_{1}(\mathsf{K}) depends on the size of the overlap of 𝖪\mathsf{K} with 𝖪\mathsf{K}^{\star}. To prove that reconstruction is possible, we compute in the planted model the probability that 𝒮1(𝖪)>max[m]𝒮1(𝖪){\cal S}_{1}(\mathsf{K})>\max_{\ell\in[m]}{\cal S}_{1}(\mathsf{K}_{\ell}^{\star}), given that 𝖪\mathsf{K} has overlap 𝖪,𝖪=ω\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle=\omega with the planted partition, and argue that this probability tends to zero whenever the overlap is small enough. For each [m]\ell\in[m], note that 𝒮1(𝖪)𝒩(λk2,k2){\cal S}_{1}(\mathsf{K}_{\ell}^{\star})\sim{\cal N}(\lambda k^{2},k^{2}), and thus Hoeffding’s inequality implies that 𝒮1(𝖪)>λk22k2logn{\cal S}_{1}(\mathsf{K}_{\ell}^{\star})>\lambda k^{2}-\sqrt{2k^{2}\log n}, with probability at least 1O(n1)1-O(n^{-1}). Taking the union bound over every 𝖪\mathsf{K} with overlap at most ω\omega gives

(max𝖪,𝖪ω𝒮1(𝖪)>λk22k2logn)\displaystyle\mathbb{P}\left(\max_{\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle\leq\omega}{\cal S}_{1}(\mathsf{K})>\lambda k^{2}-\sqrt{2k^{2}\log n}\right)\leq (138)
=n2max𝖪,𝖪ωexp([λ(k2𝖪,𝖪)2k2logn]22k2)\displaystyle\hskip 22.76228pt=n^{2}\cdot\max_{\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle\leq\omega}\exp\left(-\frac{\left[\lambda(k^{2}-\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle)-\sqrt{2k^{2}\log n}\right]^{2}}{2k^{2}}\right) (139)
=n2max𝖪,𝖪ωexp([λ2k2(k2𝖪,𝖪)logn]2)\displaystyle\hskip 22.76228pt=n^{2}\cdot\max_{\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle\leq\omega}\exp\left(-\left[\frac{\lambda}{\sqrt{2k^{2}}}(k^{2}-\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle)-\sqrt{\log n}\right]^{2}\right) (140)
exp(2logn[λ2k2(k2ω)logn]2).\displaystyle\hskip 22.76228pt\leq\exp\left(2\log n-\left[\frac{\lambda}{\sqrt{2k^{2}}}(k^{2}-\omega)-\sqrt{\log n}\right]^{2}\right). (141)

By the assumption that λ>𝖢lognk\lambda>\frac{\mathsf{C}\sqrt{\log n}}{k}, with 𝖢>2+2\mathsf{C}>2+\sqrt{2}, it follows that there exists a fixed constant ϵ>0\epsilon>0 such that (1ϵ)λ>𝖢lognk(1-\epsilon)\lambda>\frac{\mathsf{C}\sqrt{\log n}}{k}. Hence, setting ω=k2ϵ\omega=k^{2}\epsilon in the last displayed equation, we get

(max𝖪,𝖪ω𝒮1(𝖪)>λk22k2logn)\displaystyle\mathbb{P}\left(\max_{\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle\leq\omega}{\cal S}_{1}(\mathsf{K})>\lambda k^{2}-\sqrt{2k^{2}\log n}\right)
exp(2logn[λk22(1ϵ)logn]2)=eΩ(n),\displaystyle\hskip 22.76228pt\leq\exp\left(2\log n-\left[\frac{\lambda\sqrt{k^{2}}}{\sqrt{2}}(1-\epsilon)-\sqrt{\log n}\right]^{2}\right)=e^{-\Omega(n)}, (142)

and thus, with probability at least 1eΩ(n)1-e^{-\Omega(n)},

max𝖪,𝖪k2ϵ𝒮1(𝖪)<λk22k2logn.\displaystyle\max_{\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle\leq k^{2}\epsilon}{\cal S}_{1}(\mathsf{K})<\lambda k^{2}-\sqrt{2k^{2}\log n}. (143)

Consequently, we get that the maximum likelihood estimator 𝖪^1(𝖷(1))\hat{\mathsf{K}}_{1}(\mathsf{X}^{(1)}) in (20) satisfies 𝖪^1,𝖪k2ϵ\langle{\hat{\mathsf{K}}_{1},\mathsf{K}^{\star}}\rangle\geq k^{2}\epsilon with high probability. Finally, the separation assumption implies that there exist a unique j[m]j\in[m] such that 𝖪^1,𝖪=𝖪^1,𝖪jk2ϵ\langle{\hat{\mathsf{K}}_{1},\mathsf{K}^{\star}}\rangle=\langle{\hat{\mathsf{K}}_{1},\mathsf{K}_{j}^{\star}}\rangle\geq k^{2}\epsilon, and 𝖪^1,𝖪=0\langle{\hat{\mathsf{K}}_{1},\mathsf{K}_{\ell}^{\star}}\rangle=0, for j\ell\neq j. Then, in the second step of the peeling algorithm, we first compute 𝖷ij(2)\mathsf{X}_{ij}^{(2)}, by setting [𝖷ij(2)]ij=[\mathsf{X}_{ij}^{(2)}]_{ij}=-\infty, for any (i,j)𝖪^1(i,j)\in\hat{\mathsf{K}}_{1}, and [𝖷ij(2)]ij=0[\mathsf{X}_{ij}^{(2)}]_{ij}=0, otherwise. Thus, it is clear that in the second step, 𝖪^2\hat{\mathsf{K}}_{2} cannot be attained by any set that is kk-closed to 𝖪^1\hat{\mathsf{K}}_{1}; indeed, 𝒮2(𝖪)={\cal S}_{2}(\mathsf{K})=-\infty, for any set 𝖪\mathsf{K} that is kk-closed to 𝖪^1\hat{\mathsf{K}}_{1}. Therefore, for the relevant sets in the maximization in 𝖪^2\hat{\mathsf{K}}_{2}, we again have 𝒮2(𝖪)𝒩(λ𝖪,𝖪,k2){\cal S}_{2}(\mathsf{K})\sim{\cal N}(\lambda\langle{\mathsf{K},\mathsf{K}^{\star}}\rangle,k^{2}). Accordingly, repeating the exact same arguments above, we obtain that 𝖪^2,𝖪jk2ϵ\langle{\hat{\mathsf{K}}_{2},\mathsf{K}_{j}^{\star}}\rangle\geq k^{2}\epsilon with high probability, for some j[m]j\in[m]. In the same way, we get that 𝖪^,𝖪k2ϵ\langle{\hat{\mathsf{K}}_{\ell},\mathsf{K}_{\ell}^{\star}}\rangle\geq k^{2}\epsilon with high probability, for any [m]\ell\in[m]. The union bound, then implies that

[=1m{𝖪^,𝖪<k2ϵ}]m[𝖪^,𝖪<k2ϵ]meΩ(n)=o(1).\displaystyle\mathbb{P}\left[\bigcup_{\ell=1}^{m}\left\{\langle{\hat{\mathsf{K}}_{\ell},\mathsf{K}_{\ell}^{\star}}\rangle<k^{2}\epsilon\right\}\right]\leq m\cdot\mathbb{P}\left[\langle{\hat{\mathsf{K}}_{\ell},\mathsf{K}_{\ell}^{\star}}\rangle<k^{2}\epsilon\right]\leq me^{-\Omega(n)}=o(1). (144)

Thus, 𝖪^𝗉𝖾𝖾𝗅,𝖪mk2ϵ\langle{\hat{\mathsf{K}}_{\mathsf{peel}},\mathsf{K}^{\star}}\rangle\geq mk^{2}\epsilon with high probability, namely, 𝖪^𝗉𝖾𝖾𝗅\hat{\mathsf{K}}_{\mathsf{peel}} achieves correlated recovery.

4.6 Proof of Theorem 6

4.6.1 Exact recovery

We use an information theoretical argument via Fano’s inequality. Recall that 𝒦k,m,n𝖼𝗈𝗇{\cal K}_{k,m,n}^{\mathsf{con}} is the set of possible planted submatrices. Let 𝒦¯k,m,n\bar{{\cal K}}_{k,m,n} be a subset of 𝒦k,m,n𝖼𝗈𝗇{\cal K}_{k,m,n}^{\mathsf{con}}, which will be specified later on. Let ¯𝖷,𝖪\bar{\mathbb{P}}_{\mathsf{X},\mathsf{K}^{\star}} denote the joint distribution of the underlying location of the planted submatrices 𝖪\mathsf{K}^{\star} and 𝖷\mathsf{X}, when 𝖪\mathsf{K}^{\star} is drawn uniformly at random from 𝒦¯k,m,n\bar{{\cal K}}_{k,m,n}, and 𝖷\mathsf{X} is generated according to Definition 3. Let I(𝖷;𝖪)I(\mathsf{X};\mathsf{K}^{\star}) denote the mutual information between 𝖷\mathsf{X} and 𝖪\mathsf{K}^{\star}. Then, Fano’s inequality implies that,

inf𝖪^sup𝖪𝒦¯k,m,n[𝖪^𝖪]inf𝖪^¯[𝖪^𝖪]1I(𝖷;𝖪)+1log|𝒦¯k,m,n|.\displaystyle\inf_{\hat{\mathsf{K}}}\sup_{\mathsf{K}^{\star}\in\bar{{\cal K}}_{k,m,n}}\mathbb{P}\left[\hat{\mathsf{K}}\neq\mathsf{K}^{\star}\right]\geq\inf_{\hat{\mathsf{K}}}\bar{\mathbb{P}}\left[\hat{\mathsf{K}}\neq\mathsf{K}^{\star}\right]\geq 1-\frac{I(\mathsf{X};\mathsf{K}^{\star})+1}{\log|\bar{{\cal K}}_{k,m,n}|}. (145)

We construct 𝒦¯k,m,n\bar{{\cal K}}_{k,m,n} as follows. Let 𝖬αm\mathsf{M}\triangleq\alpha\cdot m, where α\alpha\in\mathbb{N} will be specified later on, and 𝒦¯k,m,n={𝖪}=0𝖬\bar{{\cal K}}_{k,m,n}=\left\{\mathsf{K}_{\ell}\right\}_{\ell=0}^{\mathsf{M}}, where:

  1. 1.

    The base submatrix 𝖪0\mathsf{K}_{0} is defined as 𝖪0==1m𝖲0×𝖳0\mathsf{K}_{0}=\bigcup_{\ell=1}^{m}\mathsf{S}_{\ell}^{0}\times\mathsf{T}_{\ell}^{0}, with 𝖲0={(1)(k+α)+1,,(1)(k+α)+k}\mathsf{S}_{\ell}^{0}=\left\{(\ell-1)\cdot(k+\alpha)+1,\ldots,(\ell-1)\cdot(k+\alpha)+k\right\} and 𝖳0=[k]\mathsf{T}_{\ell}^{0}=[k], for [m]\ell\in[m]. Namely, every pair of consecutive matrices among the mm matrices in 𝖪0\mathsf{K}_{0} are α\alpha columns far apart.

  2. 2.

    We let 𝖪(j1)α+i\mathsf{K}_{(j-1)\alpha+i}, for j=1,2,,mj=1,2,\ldots,m and i=1,2,,αi=1,2,\ldots,\alpha, to be defined the same as 𝖪0\mathsf{K}_{0} but with 𝖲j1\mathsf{S}^{j-1} shifted ii columns to the right.

Let ¯i\bar{\mathbb{P}}_{i} denote the conditional distribution of 𝖷\mathsf{X} given 𝖪=𝖪i\mathsf{K}^{\star}=\mathsf{K}_{i}. Note that,

I(𝖷;𝖪)\displaystyle I(\mathsf{X};\mathsf{K}^{\star}) =d𝖪𝖫(¯𝖷,𝖪||¯𝖷¯𝖪)\displaystyle=d_{\mathsf{KL}}(\bar{\mathbb{P}}_{\mathsf{X},\mathsf{K}^{\star}}||\bar{\mathbb{P}}_{\mathsf{X}}\bar{\mathbb{P}}_{\mathsf{K}^{\star}}) (146)
=𝔼𝖪[d𝖪𝖫(¯𝖷|𝖪||¯𝖷)]\displaystyle=\mathbb{E}_{\mathsf{K}^{\star}}\left[d_{\mathsf{KL}}(\bar{\mathbb{P}}_{\mathsf{X}|\mathsf{K}^{\star}}||\bar{\mathbb{P}}_{\mathsf{X}})\right] (147)
=1𝖬+𝟣i=0𝖬d𝖪𝖫(¯i||¯𝖷)\displaystyle=\frac{1}{\mathsf{M+1}}\sum_{i=0}^{\mathsf{M}}d_{\mathsf{KL}}(\bar{\mathbb{P}}_{i}||\bar{\mathbb{P}}_{\mathsf{X}}) (148)
1(𝖬+𝟣)2i,j=0𝖬d𝖪𝖫(¯i||¯j),\displaystyle\leq\frac{1}{(\mathsf{M+1})^{2}}\sum_{i,j=0}^{\mathsf{M}}d_{\mathsf{KL}}(\bar{\mathbb{P}}_{i}||\bar{\mathbb{P}}_{j}), (149)

where the inequality follows from the fact that ¯𝖷=1𝖬+𝟣j=0𝖬¯j\bar{\mathbb{P}}_{\mathsf{X}}=\frac{1}{\mathsf{M+1}}\sum_{j=0}^{\mathsf{M}}\bar{\mathbb{P}}_{j}, and the convexity of KL divergence. Now, since each ¯i\bar{\mathbb{P}}_{i} is a product of n2n^{2} Gaussian distributions, we get

I(𝖷;𝖪)\displaystyle I(\mathsf{X};\mathsf{K}^{\star}) 1(𝖬+1)2i,j=0𝖬d𝖪𝖫(¯i||¯j)\displaystyle\leq\frac{1}{(\mathsf{M}+1)^{2}}\sum_{i,j=0}^{\mathsf{M}}d_{\mathsf{KL}}(\bar{\mathbb{P}}_{i}||\bar{\mathbb{P}}_{j}) (150)
=1(𝖬+1)j=0𝖬d𝖪𝖫(¯0||¯j)\displaystyle=\frac{1}{(\mathsf{M}+1)}\sum_{j=0}^{\mathsf{M}}d_{\mathsf{KL}}(\bar{\mathbb{P}}_{0}||\bar{\mathbb{P}}_{j}) (151)
=2km(𝖬+1)[d𝖪𝖫(𝒩(λ,1)||𝒩(0,1))+d𝖪𝖫(𝒩(0,1)||𝒩(λ,1))]j=0αj\displaystyle=\frac{2km}{(\mathsf{M}+1)}\left[d_{\mathsf{KL}}\left(\mathcal{N}(\lambda,1)||\mathcal{N}(0,1)\right)+d_{\mathsf{KL}}\left(\mathcal{N}(0,1)||\mathcal{N}(\lambda,1)\right)\right]\sum_{j=0}^{\alpha}j (152)
=2km(𝖬+1)α(1+α)2[d𝖪𝖫(𝒩(λ,1)||𝒩(0,1))+d𝖪𝖫(𝒩(0,1)||𝒩(λ,1))]\displaystyle=\frac{2km}{(\mathsf{M}+1)}\frac{\alpha(1+\alpha)}{2}\left[d_{\mathsf{KL}}\left(\mathcal{N}(\lambda,1)||\mathcal{N}(0,1)\right)+d_{\mathsf{KL}}\left(\mathcal{N}(0,1)||\mathcal{N}(\lambda,1)\right)\right] (153)
2k(1+α)2[d𝖪𝖫(𝒩(λ,1)||𝒩(0,1))+d𝖪𝖫(𝒩(0,1)||𝒩(λ,1))]\displaystyle\leq 2k\frac{(1+\alpha)}{2}\left[d_{\mathsf{KL}}\left(\mathcal{N}(\lambda,1)||\mathcal{N}(0,1)\right)+d_{\mathsf{KL}}\left(\mathcal{N}(0,1)||\mathcal{N}(\lambda,1)\right)\right] (154)
=(1+α)kλ2.\displaystyle=(1+\alpha)k\lambda^{2}. (155)

Thus, substituting the last inequality in (145), and using the fact that |𝒦¯k,m,n|=1+𝖬|\bar{{\cal K}}_{k,m,n}|=1+\mathsf{M}, we get that inf𝖪^sup𝖪𝒦¯k,m,n[𝖪^𝖪]>1/2\inf_{\hat{\mathsf{K}}}\sup_{\mathsf{K}^{\star}\in\bar{{\cal K}}_{k,m,n}}\mathbb{P}\left[\hat{\mathsf{K}}\neq\mathsf{K}^{\star}\right]>1/2, if

λ2<12log(1+𝖬)1(1+α)k=12log(1+αm)1(1+α)k.\displaystyle\lambda^{2}<\frac{\frac{1}{2}\log(1+\mathsf{M})-1}{(1+\alpha)k}=\frac{\frac{1}{2}\log{(1+\alpha m)}-1}{(1+\alpha)k}. (156)

Finally, it is clear that there exists α0\alpha_{0}\in\mathbb{N}, such that for any α>α0\alpha>\alpha_{0}, the minimax error probability is at least half, if λ2<C/k\lambda^{2}<C/k, for some constant C>0C>0, which concludes the proof.

4.6.2 Correlated recovery

The correlated recovery lower bound follows almost directly from the same arguments as in, e.g., [51, Subsection 3.1.3]. For completeness, we present here the main ideas in the proof. Note that the observations can be written as 𝖷=λ𝖬+𝖶\mathsf{X}=\lambda\mathsf{M}+\mathsf{W}, 𝖶\mathsf{W} is an n×nn\times n i.i.d. matrix with zero mean and unit variance Gaussian entries, and 𝖬\mathsf{M} is an n×nn\times n binary matrix such that 𝖬ij=1\mathsf{M}_{ij}=1 if (i,j)𝖪(i,j)\in\mathsf{K}, and 𝖬ij=0\mathsf{M}_{ij}=0, otherwise, and 𝖪\mathsf{K} is the planted set. Define 𝖠=βλ𝖬+𝖶\mathsf{A}=\beta\lambda\mathsf{M}+\mathsf{W}, where β[0,1]\beta\in[0,1]. The minimum mean-squared error estimator (MMSE) of 𝖬\mathsf{M} given 𝖠\mathsf{A} is 𝖬^𝖬𝖬𝖲𝖤=𝔼[𝖬|𝖠]\hat{\mathsf{M}}_{\mathsf{MMSE}}=\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right], and the rescaled minimum mean-squared error is 𝖬𝖬𝖲𝖤(β)=1mk2𝔼𝖬𝔼[𝖬|𝖠]2𝖥\mathsf{MMSE}(\beta)=\frac{1}{mk^{2}}\mathbb{E}\left\|\mathsf{M}-\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right]\right\|^{2}_{\mathsf{F}}. Note that under the conditions of Theorem 6, we proved in Theorem 2 that χ2(𝒫||𝒬)<C\chi^{2}({\cal P}||{\cal Q})<C, from some constant C>0C>0. Jensen’s inequality implies that the KL divergence between 𝒫{\cal P} and 𝒬{\cal Q} is also bounded, indeed,

d𝖪𝖫(𝒫||𝒬)log𝔼𝒫𝒫𝒬logC.\displaystyle d_{\mathsf{KL}}({\cal P}||{\cal Q})\leq\log\mathbb{E}_{{\cal P}}\frac{{\cal P}}{{\cal Q}}\leq\log C. (157)

The main idea in the proof is to show that bounded KL divergence implies that for all β[0,1]\beta\in[0,1], the MMSE tends to that of the trivial estimator 𝖬^=0\hat{\mathsf{M}}=0, i.e.,

limn𝖬𝖬𝖲𝖤(β)=limn1mk2𝔼𝖬2𝖥=λ2.\displaystyle\lim_{n\to\infty}\mathsf{MMSE}(\beta)=\lim_{n\to\infty}\frac{1}{mk^{2}}\mathbb{E}\left\|\mathsf{M}\right\|^{2}_{\mathsf{F}}=\lambda^{2}. (158)

Expanding the MMSE in the left-hand-side of (158), we get

limn1mk2𝔼[2𝖬,𝔼[𝖬|𝖠]+𝔼[𝖬|𝖠]𝖥2]=0,\displaystyle\lim_{n\to\infty}\frac{1}{mk^{2}}\mathbb{E}\left[-2\left\langle\mathsf{M},\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right]\right\rangle+\left\|\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right]\right\|_{\mathsf{F}}^{2}\right]=0, (159)

which by the tower property of conditional expectation implies that

limn1mk2𝔼𝔼[𝖬|𝖠]𝖥2=0.\displaystyle\lim_{n\to\infty}\frac{1}{mk^{2}}\mathbb{E}\left\|\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right]\right\|_{\mathsf{F}}^{2}=0. (160)

Thus, the optimal estimator converges to the trivial one. To prove (158), a straightforward calculation shows that the mutual information between 𝖠\mathsf{A} and 𝖬\mathsf{M} is given by I(β)=I(𝖬;𝖠)=d𝖪𝖫(𝒫||𝒬)+β4𝔼𝖬2𝖥I(\beta)=I(\mathsf{M};\mathsf{A})=-d_{\mathsf{KL}}({\cal P}||{\cal Q})+\frac{\beta}{4}\mathbb{E}\left\|\mathsf{M}\right\|^{2}_{\mathsf{F}}. Thus, under the conditions of Theorem 6,

limn1mk2I(β)=βλ24.\displaystyle\lim_{n\to\infty}\frac{1}{mk^{2}}I(\beta)=\frac{\beta\lambda^{2}}{4}. (161)

Then, using the above and the I-MMSE formula [26] it can be shown that (158) holds true (see, [51, eqns. (13)–(15)]). Next, for any estimator 𝖪^\hat{\mathsf{K}} of the planted set, we can define an estimator for 𝖬\mathsf{M} by 𝖬^ij=1\hat{\mathsf{M}}_{ij}=1 if (i,j)𝖪^(i,j)\in\hat{\mathsf{K}}, and 𝖬^ij=0\hat{\mathsf{M}}_{ij}=0, otherwise. Then, using the Cauchy-Schwarz inequality, we have

𝔼𝖬,𝖬^\displaystyle\mathbb{E}\langle{\mathsf{M},\hat{\mathsf{M}}}\rangle =𝔼𝔼[𝖬|𝖠],𝖬^\displaystyle=\mathbb{E}\langle{\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right],\hat{\mathsf{M}}}\rangle (162)
𝔼[𝔼[𝖬|𝖠]𝖥||𝖬^||𝖥]\displaystyle\leq\mathbb{E}\left[\left\|\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right]\right\|_{\mathsf{F}}||\hat{\mathsf{M}}||_{\mathsf{F}}\right] (163)
𝔼𝔼[𝖬|𝖠]𝖥2λmk2=o(mk2),\displaystyle\leq\sqrt{\mathbb{E}\left\|\mathbb{E}\left[\mathsf{M}|\mathsf{A}\right]\right\|_{\mathsf{F}}^{2}}\lambda\sqrt{mk^{2}}=o(mk^{2}), (164)

where the last transition follows from (160). Thus, (164) implies that for any estimator 𝖪^\hat{\mathsf{K}}, we have 𝔼𝖪,𝖪^=o(mk2)\mathbb{E}\langle{\mathsf{K},\hat{\mathsf{K}}}\rangle=o(mk^{2}), and thus correlated recovery of 𝖪\mathsf{K} is impossible.

5 Conclusions and future work

In this paper, we study the computational and statistical boundaries of the submatrix and consecutive submatrix detection and recovery problems. For both models, we derive asymptotically tight lower and upper bounds on the thresholds for detection and recovery. To that end, for each problem, we propose statistically optimal and efficient algorithms for detection and recovery and analyze their performance. Our statistical lower bounds are based on classical techniques from information theory. Finally, we use the framework of low-degree polynomials to provide evidence of the statistical-computational gap we observed in the submatrix detection problem.

There are several exciting directions for future work. First, it would be interesting to generalize our results to any pair of distributions 𝒫{\cal P} and 𝒬{\cal Q}. While our information-theoretic lower bounds hold for general distributions, it is left to construct and analyze algorithms for this case, as well as to derive computational lower bounds. In our paper, we assume that the elements inside the planted submatrices are i.i.d., however, it is of practical interest to generalize this assumption and consider the case of dependent entries, e.g., Gaussians with a general covariance matrix. For example, this is the typical statistical model of cryo-EM data [8]. Finally, it will be interesting to prove a computational lower bound for the submatrix recovery problem using the recent framework of low-degree polynomials for recovery [45], and well as providing other forms of evidence to the statistical computational gaps for the submatrix detection problem with a growing number of planted submatrices, e.g., using average-case reductions (see, for example, [4]).

References

  • ACCD [10] Ery Arias-Castro, Emmanuel Candès, and Arnaud Durand. Detection of an anomalous cluster in a network. Annals of Statistics, 39, Jan. 2010.
  • ACV [14] Ery Arias-Castro and Nicolas Verzelen. Community detection in dense random networks. The Annals of Statistics, 42(3):940–969, 2014.
  • Ahl [22] Thomas D. Ahle. Sharp and simple bounds for the raw moments of the binomial and Poisson distributions. Statistics & Probability Letters, 182:109306, 2022.
  • BBH [18] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages 48–166, 06–09 Jul 2018.
  • BBH [19] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Universality of computational lower bounds for submatrix detection. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99, pages 417–468, 25–28 Jun 2019.
  • BBL+ [18] Tamir Bendory, Nicolas Boumal, William Leeb, Eitan Levin, and Amit Singer. Toward single particle reconstruction without particle picking: Breaking the detection limit. arXiv preprint arXiv:1810.00226, 2018.
  • BBLS [18] Nicolas Boumal, Tamir Bendory, Roy R Lederman, and Amit Singer. Heterogeneous multireference alignment: A single pass approach. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2018.
  • BBS [20] Tamir Bendory, Alberto Bartesaghi, and Amit Singer. Single-particle cryo-electron microscopy: Mathematical theory, computational challenges, and opportunities. IEEE signal processing magazine, 37(2):58–76, 2020.
  • BDN [17] Shankar Bhamidi, Partha Dey, and Andrew Nobel. Energy landscape for large average submatrix detection problems in gaussian random matrices. Probability Theory and Related Fields, 168, 08 2017.
  • BHK+ [16] B. Barak, S. B. Hopkins, J. Kelner, P. Kothari, A. Moitra, and A. Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 428–437, 2016.
  • BI [13] Cristina Butucea and Yuri I Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli, 19(5B):2652–2688, 2013.
  • BKR+ [11] Sivaraman Balakrishnan, Mladen Kolar, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman. Statistical and computational tradeoffs in biclustering. In NIPS 2011 workshop on computational trade-offs in statistical learning, volume 4, 2011.
  • BKW [20] Afonso S. Bandeira, Dmitriy Kunisky, and Alexander S. Wein. Computational Hardness of Certifying Bounds on Constrained PCA Problems. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020), volume 151, pages 78:1–78:29, 2020.
  • BMR+ [19] Tristan Bepler, Andrew Morin, Micah Rapp, Julia Brasch, Lawrence Shapiro, Alex J Noble, and Bonnie Berger. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nature methods, 16(11):1153–1160, 2019.
  • BMS [15] Xiao-Chen Bai, Greg McMullan, and Sjors HW Scheres. How cryo-EM is revolutionizing structural biology. Trends in biochemical sciences, 40(1):49–57, 2015.
  • BMS [22] Tamir Bendory, Oscar Mickelin, and Amit Singer. Sparse multi-reference alignment: Sample complexity and computational hardness. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8977–8981. IEEE, 2022.
  • BMV+ [18] Jess Banks, Cristopher Moore, Roman Vershynin, Nicolas Verzelen, and Jiaming Xu. Information-theoretic bounds and phase transitions in clustering, sparse pca, and submatrix localization. IEEE Transactions on Information Theory, 2018.
  • BPW [18] Afonso S Bandeira, Amelia Perry, and Alexander S Wein. Notes on computational-to-statistical gaps: predictions using statistical physics. Portugaliae Mathematica, 75(2):159–186, 2018.
  • BT [10] Daniel Berend and Tamir Tassa. Efficient bounds on bell numbers and on moments of sums of random variables. Probability and Mathematical Statistics, 30, 01 2010.
  • CC [18] Utkan Onur Candogan and Venkat Chandrasekaran. Finding planted subgraphs with few eigenvalues using the schur–horn relaxation. SIAM Journal on Optimization, 28(1):735–759, 2018.
  • CHK+ [20] Yeshwanth Cherapanamjeri, Samuel B. Hopkins, Tarun Kathuria, Prasad Raghavendra, and Nilesh Tripuraneni. Algorithms for heavy-tailed statistics: Regression, covariance estimation, and beyond. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 601–609, 2020.
  • CLR [17] Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Computational and statistical boundaries for submatrix localization in a large noisy matrix. Annals of Statistics, 45(4):1403–1430, 08 2017.
  • CX [16] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. Journal of Machine Learning Research, 17(27):1–57, 2016.
  • ELS [20] Amitay Eldar, Boris Landa, and Yoel Shkolnisky. KLT picker: Particle picking using data-driven optimal templates. Journal of structural biology, 210(2):107473, 2020.
  • GJW [20] David Gamarnik, Aukosh Jagannath, and Alexander S. Wein. Low-degree hardness of random optimization problems. In 2020 IEEE 61th Annual Symposium on Foundations of Computer Science (FOCS), page 324–356, 2020.
  • GSV [05] Dongning Guo, S. Shamai, and S. Verdu. Mutual information and minimum mean-square error in Gaussian channels. IEEE Transactions on Information Theory, 51(4):1261–1282, 2005.
  • HAS [18] Ayelet Heimowitz, Joakim Andén, and Amit Singer. APPLE picker: Automatic particle picking, a low-effort cryo-EM framework. Journal of structural biology, 204(2):215–227, 2018.
  • HB [18] Samuel Hopkins B. Statistical Inference and the Sum of Squares Method. PhD thesis, Cornell University, 2018.
  • Hen [95] Richard Henderson. The potential and limitations of neutrons, electrons and X-rays for atomic resolution microscopy of unstained biological molecules. Quarterly reviews of biophysics, 28(2):171–193, 1995.
  • HKP+ [17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures. Proceedings of the fifty-eighth IEEE Foundations of Computer Science (FOCS), pages 720–731, 2017.
  • HS [17] S. B. Hopkins and D. Steurer. Efficient Bayesian estimation from few samples: Community detection and related problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 379–390, 2017.
  • Hul [22] Wasim Huleihel. Inferring hidden structures in random graphs. IEEE Transactions on Signal and Information Processing over Networks, 8:855–867, 2022.
  • HWX [15] Bruce Hajek, Yihong Wu, and Jiaming Xu. Computational lower bounds for community detection on random graphs. In Proceedings of The 28th Conference on Learning Theory, volume 40, pages 899–928, 03–06 Jul 2015.
  • HWX [16] Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via semidefinite programming. IEEE Transactions on Information Theory, 62(5):2788–2797, 2016.
  • HWX [17] Bruce Hajek, Yihong Wu, and Jiaming Xu. Information limits for recovering a hidden community. IEEE Transactions on Information Theory, 63(8):4729–4745, 2017.
  • KB [22] Shay Kreymer and Tamir Bendory. Two-dimensional multi-target detection: An autocorrelation analysis approach. IEEE Transactions on Signal Processing, 70:835–849, 2022.
  • KBRS [11] Mladen Kolar, Sivaraman Balakrishnan, Alessandro Rinaldo, and Aarti Singh. Minimax localization of structural information in large noisy matrices. In Advances in Neural Information Processing Systems, pages 909–917, 2011.
  • KSB [23] Shay Kreymer, Amit Singer, and Tamir Bendory. A stochastic approximate expectation-maximization for structure determination directly from cryo-EM micrographs. arXiv preprint arXiv:2303.02157, 2023.
  • Lyu [19] Dmitry Lyumkis. Challenges and opportunities in cryo-EM single-particle analysis. Journal of Biological Chemistry, 294(13):5181–5197, 2019.
  • Mon [15] Andrea Montanari. Finding one community in a sparse graph. Journal of Statistical Physics, 161(2):273–299, 2015.
  • MRZ [15] Andrea Montanari, Daniel Reichman, and Ofer Zeitouni. On the limitation of spectral methods: From the gaussian hidden clique problem to rank-one perturbations of gaussian tensors. In Advances in Neural Information Processing Systems, pages 217–225, 2015.
  • MW [15] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. Annals of Statistics, 43(3):1089–1116, 2015.
  • Sin [18] Amit Singer. Mathematics for cryo-electron microscopy. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pages 3995–4014. World Scientific, 2018.
  • SN [13] Xing Sun and Andrew Nobel. On the maximal size of large-average and ANOVA-fit submatrices in a Gaussian random matrix. Bernoulli, 19:275–294, 02 2013.
  • SW [22] Tselil Schramm and S. Alexander Wein. Computational barriers to estimation from low-degree polynomials. Annals of Statistics, 50, Sept. 2022.
  • SWP+ [09] Andrey A Shabalin, Victor J Weigman, Charles M Perou, Andrew B Nobel, et al. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics, 3(3):985–1012, 2009.
  • Tsy [08] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008.
  • VAC [15] Nicolas Verzelen and Ery Arias-Castro. Community detection in sparse random networks. The Annals of Applied Probability, 25(6):3465–3510, 2015.
  • Wei [18] Alexander Spence Wein. Statistical estimation in the presence of group actions. PhD thesis, Massachusetts Institute of Technology, 2018.
  • WGL+ [16] Feng Wang, Huichao Gong, Gaochao Liu, Meijing Li, Chuangye Yan, Tian Xia, Xueming Li, and Jianyang Zeng. DeepPicker: A deep learning approach for fully automated particle picking in cryo-EM. Journal of structural biology, 195(3):325–336, 2016.
  • WX [20] Yihong Wu and Jiaming Xu. Statistical problems with planted structures: Information-theoretical and computational limits. In Miguel R. D. Rodrigues and Yonina C. Eldar, editors, Information-Theoretic Methods in Data Science. Cambridge University Press, Cambridge, 2020.