This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Concentration and regularization of random graphs

Can M. Le, Elizaveta Levina and Roman Vershynin Department of Statistics, University of California, Davis, One Shields Ave, Davis, CA 95616, U.S.A. canle@ucdavis.edu Department of Statistics, University of Michigan, 1085 S. University Ave, Ann Arbor, MI 48109, U.S.A. elevina@umich.edu Department of Mathematics, University of Michigan, 530 Church St, Ann Arbor, MI 48109, U.S.A. romanv@umich.edu
(Date: September 18, 2025)
Abstract.

This paper studies how close random graphs are typically to their expectations. We interpret this question through the concentration of the adjacency and Laplacian matrices in the spectral norm. We study inhomogeneous Erdös-Rényi random graphs on nn vertices, where edges form independently and possibly with different probabilities pijp_{ij}. Sparse random graphs whose expected degrees are o(logn)o(\log n) fail to concentrate; the obstruction is caused by vertices with abnormally high and low degrees. We show that concentration can be restored if we regularize the degrees of such vertices, and one can do this in various ways. As an example, let us reweight or remove enough edges to make all degrees bounded above by O(d)O(d) where d=maxnpijd=\max np_{ij}. Then we show that the resulting adjacency matrix AA^{\prime} concentrates with the optimal rate: A𝔼A=O(d)\|A^{\prime}-\operatorname{\mathbb{E}}A\|=O(\sqrt{d}). Similarly, if we make all degrees bounded below by dd by adding weight d/nd/n to all edges, then the resulting Laplacian concentrates with the optimal rate: (A)(𝔼A)=O(1/d)\|\mathcal{L}(A^{\prime})-\mathcal{L}(\operatorname{\mathbb{E}}A^{\prime})\|=O(1/\sqrt{d}). Our approach is based on Grothendieck-Pietsch factorization, using which we construct a new decomposition of random graphs. We illustrate the concentration results with an application to the community detection problem in the analysis of networks.

E. L. is partially supported by NSF grants DMS-1159005 and DMS-1521551. R. V. is partially supported by NSF grant 1265782 and U.S. Air Force grant FA9550-14-1-0009. This work was done while C. L. was a Ph.D. student at the University of Michigan.

1. Introduction

Many classical and modern results in probability theory, starting from the Law of Large Numbers, can be expressed as concentration of random objects about their expectations. The objects studied most are sums of independent random variables, martingales, nice functions on product probability spaces and metric measure spaces. For a panoramic exposition of concentration phenomena in modern probability theory and related fields, the reader is referred to the books [25, 9].

This paper studies concentration properties of random graphs. The first step of such study should be to decide how to interpret the statement that a random graph GG concentrates near its expectation. To do this, it will be useful to look at the graph GG through the lens of the matrices classically associated with GG, namely the adjacency and Laplacian matrices.

Let us first build the theory for the adjacency matrix AA; the Laplacian will be discussed in Section 1.5. We may say that GG concentrates about its expectation if AA is close to its expectation 𝔼A\operatorname{\mathbb{E}}A in some natural matrix norm; we interpret the expectation of GG as the weighted graph with adjacency matrix 𝔼A\operatorname{\mathbb{E}}A. Various matrix norms could be of interest here. In this paper, we study concentration in the spectral norm. This automatically gives us a tight control of all eigenvalues and eigenvectors, according to Weyl’s and Davis-Kahan perturbation inequalities (see [5, Sections III.2 and VII.3]).

Concentration of random graphs interpreted this way, and also of general random matrices, has been studied in several communities, in particular in random matrix theory, combinatorics and network science.

We will study random graphs generated from an inhomogeneous Erdös-Rényi model G(n,(pij))G(n,(p_{ij})), where edges are formed independently with given probabilities pijp_{ij}, see [7]. This is a generalization of the classical Erdös-Rényi model G(n,p)G(n,p) where all edge probabilities pijp_{ij} equal pp. Many popular graph models arise as special cases of G(n,(pij))G(n,(p_{ij})), such as the stochastic block model, a benchmark model in the analysis of networks [22] discussed in Section 1.7, and random subgraphs of given graphs.

Often, the question of interest is estimating some features of the probability matrix (pij)(p_{ij}) from random graphs drawn from G(n,(pij))G(n,(p_{ij})). Concentration of adjacency matrix and Laplacian matrix around their expectations, when it holds, guarantees that such features can be recovered. As an example of this use of our concentration results, we will show that if (pij)(p_{ij}) has a block structure, the blocks can be accurately estimated from a single realization of G(n,(pij))G(n,(p_{ij})) even when the average vertex degree is bounded.

1.1. Dense graphs concentrate

The cleanest concentration results are available for the classical Erdös-Rényi model G(n,p)G(n,p) in the dense regime. In terms of the expected degree d=pnd=pn, we have with high probability that

A𝔼A=2d(1+o(1))ifdlog4n,\|A-\operatorname{\mathbb{E}}A\|=2\sqrt{d}\,(1+o(1))\quad\text{if}\quad d\gg\log^{4}n, (1.1)

see [16, 44, 28]. Since 𝔼A=d\|\operatorname{\mathbb{E}}A\|=d, we see that the typical deviation here behaves like the square root of the magnitude of expectation – just like in many other classical results of probability theory. In other words, dense random graphs concentrate well.

The lower bound on density in (1.1) can be essentially relaxed all the way down to d=Ω(logn)d=\Omega(\log n). Thus, with high probability we have

A𝔼A=O(d)ifd=Ω(logn).\|A-\operatorname{\mathbb{E}}A\|=O(\sqrt{d})\quad\text{if}\quad d=\Omega(\log n). (1.2)

This result was proved in [15] based on the method developed by J. Kahn and E. Szemeredi [17]. More generally, (1.2) holds for any inhomogeneous Erdös-Rényi model G(n,(pij))G(n,(p_{ij})) with maximal expected degree d=maxijpijd=\max_{i}\sum_{j}p_{ij}. This generalization can be deduced from a recent result of S. Bandeira and R. van Handel [4, Corollary 3.6], while a weaker bound O(dlogn)O(\sqrt{d\log n}) follows from concentration inequalities for sums of independent random matrices [35]. Alternatively, an argument in [15] can be used to prove (1.2) for a somewhat larger but still useful value

d=maxijnpij,d=\max_{ij}np_{ij}, (1.3)

see [27, 12]. The same can be obtained by using Seginer’s bound on random matrices [20]. As we will see shortly, our paper provides an alternative and completely different approach to general concentration results like (1.2).

1.2. Sparse graphs do not concentrate

In the sparse regime, where the expected degree dd is bounded, concentration breaks down. According to [24], a random graph from G(n,p)G(n,p) satisfies with high probability that

A=(1+o(1))d(A)=(1+o(1))lognloglognifd=O(1),\|A\|=(1+o(1))\sqrt{d(A)}=(1+o(1))\sqrt{\frac{\log n}{\log\log n}}\quad\text{if}\quad d=O(1), (1.4)

where d(A)d(A) denotes the maximal degree of the graph (a random quantity). So in this regime we have A𝔼A=d\|A\|\gg\|\operatorname{\mathbb{E}}A\|=d, which shows that sparse random graphs do not concentrate.

What exactly makes the norm AA abnormally large in the sparse regime? The answer is, the vertices with too high degrees. In the dense case where dlognd\gg\log n, all vertices typically have approximately the same degrees (1+o(1))d(1+o(1))d. This no longer happens in the sparser regime dlognd\ll\log n; the degrees do not cluster tightly about the same value anymore. There are vertices with too high degrees; they are captured by the second inequality in (1.4). Even a single high-degree vertex can blow up the norm of the adjacency matrix. Indeed, since the norm of AA is bounded below by the Euclidean norm of each of its rows, we have Ad(A)\|A\|\geq\sqrt{d(A)}.

1.3. Regularization enforces concentration

If high-degree vertices destroy concentration, can we “tame” these vertices? One proposal would be to remove these vertices from the graph altogether. U. Feige and E. Ofek [15] showed that this works for G(n,p)G(n,p)the removal of the high degree vertices enforces concentration. Indeed, if we drop all vertices with degrees, say, larger than 2d2d, the the remaining part of the graph satisfies

A𝔼A=O(d)\|A^{\prime}-\operatorname{\mathbb{E}}A^{\prime}\|=O(\sqrt{d}) (1.5)

with high probability, where AA^{\prime} denotes the adjacency matrix of the new graph. The argument in [15] is based on the method developed by J. Kahn and E. Szemeredi [17]. It extends to the inhomogeneous Erdös-Rényi model G(n,(pij))G(n,(p_{ij})) with dd defined in (1.3), see [27, 12]. As we will see, our paper provides an alternative and completely different approach to such results.

Although the removal of high degree vertices solves the concentration problem, such solution is not ideal, since those vertices are in some sense the most important ones. In real-world networks, the vertices with highest degrees are “hubs” that hold the network together. Their removal would cause the network to break down into disconnected components, which leads to a considerable loss of structural information.

Would it be possible to regularize the graph in a more gentle way – instead of removing the high-degree vertices, reduce the weights of their edges just enough to keep the degrees bounded by O(d)O(d)? The main result of our paper states that this is true. Let us first state this result informally; Theorem 2.1 provides a more general and formal statement.

Theorem 1.1 (Concentration of regularized adjacency matrices).

Consider a random graph from the inhomogeneous Erdös-Rényi model G(n,(pij))G(n,(p_{ij})), and let d=maxijnpijd=\max_{ij}np_{ij}. For all high degree vertices of the graph (say, those with degrees larger than 2d2d), reduce the weights of the edges incident to them in an arbitrary way, but so that all degrees of the new (weighted) graph become bounded by 2d2d. Then, with high probability, the adjacency matrix AA^{\prime} of the new graph concentrates:

A𝔼A=O(d).\|A^{\prime}-\operatorname{\mathbb{E}}A\|=O(\sqrt{d}).

Moreover, instead of requiring that the degrees become bounded by 2d2d, we can require that the 2\ell_{2} norms of the rows of the new adjacency matrix become bounded by 2d\sqrt{2d}.

1.4. Examples of graph regularization

The regularization procedure in Theorem 1.1 is very flexible. Depending on how one chooses the weights, one can obtain as partial cases several results we summarized earlier, as well as some new ones.

  1. 1.

    Do not do anything to the graph. In the dense regime where d=Ω(logn)d=\Omega(\log n), all degrees are already bounded by 2d2d with high probability. This means that the original graph satisfies A𝔼A=O(d)\|A-\operatorname{\mathbb{E}}A\|=O(\sqrt{d}). Thus we recover the result of U. Feige and E. Ofek (1.2), which states that dense random graphs concentrate well.

  2. 2.

    Remove all high degree vertices. If we remove all vertices with degrees larger than 2d2d, we recover another result of U. Feige and E. Ofek (1.5), which states that the removal of the high degree vertices enforces concentration.

  3. 3.

    Remove just enough edges from high-degree vertices. Instead of removing the high-degree vertices with all of their edges, we can remove just enough edges to make all degrees bounded by 2d2d. This milder regularization still produces the concentration bound (1.5).

  4. 4.

    Reduce the weight of edges proportionally to the excess of degrees. Instead of removing edges, we can reduce the weight of the existing edges, a procedure which better preserves the structure of the graph. For instance, we can assign weight λiλj\sqrt{\lambda_{i}\lambda_{j}} to the edge between vertices ii and jj, choosing λi:=min(2d/di,1)\lambda_{i}:=\min(2d/d_{i},1) where did_{i} is the degree of vertex ii. One can check that this makes the 2\ell_{2} norms of all rows of the adjacency matrix bounded by 2d2d. By Theorem 1.1, such regularization procedure leads to the same concentration bound (1.5).

1.5. Concentration of Laplacian

So far, we have looked at random graphs through the lens of their adjacency matrices. A different matrix that captures the geometry of a graph is the (symmmetric, normalized) Laplacian matrix, defined as

(A)=D1/2(DA)D1/2=ID1/2AD1/2.\mathcal{L}(A)=D^{-1/2}(D-A)D^{-1/2}=I-D^{-1/2}AD^{-1/2}. (1.6)

Here II is the identity matrix and D=diag(di)D=\operatorname{diag}(d_{i}) is the diagonal matrix with degrees di=j=1nAijd_{i}=\sum_{j=1}^{n}A_{ij} on the diagonal. The reader is referred to [13] for an introduction to graph Laplacians and their role in spectral graph theory. Here we mention just two basic facts: the spectrum of (A)\mathcal{L}(A) is a subset of [0,2][0,2], and the smallest eigenvalue is always zero.

Concentration of Laplacians of random graphs has been studied in [35, 11, 39, 23, 18]. Just like the adjacency matrix, the Laplacian is known to concentrate in the dense regime where d=Ω(logn)d=\Omega(\log n), and it fails to concentrate in the sparse regime. However, the obstructions to concentration are opposite. For the adjacency matrices, as we mentioned, the trouble is caused by high-degree vertices. For the Laplacian, the problem lies with low-degree vertices. In particular, for d=o(logn)d=o(\log n) the graph is likely to have isolated vertices; they produce multiple zero eigenvalues of (A)\mathcal{L}(A), which are easily seen to destroy the concentration.

In analogy to our discussion of adjacency matrices, we can try to regularize the graph to “tame” the low-degree vertices in various ways, for example remove the low-degree vertices, connect them to some other vertices, artificially increase the degrees did_{i} in the definition (1.6) of Laplacian, and so on. Here we will focus on the following simple way of regularization proposed in [3] and analyzed in [23, 18]. Choose τ>0\tau>0 and add the same number τ/n\tau/n to all entries of the adjacency matrix AA, thereby replacing it with

Aτ:=A+(τ/n)11𝖳A_{\tau}:=A+(\tau/n){\textbf{1}}{\textbf{1}}^{\mathsf{T}}

in the definition (1.6) of the Laplacian. This regularization raises all degrees did_{i} to di+τd_{i}+\tau. If we choose τd\tau\sim d, the regularized graph does not have low-degree vertices anymore.

The following consequence of Theorem 1.1 shows that such regularization indeed forces Laplacian to concentrate. Here we state this result informally; Theorem 4.1 provides a more formal statement.

Theorem 1.2 (Concentration of the regularized Laplacian).

Consider a random graph from the inhomogeneous Erdös-Rényi model G(n,(pij))G(n,(p_{ij})), and let d=maxijnpijd=\max_{ij}np_{ij}. Choose a number τd\tau\sim d. Then, with high probability, the regularized Laplacian (Aτ)\mathcal{L}(A_{\tau}) concentrates:

(Aτ)(𝔼Aτ)=O(1d).\|\mathcal{L}(A_{\tau})-\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau})\|=O\Big{(}\frac{1}{\sqrt{d}}\Big{)}.

We will deduce this result from Theorem 1.1 in Section 4. Theorem 1.2 is an improvement upon a bound in [18] that had an extra logd\log d factor, and it was conjectured there that the logarithmic factor is not needed. Theorem 1.2 confirms this conjecture.

1.6. A numerical experiment

To conclude our discussion of various ways to regularize sparse graphs, let us illustrate the effect of regularization by a numerical experiment. Consider an inhomogeneous Erdös-Rényi graph with n=1000n=1000 vertices, 90%90\% of which have expected degrees 77 and 10%10\% percent have expected degrees 3535. We then regularize the graph by reducing the weights of edges proportionally to the excess of degrees – just as we described in Section 1.4 item 4, except that we use the overall average degree (approximately 1010) instead of dd (which results in a more severe weight reduction suitable for our illustration purpose).

Figure 1 shows the histogram of the spectrum of AA (left) and AA^{\prime} (right). As we can see, the high degree vertices lead to the long tails in the histogram of the eigenvalues, and regularization shrinks these tails toward the bulk.

Refer to caption
Refer to caption
Figure 1. Histogram of the spectrum of adjacency matrix AA (left) and regularized adjacency matrix AA^{\prime} (right) for a sparse random graph generated from the inhomogeneous Erdös-Rényi model with n=1000n=1000 vertices, 90%90\% of which have expected degrees 77 and 10%10\% percent have expected degrees 3535.

1.7. Application: community detection in networks

Concentration of random graphs has an important application to statistical analysis of networks, in particular to the problem of community detection. A common way of modeling communities in networks is the stochastic block model [22], which is a special case of the inhomogeneous Erdös-Rényi model considered in this paper. For the purpose of this example, we focus on the simplest version of the stochastic block model G(n,an,bn)G(n,\frac{a}{n},\frac{b}{n}), also known as the balanced planted partition model, defined as follows. The set of vertices is divided into two subsets (communities) of size n/2n/2 each. Edges between vertices are drawn independently with probability a/na/n if they are in the same community and with probability b/nb/n otherwise.

The community detection problem is to recover the community labels of vertices from a single realization of the random graph model. A large literature exists on both the recovery algorithms and the theory establishing when a recovery is possible [14, 33, 34, 32, 1, 29, 8]. There are methods that perform better than a random guess (i.e. the fraction of misclassified vertices is bounded away from 0.50.5 as nn\to\infty with high probability) under the condition

(ab)2>2(a+b),(a-b)^{2}>2(a+b),

and no method can perform better than a random guess if this condition is violated.

Moreover, strong consistency, or exact recovery (labeling all vertices correctly with high probability) is possible when the expected degree (a+b)/2(a+b)/2 is of order logn\log n or larger and aa and bb are sufficiently separated, see [32, 30, 6, 20, 10]. Weak consistency (the fraction of mislabeled vertices going to 0 with high probability) is achievable if and only if

(ab)2>Cn(a+b)withCn,(a-b)^{2}>C_{n}(a+b)\quad\text{with}\quad C_{n}\rightarrow\infty,

see [32]. Many of these results hold in the non-asymptotic regime, for graphs of fixed size nn. Thus, for any ε>0\varepsilon>0 there exists CεC_{\varepsilon} (which only depends on ε\varepsilon) such that one can recover communities up to εn\varepsilon n mislabeled vertices as long as

(ab)2>Cε(a+b).(a-b)^{2}>C_{\varepsilon}(a+b).

In particular, recovery of communities is possible even for very sparse graphs – those with bounded expected degrees. Several types of algorithms are known to succeed in this regume, including non-backtracking walks [33, 29, 8], spectral methods [12] and methods based on semidefinite programming [19, 31].

As an application of the new concentration results, we show that the regularized spectral clustering [3, 23], one of the simplest most popular algorithms for community detection, can recover communities in the sparse regime. In general, spectral clustering works by computing the leading eigenvectors of either the adjacency matrix or the Laplacian or their regularized versions, and running the kk-means clustering algorithm on these eigenvectors to recover the node labels. In the simple case of the model G(n,an,bn)G(n,\frac{a}{n},\frac{b}{n}), one can simply assign nodes to communities based on the sign (positive or negative) of the corresponding entries of the eigenvector v2((Aτ))v_{2}(\mathcal{L}(A_{\tau})) corresponding to the second smallest eigenvalue of regularized Laplacian matrix (Aτ)\mathcal{L}(A_{\tau}) (or the regularized adjacency matrix AA^{\prime}).

Let us briefly explain how our concentration results validate regularized spectral clustering. If the concentration of random graphs holds and (Aτ)\mathcal{L}(A_{\tau}) is close to (𝔼Aτ)\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau}), then the standard perturbation theory (Davis-Kahan theorem below) shows that v2((Aτ))v_{2}(\mathcal{L}(A_{\tau})) is close to v2((𝔼Aτ))v_{2}(\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau})), and in particular, the signs of these two eigenvectors must agree on most vertices. An easy calculation shows that the signs of v2((𝔼Aτ))v_{2}(\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau})) recover the communities exactly: this vector is a positive constant on one community and a negative constant on the other. Therefore, the signs of v2((Aτ))v_{2}(\mathcal{L}(A_{\tau})) must recover the communities up to a small fraction of misclassified vertices.

Before stating our result, let us quote a simple version of the Davis-Kahan theorem perturbation theorem (see e.g. [5, Theorem VII.3.2]).

Theorem 1.3 (Davis-Kahan theorem).

Let X,YX,Y be symmetric matrices such that the second smallest eigenvalues of XX and YY have multiplicity one and they are of distance at least δ>0\delta>0 from the remaining eigenvalues of XX and YY. Denote by xx and yy the eigenvectors of XX and YY corresponding to the second largest eigenvalues of XX and YY, respectively. Then

minβ=±1x+βy2XYδ.\min_{\beta=\pm 1}\|x+\beta y\|\leq\frac{2\|X-Y\|}{\delta}.
Corollary 1.4 (Community detection in sparse graphs).

Let ε>0\varepsilon>0 and r1r\geq 1. Let AA be the adjacency matrix drawn from the stochastic block model G(n,an,bn)G(n,\frac{a}{n},\frac{b}{n}). Assume that

(ab)2>Cε(a+b)(a-b)^{2}>C_{\varepsilon}(a+b) (1.7)

where Cε=Cr4ε2C_{\varepsilon}=Cr^{4}\varepsilon^{-2} and CC is an appropriately large absolute constant. Choose τ\tau to be the average degree of the graph, i.e. τ=(d1++dn)/n\tau=(d_{1}+\cdots+d_{n})/n where did_{i} are vertex degrees. Then with probability at least 1er1-e^{-r}, we have

minβ=±1v2((Aτ))+βv2((𝔼Aτ))ε.\min_{\beta=\pm 1}\|v_{2}(\mathcal{L}(A_{\tau}))+\beta v_{2}(\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau}))\|\leq\varepsilon.

In particular, the signs of the entires of v2((Aτ))v_{2}(\mathcal{L}(A_{\tau})) correctly estimate the partition into the two communities, up to at most εn\varepsilon n misclassified vertices.

Proof.

We apply Theorem 1.3 with X=(Aτ)X=\mathcal{L}(A_{\tau}) and Y=(𝔼Aτ)Y=\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau}). A simple calculation shows that the spectral gap δ\delta defined in Theorem 1.3 is of the order (ab)/(a+b)(a-b)/(a+b). The claim of Corollary 1.4 then follows from Davis-Kahan Theorem 1.3, Concentration Theorem 4.1 (which is a formal version of Theorem 1.2) and condition (1.7). ∎

1.8. Organization of the paper

In Section 2, we state a formal version of Theorem 1.1. We show there how to deduce this result from a new decomposition of random graphs, which we state as Theorem 2.6. We prove this decomposition theorem in Section 3. In Section 4, we state and prove a formal version of Theorem 1.2 about the concentration of the Laplacian. We conclude the paper with Section 5 where we propose some questions for further investigation.

Acknowledgement

The authors are grateful to Ramon van Handel for several insightful comments on the preliminary version of this paper.

2. Full version of Theorem 1.1, and reduction to a graph decomposition

In this section we state a more general and quantitative version of Theorem 1.1, and we reduce it to a new form of graph decomposition, which can be of interest on its own.

Theorem 2.1 (Concentration of regularized adjacency matrices).

Consider a random graph from the inhomogeneous Erdös-Rényi model G(n,(pij))G(n,(p_{ij})), and let d=maxijnpijd=\max_{ij}np_{ij}. For any r1r\geq 1, the following holds with probability at least 1nr1-n^{-r}. Consider any subset consisting of at most 10n/d10n/d vertices, and reduce the weights of the edges incident to those vertices in an arbitrary way. Let dd^{\prime} be the maximal degree of the resulting graph. Then the adjacency matrix AA^{\prime} of the new (weighted) graph satisfies

A𝔼ACr3/2(d+d).\|A^{\prime}-\operatorname{\mathbb{E}}A\|\leq Cr^{3/2}\big{(}\sqrt{d}+\sqrt{d^{\prime}}\big{)}.

Moreover, the same bound holds for dd^{\prime} being the maximal 2\ell_{2} norm of the rows of AA^{\prime}.

In this result and in the rest of the paper, C,C1,C2,C,C_{1},C_{2},\ldots denote absolute constants whose values may be different from line to line.

Remark 2.2 (Theorem 2.1 implies Theorem 1.1).

The subset of 10n/d10n/d vertices in Theorem 2.1 can be completely arbitrary. So let us choose the high-degree vertices, say those with degrees larger than 2d2d. There are at most 10n/d10n/d such vertices with high probability; this follows by an easy calculation, and also from Lemma 3.5. Thus we immediately deduce Theorem 1.1.

Remark 2.3 (Tight upper bound).

If we do not reduce weights of any edges and dd is bounded, then the upper bound in Theorem 2.1 is tight (up to a constant depending on rr). This is because of (1.4), which states that the adjacency matrix does not concentrate in the sparse regime without regularization.

Remark 2.4 (Method to prove Theorem 2.1).

One may wonder if Theorem 2.1 can be proved by developing an ϵ\epsilon-net argument similar to the method of J. Kahn and E. Szemeredi [17] and its versions [2, 15, 27, 12]. Although we can not rule out such possibility, we do not see how this method could handle a general regularization. The reader familiar with the method can easily notice an obstacle. The contribution of the so-called light couples becomes hard to control when one changes, and even reduces, the individual entries of AA (the weights of edges).

We will develop an alternative and somewhat simpler approach, which will be able to handle a general regularization of random graphs. It sheds light on the specific structure of graphs that enables concentration. We are going to identify this structure through a graph decomposition in the next section. But let us pause briefly to mention the following useful reduction.

Remark 2.5 (Reduction to directed graphs).

Our arguments will be more convenient to carry out if the adjacency matrix AA has all independent entries. To be able to make this assumption, we can decompose AA into the upper-triangular and a lower-triangular parts, both of which have independent entries. If we can show that each of these parts concentrate about its expectation, it would follow that AA concentrate about 𝔼A\operatorname{\mathbb{E}}A by triangle inequality.

In other words, we may prove Theorem 2.1 for directed inhomogeneous Erdös-Rényi graphs, where edges between any vertices and in any direction appear indepednently with probabilities pijp_{ij}. In the rest of the argument, we will only work with such random directed graphs.

2.1. Graph decomposition

In this section, we reduce Theorem 2.1 to the following decomposition of inhomogeneous Erdös-Rényi directed random graphs. This decomposition may have an independent interest. Throughout the paper, we denote by B𝒩B_{\mathcal{N}} the matrix which coincides with a matrix BB on a subset of edges 𝒩[n]×[n]\mathcal{N}\subset[n]\times[n] and has zero entries elsewhere.

Theorem 2.6 (Graph decomposition).

Consider a random directed graph from the inhomogeneous Erdös-Rényi model, and let dd be as in (1.3). For any r1r\geq 1, the following holds with probability at least 13nr1-3n^{-r}. One can decompose the set of edges [n]×[n][n]\times[n] into three classes 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C} so that the following properties are satisfied for the adjacency matrix AA.

  • The graph concentrates on 𝒩\mathcal{N}, namely (A𝔼A)𝒩Cr3/2d\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}}\|\leq Cr^{3/2}\sqrt{d}.

  • Each row of AA_{\mathcal{R}} and each column of A𝒞A_{\mathcal{C}} has at most 32r32r ones.

Moreover, \mathcal{R} intersects at most n/dn/d columns, and 𝒞\mathcal{C} intersects at most n/dn/d rows of [n]×[n][n]\times[n].

Figure 2 illustrates a possible decomposition Theorem 2.6 can provide. The edges in 𝒩\mathcal{N} form a big “core” where the graph concentrates well even without regularization. The edges in \mathcal{R} and 𝒞\mathcal{C} can be thought of (at least heuristically) as those attached to high-degree vertices.

Refer to caption
Figure 2. An example of graph decomposition in Theorem 2.6.

We will prove Theorem 2.6 in Section 3; let us pause to deduce Theorem 2.1 from it.

2.2. Deduction of Theorem 2.1

First, let us explain informally how the graph decomposition could lead to Theorem 2.1. The regularization of the graph does not destroy the properties of 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C} in Theorem 2.6. Moreover, regularization creates a new property for us, allowing for a good control of the columns of \mathcal{R} and rows of 𝒞\mathcal{C}. Let us focus on AA_{\mathcal{R}} to be specific. The 1\ell_{1} norms of all columns of this matrix are at most dd^{\prime}, and the 1\ell_{1} norms of all rows are O(r)O(r) by Theorem 2.6. By a simple calculation which we will do in Lemma 2.7, this implies that A=O(rd)\|A_{\mathcal{R}}\|=O(\sqrt{rd^{\prime}}). A similar bound can be proved for 𝒞\mathcal{C}. Combining 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C} together will lead to the error bound O(r3/2(d+d))O(r^{3/2}(\sqrt{d}+\sqrt{d^{\prime}})) in Theorem 2.1.

To make this argument rigorous, let us start with the simple calculation we just mentioned.

Lemma 2.7.

Consider a matrix BB in which each row has 1\ell_{1} norm at most aa, and each column has 1\ell_{1} norm at most bb. Then Bab\|B\|\leq\sqrt{ab}.

Proof.

Let xx be a vector with x2=1\|x\|_{2}=1. Using Cauchy-Schwarz inequality and the assumptions, we have

Bx22\displaystyle\|Bx\|_{2}^{2} =\displaystyle= i(jBijxj)2i(j|Bij|j|Bij|xj2)\displaystyle\sum_{i}\Big{(}\sum_{j}B_{ij}x_{j}\Big{)}^{2}\leq\sum_{i}\Big{(}\sum_{j}|B_{ij}|\sum_{j}|B_{ij}|x_{j}^{2}\Big{)}
\displaystyle\leq i(aj|Bij|xj2)=ajxj2i|Bij|ajxj2b=ab.\displaystyle\sum_{i}\Big{(}a\sum_{j}|B_{ij}|x_{j}^{2}\Big{)}=a\sum_{j}x_{j}^{2}\sum_{i}|B_{ij}|\leq a\sum_{j}x_{j}^{2}b=ab.

Since xx is arbitrary, this completes the proof. ∎

Remark 2.8 (Riesz-Thorin interpolation theorem implies Lemma 2.7).

Lemma 2.7 can also be deduced from Riesz-Thorin interpolation theorem (see e.g. [40, Theorem 2.1]), since the maximal 1\ell_{1} norm of columns is the 11\ell_{1}\rightarrow\ell_{1} operator norm, and the maximal 1\ell_{1} norm of rows is the \ell_{\infty}\rightarrow\ell_{\infty} operator norm.

We are ready to formally deduce the main part of Theorem 2.1 from Theorem 2.6; we defer the “moreover” part to Section 3.6.

Proof of Theorem 2.1 (main part)..

Fix a realization of the random graph that satisfies the conclusion of Theorem 2.6, and decompose the deviation A𝔼AA^{\prime}-\operatorname{\mathbb{E}}A as follows:

A𝔼A=(A𝔼A)𝒩+(A𝔼A)+(A𝔼A)𝒞.A^{\prime}-\operatorname{\mathbb{E}}A=(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{N}}+(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{R}}+(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{C}}. (2.1)

We will bound the spectral norm of each of the three terms separately.

Step 1. Deviation on 𝒩\mathcal{N}. Let us further decompose

(A𝔼A)𝒩=(A𝔼A)𝒩(AA)𝒩.(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{N}}=(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}}-(A-A^{\prime})_{\mathcal{N}}. (2.2)

By Theorem 2.6, (A𝔼A)𝒩Cr3/2d\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}}\|\leq Cr^{3/2}\sqrt{d}. To control the second term in (2.2), denote by [n]×[n]\mathcal{E}\subset[n]\times[n] the subset of edges that are reweighed in the regularization process. Since AA and AA^{\prime} are equal on c\mathcal{E}^{c}, we have

(AA)𝒩\displaystyle\|(A-A^{\prime})_{\mathcal{N}}\| =(AA)𝒩A𝒩(since 0AAA entrywise)\displaystyle=\|(A-A^{\prime})_{\mathcal{N}\cap\mathcal{E}}\|\leq\|A_{\mathcal{N}\cap\mathcal{E}}\|\qquad\text{(since $0\leq A-A^{\prime}\leq A$ entrywise)}
(A𝔼A)𝒩+𝔼A𝒩(by triangle inequality).\displaystyle\leq\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}\cap\mathcal{E}}\|+\|\operatorname{\mathbb{E}}A_{\mathcal{N}\cap\mathcal{E}}\|\qquad\text{(by triangle inequality).} (2.3)

Further, a simple restriction property implies that

(A𝔼A)𝒩2(A𝔼A)𝒩.\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}\cap\mathcal{E}}\|\leq 2\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}}\|. (2.4)

Indeed, restricting a matrix onto a product subset of [n]×[n][n]\times[n] can only reduce its norm. Although the set of reweighted edges \mathcal{E} is not a product subset, it can be decomposed into two product subsets:

=(I×[n])(Ic×I)\mathcal{E}=\big{(}I\times[n]\big{)}\cup\big{(}I^{c}\times I\big{)} (2.5)

where II is the subset of vertices incident to the edges in \mathcal{E}. Then (2.4) holds; right hand side of that inequality is bounded by Cr3/2dCr^{3/2}\sqrt{d} by Theorem 2.6. Thus we handled the first term in (2.3).

To bound the second term in (2.3), we can use another restriction property that states that the norm of the matrix with non-negative entries can only reduce by restricting onto any subset of [n]×[n][n]\times[n] (whether a product subset or not). This yields

𝔼A𝒩𝔼A𝔼AI×[n]+𝔼AIc×I\|\operatorname{\mathbb{E}}A_{\mathcal{N}\cap\mathcal{E}}\|\leq\|\operatorname{\mathbb{E}}A_{\mathcal{E}}\|\leq\|\operatorname{\mathbb{E}}A_{I\times[n]}\|+\|\operatorname{\mathbb{E}}A_{I^{c}\times I}\| (2.6)

where the second inequality follows by (2.5). By assumption, the matrix 𝔼AI×[n]\operatorname{\mathbb{E}}A_{I\times[n]} has |I|10n/d|I|\leq 10n/d rows and each of its entries is bounded by d/nd/n. Hence the 1\ell_{1} norm of all rows is bounded by dd, and the 1\ell_{1} norm of all columns is bounded by 1010. Lemma 2.7 implies that 𝔼AI×[n]10d\|\operatorname{\mathbb{E}}A_{I\times[n]}\|\leq\sqrt{10d}. A similar bound holds for the second term of (2.6). This yields

𝔼A𝒩5d,\|\operatorname{\mathbb{E}}A_{\mathcal{N}\cap\mathcal{E}}\|\leq 5\sqrt{d},

so we handled the second term in (2.3). Recalling that the first term there is bounded by Cr3/2dCr^{3/2}\sqrt{d}, we conclude that (AA)𝒩2Cr3/2d\|(A-A^{\prime})_{\mathcal{N}}\|\leq 2Cr^{3/2}\sqrt{d}.

Returning to (2.2), we recall that the first term in the right hand is bounded by Cr3/2dCr^{3/2}\sqrt{d}, and we just bounded the second term by 2Cr3/2d2Cr^{3/2}\sqrt{d}. Hence

(A𝔼A)𝒩4Cr3/2d.\|(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{N}}\|\leq 4Cr^{3/2}\sqrt{d}.

Step 2. Deviation on \mathcal{R} and 𝒞\mathcal{C}. By triangle inequality, we have

(A𝔼A)A+𝔼A.\|(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{R}}\|\leq\|A^{\prime}_{\mathcal{R}}\|+\|\operatorname{\mathbb{E}}A_{\mathcal{R}}\|.

Recall that 0AA0\leq A_{\mathcal{R}}^{\prime}\leq A_{\mathcal{R}} entrywise. By Theorem 2.6, each of the rows of AA_{\mathcal{R}}, and thus also of AA_{\mathcal{R}}^{\prime}, has 1\ell_{1} norm at most 32r32r. Moreover, by definition of dd^{\prime}, each of the columns of AA^{\prime}, and thus also of AA_{\mathcal{R}}^{\prime}, has 1\ell_{1} norm at most dd^{\prime}. Lemma 2.7 implies that A32rd\|A^{\prime}_{\mathcal{R}}\|\leq\sqrt{32rd^{\prime}}.

The matrix 𝔼A\operatorname{\mathbb{E}}A_{\mathcal{R}} can be handled similarly. By Theorem 2.6, it has at most n/dn/d entries in each row, and all entries are bounded by d/nd/n. Thus each column of 𝔼A\operatorname{\mathbb{E}}A_{\mathcal{R}} has 1\ell_{1} norm at most 11, and and each row has 1\ell_{1} norm at most dd. Lemma 2.7 implies that 𝔼Ad\|\operatorname{\mathbb{E}}A_{\mathcal{R}}\|\leq\sqrt{d}.

We showed that

(A𝔼A)32rd+d.\|(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{R}}\|\leq\sqrt{32rd^{\prime}}+\sqrt{d}.

A similar bound holds for (A𝔼A)𝒞\|(A^{\prime}-\operatorname{\mathbb{E}}A)_{\mathcal{C}}\|. Combining the bounds on the deviation of A𝔼AA^{\prime}-\operatorname{\mathbb{E}}A on 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C} and putting them into (2.1), we conclude that

A𝔼A4Cr3/2d+2(32rd+d).\|A^{\prime}-\operatorname{\mathbb{E}}A\|\leq 4Cr^{3/2}\sqrt{d}+2\big{(}\sqrt{32rd^{\prime}}+\sqrt{d}\big{)}.

Simplifying this inequality, we complete the proof of the main part of Theorem 2.1. ∎

3. Proof of Decomposition Theorem 2.6

3.1. Outline of the argument

We will construct the decomposition in Theorem 2.6 by an iterative procedure. The first and crucial step is to find a big block111In this paper, by block we mean a product set I×JI\times J with arbitrary index subsets I,J[n]I,J\subset[n]. These subsets are not required to be intervals of successive integers. 𝒩[n]×[n]\mathcal{N}^{\prime}\subset[n]\times[n] of size at least (nn/d)×n/2(n-n/d)\times n/2 on which AA concentrates, i.e.

(A𝔼A)𝒩=O(d).\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}^{\prime}}\|=O(\sqrt{d}).

To find such block, we first establish concentration in 2\ell_{\infty}\to\ell_{2} norm; this can be done by standard probabilistic techniques. Next, we can automatically upgrade this to concentration in the spectral norm (22\ell_{2}\to\ell_{2}) once we pass to an appropriate block 𝒩\mathcal{N}^{\prime}. This can be done using a general result from functional analysis, which we call Grothendieck-Pietsch factorization.

Repeating this argument for the transpose, we obtain another block 𝒩′′\mathcal{N}^{\prime\prime} of size at least n/2×(nn/d)n/2\times(n-n/d) where the graph concentrates as well. So the graph concentrates on 𝒩0:=𝒩𝒩′′\mathcal{N}_{0}:=\mathcal{N}^{\prime}\cup\mathcal{N}^{\prime\prime}. The “core” 𝒩0\mathcal{N}_{0} will form the first part of the class 𝒩\mathcal{N} we are constructing.

It remains to control the graph on the complement of 𝒩0\mathcal{N}_{0}. That set of edges is quite small; it can be described as a union of a block 𝒞0\mathcal{C}_{0} with n/dn/d rows, a block 0\mathcal{R}_{0} with n/dn/d columns and an exceptional n/2×n/2n/2\times n/2 block; see Figure 3(b) for illustration. We may consider 𝒞0\mathcal{C}_{0} and 0\mathcal{R}_{0} as the first parts of the future classes 𝒞\mathcal{C} and \mathcal{R} we are constructing.

Indeed, since 𝒞0\mathcal{C}_{0} has so few rows, the expected number of ones in each column of 𝒞0\mathcal{C}_{0} is bounded by 11. For simplicity, let us think that all columns of 𝒞0\mathcal{C}_{0} have O(1)O(1) ones as desired. (In the formal argument, we will add the bad columns to the exceptional block.) Of course, the block 0\mathcal{R}_{0} can be handled similarly.

At this point, we decomposed [n]×[n][n]\times[n] into 𝒩0\mathcal{N}_{0}, 0\mathcal{R}_{0}, 𝒞0\mathcal{C}_{0} and an exceptional n/2×n/2n/2\times n/2 block. Now we repeat the process for the exceptional block, constructing 𝒩1\mathcal{N}_{1}, 1\mathcal{R}_{1}, and 𝒞1\mathcal{C}_{1} there, and so on. Figure 3(c) illustrates this process. At the end, we choose 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C} to be the unions of the blocks 𝒩i\mathcal{N}_{i}, i\mathcal{R}_{i} and 𝒞i\mathcal{C}_{i} respectively.

Refer to caption

(a) The core.
Refer to caption
(b) After the first step.

Refer to caption

(c) Final decomposition.
Figure 3. Constructing decomposition iteratively in the proof of Theorem 2.6.

Two precautions have to be taken in this argument. First, we need to make concentration on the core blocks 𝒩i\mathcal{N}_{i} better at each step, so that the sum of those error bounds would not depend of the total number of steps. This can be done with little effort, with the help of the exponential decrease of the size of the blocks 𝒩i\mathcal{N}_{i}. Second, we have a control of the sizes but not locations of the exceptional blocks. Thus to be able to carry out the decomposition argument inside an exceptional block, we need to make the argument valid uniformly over all blocks of that size. This will require us to be delicate with probabilistic arguments, so we can take a union bound over such blocks.

3.2. Grothendieck-Pietsch factorization

As we mentioned in the previous section, our proof of Theorem 2.6 is based on Grothendieck-Pietsch factorization. This general and well known result in functional analysis [36, 37] has already been used in a similar probabilistic context, see [26, Proposition 15.11].

Grothendieck-Pietsch factorization compares two matrix norms, the 22\ell_{2}\to\ell_{2} norm (which we call the spectral norm ) and the 2\ell_{\infty}\to\ell_{2} norm. For a k×mk\times m matrix BB, these norms are defined as

B=maxx2=1Bx2,B2=maxx=1Bx2=maxx{1,1}mBx2.\|B\|=\max_{\|x\|_{2}=1}\|Bx\|_{2},\qquad\|B\|_{\infty\to 2}=\max_{\|x\|_{\infty}=1}\|Bx\|_{2}=\max_{x\in\{-1,1\}^{m}}\|Bx\|_{2}.

The 2\ell_{\infty}\to\ell_{2} norm is usually easier to control, since the supremum is taken with respect to the discrete set {1,1}m\{-1,1\}^{m}, and any vector there has all coordinates of the same magnitude.

To compare the two norms, one can start with the obvious inequality

B2mBB2.\frac{\|B\|_{\infty\to 2}}{\sqrt{m}}\leq\|B\|\leq\|B\|_{\infty\to 2}.

Both parts of this inequality are optimal, so there is an unavoidable slack between the upper and lower bounds. However, Grothendieck-Pietsch factorization allows us to tighten the inequality by changing BB sightly. The next two results offer two ways to change BB – introduce weights and pass to a sub-matrix.

Theorem 3.1 (Grothendieck-Pietsch’s factorization, weighted version).

Let BB be a k×mk\times m real matrix. Then there exist positive weights μj\mu_{j} with j=1mμj=1\sum_{j=1}^{m}\mu_{j}=1 such that

B2BDμ1/2π/2B2,\|B\|_{\infty\rightarrow 2}\leq\|BD_{\mu}^{-1/2}\|\leq\sqrt{\pi/2}\|B\|_{\infty\rightarrow 2}, (3.1)

where Dμ=diag(μj)D_{\mu}=\operatorname{diag}(\mu_{j}) denotes the m×mm\times m diagonal matrix with weights μj\mu_{j} on the diagonal.

This result is a known combination of the Little Grothendieck Theorem (see [41, Corollary 10.10] and [38]) and Pietsch Factorization (see [41, Theorem 9.2]). In an explicit form, a version of this result can be found e.g. in [26, Proposition 15.11]. The weights μj\mu_{j} can be computed algorithmically, see [42].

The following related version of Grothendieck-Pietsch’s factorization can be especially useful in probabilistic contexts, see [26, Proposition 15.11]. Here and in the rest of the paper, we denote by BI×JB_{I\times J} the sub-matrix of a matrix BB with rows indexed by a subset II and columns indexed by a subset JJ.

Theorem 3.2 (Grothendieck-Pietsch factorization, sub-matrix version).

Let BB be a k×mk\times m real matrix and δ>0\delta>0. Then there exists J[m]J\subseteq[m] with |J|(1δ)m|J|\geq(1-\delta)m such that

B[k]×J2B2δm.\|B_{[k]\times J}\|\leq\frac{2\|B\|_{\infty\rightarrow 2}}{\sqrt{\delta m}}.
Proof.

Consider the weights μi\mu_{i} given by Theorem 3.1, and choose JJ to consist of the indices jj satisfying μj1/(δm)\mu_{j}\leq 1/(\delta m). Since jμj=1\sum_{j}\mu_{j}=1, the set JJ must contain at least (1δ)m(1-\delta)m indices as claimed. Furthermore, the diagonal entries of (Dμ1/2)J×J(D_{\mu}^{-1/2})_{J\times J} are all bounded from below by δm\sqrt{\delta m}, which yields

(BDμ1/2)[k]×JδmB[k]×J.\|(BD_{\mu}^{-1/2})_{[k]\times J}\|\geq\sqrt{\delta m}\|B_{[k]\times J}\|.

On the other hand, by (3.1) the left-hand side of this inequality is bounded by 2B22\|B\|_{\infty\rightarrow 2}. Rearranging the terms, we complete the proof. ∎

3.3. Concentration on a big block

We are starting to work toward constructing the core part 𝒩\mathcal{N} in Theorem 2.6. In this section we will show how to find a big block on which the adjacency matrix AA concentrates. First we will establish concentration in 2\ell_{\infty}\to\ell_{2} norm, and then, using Grothendieck-Pietsch factorization, in the spectral norm.

The lemmas of this and next section should be best understood for m=nm=n, I=J=[n]I=J=[n] and α=1\alpha=1. In this case, we are working with the entire adjacency matrix, and trying to make the first step in the iterative procedure. The further steps will require us to handle smaller blocks I×JI\times J; the parameter α\alpha will then become smaller in order to achieve better concentration for smaller blocks.

Lemma 3.3 (Concentration in 2\ell_{\infty}\rightarrow\ell_{2} norm).

Let 1mn1\leq m\leq n and αm/n\alpha\geq m/n. Then for r1r\geq 1 the following holds with probability at least 1nr1-n^{-r}. Consider a block I×JI\times J of size m×mm\times m. Let II^{\prime} be the set of indices of the rows of AI×JA_{I\times J} that contain at most αd\alpha d ones. Then

(A𝔼A)I×J2Cαdmrlog(en/m).\|(A-\operatorname{\mathbb{E}}A)_{I^{\prime}\times J}\|_{\infty\rightarrow 2}\leq C\sqrt{\alpha dmr\log(en/m)}. (3.2)
Proof.

By definition,

(A𝔼A)I×J22=maxx{1,1}miI(jJ(Aij𝔼Aij)xj)2=maxx{1,1}miI(Xiξi)2\|(A-\operatorname{\mathbb{E}}A)_{I^{\prime}\times J}\|_{\infty\to 2}^{2}=\max_{x\in\{-1,1\}^{m}}\sum_{i\in I^{\prime}}\Big{(}\sum_{j\in J}(A_{ij}-\operatorname{\mathbb{E}}A_{ij})x_{j}\Big{)}^{2}=\max_{x\in\{-1,1\}^{m}}\sum_{i\in I}\big{(}X_{i}\xi_{i}\big{)}^{2} (3.3)

where we denoted

Xi:=jJ(Aij𝔼Aij)xj,ξi:=1{iJAijαd}.X_{i}:=\sum_{j\in J}(A_{ij}-\operatorname{\mathbb{E}}A_{ij})x_{j},\qquad\xi_{i}:={\textbf{1}}_{\left\{\sum_{i\in J}A_{ij}\leq\alpha d\right\}}.

Let us first fix a block I×JI\times J and a vector x{1,1}mx\in\{-1,1\}^{m}. Let us analyze the independent random variables XiξiX_{i}\xi_{i}. Since |Xi|jJ|Aij𝔼Aij|jJAij|X_{i}|\leq\sum_{j\in J}|A_{ij}-\operatorname{\mathbb{E}}A_{ij}|\leq\sum_{j\in J}A_{ij}, it follows by definition of ξi\xi_{i} that

|Xiξi|αd.|X_{i}\xi_{i}|\leq\alpha d. (3.4)

A more useful bond will follow from Bernstein’s inequality. Indeed, XiX_{i} is a sum of mm independent random variables with zero means and variances at most d/nd/n. By Bernstein’s inequality, for any t>0t>0 we have

{|Xiξi|>tm}{|Xi|>tm}2exp(mt2/2d/n+t/3),t0.\mathbb{P}\left\{|X_{i}\xi_{i}|>tm\rule{0.0pt}{8.53581pt}\right\}\leq\mathbb{P}\left\{|X_{i}|>tm\rule{0.0pt}{8.53581pt}\right\}\leq 2\exp\left(\frac{-mt^{2}/2}{d/n+t/3}\right),\quad t\geq 0. (3.5)

For tmαdtm\leq\alpha d, this can be further bounded by 2exp(m2t2/4αd)2\mathrm{exp}(-m^{2}t^{2}/4\alpha d), once we use the assumption αm/n\alpha\geq m/n. For tm>αdtm>\alpha d, the left-hand side of (3.5) is automatically zero by (3.4). Therefore

{|Xiξi|>tm}2exp(m2t24αd),t0.\mathbb{P}\left\{|X_{i}\xi_{i}|>tm\rule{0.0pt}{8.53581pt}\right\}\leq 2\ \mathrm{exp}\left(\frac{-m^{2}t^{2}}{4\alpha d}\right),\qquad t\geq 0. (3.6)

We are now ready to bound the right-hand side of (3.3). By (3.6), the random variable XiξiX_{i}\xi_{i} is sub-gaussian222For definitions and basic facts about sub-gaussian random variables, see e.g. [43]. with sub-gaussian norm at most αd\sqrt{\alpha d}. It follows that (Xiξi)2(X_{i}\xi_{i})^{2} is sub-exponential with sub-exponential norm at most 2αd2\alpha d. Using Bernstein’s inequality for sub-exponential random variables (see Corrollary 5.17 in [43]), we have

{iI(Xiξi)2>εmαd}2exp[cmin(ε2,ε)m],ε0.\mathbb{P}\left\{\sum_{i\in I}\big{(}X_{i}\xi_{i}\big{)}^{2}>\varepsilon m\alpha d\rule{0.0pt}{8.53581pt}\right\}\leq 2\ \mathrm{exp}\left[-c\min\left(\varepsilon^{2},\varepsilon\right)m\right],\qquad\varepsilon\geq 0. (3.7)

Choosing ε:=(10/c)rlog(en/m)\varepsilon:=(10/c)r\log(en/m), we bound this probability by (en/m)5rm(en/m)^{-5rm}.

Summarizing, we have proved that for fixed I,J[n]I,J\subseteq[n] and x{1,1}mx\in\{-1,1\}^{m}, with probability at least 1(en/m)5rm1-(en/m)^{-5rm}, the following holds:

iI(Xiξi)2(10/c)rlog(en/m)mαd.\sum_{i\in I}\big{(}X_{i}\xi_{i}\big{)}^{2}\leq(10/c)r\log(en/m)\cdot m\alpha d. (3.8)

Taking a union bound over all possibilities of m,I,J,xm,I,J,x and using (3.3), (3.8), we see that the conclusion of the lemma holds with probability at least

1m=1n2m(nm)2(enm)5rm1nr.1-\sum_{m=1}^{n}2^{m}\binom{n}{m}^{2}\left(\frac{en}{m}\right)^{-5rm}\geq 1-n^{-r}.

The proof is complete. ∎

Applying Lemma 3.3 followed by Grothendieck-Piesch factorization (Theorem 3.2), we obtain the following.

Lemma 3.4 (Concentration in spectral norm).

Let 1mn1\leq m\leq n and αm/n\alpha\geq m/n. Then for r1r\geq 1 the following holds with probability at least 1nr1-n^{-r}. Consider a block I×JI\times J of size m×mm\times m. Let II^{\prime} be the set of indices of the rows of AI×JA_{I\times J} that contain at most αd\alpha d ones. Then one can find a subset JJJ^{\prime}\subseteq J of at least 3m/43m/4 columns such that

(A𝔼A)I×JCαdrlog(en/m).\|(A-\operatorname{\mathbb{E}}A)_{I^{\prime}\times J^{\prime}}\|\leq C\sqrt{\alpha dr\log(en/m)}. (3.9)

3.4. Restricted degrees

The two simple lemmas of this section will help us to handle the part of the adjacency matrix outside the core block constructed in Lemma 3.4. First, we show that almost all rows have at most O(αd)O(\alpha d) ones, and thus are included in the core block.

Lemma 3.5 (Degrees of subgraphs).

Let 1mn1\leq m\leq n and αm/n\alpha\geq\sqrt{m/n}. Then for r1r\geq 1 the following holds with probability at least 1nr1-n^{-r}. Consider a block I×JI\times J of size m×mm\times m. Then all but m/αdm/\alpha d rows of AI×JA_{I\times J} have at most 8rαd8r\alpha d ones.

Proof.

Fix a block I×JI\times J, and denote by did_{i} the number of ones in the ii-th row of AI×JA_{I\times J}. Then 𝔼dimd/n\operatorname{\mathbb{E}}d_{i}\leq md/n by the assumption. Using Chernoff’s inequality, we obtain

{di>8rαd}(8rαdemd/n)8rαd(2αnm)8rαd=:p.\mathbb{P}\left\{d_{i}>8r\alpha d\rule{0.0pt}{8.53581pt}\right\}\leq\Big{(}\frac{8r\alpha d}{emd/n}\Big{)}^{-8r\alpha d}\leq\Big{(}\frac{2\alpha n}{m}\Big{)}^{-8r\alpha d}=:p.

Let SS be the number of rows ii with di>8rαdd_{i}>8r\alpha d. Then SS is a sum of mm independent Bernoulli random variables with expectations at most pp. Again, Chernoff’s inequality implies

{S>m/αd}(epαd)m/αdpm/2αd=(2αnm)4rm.\mathbb{P}\left\{S>m/\alpha d\rule{0.0pt}{8.53581pt}\right\}\leq(ep\alpha d)^{m/\alpha d}\leq p^{m/2\alpha d}=\Big{(}\frac{2\alpha n}{m}\Big{)}^{-4rm}.

The second inequality here holds since eαdp1/2e\alpha d\leq p^{-1/2}. (To see this, notice that the definition of pp and assumption on α\alpha imply that p1/2=(2αn/m)4rαd24rαdp^{-1/2}=(2\alpha n/m)^{4r\alpha d}\geq 2^{4r\alpha d}.)

It remains to take a union bound over all blocks I×JI\times J. We obtain that the conclusion of the lemma holds with probability at least

1m=1n(nm)2(2αnm)4rm1nr.1-\sum_{m=1}^{n}\binom{n}{m}^{2}\Big{(}\frac{2\alpha n}{m}\Big{)}^{-4rm}\geq 1-n^{-r}.

In the last inequality we used the assumption that αm/n\alpha\geq\sqrt{m/n}. The proof is complete. ∎

Next, we handle the block of rows that do have too many ones. We show that most columns of this block have O(1)O(1) ones.

Lemma 3.6 (More on degrees of subgraphs).

Let 1mn1\leq m\leq n and αm/n\alpha\geq\sqrt{m/n}. Then for r1r\geq 1 the following holds with probability at least 1nr1-n^{-r}. Consider a block I×JI\times J of size k×mk\times m with some km/αdk\leq m/\alpha d. Then all but m/4m/4 columns of AI×JA_{I\times J} have at most 32r32r ones.

Proof.

Fix II and JJ, and denote by djd_{j} the number of ones in the jj-th column of AI×JA_{I\times J}. Then 𝔼djkd/nm/αn\operatorname{\mathbb{E}}d_{j}\leq kd/n\leq m/\alpha n by assumption. Using Chernoff’s inequality, we have

{dj>32r}(32rem/αn)32r(10αnm)32r=:p.\mathbb{P}\left\{d_{j}>32r\rule{0.0pt}{8.53581pt}\right\}\leq\Big{(}\frac{32r}{em/\alpha n}\Big{)}^{-32r}\leq\Big{(}\frac{10\alpha n}{m}\Big{)}^{-32r}=:p.

Let SS be the number of columns jj with dj>32rd_{j}>32r. Then SS is a sum of mm independent Bernoulli random variables with expectations at most pp. Again, Chernoff’s inequality implies

{S>m/4}(4ep)m/4pm/6(10αnm)5rm.\mathbb{P}\left\{S>m/4\rule{0.0pt}{8.53581pt}\right\}\leq(4ep)^{m/4}\leq p^{m/6}\leq\Big{(}\frac{10\alpha n}{m}\Big{)}^{-5rm}.

The second inequality here holds since 4e<p1/24e<p^{1/2}, which in turn follows by assumption on α\alpha.

It remains to take a union bound over all blocks I×JI\times J. It is enough to consider the blocks with largest possible number of columns, thus with k=m/αdk=\lceil m/\alpha d\rceil. We obtain that the conclusion of the lemma holds with probability at least

1m=1n(nm)(nm/αd)(10αnm)5rm1nr.1-\sum_{m=1}^{n}\binom{n}{m}\binom{n}{\lceil m/\alpha d\rceil}\Big{(}\frac{10\alpha n}{m}\Big{)}^{-5rm}\leq 1-n^{-r}.

In the last inequality we used the assumption that αm/n\alpha\geq\sqrt{m/n}. The proof is complete. ∎

3.5. Iterative decomposition: proof of Theorem 2.1

Finally, we combine the tools we developed so far, and we construct an iterative decomposition of the adjacency matrix the way we outline in Section 3.1. Let us set up one step of this procedure, where we consider an m×mm\times m block and decompose almost all of it (everything except an m/2×m/2m/2\times m/2 block) into classes 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C} satisfying the conclusion of Theorem 2.6. Once we can do this, we repeat the procedure for the m/2×m/2m/2\times m/2 block, etc.

Lemma 3.7 (Decomposition of a block).

Let 1mn1\leq m\leq n and αm/n\alpha\geq\sqrt{m/n}. Then for r1r\geq 1 the following holds with probability at least 13nr1-3n^{-r}. Consider a block I×JI\times J of size m×mm\times m. Then there exists an exceptional sub-block I1×J1I_{1}\times J_{1} with dimensions at most m/2×m/2m/2\times m/2 such that the remaining part of the block, that is (I×J)(I1×J1)(I\times J)\setminus(I_{1}\times J_{1}), can be decomposed into three classes 𝒩\mathcal{N}, (II1)×J\mathcal{R}\subset(I\setminus I_{1})\times J and 𝒞I×(JJ1)\mathcal{C}\subset I\times(J\setminus J_{1}) so that the following holds.

  • The graph concentrates on 𝒩\mathcal{N}, namely (A𝔼A)𝒩Cr3/2αdlog(en/m)\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}}\|\leq Cr^{3/2}\sqrt{\alpha d\log(en/m)}.

  • Each row of AA_{\mathcal{R}} and each column of A𝒞A_{\mathcal{C}} has at most 32r32r ones.

Moreover, \mathcal{R} intersects at most n/αdn/\alpha d columns and 𝒞\mathcal{C} intersects at most n/αdn/\alpha d rows of I×JI\times J.

After a permutation of rows and columns, a decomposition of the block stated in Lemma 3.7 can be visualized in Figure 4(c).

Refer to caption
(a) Initial step.

Refer to caption

(b) Repeat for transpose.
Refer to caption
(c) Final decomposition.
Figure 4. Construction of a block decomposition in Lemma 3.7.
Proof.

Since we are going to use Lemmas 3.4, 3.5 and 3.6, let us fix realization of the random graph that satisfies the conclusion of those three lemmas.

By Lemma 3.5, all but m/αdm/\alpha d rows of AI×JA_{I\times J} have at most 8rαd8r\alpha d ones; let us denote by II^{\prime} the set of indices of those rows with at most 8rαd8r\alpha d ones. Then we can use Lemma 3.4 for the block I×JI^{\prime}\times J and with α\alpha replaced by 8rα8r\alpha; the choice of II^{\prime} ensures that all rows have small numbers of ones, as required in that lemma. To control the rows outside II^{\prime}, we may use Lemma 3.6 for (II)×J(I\setminus I^{\prime})\times J; as we already noted, this block has at most m/αdm/\alpha d rows as required in that lemma. Intersecting the good sets of columns produced by those two lemmas, we obtain a set of at most m/2m/2 exceptional columns J1JJ_{1}\subset J such that the following holds.

  • On the block 𝒩1:=I×(JJ1)\mathcal{N}_{1}:=I^{\prime}\times(J\setminus J_{1}), we have (A𝔼A)𝒩1Cr3/2αdlog(en/m).\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}_{1}}\|\leq Cr^{3/2}\sqrt{\alpha d\log(en/m)}.

  • For the block 𝒞:=(II)×(JJ1)\mathcal{C}:=(I\setminus I^{\prime})\times(J\setminus J_{1}), all columns of A𝒞A_{\mathcal{C}} have at most 32r32r ones.

Figure 4(a) illustrates the decomposition of the block I×JI\times J into the set of exceptional columns indexed by J1J_{1} and good sets 𝒩1\mathcal{N}_{1} and 𝒞\mathcal{C}.

To finish the proof, we apply the above argument to the transpose A𝖳A^{\mathsf{T}} on the block J×IJ\times I. To be precise, we start with the set JJJ^{\prime}\subset J of all but m/αdm/\alpha d small columns of AI×JA_{I\times J} (those with at most 8rαd8r\alpha d ones); then we obtain an exceptional set I1II_{1}\subset I of at most m/2m/2 rows; and finally we conclude that AA concentrates on the block 𝒩2:=(II1)×J\mathcal{N}_{2}:=(I\setminus I_{1})\times J^{\prime} and has small rows on the block :=(II1)×(JJ)\mathcal{R}:=(I\setminus I_{1})\times(J\setminus J^{\prime}). Figure 4(b) illustrates this decomposition.

It only remains to combine the decompositions for AA and A𝖳A^{\mathsf{T}}; Figure 4(c) illustrates a result of the combination. Once we define 𝒩:=𝒩1𝒩2\mathcal{N}:=\mathcal{N}_{1}\cup\mathcal{N}_{2}, it becomes clear that 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C} have the required properties.333It may happen that an entry ends up in more than one class 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C}. In such cases, we split the tie arbitrarily.

Proof of Theorem 2.6.

Let us fix a realization of the random graph that satisfies the conclusion of Lemma 3.7. Applying that lemma for m=nm=n and with α=1\alpha=1, we decompose the set of edges [n]×[n][n]\times[n] into three classes 𝒩0\mathcal{N}_{0}, 𝒞0\mathcal{C}_{0} and 0\mathcal{R}_{0} plus an n/2×n/2n/2\times n/2 exceptional block I1×J1I_{1}\times J_{1}. Apply Lemma 3.7 again, this time for the block I1×J1I_{1}\times J_{1}, for m=n/2m=n/2 and with α=1/2\alpha=\sqrt{1/2}. We decompose I1×J1I_{1}\times J_{1} into 𝒩1\mathcal{N}_{1}, 𝒞1\mathcal{C}_{1} and 1\mathcal{R}_{1} plus an n/4×n/4n/4\times n/4 exceptional block I2×J2I_{2}\times J_{2}.

Repeat this process for α=m/n\alpha=\sqrt{m/n} where mm is the running size of the block; we halve this size at each step, and so we have αi2i/2\alpha_{i}\leq 2^{-i/2}. Figure 3(c) illustrates a decomposition that we may obtain this way. In a finite number of steps (actually, in O(logn)O(\log n) steps) the exceptional block becomes empty, and the process terminates. At that point we have decomposed the set of edges [n]×[n][n]\times[n] into 𝒩\mathcal{N}, \mathcal{R} and 𝒞\mathcal{C}, defined as the union of 𝒩i\mathcal{N}_{i}, 𝒞i\mathcal{C}_{i} and i\mathcal{R}_{i} respectively, which we obtained at each step. It is clear that \mathcal{R} and 𝒞\mathcal{C} satisfy the required properties.

It remains to bound the deviation of AA on 𝒩\mathcal{N}. By construction, 𝒩i\mathcal{N}_{i} satisfies

(A𝔼A)𝒩iCr3/2αidlog(eαi).\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}_{i}}\|\leq Cr^{3/2}\sqrt{\alpha_{i}d\log(e\alpha_{i})}.

Thus, by triangle inequality we have

(A𝔼A)𝒩i0Cr3/2αidlog(eαi)Cr3/2d.\|(A-\operatorname{\mathbb{E}}A)_{\mathcal{N}}\|\leq\sum_{i\geq 0}Cr^{3/2}\sqrt{\alpha_{i}d\log(e\alpha_{i})}\leq C^{\prime}r^{3/2}\sqrt{d}.

In the second inequality we used that αi2i/2\alpha_{i}\leq 2^{-i/2}, which forces the series to converge. The proof of Theorem 2.6 is complete. ∎

3.6. Replacing the degrees by the 2\ell_{2} norms in Theorem 2.1

Let us now prove the “moreover” part of Theorem 2.1, where dd^{\prime} is the the maximal 2\ell_{2} norm of the rows and columns of the regularized adjacency matrix AA^{\prime}. This is clearly a stronger statement than in the main part of the theorem. Indeed, since all entries of AA^{\prime} are bounded in absolute value by 11, each degree, being the 1\ell_{1} norm of a row, is bounded below by the 2\ell_{2} norm squared.

This strengthening is in fact easy to check. To do so, note that the definition of dd^{\prime} was used only once in the proof of Theorem 2.1, namely in Step 2 where we bounded the norms of AA^{\prime}_{\mathcal{R}} and A𝒞A^{\prime}_{\mathcal{C}}. Thus, to obtain the strengthening, it is enough to replace the application of Lemma 2.7 there by the following lemma.

Lemma 3.8.

Consider a matrix BB with entries in [0,1][0,1]. Suppose each row of BB has at most aa non-zero entries, and each column has 2\ell_{2} norm at most b\sqrt{b}. Then Bab\|B\|\leq\sqrt{ab}.

Proof.

To prove the claim, let xx be a vector with x2=1\|x\|_{2}=1. Using Cauchy-Schwarz inequality and the assumptions, we have

Bx22\displaystyle\|Bx\|_{2}^{2} =j(iBijxi)2j(i:Bij0Bij2i:Bij0xi2)\displaystyle=\sum_{j}\Big{(}\sum_{i}B_{ij}x_{i}\Big{)}^{2}\leq\sum_{j}\Big{(}\sum_{i:\,B_{ij}\neq 0}B_{ij}^{2}\sum_{i:\,B_{ij}\neq 0}x_{i}^{2}\Big{)}
j(bi:Bij0xi2)=bixi2j:Bij01bixi2a=ab.\displaystyle\leq\sum_{j}\Big{(}b\sum_{i:\,B_{ij}\neq 0}x_{i}^{2}\Big{)}=b\sum_{i}x_{i}^{2}\sum_{j:\,B_{ij}\neq 0}1\leq b\sum_{i}x_{i}^{2}a=ab.

Since xx is arbitrary, this completes the proof. ∎

4. Concentration of the regularized Laplacian

In this section, we state the following formal version of Theorem 1.2, and we deduce it from concentration of adjacency matrices (Theorem 2.1).

Theorem 4.1 (Concentration of regularized Laplacians).

Consider a random graph from the inhomogeneous Erdös-Rényi model, and let dd be as in (1.3). Choose a number τ>0\tau>0. Then, for any r1r\geq 1, with probability at least 1er1-e^{-r} one has

(Aτ)(𝔼Aτ)Cr2τ(1+dτ)5/2.\|\mathcal{L}(A_{\tau})-\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau})\|\leq\frac{Cr^{2}}{\sqrt{\tau}}\Big{(}1+\frac{d}{\tau}\Big{)}^{5/2}.
Proof.

Two sources contribute to the deviation of Laplacian – the deviation of the adjacency matrix and the deviation of the degrees. Let us separate and bound them individually.

Step 1. Decomposing the deviation. Let us denote A¯:=𝔼A\bar{A}:=\operatorname{\mathbb{E}}A for simplicity; then

E:=(Aτ)(A¯τ)=Dτ1/2AτDτ1/2D¯τ1/2A¯τD¯τ1/2.E:=\mathcal{L}(A_{\tau})-\mathcal{L}(\bar{A}_{\tau})=D_{\tau}^{-1/2}A_{\tau}D_{\tau}^{-1/2}-\bar{D}_{\tau}^{-1/2}\bar{A}_{\tau}\bar{D}_{\tau}^{-1/2}.

Here Dτ=diag(di+τ)D_{\tau}=\operatorname{diag}(d_{i}+\tau) and D¯τ=diag(d¯i+τ)\bar{D}_{\tau}=\operatorname{diag}(\bar{d}_{i}+\tau) are the diagonal matrices with degrees of AτA_{\tau} and A¯τ\bar{A}_{\tau} on the diagonal, respectively. Using the fact that AτA¯τ=AA¯A_{\tau}-\bar{A}_{\tau}=A-\bar{A}, we can represent the deviation as

E=S+TwhereS=Dτ1/2(AA¯)Dτ1/2,T=Dτ1/2A¯τDτ1/2D¯τ1/2A¯τD¯τ1/2.E=S+T\quad\text{where}\quad S=D_{\tau}^{-1/2}(A-\bar{A})D_{\tau}^{-1/2},\quad T=D_{\tau}^{-1/2}\bar{A}_{\tau}D_{\tau}^{-1/2}-\bar{D}_{\tau}^{-1/2}\bar{A}_{\tau}\bar{D}_{\tau}^{-1/2}.

Let us bound SS and TT separately.

Step 2. Bounding SS. Let us introduce a diagonal matrix Δ\Delta that should be easier to work with than DτD_{\tau}. Set Δii=1\Delta_{ii}=1 if di8rdd_{i}\leq 8rd and Δii=di/τ+1\Delta_{ii}=d_{i}/\tau+1 otherwise. Then entries of τΔ\tau\Delta are upper bounded by the corresponding entries of DτD_{\tau}, and so

τSΔ1/2(AA¯)Δ1/2.\tau\|S\|\leq\|\Delta^{-1/2}(A-\bar{A})\Delta^{-1/2}\|.

Next, by triangle inequality,

τSΔ1/2AΔ1/2A¯+A¯Δ1/2A¯Δ1/2=:R1+R2.\tau\|S\|\leq\|\Delta^{-1/2}A\Delta^{-1/2}-\bar{A}\|+\|\bar{A}-\Delta^{-1/2}\bar{A}\Delta^{-1/2}\|=:R_{1}+R_{2}. (4.1)

In order to bound R1R_{1}, we use Theorem 2.1 to show that A:=Δ1/2AΔ1/2A^{\prime}:=\Delta^{-1/2}A\Delta^{-1/2} concentrates around A¯\bar{A}. This should be possible because AA^{\prime} is obtained from AA by reducing the degrees that are bigger than 8rd8rd. To apply the “moreover” part of Theorem 2.1, let us check the magnitude of the 2\ell_{2} norms of the rows AiA_{i}^{\prime} of AA^{\prime}:

Ai22=j=1nAijΔiiΔjjdiΔiimax(8rd,τ).\|A^{\prime}_{i}\|_{2}^{2}=\sum_{j=1}^{n}\frac{A_{ij}}{\Delta_{ii}\Delta_{jj}}\leq\frac{d_{i}}{\Delta_{ii}}\leq\max(8rd,\,\tau).

Here in the first inequality we used that Δjj1\Delta_{jj}\geq 1 and jAij=di\sum_{j}A_{ij}=d_{i}; the second inequality follows by definition of Δii\Delta_{ii}. Applying Theorem 2.1, we obtain with probability 1nr1-n^{-r} that

R1=AA¯C1r2(d+τ).R_{1}=\|A^{\prime}-\bar{A}\|\leq C_{1}r^{2}(\sqrt{d}+\sqrt{\tau}).

To bound R2R_{2}, we note that by construction of Δ\Delta, the matrices A¯\bar{A} and Δ1/2A¯Δ1/2\Delta^{-1/2}\bar{A}\Delta^{-1/2} coincide on the block I×II\times I, where II is the set of vertices satisfying di8rdd_{i}\leq 8rd. This block is very large – indeed, Lemma 3.5 implies that |Ic|n/d|I^{c}|\leq n/d with probability 1nr1-n^{-r}. Outside this block, i.e. on the small blocks Ic×[n]I^{c}\times[n] and [n]×Ic[n]\times I^{c}, the entries of A¯Δ1/2A¯Δ1/2\bar{A}-\Delta^{-1/2}\bar{A}\Delta^{-1/2} are bounded by the corresponding entries of A¯\bar{A}, which are all bounded by d/nd/n. Thus, using Lemma 2.7, we have

R2A¯Ic×[n]+A¯[n]×Ic2d.R_{2}\leq\|\bar{A}_{I^{c}\times[n]}\|+\|\bar{A}_{[n]\times I^{c}}\|\leq 2\sqrt{d}.

Substituting the bounds for R1R_{1} and R2R_{2} into (4.1), we conclude that

SC2r2τ(d+τ)\|S\|\leq\frac{C_{2}r^{2}}{\tau}(\sqrt{d}+\sqrt{\tau})

with probability at least 12nr1-2n^{-r}.

Step 3. Bounding TT. Bounding the spectral norm by the Hilbert-Schmidt norm, we get

TTHS=i,j=1nTij2,whereTij=(A¯ij+τ/n)[1/δij1/δ¯ij]\|T\|\leq\|T\|_{\mathrm{HS}}=\sum_{i,j=1}^{n}T_{ij}^{2},\quad\text{where}\quad T_{ij}=(\bar{A}_{ij}+\tau/n)\Big{[}1/\sqrt{\delta_{ij}}-1/{\sqrt{\bar{\delta}_{ij}}}\Big{]}

and δij=(di+τ)(dj+τ)\delta_{ij}=(d_{i}+\tau)(d_{j}+\tau) and δ¯ij=(d¯i+τ)(d¯j+τ)\bar{\delta}_{ij}=(\bar{d}_{i}+\tau)(\bar{d}_{j}+\tau). To bound TijT_{ij}, we note that

0A¯ij+τ/nd+τnand|1/δij1/δ¯ij|=|δijδ¯ijδijδ¯ij+δ¯ijδij||δijδ¯ij|2τ3.0\leq\bar{A}_{ij}+\tau/n\leq\frac{d+\tau}{n}\quad\text{and}\quad\big{|}1/\sqrt{\delta_{ij}}-1/{\sqrt{\bar{\delta}_{ij}}}\big{|}=\left|\frac{\delta_{ij}-\bar{\delta}_{ij}}{\delta_{ij}\sqrt{\bar{\delta}_{ij}}+\bar{\delta}_{ij}\sqrt{\delta_{ij}}}\right|\geq\frac{|\delta_{ij}-\bar{\delta}_{ij}|}{2\tau^{3}}.

Recalling the definition of δij\delta_{ij} and δ¯ij\bar{\delta}_{ij} and adding and subtracting (di+τ)(d¯j+τ)(d_{i}+\tau)(\bar{d}_{j}+\tau), we have

δijδ¯ij=(di+τ)(djd¯j)+(d¯j+τ)(did¯i).\delta_{ij}-\bar{\delta}_{ij}=(d_{i}+\tau)(d_{j}-\bar{d}_{j})+(\bar{d}_{j}+\tau)(d_{i}-\bar{d}_{i}).

So, using the inequality (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) and bounding d¯j+τ\bar{d}_{j}+\tau by d+τd+\tau, we obtain

T2(d+τ)2n2τ6[i=1n(di+τ)2j=1n(djd¯j)2+n(d+τ)2i=1n(did¯i)2].\|T\|^{2}\leq\frac{(d+\tau)^{2}}{n^{2}\tau^{6}}\Big{[}\sum_{i=1}^{n}(d_{i}+\tau)^{2}\sum_{j=1}^{n}(d_{j}-\bar{d}_{j})^{2}+n(d+\tau)^{2}\sum_{i=1}^{n}(d_{i}-\bar{d}_{i})^{2}\Big{]}. (4.2)

We claim that

j=1n(djd¯j)2C3r2ndwith probability 1e2r.\sum_{j=1}^{n}(d_{j}-\bar{d}_{j})^{2}\leq C_{3}r^{2}nd\quad\text{with probability }1-e^{-2r}. (4.3)

Indeed, since the variance of each did_{i} is bounded by dd, the expectation of the sum in (4.3) is bounded by ndnd. To upgrade the variance bound to an exponential deviation bound, one can use one of the several standard methods. For example, Bernstein’s inequality implies that Xi=djdj¯X_{i}=d_{j}-\bar{d_{j}} satisfies {Xi>C4td}et\mathbb{P}\left\{X_{i}>C_{4}t\sqrt{d}\rule{0.0pt}{8.53581pt}\right\}\leq e^{-t} for all t1t\geq 1. This means that the random variable Xi2X_{i}^{2} belongs to the Orlicz space Lψ1/2L_{\psi_{1/2}} and has norm Xi2ψ1/2C3d\|X_{i}^{2}\|_{\psi_{1/2}}\leq C_{3}d, see [26]. By triangle inequality, we conclude that i=1nXi2ψ1/2C3nd\|\sum_{i=1}^{n}X_{i}^{2}\|_{\psi_{1/2}}\leq C_{3}nd, which in turn implies (4.3).

Furthermore, (4.3) implies

i=1n(di+τ)22i=1n(did¯i)2+2i=1n(d¯i+τ)22C3r2nd+2n(d+τ)2C5r2n(d+τ)2.\sum_{i=1}^{n}(d_{i}+\tau)^{2}\leq 2\sum_{i=1}^{n}(d_{i}-\bar{d}_{i})^{2}+2\sum_{i=1}^{n}(\bar{d}_{i}+\tau)^{2}\leq 2C_{3}r^{2}nd+2n(d+\tau)^{2}\leq C_{5}r^{2}n(d+\tau)^{2}.

Substituting this bound and (4.3) into (4.2) we conclude that

T2(d+τ)2n2τ6C3r2nd[C5r2n(d+τ)2+n(d+τ)2]C6r4τ(1+dτ)5.\|T\|^{2}\leq\frac{(d+\tau)^{2}}{n^{2}\tau^{6}}\cdot C_{3}r^{2}nd\,\Big{[}C_{5}r^{2}n(d+\tau)^{2}+n(d+\tau)^{2}\Big{]}\leq\frac{C_{6}r^{4}}{\tau}\Big{(}1+\frac{d}{\tau}\Big{)}^{5}.

It remains to substitute the bounds for SS and TT into the inequality ES+T\|E\|\leq\|S\|+\|T\|, and simplify the expression. The resulting bound holds with probability at least 1nrnre2r1er1-n^{-r}-n^{-r}-e^{-2r}\geq 1-e^{-r}, as claimed. ∎

5. Further questions

5.1. Optimal regularization?

The main point of our paper was that regularization helps sparse graphs to concentrate. We have discussed several kinds of regularization in Section 1.4 and mentioned some more in Section 1.4. We found that any meaningful regularization works, as long as it reduces the too high degrees and increases the too low degrees. Is there an optimal way to regularize a graph? Designing the best “preprocessing” of sparse graphs for spectral algorithms is especially interesting from the applied perspective, i.e. for real world networks.

On the theoretical level, can regularization of sparse graphs produce the same optimal bound 2d(1+o(1))2\sqrt{d}(1+o(1)) that we saw for dense graphs in (1.1)? Thus, an ideal regularization should bring all parasitic outliers of the spectrum into the bulk. If so, this would lead to a potentially simple spectral clustering algorithm for community detection in networks which matches the theoretical lower bounds. Algorithms with optimal rates exist for this problem [33, 29], but their analysis is very technical.

5.2. How exactly concentration depends on regularization?

It would be interesting to determine how exactly the concentration of Laplacian depends on the regularization parameter τ\tau. The dependence in Theorem 4.1 is not optimal, and we have not made efforts to improve it. Although it is natural to choose τd\tau\sim d as in Theorem 1.2, choosing τd\tau\gg d could also be useful [23]. Choosing τd\tau\ll d may be interesting as well, for then (𝔼Aτ)(𝔼A)\mathcal{L}(\operatorname{\mathbb{E}}A_{\tau})\approx\mathcal{L}(\operatorname{\mathbb{E}}A) and we obtain the concentration of (Aτ)\mathcal{L}(A_{\tau}) around the Laplacian of the expectation of the original (rather than regularized) matrix 𝔼A\operatorname{\mathbb{E}}A.

5.3. Average expected degree?

Both concentration results of this paper, Theorems 1.1 and 1.2, depend on d=maxijnpijd=\max_{ij}np_{ij}. Would it be possible to reduce dd to the maximal expected degree dave=maxijpijd_{ave}=\max_{i}\sum_{j}p_{ij}?

5.4. From random graphs to random matrices?

Adjacency matrices of random graphs are particular examples of random matrices. Does the phenomenon we described, namely that regularization leads to concentration, apply for general random matrices? Guided by Theorem 1.1, we might expect the following for a broader general class of random matrices BB with mean zero independent entries. First, the only reason the spectral norm of BB is too large (and that it is determined by outliers of spectrum) could be the existence of a large row or column. Furthermore, it might be possible to reduce the norm of BB (and thus bring the outliers into the bulk of spectrum) by regularizing in some way the rows and columns that are too large. For related questions in random matrix theory, see the recent work [4, 21].

References

  • [1] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471–487, 2016.
  • [2] N. Alon and N. Kahale. A spectral technique for coloring random 3-colorable graphs. SIAM J. Comput., (26):1733–1748, 1997.
  • [3] A. A. Amini, A. Chen, P. J. Bickel, and E. Levina. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097–2122, 2013.
  • [4] A. Bandeira and R. V. Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. Annals of Probability, to appear, 2014.
  • [5] R. Bhatia. Matrix Analysis. Springer-Verlag New York, 1996.
  • [6] P. J. Bickel and A. Chen. A nonparametric view of network models and Newman-Girvan and other modularities. Proc. Natl. Acad. Sci. USA, 106:21068–21073, 2009.
  • [7] B. Bollobas, S. Janson, and O. Riordan. The phase transition in inhomogeneous random graphs. Random Structures and Algorithms, 31:3–122, 2007.
  • [8] C. Bordenave, M. Lelarge, and L. Massoulié. Non-backtracking spectrum of random graphs: community detection and non-regular Ramanujan graphs. arxiv:1501.06087, 2015.
  • [9] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, 2013.
  • [10] T. Cai and X. Li. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. Ann. Statist., 43(3):1027–1059, 2015.
  • [11] K. Chaudhuri, F. Chung, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. Journal of Machine Learning Research Workshop and Conference Proceedings, 23:35.1 – 35.23, 2012.
  • [12] P. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in the sparse graphs : A spectral algorithm with optimal rate of recovery. arXiv:1501.05021, 2015.
  • [13] F. R. K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997.
  • [14] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84:066106, 2011.
  • [15] U. Feige and . Ofek. Spectral techniques applied to sparse random graphs. Wiley InterScience, 2005.
  • [16] Z. Füredi and J. Komlós. The eigenvalues of random symmetric matrices. Combinatorica, 1:3:233–241, 1980.
  • [17] J. Friedman, J. Kahn, and E. Szemeredi. On the second eigenvalue in random regular graphs. Proc Twenty First Annu ACMSymp Theory of Computing, pages 587–598, 1989.
  • [18] C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Achieving optimal misclassification proportion in stochastic block model. arXiv:1505.03772, 2015.
  • [19] O. Guédon and R. Vershynin. Community detection in sparse networks via grothendieck’s inequality. Probability Theory and Related Fields, to appear, 2014.
  • [20] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semidefinite programming. arXiv:1412.6156, 2014.
  • [21] R. V. Handel. On the spectral norm of inhomogeneous random matrices. arXiv:1502.05003, 2015.
  • [22] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: first steps. Social Networks, 5(2):109–137, 1983.
  • [23] A. Joseph and B. Yu. Impact of regularization on spectral clustering. Ann. Statist., 44(4):1765–1791, 2016.
  • [24] M. Krivelevich and B. Sudakov. The largest eigenvalue of sparse random graphs. Combin Probab Comput, 12:61–72, 2003.
  • [25] M. Ledoux. The Concentration of Measure Phenomenon, volume 89 of Mathematical Surveys and Monographs. Amer. Math. Society, 2001.
  • [26] M. Ledoux and M. Talagrand. Probability in Banach spaces: Isoperimetry and processes. Springer-Verlag, Berlin, 1991.
  • [27] J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models. Ann. Statist., 43(1):215–237, 2015.
  • [28] L. Lu and X. Peng. Spectra of edge-independent random graphs. The electronic journal of combinatorics, 20(4), 2013.
  • [29] L. Massoulié. Community detection thresholds and the weak Ramanujan property. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, STOC ’14, pages 694–703, 2014.
  • [30] McSherry. Spectral partitioning of random graphs. Proc. 42nd FOCS, pages 529–537, 2001.
  • [31] A. Montanari and S. Sen. Semidefinite programs on sparse random graphs and their application to community detection. arXiv:1504.05910, 2015.
  • [32] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for binary symmetric block models. arXiv:1407.1591, 2014.
  • [33] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. arXiv:1311.4115, 2014.
  • [34] E. Mossel, J. Neeman, and A. Sly. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 2014.
  • [35] R. Oliveira. Concentration of the adjacency matrix and of the laplacian in random graphs with independent edges. arXiv:0911.0600, 2010.
  • [36] A. Pietsch. Operator Ideals. North-Holland Amsterdam, 1978.
  • [37] G. Pisier. Factorization of linear operators and geometry of Banach spaces. Number 60 in CBMS Regional Conference Series in Mathematics. AMS, Providence, 1986.
  • [38] G. Pisier. Grothendieck’s theorem, past and present. Bulletin (New Series) of the American Mathematical Society, 49(2):237–323, 2012.
  • [39] T. Qin and K. Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In Advances in Neural Information Processing Systems, pages 3120–3128, 2013.
  • [40] E. M. Stein and R. Shakarchi. Functional Analysis: Introduction to Further Topics in Analysis. Princeton University Press, 2011.
  • [41] N. Tomczak-Jaegermann. Banach-Mazur distances and finite-dimensional operator ideals. John Wiley & Sons, Inc., New York, 1989.
  • [42] J. A. Tropp. Column subset selection, matrix factorization, and eigenvalue optimization. Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 978–986, 2009.
  • [43] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed sensing: theory and applications. Cambridge University Press. Submitted.
  • [44] V. Vu. Spectral norm of random matrices. Combinatorica, 27(6):721–736, 2007.