This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Local correlation clustering

Francesco Bonchi    David García–Soriano Konstantin Kutzkov Yahoo Labs IT University of Copenhagen Barcelona, Spain Copenhagen, Denmark {bonchi, davidgs}@yahoo-inc.com konk@itu.dk
Abstract

Correlation clustering is perhaps the most natural formulation of clustering. Given nn objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters.

Despite its theoretical appeal, the practical relevance of correlation clustering still remains largely unexplored. This is mainly due to the fact that correlation clustering requires the Θ(n2)\Theta(n^{2}) pairwise similarities as input. In large datasets this is infeasible to compute or even only to store.

In this paper we initiate the investigation into local algorithms for correlation clustering, laying the theoretical foundations for clustering “big data”. In local correlation clustering we are given the identifier of a single object and we want to return the cluster to which it belongs in some globally consistent near-optimal clustering, using a small number of similarity queries.

Local algorithms for correlation clustering open the door to sublinear-time algorithms, which are particularly useful when the similarity between items is costly to compute, as it is often the case in many practical application domains. They also imply (i)(i) distributed and streaming clustering algorithms, (ii)(ii) constant-time estimators and testers for cluster edit distance, and (iii)(iii) property-preserving parallel reconstruction algorithms for clusterability.

Specifically, we devise a local clustering algorithm attaining a (3,ε)(3,\varepsilon)-approximation (a solution with cost at most 3\funcOPT+εn23\cdot\func{OPT}+\varepsilon n^{2}, where \funcOPT\func{OPT} is the optimal cost). Its running time is O(1/ε2)O(1/\varepsilon^{2}) independently of the dataset size. If desired, an explicit approximate clustering for all nn objects can be produced in time O(n/ε)O(n/\varepsilon) (which is provably optimal). We also provide a fully additive (1,ε)(1,\varepsilon)-approximation with local query complexity poly(1/ε){\operatorname{poly}}(1/\varepsilon) and time complexity 2poly(1/ε)2^{{\operatorname{poly}}(1/\varepsilon)}. The explicit clustering can be found in time npoly(1/ε)+2poly(1/ε)n\cdot{\operatorname{poly}}(1/\varepsilon)+2^{{\operatorname{poly}}(1/\varepsilon)}. The latter yields the fastest polynomial-time approximation scheme for correlation clustering known to date.

1 Introduction

In correlation clustering111Sometimes called clustering with qualitative information or cluster editing. we are given a set VV of nn objects and a pairwise similarity function sim:V×V[0,1]\operatorname{sim}:V\times V\rightarrow[0,1], and the goal is to cluster the items in such a way that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. Assuming that cluster identifiers are represented by natural numbers, a clustering c\operatorname{c\ell} is a function c:V\operatorname{c\ell}:V\rightarrow\mathbb{N}. Correlation clustering aims at minimizing the cost:

(x,y)V×V,c(x)=c(y)(1sim(x,y))+(x,y)V×V,c(x)c(y)sim(x,y).\mspace{-20.0mu}\sum_{\begin{subarray}{c}(x,y)\in V\times V,\\ \operatorname{c\ell}(x)=\operatorname{c\ell}(y)\end{subarray}}(1-\operatorname{sim}(x,y))\mspace{5.0mu}+\mspace{-15.0mu}\sum_{\begin{subarray}{c}(x,y)\in V\times V,\\ \operatorname{c\ell}(x)\not=\operatorname{c\ell}(y)\end{subarray}}\operatorname{sim}(x,y). (1)

The intuition underlying the above problem definition is that if two objects xx and yy are assigned to the same cluster we should pay the amount of their dissimilarity (1sim(x,y))(1-\operatorname{sim}(x,y)), while if they are assigned to different clusters we should pay the amount of their similarity sim(x,y)\operatorname{sim}(x,y).

In the most widely studied setting, the similarity function is binary, i.e., sim:V×V{0,1}\operatorname{sim}:V\times V\rightarrow\{0,1\}. This setting can be viewed very conveniently trough graph-theoretic lenses: the nn items correspond to the vertices of a similarity graph GG, which is a complete undirected graph with edges labelled “+” or “-”. An edge ee causes a disagreement (of cost 1) between the similarity graph and a clustering when it is a “+” edge connecting vertices in different clusters, or a “–” edge connecting vertices within the same cluster. If we were given a cluster graph [37] (or clusterable graph), i.e., a graph whose set of positive edges is the union of vertex-disjoint cliques, we would be able to produce a perfect (i.e., cost 0) clustering simply by computing the connected components of the positive graph. However, similarities will generally be inconsistent with one another, so incurring a certain cost is unavoidable. Correlation clustering aims at minimizing such cost. The problem can be viewed as an agnostic learning problem, where we try to approximate the adjacency function of GG by the hypothesis class of cluster graphs; alternatively, it is the task of finding the equivalence relation that most closely resembles a given symmetric relation RR.

Correlation clustering provides a general framework in which one only needs to define a suitable similarity function. This makes it particularly appealing for the task of clustering structured objects, where the similarity function is domain-specific and does not rely on an ad hoc specification of some suitable metric such as the Euclidean distance of vectors. Thanks to this generality, the technique is applicable to a multitude of problems in different domains, including duplicate detection and similarity joins [27, 17], biology [11], image segmentation [30] and social networks [12].

Another key feature of correlation clustering is that it does not require a prefixed number of clusters, instead it automatically finds the optimal number.

Despite its appeal, correlation clustering has been, so far, mainly of theoretical interest. This is due to its scaling behavior with the size of the input data: given nn items to be clustered, building the complete similarity graph GG requires Θ(n2)\Theta(n^{2}) similarity computations. For a large nn, the the similarity graph GG might unfeasible to construct, or even only to store. This is the main bottleneck of correlation clustering and the reason why its practical relevance still remains largely unexplored.

The high-level contribution of our work is to overcome the main drawback of correlation clustering, making it scalable. We achieve this by designing algorithms that can construct a clustering in a local and distributed manner.

The input of a local clustering algorithm is the identifier of one of the nn objects to be clustered, along with a short random seed. After making a small number of oracle similarity queries (probes into the pairwise similarity matrix), a local algorithm outputs the cluster to which the object belongs, in some globally consistent near-optimal clustering.

1.1 A model for local correlation clustering

In the following we focus on the binary case: we will discuss the non-binary case together with other extensions in Section LABEL:sec:extensions. We work with the adjacency matrix model, which assumes oracle access to the input graph GG. Namely, given x,yV(G)x,y\in V(G), we can ask whether {x,y}\{x,y\} is a positive edge of GG; each query is charged with unit cost. By explicitly finding a clustering we mean storing c(v)\operatorname{c\ell}(v) for every vV(G)v\in V(G). In this explicit model a running time of Ω(n)\Omega(n) is necessary as it requires to specify all values. An algorithm with complexity O(n)O(n) for (approximate) correlation clustering is already a significant improvement over the complexity of most current solutions, but we take a step further and ask whether the dependence on nn may be avoided altogether by producing implicit representations of the cluster mapping.

It is for this reason that we define local clustering as follows. Let us fix, for each finite graph GG, a collection 𝒞G{\mathcal{C}}^{G} of “high quality” clusterings for GG.

Definition 1.1 (Local clustering algorithm)

Let tt\in{\mathbb{N}}. A clustering algorithm 𝒜{\mathcal{A}} for 𝒢n{\mathcal{G}}^{n} is said to be local with time (resp., query) complexity tt if having oracle access to any graph GG, and taking as input |V(G)||V(G)| and a vertex vV(G)v\in V(G), 𝒜{\mathcal{A}} returns a cluster label 𝒜G(v){\mathcal{A}}^{G}(v) in time tt (resp., with tt queries).

Algorithm 𝒜{\mathcal{A}} implicitly defines a clustering, described by the cluster label function c(v)=𝒜G(v)\operatorname{c\ell}(v)={\mathcal{A}}^{G}(v), where the same sequence rr of random bits is used by 𝒜{\mathcal{A}} to calculate 𝒜G(v){\mathcal{A}}^{G}(v) for each vv. The success probability of 𝒜{\mathcal{A}} is the infimum (over all graphs GG) of the probability (over rr) that the clustering implicitly defined by 𝒜G{\mathcal{A}}^{G} belongs to 𝒞G{\mathcal{C}}^{G}.

Note that tt does not depend on n=|V(G)|n=|V(G)|: this means that the cluster label of each vertex can be computed in constant time independently of the others. On the other hand, tt could have a (hopefully mild) dependence on the desired quality of the clustering produced (which defines the set 𝒞G{\mathcal{C}}^{G} for a given GG), and the success probability of 𝒜{\mathcal{A}}. Finally, it is important to note that, in order to define a unique “global” clustering across different vertices, the same sequence rr of random coin flips must be used.

Sometimes we also allow local algorithms with preprocessing pp, meaning (when pp denotes time complexity) that 𝒜{\mathcal{A}} is allowed to perform computations and queries using total time pp before reading the input vertex vv. This preprocessing computation/query set is common to all vertices and may only depend on the outcome of 𝒜{\mathcal{A}}’s internal coin tosses and the edges probed.

1.2 Contributions and practical implications

We focus on approximation algorithms for local correlation clustering with sublinear time and query complexity. Since any multiplicative approximation needs to make Ω(n2)\Omega(n^{2}) queries (Section LABEL:sec:lb), we need less stringent requirements.222We remark that in a different model that uses neighborhood oracles [4], it is possible to bypass the Ω(n2)\Omega(n^{2}) lower bound for multiplicative approximations that holds for edge queries. In fact from our analysis we can derive the first sublinear-time constant-factor approximation algorithm for this case; see Section LABEL:sec:extensions. One way is to allow an additional ε\varepsilon-fraction of edges to be violated, compared to the optimal clustering of cost \funcOPT\func{OPT}. Following Parnas and Ron [33], we study (c,ε)(c,\varepsilon) approximations: solutions with at most c\funcOPT+εn2c\cdot\func{OPT}+\varepsilon\cdot n^{2} disagreements. These solutions form the set CGC^{G} of “high-quality” clusterings in Definition LABEL:def:local. Here cc is a small constant and ε(0,1)\varepsilon\in(0,1) is an accuracy parameter specified by the user. Essentially ε\varepsilon handles the trade-off between the desired accuracy and the run-time: the larger ε\varepsilon the faster then algorithm, but also the further from \funcOPT\func{OPT}.

While we provide the formal statement of our results in Section LABEL:sec:results, here we highlight the main message of this paper: there exist efficient local clustering algorithms with good approximation guarantees. Namely, in time t=poly(1/ε)t={\operatorname{poly}}(1/\varepsilon) it is possible to obtain (O(1),ε)(O(1),\varepsilon)-approximations locally. (Typically we think of ε\varepsilon as a user-defined constant.) This yields many practical contributions as by-products:

  • \bullet

    Explicit clustering in time O(n)\boldmath{O(n)}. Given that c(v)\operatorname{c\ell}(v) can be computed in time tt for each vV(G)v\in V(G), one can produce an explicit clustering in time O(nt)O(n\cdot t). Since t=O(1)t=O(1), this is linear in the number of vertices (not edges) of the graph. More generally, the complexity of finding clusters of a subset SVS\subseteq V of vertices requested by the user is proportional to the size of this subset.

  • \bullet

    Distributed algorithms. We can assign vertices to different processors and compute their cluster labels in parallel, provided that the same random seed is passed along to all processors.

  • \bullet

    Streaming algorithms. Similarly, local clustering algorithms can cluster graphs in the streaming setting, where edges arrive in arbitrary order. In this case the sublinear behaviour is lost because we still need to process every edge. However, the memory footprint of the algorithm can be brought down from Ω(n2)\Omega(n^{2}) to O(n)O(n) (called the semi-streaming model [18]). Indeed, note that given a fixed random seed, for every vertex vv the set of all possible queries QvQ_{v} that can be made during the computation of c(v)\operatorname{c\ell}(v) has size333This bound can in fact be reduced to tt for the non-adaptive algorithms we devise. at most 2t2^{t}. This set can be computed before any edge arrives. From then on it suffices to keep in memory the edges (v,w)(v,w) where wQvw\in Q_{v}, and there are n2t=O(n)n\cdot 2^{t}=O(n) of them. In fact, the running time of the local-based algorithm will be dominated by the time it takes to discard the unneeded edges.

  • \bullet

    Cluster edit distance estimators and testers. We can estimate the degree of clusterability of the input data in constant time by sampling pairs of vertices and using the local clustering algorithm to see how many of them disagree with the input graph. We believe this can be an important primitive to develop new algorithms. Moreover, estimators for cluster edit distance give (tolerant) testers for the property of being clusterable, thereby allowing us to quickly detect data instances where any attempt to obtain a good clustering is bound to failure.

  • \bullet

    Local clustering reconstruction. Queries of the form “are x,yx,y in the same cluster?” can be answered in constant time without having to partition the whole graph: simply compute c(x)\operatorname{c\ell}(x) and c(y)\operatorname{c\ell}(y), and check for equality. This means that we can “correct” our input graph GG (a “corrupted” version of a clusterable graph) so that the modified graph we output is close to the input and satisfies the property of being clusterable. This fits the paradigm of local property-preserving data reconstruction of [3] and [35].

To the best of our knowledge, this is the first work about local algorithms for correlation clustering.

2 Background and related work

Correlation clustering. Minimizing disagreements is the same as maximizing agreements for exact algorithms, but the two tasks differ with regard to approximation. Following [25], we refer to these two problems as MaxAgree and MinDisagree, while MaxAgree[k]{\textsc{MaxAgree}}[k] and MinDisagree[k]{\textsc{MinDisagree}}[k] refer to the variants of the problem with a bound kk on the number of clusters. Not surprisingly MaxAgree and MinDisagree are \NP\NP-complete [10, 37]; the same holds for their bounded counterparts, provided that k2k\geq 2. Therefore approximate solutions are of interest. For MaxAgree, there is a (randomized) \PTAS: the first such result was due to Bansal et al. [10] and ran in time n2exp(O(1/ε))n^{2}\exp{(O(1/\varepsilon))}, later improved to n2poly(1/ε)n\cdot 2^{{\operatorname{poly}}(1/\varepsilon)} by Giotis and Guruswami [25]. The latter also presented a \PTAS\PTAS for MaxAgree[k]{\textsc{MaxAgree}}[k] that runs in time nkO(ε3log(k/ε))n\cdot k^{O(\varepsilon^{-3}\log(k/\varepsilon))}. In contrast, MinDisagree is \APX\APX-hard [14], so we do not expect a \PTAS. Nevertheless, there are constant-factor approximation algorithms [10, 14, 2]. The best factor (2.52.5) was given by Ailon et al. [2], who also present a simple, elegant algorithm that achieves a slightly weaker expected approximation ratio of 33, called QuickCluster (see Section LABEL:sec:main). For MinDisagree[k]{\textsc{MinDisagree}}[k], \PTAS\PTAS appeared in  [25] and [29]. There is also work on correlation clustering on incomplete graphs [10, 14, 41, 25, 17].

Sublinear clustering algorithms. Sublinear clustering algorithms for geometric data sets are known [5, 32, 9, 15, 16]. Many of these find implicit representations of the clustering they output. There is a natural implicit representation for most of this problems, e.g., the set of kk cluster centers. By contrast, in correlation clustering there may be no clear way to define a clustering for the whole graph based on a small set of vertices. The only sublinear-time algorithm known for correlation clustering is the aforementioned result of  [25]; it runs in time O(n)O(n), but the multiplicative constant hidden in the notation has an exponential dependence on the approximation parameter.

The literature on active clustering also contains algorithms with sublinear query complexity (see, e.g., [28]); many of them are heuristic or do not apply to correlation clustering. Ailon et al. [1] obtain algorithms for MinDisagree[k]{\textsc{MinDisagree}}[k] with sublinear query complexity, but the running time of their solutions is exponential in nn.

Local algorithms. The following notion of locality is used in the distributed computing literature. Each vertex of a sparse graph is assigned a processor, and each processor can compute a certain function in a constant number of rounds by passing messages to its neighbours (see Suomela’s survey [40]). Our algorithms are also local in this sense.

Recently, Rubinfeld et al. [34] introduced a model that encompasses notions from several algorithmic subfields, such as locally decodable codes, local reconstruction and local distributed computation. Our definition fits into their framework: it corresponds to query-oblivious, parallelizable, strongly local algorithms that compute a cluster label function in constant time.

Finally, we point out the work of Spielman and Teng [39] pertaining local clustering algorithms. In their papers an algorithm is “local” if it can, given a vertex vv, output vv’s cluster in time nearly linear in the cluster’s size. Our local clustering algorithms also have this ability (assuming, as they do, that for each vertex we are given a list of its neighbours), although the results are not comparable because  [39] attempt to minimize the cluster’s conductance.

Testing and estimating clusterability. Our methods can also be used for quickly testing clusterability of a given input graph GG, which is related to the task of estimating the cluster edit distance, i.e., the minimum number of edge label swaps (from “+” to “–” and viceversa) needed to transform GG into a cluster graph. Note that this corresponds to the optimal cost of correlation clustering for the given input GG. Clusterability is a hereditary graph property (closed under removal and renaming of vertices), hence it can be tested with one-sided error using a constant number of queries by the powerful result of Alon and Shapira [8]. Combined with the work of Fischer and Newman [20], this also yields estimators for cluster edit distance that run in time independent of the graph size. Unfortunately, the query complexity of the algorithm given by these results would be a tower exponential of height poly(1/ε){\operatorname{poly}}(1/\varepsilon), where ε\varepsilon is the approximation parameter.

Approximation algorithms for MIN-2-CSP problems [7] also give estimators for cluster edit distance. However, they provide no way of computing each variable assignment in constant time. Moreover, they use time n2\sim n^{2} to calculate all assignments, and hence do not lend themselves to sublinear-time clustering algorithms.

3 Statement of results

All our graphs are undirected and simple. For a vertex vv, Γ+(v)\Gamma^{+}(v) is the set of positive edges incident with vv; similarly define Γ(v)\Gamma^{-}(v). We extend this notation to sets of vertices in the obvious manner. The distance between two graphs G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V,E^{\prime}) is |EE||E\oplus E^{\prime}|. Their fractional distance is their distance divided by n2n^{2} (note this is in the interval [0,1/2)[0,1/2)). Two graphs are ε\varepsilon-close to each other if their distance is at most most εn2\varepsilon n^{2}. A kk-clusterable graph is a union of at most kk vertex-disjoint cliques. A graph is clusterable if it is kk-clusterable for some kk.

The following folklore lemma says that approximate kk-clustering algorithms yield approximate clustering algorithms with an unbounded number of clusters:

Lemma 3.1

If GG is clusterable, then it is ε\varepsilon-close to (1+1/ε)(1+1/\varepsilon)-clusterable.

Proof.  Take the optimal clustering for GG. Let BB be the set of vertices in clusters of size <εn<\varepsilon n. Now re-cluster the elements of BB arbitrarily into clusters of size εn\lfloor\varepsilon n\rfloor (except possibly one). This introduces at most εn|B|εn2\varepsilon n\cdot|B|\leq\varepsilon n^{2} additional errors. All but one of the clusters of the resulting clustering have size εn\geq\lfloor\varepsilon n\rfloor, hence it has at most 1+1/ε1+1/\varepsilon clusters.      

Corollary 3.2

Any (1,ε/2)(1,\varepsilon/2) approximation to the optimal (1+2/ε)(1+2/\varepsilon)-clustering is also a (1,ε)(1,\varepsilon) approximation to the optimal clustering.

Proof.  Immediate from the triangle inequality for graph distances.      

We are now ready to summarize our results. All our algorithms are (necessarily) randomized, and succeed with probability no less than 2/32/3 (which can be amplified). Our first result concerns the standard setting where the clusters of all vertices need to be explicitly computed. We present a (4,ε)(4,\varepsilon)-approximation444We can also produce an expected (3,ε)(3,\varepsilon)-approximation. Because we insist on algorithms that work with constant success probability, we talk about (4,ε)(4,\varepsilon)-approximations, where the constant 44 could be replaced with any number 3\geq 3. that runs in time O(n/ε)O(n/\varepsilon); compare the Ω(n2)\Omega(n^{2}) complexity of most other clustering methods. Our algorithm is optimal up to constant factors.

Theorem 3.3

Given ε(0,1)\varepsilon\in(0,1), a (4,ε)(4,\varepsilon)-approximate clustering for MinDisagree can be found in time O(n/ε)O(n/\varepsilon). Moreover, finding an (O(1),ε)(O(1),\varepsilon)-approximation with constant success probability requires Ω(n/ε)\Omega(n/\varepsilon) queries.

In other words, with a “budget” of qq queries we can obtain a (4,O(n/q))(4,O(n/q))-approximation. In fact, the upper bound of Theorem LABEL:explicit3 can be derived from our next result. It states that the same approximation can be implicitly constructed in constant time, regardless of the size of the graph.

Theorem 3.4

Given ε(0,1)\varepsilon\in(0,1), a (4,ε)(4,\varepsilon)-approximate clustering for MinDisagree can be found locally in time O(1/ε2)O(1/\varepsilon^{2}), or in time O(1/ε)O(1/\varepsilon) after preprocessing that uses O(1/ε2)O(1/\varepsilon^{2}) non-adaptive queries and time. Moreover, finding an (O(1),ε)(O(1),\varepsilon)-approximation with constant success probability requires Ω(1/ε)\Omega(1/\varepsilon) adaptive queries.

As a corollary we obtain a partially tolerant tester of clusterability. We stress that the tester is efficient both in terms of query complexity and time complexity, unlike many results in property testing.

Corollary 3.5

There is a non-adaptive, two-sided error tester which accepts graphs that are ε/5\varepsilon/5-close to clusterable and rejects graphs that are ε\varepsilon-far from clusterable. It runs in time O(1/ε2)O(1/\varepsilon^{2}).

So far these results do not allow us to obtain clusterings that are arbitrarily close to the optimal one. To overcome this issue, we also show (using different techniques) that a purely additive approximation can still be found with poly(1/ε){\operatorname{poly}}(1/\varepsilon) queries, but with an exponentially larger running time.

Theorem 3.6

Given ε(0,1)\varepsilon\in(0,1), there is a local clustering algorithm that achieves an (1,ε)(1,\varepsilon) approximation to the cost of the optimal clustering. Its local time complexity is poly(1/ε){\operatorname{poly}}(1/\varepsilon) after preprocessing that uses poly(1/ε){\operatorname{poly}}(1/\varepsilon) queries and 2poly(1/ε)2^{{\operatorname{poly}}(1/\varepsilon)} time.

For the explicit versions we obtain the following.

Corollary 3.7

There is a (1,ε)(1,\varepsilon)-approximate clustering algorithm for MinDisagree (and hence MaxAgree too) that runs in time npoly(1/ε)+2poly(1/ε)n\cdot{\operatorname{poly}}(1/\varepsilon)+2^{{\operatorname{poly}}(1/\varepsilon)}. In particular there is a \PTAS\PTAS for MaxAgree with the same running time.

The “in particular” part follows from the observation that the optimum value for MaxAgree[2]{\textsc{MaxAgree}}[2] is Ω(n2)\Omega(n^{2}) (see, e.g., [25, Theorem 3.1]). The best \PTAS\PTAS in the literature [25] ran in time n2Ω(ε3log2(1/ε))n\cdot 2^{\Omega(\varepsilon^{-3}\log^{2}(1/\varepsilon))}. In our result, the dominating term (depending on nn) has an exponentially smaller multiplicative constant (polynomial in 1/ε1/\varepsilon), and then we have an additive term exponential in 1/ε1/\varepsilon (and independent of nn). As for lower bounds, observe that the Ω(n/ε)\Omega(n/\varepsilon) bound from Theorem LABEL:explicit3 still applies, while the presence of a term of the form 2(1/ε)Ω(1)2^{(1/\varepsilon)^{\Omega(1)}} for very small ε\varepsilon seems hard to avoid due to the \NP\NP-completeness of the problems, as an optimal solution can be found upon setting ε=1/n2\varepsilon=1/n^{2}.

These results are established via the study of the corresponding problems with a prespecified number kk of clusters; such algorithms yield additive approximations to the general case upon setting k=O(1/ε)k=O(1/\varepsilon) in view of Lemma LABEL:fromkton. For fixed kk, the bounds for our algorithms have the same form after replacing ε\varepsilon by k/εk/\varepsilon (see Section LABEL:main_dense). For example, we get a \PTAS\PTAS for MaxAgree[k]{\textsc{MaxAgree}}[k] in time npoly(k/ε)+2poly(k/ε)n\cdot{\operatorname{poly}}(k/\varepsilon)+2^{{\operatorname{poly}}(k/\varepsilon)}.

Corollary 3.8

For any 0<ε1<ε20<\varepsilon_{1}<\varepsilon_{2}, there is a non-adaptive, one-sided error tester which accepts graphs that are ε1\varepsilon_{1}-close to clusterable and rejects graphs that are ε2\varepsilon_{2}-far from clusterable. It has query complexity poly(1/ε){\operatorname{poly}}(1/\varepsilon) and runs in time 2poly(1/ε)2^{{\operatorname{poly}}(1/\varepsilon)}, where ε=ε2ε1\varepsilon=\varepsilon_{2}-\varepsilon_{1}.

Techniques and roadmap. Our first local algorithm (Theorem LABEL:main_local) is inspired by the QuickCluster algorithm of Ailon et al. [2], which resembles the greedy procedure for finding maximal independent sets. The main idea to make a local version is to define the clusters “in reverse”. We find a small set PP of “cluster centers” or “pivots” by looking at a small induced subgraph, and then we show a simple rule to define an extended clustering for the whole graph in terms of the adjacencies of each particular vertex with PP. As it turns out, such PP can be obtained by a procedure that finds a constant-sized “almost-dominating” set of vertices that are within distance two of most other vertices in the graph, in such a way that we can combine the expected 3-approximation guarantee of [2] with an additive error term. The algorithm and its analysis are given in Section LABEL:sec:main.

The second local algorithm (Theorem LABEL:main_dense) borrows ideas from the PTAS for dense MaxCut of Frieze and Kannan [24] and uses low-rank approximations to the adjacency matrix of the graph. (Interestingly, while such approximations have been known for a long time, their implications for correlation clustering have been overlooked.) Notably, implicit descriptions of these approximations are locally computable in constant time (polynomial in the inverse of the approximation parameter). We show that in order to look for near-optimal clusterings, we can restrict the search to clusterings that “respect” a sufficiently fine weakly regular partition of the graph. Then we argue that this can be used to implicitly define a good approximate clustering: to cluster a given vertex, we first determine its piece in a regular partition, and then we look at which cluster contains this piece in the best coarsening of the partition. The details are in Section LABEL:sec:additive.

The lower bounds, proven in Section LABEL:sec:lb, are applications of Yao’s lemma [42]. Broadly speaking, we give the candidate algorithm a perfect clustering of most vertices of the graph into t=O(1/ε)t=O(1/\varepsilon) clusters of equal size, and for each of the remaining vertices a “secret” cluster is chosen at random among these tt. The optimal clustering of the resulting graph has fractional cost ε/c\varepsilon/c for some constant c>1c>1. We then ask the algorithm to find clusters for the remaining vertices, and show that it must make Ω(n/ε)\Omega(n/\varepsilon) adaptive queries if it is to output a clustering with fractional cost no larger than ε\varepsilon.

Finally, in Section LABEL:sec:extensions we discuss several extensions, including the case of non-binary similarity measure.

4 (3,ε)(3,\varepsilon)-approximations

First we describe the QuickCluster algorithm of Ailon et al. [2]. It selects a random pivot, creates a cluster with it and its positive neighborhood, removes the cluster, and iterates on the induced subgraph remaining. Essentially it finds a maximal independent set in the positive graph. When the graph is clusterable, it makes no errors. In [2], the authors show that the expected cost of the clustering found is at most three times the optimum.

Note that determining the positive neighborhood Γ+(v)\Gamma^{+}(v) of a pivot vv takes n1n-1 queries to the adjacency oracle. The algorithm’s worst-case complexity is Θ(n2)\Theta(n^{2}): consider the graph with no positive edges. In fact its time and query complexity is O(nc)O(nc), where cc is the average number of clusters found. This suggests attempting to partition the data into a small number of clusters to minimize query complexity.

We know from Lemma LABEL:fromkton that any clustering can be ε\varepsilon-approximated by a clustering with pieces of size Ω(εn)\Omega(\varepsilon n). So an idea would be to modify QuickCluster so that most clusters output are sufficiently large. Fortunately, QuickCluster tends to do just that on average, provided that the graph of positive edges is sufficiently dense, because the expected size of the next cluster found is precisely one plus the average degree of the remaining graph. Once the graph becomes too sparse, a low-cost clustering of the remaining vertices can be found without even looking at the edges, for example by putting each of them into their own cluster.555Another possibility that works is to cluster all remaining vertices into clusters of size εn\varepsilon n, eliminating the need for singleton clusters.

Another advantage of finding a small number of clusters is locality. Let P=P1,,PtP=P_{1},\ldots,P_{t} denote the first tt elements of the sequence of pivots found by QuickCluster. Let us pick an arbitrary vertex vv contained in the neighbourhood of {P1,,Pt}\{P_{1},\ldots,P_{t}\}; all other vertices can be safely ignored because as we shall see they usually will be incident to few edges (for suitably chosen tt). Then the pivot of vv’s cluster is the first element of PP that is a positive neighbour of vv: therefore it can be determined in time O(t)O(t), assuming we are given the pivot sequence P1,,PtP_{1},\ldots,P_{t}.

function LocalCluster(v,εv,\varepsilon)
  PFindGoodPivots(ε)P\leftarrow\textsc{FindGoodPivots}(\varepsilon) \triangleright This is the preprocessing stage and can be taken outside
  return FindCluster(v,P)\textsc{FindCluster}(v,P)
function FindCluster(v,Pv,P)
  if vΓ+(P)v\notin\Gamma^{+}(P) then return vv \triangleright Cluster vv by itself
  elseimin{jvΓ+(Pj)}\;\;i\leftarrow\min\{j\mid v\in\Gamma^{+}(P_{j})\}; return PiP_{i} \triangleright Find first positive neighbour in PP   
function FindGoodPivots(ε\varepsilon)
  for i[16]i\in[16] do
    PiFindPivots(ε/12)P^{i}\leftarrow\textsc{FindPivots}(\varepsilon/12);
    d~i\tilde{d}^{i}\leftarrow estimate of the cost of PiP^{i} with O(1/ε)O(1/\varepsilon) local clustering calls (see Appendix LABEL:sec:main)   
  jargmin{d~ii[16]}j\leftarrow\arg\min\{\tilde{d}^{i}\mid i\in[16]\}
  return PjP^{j}
function FindPivots(ε\varepsilon)
  Let QVQ\subseteq V be a random sample without replacement of size min(n,12ε)\min\big{(}n,\frac{1}{2\varepsilon}\big{)}
  return IndependentSet(Q)\textsc{IndependentSet}(Q)
function IndependentSet(QQ)
  P[]P\leftarrow[] (empty sequence)
  for vQv\in Q do
    if FindCluster(v,Pv,P) =v=v then append vv to PP       return PP
Algorithm 1 LocalCluster

Therefore we propose the scheme whose pseudocode is given in Algorithm 1 (the analysis is presented in the next section). Assuming we know a good sequence PP, an implicit clustering is defined deterministically in the way described above; two vertices vv and vv^{\prime} belong to the same cluster if and only if FindCluster(vv) = FindCluster(vv^{\prime}). Similarly to QuickCluster, we can find a set of pivots by finding an independent set of vertices in the graph; to keep it small we restrict the search to an induced subgraph of size O(1/ε)O(1/\varepsilon). This is done by FindPivots, which can be seen as a “preprocessing stage” for the local clustering algorithm FindCluster. In the next section the following key lemma will be shown.

Lemma 4.1

Let ε(0,1)\varepsilon\in(0,1) and r,s>1r,s>1. The expected cost of the clustering determined by FindPivots(ε)\textsc{FindPivots}(\varepsilon) is at most 3\funcOPT+εn23\cdot\func{OPT}+\varepsilon n^{2}, and the probability that it exceeds 3r\funcOPT+sεn23r\cdot\func{OPT}+s\cdot\varepsilon n^{2} is less than 1r+1s\frac{1}{r}+\frac{1}{s}.

For example, setting r=4/3r=4/3, s=8s=8 we see that with probability 1/81/8, the clustering determined by FindPivots(ε/4)\textsc{FindPivots}(\varepsilon/4) is a (4,ε)(4,\varepsilon)-approximation to the optimal one. Although this low bound on the success probability may be overly pessimistic, we can amplify it in order to obtain better theoretical guarantees. To do this with confidence 2/32/3 we try several samples QQ and estimate the cost of the associated local clusterings by sampling random edges.

Lemma 4.2

Let dd denote the fractional cost of the optimal clustering. With probability at least 5/65/6, FindGoodPivots(ε)\textsc{FindGoodPivots}(\varepsilon) returns a pivot set PiP^{i} with fractional cost at most 4d+ε4d+\varepsilon. Its running time is O(1/ε2)O(1/\varepsilon^{2}).

Finally, to obtain a purely local clustering algorithm with no preprocessing we need each vertex to find the good set of pivots by itself, as per LocalCluster(v,εv,\varepsilon). Note it is crucial here that all vertices have access to the same source of randomness so the same set of pivots is found on each separate call. In practical implementations this means introducing an additional parameter of short length, for instance a common random seed to be used.

From Lemmas LABEL:lem:eight and LABEL:lem:cost we can easily deduce two of our main results.

Corollary 4.3 (Upper bound of Theorem LABEL:main_local)

LocalCluster(v,ε)\textsc{LocalCluster}(v,\varepsilon) is a local clustering algorithm for MinDisagree achieving a (4,ε)(4,\varepsilon) approximation to the optimal clustering with probability 2/32/3. The preprocessing runs in time O(min(n/ε,1/ε2))O(\min(n/\varepsilon,1/\varepsilon^{2})), and the clustering time per vertex is O(1/ε)O(1/\varepsilon).

Corollary 4.4 (Upper bound of Theorem LABEL:explicit3)

An explicit clustering attaining a (4,ε)(4,\varepsilon) approximation can be found with probability 2/32/3 in time O(n/ε)O(n/\varepsilon).

By a very similar argument we can produce an expected (3,ε)(3,\varepsilon)-approximate clustering in time O(n/ε)O(n/\varepsilon). A time-efficient property tester for clusterability (Corollary LABEL:cor:prop_test) is also a simple consequence of the above.

4.1 Analysis of the local algorithm

We prove the approximation guarantees of the algorithm (Lemma LABEL:lem:eight) by comparing it to the clustering found by QuickCluster, which is known to achieve an expected 3-approximation [10]. In this section we consider the input graph G=G+G=G^{+} with negative edges removed, so Γ(v)=Γ+(v)\Gamma(v)=\Gamma^{+}(v) for all vGv\in G.

The following is a straightforward consequence of the multiplicative Chernoff bounds:

Lemma 4.5

Let c>1c>1, ε,δ(0,1)\varepsilon,\delta\in(0,1), m+m\in{\mathbb{N}}^{+} and n=3c(c1)2log(m/δ)εn=\frac{3\cdot c}{(c-1)^{2}}\frac{\log(m/\delta)}{\varepsilon}. Suppose {Xjii[m],j[n]}\{X^{i}_{j}\mid i\in[m],j\in[n]\} are random variables in [0,1][0,1] such that for all i[m]i\in[m], the variables X1i,,XniX^{i}_{1},\ldots,X^{i}_{n} are independent with common mean μi\mu_{i}. Define μ~i=1nj=1nXij\tilde{\mu}_{i}=\frac{1}{n}\sum_{j=1}^{n}X_{i}^{j}. Then with probability at least 1δ1-\delta, the following holds:

  • If mini[m]μi1cε\min_{i\in[m]}\mu_{i}\leq\frac{1}{c}\cdot\varepsilon, then mini[m]μ~iε\min_{i\in[m]}\tilde{\mu}_{i}\leq\varepsilon.

  • For all i[m]i\in[m], if μi>cε\mu_{i}>c\cdot\varepsilon, then μ~i>ε\tilde{\mu}_{i}>\varepsilon.

Corollary 4.6

Let d,δ(0,1),c>1d,\delta\in(0,1),c>1. Let C1,,CmC_{1},\ldots,C_{m} be mm clusterings such that at least one of them has fractional cost d\leq d. Then with probability 1δ1-\delta we can select i[m]i\in[m] such that CiC_{i} has fractional cost at most cdc\cdot d using a total of 3c(c1)2mlog(m/δ)ε\frac{3c}{(c-1)^{2}}\frac{m\,\log(m/\delta)}{\varepsilon} edge queries to C1,,CmC_{1},\ldots,C_{m}.

Proof of Lemma LABEL:lem:cost.  Use Lemma LABEL:lem:eight and Corollary LABEL:cor:aggr.      

Proof of Corollary LABEL:explicit4.  Call FindGoodPivots(ε\varepsilon) once to obtain a good pivot sequence PP with probability 2/32/3 in time O(min(n,ε2))O(n)O(\min(n,\varepsilon^{-2}))\leq O(n). Then run FindCluster(v,P)\textsc{FindCluster}(v,P) sequentially for each vertex vv in order to determine its cluster label ll, appending vv to the list of vertices in cluster labelled ll. Finally output the resulting clusters. The whole process runs in time O(n)+O(n/ε)=O(n/ε)O(n)+O(n/\varepsilon)=O(n/\varepsilon).      

A partial clustering of VV is a clustering of a subset WW of VV. Its partial cost is the number of disagreements between edges that have at least one endpoint in WW.

Now consider a clustering 𝒞{\mathcal{C}} of VV into C1,,CmVC_{1},\ldots,C_{m}\subseteq V. For S[m]S\subseteq[m], the SSth partial subclustering of 𝒞{\mathcal{C}} is the partition of VS=iSCiV_{S}=\bigcup_{i\in S}C_{i} into {Ci}iS\{C_{i}\}_{i\in S}. Clearly the cost of a clustering upper bounds the partial cost of any of its partial subclusterings.

Lemma 4.7

Let P1,,PtP_{1},\ldots,P_{t} denote the sequence of pivots found by FindPivots(ε)\textsc{FindPivots}(\varepsilon). The expected number of edge violations involving vertices within distance 2\leq 2 from P1,,PtP_{1},\ldots,P_{t} is at most 3\funcOPT3\cdot\func{OPT}.

Proof.  To simplify the analysis, in the proof of this lemma we modify QuickCluster and  FindPivots slightly so that they run deterministically provided that a random permutation π\pi of the vertex set VV is chosen in advance. Concretely, we consider a deterministic version of QuickCluster, denoted QuickClusterπ{\textsc{QuickCluster}}^{\pi}, that uses pivot set IndependentSet(Q)\textsc{IndependentSet}(Q), where QQ lists all vertices of VV in ascending order of π\pi. Similarly, deterministic FindClusterπ{\textsc{FindCluster}}^{\pi} takes for QQ the set of the first O(1/ε)O(1/\varepsilon) elements in increasing order of π\pi. Clearly running FindClusterπ{\textsc{FindCluster}}^{\pi} on a random permutation π\pi is the same as running the original FindCluster, and likewise for QuickCluster.

Observe that the set P1,,Pt(π)P_{1},\ldots,P_{t(\pi)} of pivots returned by FindClusterπ{\textsc{FindCluster}}^{\pi} is a prefix of the set of pivots returned by QuickClusterπ{\textsc{QuickCluster}}^{\pi}. Therefore the first t(π)t(\pi) clusters are the same as well, i.e., P1,,Pt(π)P_{1},\ldots,P_{t(\pi)} define a partial subclustering of the one found by QuickClusterπ{\textsc{QuickCluster}}^{\pi}. Hence the partial cost of the subclustering determined by FindClusterπ{\textsc{FindCluster}}^{\pi} is in expectation at most 3\funcOPT3\cdot\func{OPT}. This is equivalent to the statement of the lemma.      

Next we show that FindCluster returns a small “almost-dominating” set of vertices, in the sense quantified in the following result.

Theorem 4.8

Let G=(V,E)G=(V,E) be a graph and QQ be an ordered sample of rr independent vertices uniformly chosen with replacement from VV. Let P=IndependentSet(Q)P=\textsc{IndependentSet}(Q). Then the expected number of edges of GG not incident with an element of PΓ(P)P\cup\Gamma(P) is less than n22r\frac{n^{2}}{2r}.

Observe that an existential result for an almost-dominating set PP is easy to establish by picking pivots in order of decreasing degree in the residual graph. However, doing so would invalidate the approximation guarantees of QuickCluster we are relying on. We defer the proof of Theorem LABEL:lem:indep. Assuming the result, we are ready to prove Lemma LABEL:lem:eight.

Proof of Lemma LABEL:lem:eight.  Lemma LABEL:lem:prefix says that the set of pivots found define a partial clustering with expected cost bounded by 3\funcOPT3\cdot\func{OPT}. Let QQ be the random sample used by FindPivots(ε)\textsc{FindPivots}(\varepsilon). Theorem LABEL:lem:indep is stated for sampling with replacement, but this implies the same result for sampling without replacement, so its conclusion still holds. Combining the two results and setting r=1/(2ε)r=1/(2\varepsilon), we see that the set of disagreements in the clustering produced can be written as the union of two disjoint random sets A,BV×VA,B\in V\times V with 𝔼[A]3\funcOPT\operatorname*{\mathbb{E}}[A]\leq 3\cdot\func{OPT} and 𝔼[B]εn2\operatorname*{\mathbb{E}}[B]\leq\varepsilon n^{2}. Thus the total cost is |AB|=|A|+|B||A\cup B|=|A|+|B|. By linearity of expectation, the expected cost is 𝔼[|A|+|B|]3\funcOPT+εn2,\operatorname*{\mathbb{E}}[|A|+|B|]\leq 3\cdot\func{OPT}+\varepsilon n^{2}, and by applying Markov’s inequality to the non-negative variables |A||A| and |B||B| separately we conclude that

Pr[(|A|>r𝔼[|A|]) or (|B|>s𝔼[|B|])]<1r+1s.\Pr[(|A|>r\operatorname*{\mathbb{E}}[|A|])\text{ or }(|B|>s\operatorname*{\mathbb{E}}[|B|])]<\frac{1}{r}+\frac{1}{s}.

     
The rest of this section is devoted to prove Theorem LABEL:lem:indep. For any non-empty graph GG and pivot vV(G)v\in V(G), let Nv(G)N_{v}(G) denote the subgraph of GG resulting from removing all edges incident to {v}Γ(v)\{v\}\cup\Gamma(v) (keeping all vertices). Define a random sequence G0,G1,G_{0},G_{1},\ldots of graphs by G0=GG_{0}=G and Gi+1=Nvi+1(Gi)G_{i+1}=N_{v_{i+1}}(G_{i}), where v1,v2,v_{1},v_{2},\ldots are chosen independently at random from V(G0)V(G_{0}) (note that sometimes Gi+1=GiG_{i+1}=G_{i}).

Lemma 4.9

Let GiG_{i} have average degree d~\tilde{d}. When going from GiG_{i} to Gi+1G_{i+1}, the number of edges decreases in expectation by at least (d~+12)\binom{\tilde{d}+1}{2}, and the number of degree-0 vertices increases in expectation by at least d~+1\tilde{d}+1.

Proof.  Let V=V(G0)V=V(G_{0}), E=E+(Gi)E=E^{+}(G_{i}) and let du=|Γ(u)|d_{u}=|\Gamma(u)| denote the positive degree of uVu\in V on GiG_{i}. The claim on the number of degree-0 vertices is easy, so we prove the claim on the number of edges. Consider an edge {u,v}E\{u,v\}\in E. It is deleted if the chosen pivot is an element of Γ(u)Γ(v)\Gamma(u)\cup\Gamma(v) (this contains uu and vv); let XuvX_{uv} be the 0-1 random variable associated with this event. It occurs with probability

𝔼[Xuv]=|Γ(u)Γ(v)|n1+max(du,dv)n1n+du+dv2n.\operatorname*{\mathbb{E}}[X_{uv}]=\frac{|\Gamma(u)\cup\Gamma(v)|}{n}\geq\frac{1+\max(d_{u},d_{v})}{n}\geq\frac{1}{n}+\frac{d_{u}+d_{v}}{2n}.

Let D=u<v{u,v}EXuvD=\sum_{u<v\mid{\{u,v\}}\in E}X_{uv} be the number of edges deleted. By linearity of expectation, its average is

𝔼[D]\displaystyle\operatorname*{\mathbb{E}}[D] =\displaystyle= u<v{u,v}E𝔼[Xuv]12u,v{u,v}E(1n+du+dv2n)\displaystyle\sum_{\begin{subarray}{c}u<v\\ \{u,v\}\in E\end{subarray}}\operatorname*{\mathbb{E}}[X_{uv}]\geq\frac{1}{2}\sum_{\begin{subarray}{c}u,v\\ \{u,v\}\in E\end{subarray}}\left(\frac{1}{n}+\frac{d_{u}+d_{v}}{2n}\right)
=\displaystyle= d~2+14nu,v{u,v}E(du+dv).\displaystyle\frac{\tilde{d}}{2}+\frac{1}{4n}\sum_{\begin{subarray}{c}u,v\\ \{u,v\}\in E\end{subarray}}(d_{u}+d_{v}).

Now we compute

14nu,v{u,v}E(du+dv)\displaystyle\frac{1}{4n}\sum_{\begin{subarray}{c}u,v\\ \{u,v\}\in E\end{subarray}}(d_{u}+d_{v}) =\displaystyle= 12nu,v{u,v}Edu=12nudu2\displaystyle\frac{1}{2n}\sum_{\begin{subarray}{c}u,v\\ \{u,v\}\in E\end{subarray}}d_{u}=\frac{1}{2n}\sum_{u}d_{u}^{2}
=\displaystyle= 12𝔼u[du2]12(𝔼u[du])212d~2,\displaystyle\frac{1}{2}\operatorname*{\mathbb{E}}_{u}[d_{u}^{2}]\geq\frac{1}{2}\left(\operatorname*{\mathbb{E}}_{u}[d_{u}]\right)^{2}\geq\frac{1}{2}\tilde{d}^{2},

hence 𝔼[D]d~2+d~22=(d~+12).\operatorname*{\mathbb{E}}[D]\geq\frac{\tilde{d}}{2}+\frac{\tilde{d}^{2}}{2}=\binom{\tilde{d}+1}{2}.      

Now let V~(G)={vV(G)deg(v)>0}\widetilde{V}(G)=\{v\in V(G)\mid\deg(v)>0\} and define the “actual size” S(G)S(G) of a graph by S(G)=2|E(G)|+|V~(G)|=vV~(G)(1+deg(v)).S(G)=2\cdot|E(G)|+|\widetilde{V}(G)|=\sum_{v\in\widetilde{V}(G)}(1+\deg(v)). Let n=|V(G0)|n=|V(G_{0})| and define αi[0,1]\alpha_{i}\in[0,1] by αi=s(Gi)n2.\alpha_{i}=\frac{s(G_{i})}{n^{2}}.

Lemma 4.10

For all i1i\geq 1 the following inequalities hold:

𝔼[αiv1,,vi1]\displaystyle\operatorname*{\mathbb{E}}[\alpha_{i}\mid v_{1},\ldots,v_{i-1}] αi1(1αi1),\displaystyle\leq\alpha_{i-1}(1-\alpha_{i-1}), (2)
𝔼[αi]\displaystyle\operatorname*{\mathbb{E}}[\alpha_{i}] 𝔼[αi1](1𝔼[αi1]),\displaystyle\leq\operatorname*{\mathbb{E}}[\alpha_{i-1}](1-\operatorname*{\mathbb{E}}[\alpha_{i-1}]), (3)
𝔼[αi]\displaystyle\operatorname*{\mathbb{E}}[\alpha_{i}] <1i+1.\displaystyle<\frac{1}{i+1}. (4)

Proof.  Inequality (LABEL:eq:cond_alpha_i) is a restatement of Lemma LABEL:lem:del_edges. Inequality (LABEL:eq:alpha_i) follows from Jensen’s inequality: since 𝔼[αi]=𝔼[𝔼[αiv1,,vi1]]𝔼[αi1(1αi1)]\operatorname*{\mathbb{E}}[\alpha_{i}]=\operatorname*{\mathbb{E}}\big{[}\operatorname*{\mathbb{E}}[\alpha_{i}\mid v_{1},\ldots,v_{i-1}]\big{]}\leq\operatorname*{\mathbb{E}}[\alpha_{i-1}(1-\alpha_{i-1})] and the function gg mapping xx to g(x)=x(1x)g(x)=x(1-x) is concave, we have 𝔼[αi]𝔼[g(αi1)]g(𝔼[αi1])=𝔼[αi1](1𝔼[αi1]).\operatorname*{\mathbb{E}}[\alpha_{i}]\leq\operatorname*{\mathbb{E}}[g(\alpha_{i-1})]\leq g(\operatorname*{\mathbb{E}}[\alpha_{i-1}])=\operatorname*{\mathbb{E}}[\alpha_{i-1}](1-\operatorname*{\mathbb{E}}[\alpha_{i-1}]).

Finally we prove 𝔼[αi]<1/(i+1)\operatorname*{\mathbb{E}}[\alpha_{i}]<1/(i+1) for all i1i\geq 1. We know that 𝔼[α1]maxx[0,1]g(x)=g(12)=14,\operatorname*{\mathbb{E}}[\alpha_{1}]\leq\max_{x\in[0,1]}g(x)=g\left(\frac{1}{2}\right)=\frac{1}{4}, so the claim follows by induction on ii as gg is increasing on [0,1/2][0,1/2] and g(1i)=1i1i21i1i(i+1)=1i+1.g\left(\frac{1}{i}\right)=\frac{1}{i}-\frac{1}{i^{2}}\geq\frac{1}{i}-\frac{1}{i(i+1)}=\frac{1}{i+1}.      

Remark 4.1

With a finer analysis (Appendix LABEL:sec:sharper), Equation (LABEL:eq:alpha_i2) can be strengthened to 𝔼[αi]11a^+i+Ω(ln(a^i)),\operatorname*{\mathbb{E}}[\alpha_{i}]\leq\frac{1}{\frac{1}{\widehat{a}}+i+\Omega(\ln{(\widehat{a}\cdot i)})}, where a^=min(α0,1α0)\widehat{a}=\min(\alpha_{0},1-\alpha_{0}). (This does not affect the asymptotics.)

Proof of Theorem LABEL:lem:indep.  Note that after sampling rr vertices v1,,vrv_{1},\ldots,v_{r} with replacement from G0G_{0}, the subgraph of G0G_{0} resulting from removing all edges incident to i=1r{vi}Γ(vi)\bigcup_{i=1}^{r}\{v_{i}\}\cup\Gamma(v_{i}) is distributed according to GrG_{r}. Using Equation (LABEL:eq:alpha_i2), we bound 𝔼[|E(Gr)|]n22𝔼[αr]<n22r.\operatorname*{\mathbb{E}}[|E(G_{r})|]\leq\frac{n^{2}}{2}\operatorname*{\mathbb{E}}[\alpha_{r}]<\frac{n^{2}}{2r}.      

5 Fully additive approximations

Here we study (1,ε)(1,\varepsilon)-approximations. By Lemma LABEL:fromkton and its corollary, it is enough to consider kk-clusterings for k=O(1/ε)k=O(1/\varepsilon).

5.1 The regularity lemma.

One of the cornerstone results in graph theory is the regularity lemma of Szemerédi, which has found a myriad applications in combinatorics, number theory and theoretical computer science [31]. It asserts that every graph GG can be approximated by a small collection of random bipartite graphs; in fact from GG we can construct a small “reduced” weighted graph G~\widetilde{G} of constant size which inherits many properties of GG. If we select an approximation parameter ε\varepsilon, it gives us an equitable partition of the vertex set VV of GG into a constant number m=m(ε)m=m(\varepsilon) of classes C1,,CmC_{1},\ldots,C_{m} such that the following holds: for any two large enough A,BVA,B\subseteq V, the number of edges between AA and BB can be estimated by thinking of GG as a random graph where the probability of an edge between vCiv\in C_{i} and wCjw\in C_{j} is d(Ci,Cj)=|E(Ci,Cj)|/(|Ci||Cj|)d(C_{i},C_{j})=|E(C_{i},C_{j})|/(|C_{i}||C_{j}|). (The precise notion of approximation we need will be explained later.) Moreover, it is possible to choose a minimum partition size mminm_{\min}; often mminm_{\min} is chosen such that “internal” edges among vertices from the same class are few enough to be ignored.

The original result was existential, but algorithms to construct a regular partition are known [6, 23, 19] which run in time polynomial in |V||V| (for constant ε\varepsilon). This naturally suggests trying to use the partition classes in order to obtain an approximation of the optimal clustering. Nevertheless, to the best of our knowledge, the only prior attempts to exploit the regularity lemma for clustering are the papers of Speroto and Pelillo [38] and Sárközy, Song, Szemerédi  and Trivedi [36]. They use the constructive versions of the lemma to find the reduced graph G~\widetilde{G}, and apply standard clustering algorithms to G~\widetilde{G}. Since the partition size mm required by the lemma is an 1/ε51/\varepsilon^{5}-level iterated exponential of mminm_{min} (and this kind of growth rate is necessary [26]), they propose heuristics to avoid this tower-exponential behaviour. However, the running time of their algorithms is at least nωn^{\omega}, where ω[2,2.373)\omega\in[2,2.373) is the exponent for matrix multiplication. Moreover, no theoretical bounds are provided on the quality of the clustering found by working with the reduced graph, even if no heuristics were applied.

To address these issues, we opt to use a weaker variant of the regularity lemma due to Frieze and Kannan [22, 21]. It has better quantitative parameters and gives an implicit description of the partition, which opens the door for local clustering.

5.2 Cut decompositions of matrices

The idea of Frieze and Kannan is to take any real matrix AA with row set {\mathcal{R}} and column set 𝒮{\mathcal{S}} with bounded entries and approximate it by a low-rank matrix of a certain form. (The case of interest for us is when AA is the adjacency matrix of a graph.) Let m=||m=|{\mathcal{R}}| and n=|𝒮|n=|{\mathcal{S}}|. Given R,S𝒮R\subseteq{\mathcal{R}},S\subseteq{\mathcal{S}} and a real density dd, the cut matrix D=CUT(R,S,d)D=CUT(R,S,d) is the rank-1 matrix defined by Dij=dD_{ij}=d if (i,j)R×S(i,j)\in R\times S, and 0 otherwise. We identify RR and SS with with column indicator vectors of length mm and nn, respectively. Then we can write CUT(R,S,d)=dRStCUT(R,S,d)=d\cdot RS^{t}. We define A(R,S)=RtAS=iRjSAij.A(R,S)=R^{t}AS=\sum_{i\in R}\sum_{j\in S}A_{ij}. A cut decomposition of a matrix AA with relative error ε\varepsilon is a set of cut matrices {Di}i[s]\{D_{i}\}_{i\in[s]}, where Di=CUT(Ri,Si,di)D_{i}=CUT(R_{i},S_{i},d_{i}), such that for all R,S𝒮R\subseteq{\mathcal{R}},S\subseteq{\mathcal{S}},

|A(R,S)i[s]Di(R,S)|εR2S2mn.\vskip-2.84526pt\Bigl{|}A(R,S)-\sum_{i\in[s]}D_{i}(R,S)\Bigr{|}\leq\varepsilon\cdot\lVert{R}\rVert_{2}\lVert{S}\rVert_{2}\cdot\sqrt{mn}.

Such a decomposition is said to have width ss and coefficient length d12+ds2\sqrt{d_{1}^{2}+\ldots d_{s}^{2}}.

Theorem 5.1 (Cut decompositions [22, Th. 1])

Suppose AA is a ×𝒮{\mathcal{R}}\times{\mathcal{S}} matrix with entries in [1,1][-1,1] and ε,δ(0,1)\varepsilon,\delta\in(0,1) are reals. Then in time poly(1/ε)/δ{\operatorname{poly}}(1/\varepsilon)/\delta we can, with probability 1δ1-\delta, find implicitly a cut decomposition of width poly(1/ε){\operatorname{poly}}(1/\varepsilon), relative error at most ε\varepsilon and coefficient length at most 6.

Regarding the meaning of “implicit”. By implicitly finding a cut decomposition B=i[s]CUT(Ri,Si,di)B=\sum_{i\in[s]}CUT(R_{i},S_{i},d_{i}) of a matrix AA in time tt, we mean that for any given pair (x,y)𝒞×(x,y)\in{\mathcal{C}}\times{\mathcal{R}}, we can compute all of the following in time tt by making queries to AA:

  • \bullet

    the rational values d1,,dsd_{1},\ldots,d_{s};

  • \bullet

    the indicator functions 𝕀[xRi]\mathbb{I}\,[x\in R_{i}] and 𝕀[ySi]\mathbb{I}\,[y\in S_{i}], for all i[s]i\in[s];

  • \bullet

    the value of the entry Bx,yB_{x,y}.

In Appendix LABEL:app:fk we give a sketch of how Frieze and Kannan achieve this. We also observe that their algorithm is non-adaptive.

Specifying a maximum cut-set size. Suppose we start with arbitrary equitable partitions of the row set and column set of AA into tt pieces. We can then find cut decompositions of the t2t^{2} submatrices induced by the partition, and combine them into a cut decomposition of the original matrix that satisfies |Si|m/t,|Ti|n/t|S_{i}|\leq m/t,|T_{i}|\leq n/t; the reader may verify that this preserves the bound on relative error. This process can only increase the query and time complexities by an O(t2)O(t^{2}) factor (c.f. [22, Section 5.1]).

Application to adjacency matrices. Suppose AA is the adjacency matrix of an unweighted graph G=(V,E)G=(V,E), and identify the sets {\mathcal{R}} and 𝒮{\mathcal{S}} with VV. Then ||=|𝒮|=|V|=n|{\mathcal{R}}|=|{\mathcal{S}}|=|V|=n. Let E(R,S)E(R,S) denote the number of edges between RVR\subseteq V and SVS\subseteq V. Then A(R,S)=E(R,S)+E(RS,RS)A(R,S)=E(R,S)+E(R\cap S,R\cap S) and the conclusion of Theorem LABEL:thm:fk can be written as

for all R,S𝒮R\subseteq{\mathcal{R}},S\subseteq{\mathcal{S}},

E(R,S)+\displaystyle E(R,S)+ E(RS,RS)=\displaystyle E(R\cap S,R\cap S)=
(i[s]di|RRi||SSi|)±εn|R||S|.\displaystyle\Big{(}\sum_{i\in[s]}d_{i}\cdot\lvert R\cap R_{i}\rvert\cdot|S\cap S_{i}|\Big{)}\pm\varepsilon n\sqrt{|R||S|}. (5)

The last term can be bounded by εn2\varepsilon n^{2}. While the standard regularity lemma supplies a much stronger notion of approximation, this bound suffices for certain applications.

Weakly regular partitions. A weakly ε\varepsilon-pseudo-regular partition of VV is a partition of VV into classes V1,,VV^{1},\ldots,V^{\ell} such that for all disjoint R,SVR,S\subseteq V, |E(R,S)i,j[]d(Vi,Vj)|RVi||SVj||εn2,\bigl{|}E(R,S)-\sum_{i,j\in[\ell]}d(V^{i},V^{j})\cdot\lvert R\cap V^{i}\rvert\cdot|S\cap V^{j}|\bigr{|}\leq\varepsilon n^{2}, where d(Vi,Vj)=E(Vi,Vj)|Vi||Vj|d(V^{i},V^{j})=\frac{E(V^{i},V^{j})}{|V^{i}||V^{j}|}. If, in addition, the partition is equitable, it is said to be weakly ε\varepsilon-regular.

Given a cut decomposition of a graph with relative error ε\varepsilon and size ss, we get an 2ε2\varepsilon-weakly pseudo-regular partition of size 22s\ell\leq 2^{2s} by taking the classes of the Venn diagram of R1,S1,,Rs,SsR_{1},S_{1},\ldots,R_{s},S_{s} with universe VV. So we can enforce the condition that the sets R1,S1,R2,S2,R_{1},S_{1},R_{2},S_{2},\ldots partition the vertex set of GG, at an exponential increase in the number of such sets. Furthermore, any weakly ε\varepsilon-pseudo-regular partition of size \ell may be refined to obtain a weakly 3ε3\varepsilon-regular partition of slightly larger size; see [22, Section 5.1].

Often the weak regularity lemma is stated thusly in terms of weakly regular partitions, but the formulation of Theorem LABEL:thm:fk is stronger in that it allows us to estimate the number of edges between two sets in time poly(1/ε){\operatorname{poly}}(1/\varepsilon) provided that we know the sizes of their intersections with all Ri,SiR_{i},S_{i}, even though the weakly regular partition has size =2poly(1/ε)\ell=2^{{\operatorname{poly}}(1/\varepsilon)}.

5.3 Near-optimal clusterings and the local algorithm

Intuitively, two vertices v,wv,w in the same class of a regular partition have roughly the same number of connections with vertices outside. Hence for any given clustering of the remaining nodes, the cost of placing vv into any one of the clusters is roughly the same as the cost of placing ww there, suggesting they belong together in an optimal clustering (if we can afford to ignore the cost due to internal edges in the regular partition). In other words, a regular partition can be “coarsened” into a good clustering; the best one can be found by considering all possible combinations of assigning partition classes to clusters and estimating the cost of each resulting clustering.

We can make this argument rigorous by using bounds derived from the weak regularity lemma to approximate the cost of the optimal clustering by a certain quadratic program. If we ignore the terms with a single variable squared, the optimum of this program does not change by much as long as the partition is sufficiently fine. Then one can argue that the modified program attains its optimum for an assignment of variables which can be interpreted as a clustering that puts everything from the same regular partition into the same cluster.

Lemma 5.2

Let AA be the adjacency matrix of a graph G=(V,E)G=(V,E) and kk\in{\mathbb{N}}. Let {CUT(Ri,Si,di)}i[s]\{CUT(R_{i},S_{i},d_{i})\}_{i\in[s]} be a cut decomposition of AA with relative error ε2k\frac{\varepsilon}{2k} and with |Si|,|Ti|εn8k|S_{i}|,|T_{i}|\leq\frac{\varepsilon n}{8k} for all i[s]i\in[s]. Denote by CC^{*} the optimal kk-clustering, and by CC the optimal kk-clustering into classes that belong to the σ\sigma-algebra generated by i[s]{Ri,Si}\bigcup_{i\in[s]}\{R_{i},S_{i}\} over VV. Then cost(C)εn2cost(C)cost(C).\operatorname{cost}(C)-\varepsilon n^{2}\leq\operatorname{cost}(C^{*})\leq\operatorname{cost}(C).

Proof.  We use Equation (LABEL:eq:weak_reg) to introduce an “idealized” cost function ideal\operatorname{ideal} satisfying the following for any clustering XX:

  1. 1.

    |cost(X)ideal(X)|εn22\lvert{\operatorname{cost}(X)-\operatorname{ideal}(X)}\rvert\leq\frac{\varepsilon n^{2}}{2}; and

  2. 2.

    ideal(C)ideal(X)+εn22\operatorname{ideal}(C)\leq\operatorname{ideal}(X)+\frac{\varepsilon n^{2}}{2}.

Taken together, these two properties imply the result.

For each kk-clustering XX into X1,,XkX_{1},\ldots,X_{k}, define

ideal(X)=n2\displaystyle\operatorname{ideal}(X)={-\frac{n}{2}}\,\, +j[k][i[s](1di2)|XjRi||XjSi|]\displaystyle+\,\,\sum_{j\in[k]}\,\,\Bigg{[}\sum_{i\in[s]}\left(\frac{1-d_{i}}{2}\right)|X_{j}\cap R_{i}||X_{j}\cap S_{i}|\Bigg{]}
+j,j[k]jj[i[s]di|XjRi||XjSi|].\displaystyle+\sum_{\begin{subarray}{c}j,j^{\prime}\in[k]\\ j\neq j^{\prime}\end{subarray}}\Bigg{[}\sum_{i\in[s]}d_{i}|X_{j}\cap R_{i}||X_{j^{\prime}}\cap S_{i}|\Bigg{]}. (6)

For any j,j[k]j,j^{\prime}\in[k], jjj\neq j^{\prime}, using Equation (LABEL:eq:weak_reg) it holds that

E(Xj,Xj)\displaystyle E(X_{j},X_{j^{\prime}}) =[i[s]di|XjRi||XjSi|\displaystyle=\Big{[}\,\sum_{i\in[s]}d_{i}|X_{j}\cap R_{i}||X_{j^{\prime}}\cap S_{i}| ]±\displaystyle\Big{]}\pm εn2k|Xj||Xj|.\displaystyle\frac{\varepsilon n}{2k}\sqrt{|X_{j}||X_{j^{\prime}}|}.
Similarly,
E(Xj,Xj)\displaystyle E(X_{j},X_{j}) =12[i[s]di|XjRi||XjSi|\displaystyle=\frac{1}{2}\Big{[}\sum_{i\in[s]}d_{i}|X_{j}\cap R_{i}||X_{j}\cap S_{i}| ]±\displaystyle\Big{]}\pm εn2k|Xj||Xj|.\displaystyle\frac{\varepsilon n}{2k}\sqrt{|X_{j}||X_{j}|}.

Therefore

cost(X)\displaystyle\operatorname{cost}(X) =j[k](|Xj|2)E(Xj,Xj)+j,j[k]jjE(Xj,Xj)\displaystyle=\sum_{j\in[k]}\binom{|X_{j}|}{2}-E(X_{j},X_{j})+\sum_{\begin{subarray}{c}j,j^{\prime}\in[k]\\ j\neq j^{\prime}\end{subarray}}E(X_{j},X_{j}^{\prime})
=n2+j[k]12|Xj|2E(Xj,Xj)+j,j[k]jjE(Xj,Xj)\displaystyle=-\frac{n}{2}+\sum_{j\in[k]}\frac{1}{2}|X_{j}|^{2}-E(X_{j},X_{j})+\sum_{\begin{subarray}{c}j,j^{\prime}\in[k]\\ j\neq j^{\prime}\end{subarray}}E(X_{j},X_{j}^{\prime})
ideal(X)+εn2kj,j[k]|Xj||Xj|\displaystyle\leq\operatorname{ideal}(X)+\frac{\varepsilon n}{2k}\cdot\sum_{\begin{subarray}{c}j,j^{\prime}\in[k]\end{subarray}}\sqrt{|X_{j}||X_{j}^{\prime}|}
=ideal(X)+εn2k(j[k]|Xj|)2\displaystyle=\operatorname{ideal}(X)+\frac{\varepsilon n}{2k}\cdot\Big{(}\sum_{\begin{subarray}{c}j\in[k]\end{subarray}}\sqrt{|X_{j}|}\Big{)}^{2}
ideal(X)+εn2kk(j[k]|Xj|)\displaystyle\leq\operatorname{ideal}(X)+\frac{\varepsilon n}{2k}\cdot k\Big{(}\sum_{\begin{subarray}{c}j\in[k]\end{subarray}}{|X_{j}|}\Big{)}
=ideal(X)+εn22,\displaystyle=\operatorname{ideal}(X)+\frac{\varepsilon n^{2}}{2},

where the last inequality is by Cauchy-Schwarz.

It remains to be shown that ideal(X)ideal(C)εn22\operatorname{ideal}(X)\geq\operatorname{ideal}(C)-\frac{\varepsilon n^{2}}{2}; in other words, that there is an almost-optimal kk-clustering under the ideal\operatorname{ideal} cost function whose pieces are unions of the pieces V1,,VV^{1},\ldots,V^{\ell} of the Venn diagram of S1,T1,,Ss,TsS_{1},T_{1},\ldots,S_{s},T_{s}. To see this, write

Ri=tVtRiVt,Si=tVtSiVt.R_{i}=\bigcup_{t\mid V_{t}\subseteq R_{i}}V^{t},\quad S_{i}=\bigcup_{t^{\prime}\mid V_{t^{\prime}}\subseteq S_{i}}V^{t^{\prime}}.

Then

|XjRi|=tVtRi|XjVt|,|XjSi|=tVtSi|XjVt|.|X_{j}\cap R_{i}|=\sum_{t\mid V_{t}\subseteq R_{i}}|X_{j}\cap V^{t}|,\quad|X_{j}\cap S_{i}|=\sum_{t^{\prime}\mid V_{t^{\prime}}\subseteq S_{i}}|X_{j}\cap V^{t^{\prime}}|.

Therefore ideal(X)+n/2\operatorname{ideal}(X)+n/2 is a quadratic form on the kk\ell intersection sizes |XjVt||X_{j}\cap V^{t}|, j[k],t[]j\in[k],t\in[\ell]:

ideal(X)+n2=j,j[k]i[s]VtRi,VtSiλj,ji|XjVt||XjVt|,\operatorname{ideal}(X)+\frac{n}{2}=\sum_{\begin{subarray}{c}j,j^{\prime}\in[k]\\ i\in[s]\\ V^{t}\subseteq R_{i},V^{t^{\prime}}\subseteq S_{i}\end{subarray}}\lambda_{j,j^{\prime}}^{i}|X_{j}\cap V^{t}||X_{j^{\prime}}\cap V^{t^{\prime}}|,

where λj,ji=(1di)/2\lambda_{j,j}^{i}=(1-d_{i})/2 and λj,ji=di\lambda_{j,j^{\prime}}^{i}=d_{i} when jjj\neq j^{\prime}.

Now remove from this expression the terms where t=tt=t^{\prime}. Among these, the terms where jjj\neq j^{\prime} evaluate to zero because XjX_{j} and XjX_{j^{\prime}} are disjoint. Each of the terms where t=tt=t^{\prime} and j=jj=j^{\prime} has absolute value at most

|λj,ji||XVt|24|X||Vt|εn2k|Xj|,|\lambda_{j,j}^{i}||X\cap V^{t}|^{2}\leq 4|X||V^{t}|\leq\frac{\varepsilon n}{2k}|X_{j}|,

since |λj,ji|=|1di2|4|\lambda_{j,j}^{i}|=|\frac{1-d_{i}}{2}|\leq 4 from the bound on the coefficient length, and |Vt|εn/(8k)|V^{t}|\leq\varepsilon n/(8k). Therefore the term removal changes the value of the ideal\operatorname{ideal} cost function by at most εn2/2\varepsilon n^{2}/2.

For (t,j)[]×[k](t,j)\in[\ell]\times[k], let αjt=|XjVt|\alpha_{j}^{t}=|X_{j}\cap V^{t}|. Let

κj,jt,t=𝕀[tt]i[k]VtRiVtSiλj,ji𝕀[VtRiVtSi].\kappa_{j,j^{\prime}}^{t,t^{\prime}}=\mathbb{I}\,[t\neq t^{\prime}]\cdot\sum_{\begin{subarray}{c}i\in[k]\\ V_{t}\subseteq R_{i}\\ V_{t^{\prime}}\subseteq S_{i}\end{subarray}}\lambda_{j,j^{\prime}}^{i}\,\mathbb{I}\,[V^{t}\subseteq R_{i}\wedge V^{t^{\prime}}\subseteq S_{i}].

Then we have seen that

ideal(X)+εn22n2+t,t[]j,j[k]κj,jt,tαjtαjt,\operatorname{ideal}(X)+\frac{\varepsilon n^{2}}{2}\geq{-\frac{n}{2}}+\sum_{\begin{subarray}{c}t,t^{\prime}\in[\ell]\\ j,j^{\prime}\in[k]\end{subarray}}\kappa_{j,j^{\prime}}^{t,t^{\prime}}\,\alpha_{j}^{t}\alpha_{j^{\prime}}^{t^{\prime}},

and κj,jt,t=0\kappa_{j,j^{\prime}}^{t,t}=0. Hence finding the optimal kk-clustering under the idealized cost function can be reduced, up to an additive error of ε2n2\frac{\varepsilon}{2}n^{2}, to solving the following integer quadratic program:

minimize n2+κj,jt,tαjtαjt\displaystyle{-\frac{n}{2}}+\sum\kappa_{j,j^{\prime}}^{t,t^{\prime}}\,\alpha_{j}^{t}\alpha_{j^{\prime}}^{t^{\prime}} (7)
subject to j[k]αjt=|Vt|,\displaystyle\sum_{j\in[k]}\alpha_{j}^{t}=|V^{t}|,\quad t[]\displaystyle\forall t\in[\ell]
αjt0,\displaystyle\alpha_{j}^{t}\geq 0,\quad t[],j[k]\displaystyle\forall t\in[\ell],j\in[k]
αjt.\displaystyle\alpha_{j}^{t}\in{\mathbb{N}}.

The reason is that any feasible solution for {αjt}\{\alpha_{j}^{t}\} gives a clustering by assigning α1t\alpha_{1}^{t} arbitrary elements of VtV^{t} to the first cluster, another α2t\alpha_{2}^{t} elements of VtV^{t} to the second cluster, and so on.

Because κj,jt,t=0\kappa_{j,j^{\prime}}^{t,t}=0, there is an optimal solution to (LABEL:eq:qp) in which for all t[]t\in[\ell], exactly one αti\alpha^{i}_{t} is equal to |Vt||V^{t}| and the rest are zero. Indeed, fix αjt\alpha^{t^{\prime}}_{j} for all ttt^{\prime}\neq t and all jj in a solution (which corresponds to fixing a kk-clustering of VVtV\setminus V^{t}). Then the objective function becomes a linear combination of α1t,,αkt\alpha^{t}_{1},\ldots,\alpha^{t}_{k}, plus a constant term. Therefore it is minimized by picking the cluster j[k]j\in[k] with the smallest coefficient and setting αjt=|Vt|\alpha^{t}_{j}=|V^{t}|.      
We sketch now our second local algorithm.

Proof of Theorem LABEL:main_dense.  For any k,ε(0,1)k\in{\mathbb{N}},\varepsilon\in(0,1), we show a local algorithm that achieves an (1,ε)(1,\varepsilon)-approximation to the optimal kk-clustering in time poly(k/ε){\operatorname{poly}}(k/\varepsilon), after a preprocessing stage that uses poly(k/ε){\operatorname{poly}}(k/\varepsilon) queries and 2poly(k/ε)2^{{\operatorname{poly}}(k/\varepsilon)} time. Theorem LABEL:main_dense then follows by setting k=O(1/ε)k=O(1/\varepsilon).

Fist compute a cut decomposition of AA that satisfies the conditions of Lemma LABEL:lem:opt_cl. By Theorem LABEL:thm:fk, it can be computed implicitly in poly(k/ε){\operatorname{poly}}(k/\varepsilon) time. Let V1,,VV^{1},\ldots,V^{\ell} be the atoms of the σ\sigma-algebra, where =22s\ell=2^{2s} and s=poly(k/ε)s={\operatorname{poly}}(k/\varepsilon). Observe that they can also be defined implicitly: given xVx\in V we can compute in poly(s){\operatorname{poly}}(s) time a 2s2s-bit label that determines the unique VtV^{t} to which xx belongs, namely the value of the 2s2s indicator functions 𝕀[xSi],𝕀[xTi]\mathbb{I}\,[x\in S_{i}],\mathbb{I}\,[x\in T_{i}].

Next we proceed to the more expensive preprocessing part. Consider a clustering all of whose classes are unions of V1,,VV^{1},\ldots,V^{\ell}. Any such clustering is defined by a mapping g:[][k]g:[\ell]\to[k] that, for every i[]i\in[\ell], identifies the cluster to which all elements of ViV^{i} belong. We can try all the k=2poly(k/ε)\ell^{k}=2^{{\operatorname{poly}}(k/\varepsilon)} possibilities for gg, and for each of them and estimate the cost of the associated clustering to within ε/(2k)\varepsilon/(2k) with high enough success probability by sampling. (We omit the details.) If we select the best of them, by Lemma LABEL:lem:opt_cl, it will have cost within εn2\varepsilon n^{2} of the optimal one.

Now we have a “best” mapping gg from [][\ell] to [k][k] that, for every ii\in\ell, tells us the cluster of the elements of ViV^{i}. Finally, note that for any xVx\in V, the appropriate i[]i\in[\ell] such that xVix\in V^{i} can be determined in time poly(s)=poly(k/ε){\operatorname{poly}}(s)={\operatorname{poly}}(k/\varepsilon), and then we can get a cluster label for xx in time poly(k/ε){\operatorname{poly}}(k/\varepsilon) by computing g(i)g(i).      

6 Lower bounds

We show that our algorithm from Section LABEL:sec:main is optimal up to constant factors by proving a matching lower bound for obtaining (O(1),ε)(O(1),\varepsilon)-approximations. For simplicity we consider expected approximations at first; later we prove that combining upper and lower bounds for expected approximations leads to lower bounds for finding bounded approximations with high confidence.

Theorem 6.1

Let c1c\geq 1, ε(1/n,1/(100c))\varepsilon\in(1/n,1/(100c)). Finding an expected (c,ε)(c,\varepsilon)-approximation to the best clustering with probability 1/21/2 requires n4000εc2\frac{n}{4000\varepsilon c^{2}} queries to the similarity matrix.

In addition, any local clustering algorithm achieving this approximation has query complexity Ω(1/(εc2))\Omega(1/(\varepsilon c^{2})). (This remains true even if we allow preprocessing, as long as its running time is bounded by a function of ε\varepsilon and not nn.)

Proof.  The first part implies the second because any q(ε)q(\varepsilon)-query local clustering algorithm with preprocessing p(ε)p(\varepsilon) can be turned into an explicit nq(ε)+p(ε)n\cdot q(\varepsilon)+p(\varepsilon)-query clustering algorithm. Given a lower bound of nl(ε)n\cdot l(\varepsilon) on the complexity of finding approximate (c,ε)(c,\varepsilon)-clusterings for large enough nn, we get nq(ε)+p(ε)nl(ε)n\cdot q(\varepsilon)+p(\varepsilon)\geq n\cdot l(\varepsilon) for all large enough nn, which implies q(ε)l(ε)q(\varepsilon)\geq l(\varepsilon) since limn>p(ε)/n=0\lim_{n->\infty}p(\varepsilon)/n=0. So we prove the first claim.

By Yao’s minimax principle, it is enough to produce a distribution 𝒢{\mathcal{G}} over graphs with the following properties:

  • the expected cost of the optimal clustering of G𝒢G\sim{\mathcal{G}} is 𝔼[\funcOPT(G)]εn2c;\operatorname*{\mathbb{E}}[\func{OPT}(G)]\leq\frac{\varepsilon n^{2}}{c};

  • for any deterministic algorithm making at most n/(4000εc2)n/(4000\varepsilon c^{2}) queries, the expected cost (over GG) of the clustering produced exceeds 2εn2c𝔼[\funcOPT(G)]+εn22\varepsilon n^{2}\geq c\cdot\operatorname*{\mathbb{E}}[\func{OPT}(G)]+\varepsilon n^{2}.

Let α=14c\alpha=\frac{1}{4c}, k=132cεk=\frac{1}{32c\varepsilon} and l=k2εn3n4000c2εl=\frac{k^{2}\varepsilon n}{3}\geq\frac{n}{4000c^{2}\varepsilon}. We can assume that cc, kk and αn/k\alpha n/k are integral (here we use the fact that ε>1/n\varepsilon>1/n). Let A={1,,(1α)n}A=\{1,\ldots,(1-\alpha)n\} and B={(1α)n+1,,n}B=\{(1-\alpha)n+1,\ldots,n\}. Consider the following distribution 𝒢\cal G of graphs: partition the vertices of AA into exactly kk equal-sized clusters C1,,CkC_{1},\ldots,C_{k}. The set of positive edges will be the union of the cliques defined by C1,,CkC_{1},\ldots,C_{k}, plus an edge joining each vertex vBv\in B to a randomly chosen element rvAr_{v}\in A. Define the natural clustering of a graph G𝒢G\in{\mathcal{G}} by the classes Ci=Ci{vBrvCi}C_{i}^{\prime}=C_{i}\cup\{v\in B\mid r_{v}\in C_{i}\} (i[k]i\in[k]). This clustering will have a few disagreements because of the negative edges between different vertices v,wBv,w\in B with rv=rwr_{v}=r_{w}. The cost of the optimal clustering of GG is bounded by that of the natural clustering NN, hence

𝔼[\funcOPT]𝔼[cost(N)]=(αn2)kα2n22k=εcn2.\operatorname*{\mathbb{E}}[\func{OPT}]\leq\operatorname*{\mathbb{E}}[\operatorname{cost}(N)]=\frac{\binom{\alpha n}{2}}{k}\leq\frac{\alpha^{2}n^{2}}{2k}=\frac{\varepsilon}{c}n^{2}.

We have to show that any algorithm making ll queries to graphs drawn from 𝒢{\mathcal{G}} produces a clustering with expected cost larger than 2εn22\varepsilon n^{2}. This inequality holds provided that the output clustering CC and the natural clustering NN are at least 3ε3\varepsilon-far apart. Indeed, reasoning about expected distances, NN is ε/c\varepsilon/c-close to GG, therefore any clustering that is 2ε2\varepsilon-close to GG is also 2ε+ε/c3ε2\varepsilon+\varepsilon/c\leq 3\varepsilon-close to NN from the triangle inequality.

Since all graphs in 𝒢{\mathcal{G}} induce the same subgraphs on AA and BB separately, we can assume without loss of generality that the algorithm queries only edges between AA and BB. Let us analyze the distance between the natural clustering and the clustering found by the algorithm. For vBv\in B, let QvQ_{v} denote set of queries it makes from vv to AA and put qv=|Qv|q_{v}=|Q_{v}|. Clearly we can assume qvk1q_{v}\leq k-1. The total number of queries made is q=vBqvq=\sum_{v\in B}q_{v}.

As rvr_{v} is independent of all edges from [n]{v}[n]-\{v\} to [n]{v}[n]-\{v\}, conditioning on the responses to all queries not involving vv we still know that the probability that all responses are negative is Pr[rvQv]=1qv/k\Pr[r_{v}\notin Q_{v}]=1-q_{v}/k. When this happens, the probability that rvr_{v} coincides with the algorithm’s choice is at most 1kqv\frac{1}{k-q_{v}}.

All in all we have that the probability that the algorithm puts vv into the same cluster as rvr_{v} is bounded by 1kqv+qvk\frac{1}{k-q_{v}}+\frac{q_{v}}{k}. Let us associate a 0-1 random variable ava_{v} with this event and put R=vBavR=\sum_{v\in B}a_{v}. Consequently,

𝔼[R]vB(1kqv+qvk)=qk+vBqvkqv.\operatorname*{\mathbb{E}}[R]\leq\sum_{v\in B}\left(\frac{1}{k-q_{v}}+\frac{q_{v}}{k}\right)=\frac{q}{k}+\sum_{v\in B}\frac{q_{v}}{k-q_{v}}.

We will see below (Lemma LABEL:kmqi) that the last term can be bounded by 2(m+q)/k2(m+q)/k, where m=|B|=αnm=|B|=\alpha n. Therefore 𝔼[R]3q+2mk\operatorname*{\mathbb{E}}[R]\leq\frac{3q+2m}{k}.

Now note that any vertex with av=0a_{v}=0 introduces 2(nm)/kn/k2(n-m)/k\geq n/k new differences with the natural clustering. Thus the expected number of differences is at least

(m3q+2mk)nk\displaystyle\left(m-\frac{3q+2m}{k}\right)\frac{n}{k} =\displaystyle= m(12k)nk3qnk2\displaystyle m\left(1-\frac{2}{k}\right)\frac{n}{k}-\frac{3qn}{k^{2}}
\displaystyle\geq αn22k3qnk2>4εn23qnk23εn2,\displaystyle\alpha\frac{n^{2}}{2k}-\frac{3qn}{k^{2}}>4\varepsilon n^{2}-\frac{3qn}{k^{2}}\geq 3\varepsilon n^{2},

because qlq\leq l.

     

Lemma 6.2

Let q1,,qm[0,k1]q_{1},\ldots,q_{m}\in[0,k-1] with i=1mqi=q\sum_{i=1}^{m}q_{i}=q. Then

i=1m1kqi2(m+q)k.\sum_{i=1}^{m}\frac{1}{k-q_{i}}\leq\frac{2(m+q)}{k}.

Proof.  Let γ=qm+q\gamma=\frac{q}{m+q}. Define the sets

A={i[m]qiγk}A=\{i\in[m]\mid q_{i}\geq\gamma k\}

and

B={i[m]qi<γk}.B=\{i\in[m]\mid q_{i}<\gamma k\}.

Observe that |A|qγk=m+qk|A|\leq\frac{q}{\gamma k}=\frac{m+q}{k}. Then

i=1m1kqi|A|+|B|(1γ)k|A|+m(1γ)k2(m+q)k.\sum_{i=1}^{m}\frac{1}{k-q_{i}}\leq|A|+\frac{|B|}{(1-\gamma)k}\leq|A|+\frac{m}{(1-\gamma)k}\leq\frac{2(m+q)}{k}.

     

Finally, we argue that similar bounds hold for algorithms that obtain good approximation with high success probability.

Lemma 6.3

Suppose 𝒜{\mathcal{A}} finds a (c,ε)(c,\varepsilon)-approximate clustering with success probability 1/21/2 using qq queries, and  {\mathcal{B}} finds an expected (c,rε)(c,r\cdot\varepsilon)-approximate clustering using qq queries. Then there is an algorithm 𝒞{\mathcal{C}} that finds an expected (c,2ε+log(2r)exp(2ε2q))\big{(}c,2\varepsilon+\log(2r)\cdot\exp(-2\varepsilon^{2}q)\big{)}-approximation using 2qlog(2r)2q\cdot\log(2r) queries.

Proof.  Algorithm 𝒞{\mathcal{C}} does the following:

  1. 1.

    Let tlogrt\leftarrow\log r.

  2. 2.

    Run tt independent instantiations of 𝒜{\mathcal{A}} to find clusterings C𝒜1,,C𝒜tC_{\mathcal{A}}^{1},\ldots,C_{{\mathcal{A}}}^{t} with qtqt queries.

  3. 3.

    Run {\mathcal{B}} independently to find an expected (c,rε)(c,r\cdot\varepsilon)-approximate clustering CC with qq queries.

  4. 4.

    Estimate the quality of these t+1t+1 clusterings using qq random samples for each of them.

  5. 5.

    Return the clustering with the smallest estimated error.

The query complexity bound of 𝒞{\mathcal{C}} is as stated. When one of the t+1t+1 clusterings found is (c,ε)(c,\varepsilon)-approximate, the probability that we fail to return a (c,2ε)(c,2\varepsilon)-approximation is at most p=exp(2ε2q)(t+1)p=\exp(-2\varepsilon^{2}q)\cdot(t+1). In this case we bound the error of the clustering output by 11. So the contribution to the expected approximation due to this kind of failure is at most (0,p)(0,p). We assume from now on that this is not the case.

The probability that none of C𝒜1,,C𝒜tC_{{\mathcal{A}}}^{1},\ldots,C_{{\mathcal{A}}}^{t} is a (c,ε)(c,\varepsilon)-approximation is at most 2t1r2^{-t}\leq\frac{1}{r}. In this case we output a (c,2ε)(c,2\varepsilon)-approximation. On the other hand, with probability at most 1r\frac{1}{r}, we output a clustering that in expectation is a (c,rε)(c,r\cdot\varepsilon)-approximation. Therefore, the output is an expected (c,2ε)+1r(c,r)+(0,p)(c,2\varepsilon)+\frac{1}{r}(c,r)+(0,p)-approximation.      

Corollary 6.4

Let ε>107/n\varepsilon>10^{7}/n. Finding a (c,ε)(c,\varepsilon)-approximate clustering with confidence 1/21/2 requires q=n2106c2εq=\frac{n}{2\cdot 10^{6}\cdot c^{2}\varepsilon} queries.

Proof.  We may assume c3c\geq 3. Take the algorithm {\mathcal{B}} from Corollary LABEL:explicit4 and plug it into Lemma LABEL:conv_expected. This gives an expected approximation of (max(c,3),2ε+25exp(2ε2q))(c,3ε)(\max(c,3),2\varepsilon+25\exp(-2\varepsilon^{2}q))\leq(c,3\varepsilon) using 50q50q queries. The result now follows from Lemma LABEL:lb_neps.      

7 Extensions

Non-binary similarity function. In Section 1 we have introduced correlation clustering in its most general form, with a pairwise similarity function sim:V×V[0,1]\operatorname{sim}:V\times V\rightarrow[0,1], while the case we ave studied so far is that of a binary similarity function sim:V×V{0,1}\operatorname{sim}:V\times V\rightarrow\{0,1\}. The general case can be reduced to this by “rounding the graph”, i.e., by replacing a non-binary similarity score with either 0 or 11 according to which is the closest (breaking ties arbitrarily): Bansal et al. [10, Thm. 23] showed that if 𝒜{\mathcal{A}} is an algorithm that produces a clustering on a graph GG with 0,10,1-edges with approximation ratio ρ\rho, then running 𝒜{\mathcal{A}} on the rounding of GG achieves an approximation ratio of 2ρ+12\rho+1. Therefore our algorithms also provide (O(1),ε)(O(1),\varepsilon) approximations for correlation clustering in the more general weighted case.

Neighborhood oracles. If, given vv, we can obtain a linked list of the positive neighbours of vv (in time linear in its length), then it is possible to obtain a multiplicative (O(1),0)(O(1),0)-approximation in time O(n3/2)O(n^{3/2}), which is sublinear. Indeed, Ailon and Liberty [4] argue that with a neighborhood oracle, QuickCluster runs in time O(n+OPT)O(n+OPT); if OPTn3/2OPT\leq n^{3/2} this is O(n3/2)O(n^{3/2}). On the other hand, if we set ε=n1/2\varepsilon=n^{-1/2} in our algorithm, we obtain in time O(n3/2)O(n^{3/2}) a (O(1),n1/2)(O(1),n^{-1/2})-approximation, which is also a (O(1),0)(O(1),0)-approximation when OPTn3/2OPT\geq n^{3/2}. So we can run QuickCluster for O(n3/2)O(n^{3/2}) steps and output the result; if it doesn’t finish, we run our algorithm with ε=n1/2\varepsilon=n^{-1/2}.

Distributed/streaming clustering. In Section LABEL:sec:intro we mentioned that there are general transformations from local clustering algorithms into distributed/streaming algorithms. For our local algorithm from Section LABEL:sec:main we can do the following. Suppose that to each processor PP is assigned a subset APA_{P} of the pairs V×VV\times V, so that PP can compute (or has information about) whether there is a positive edge between xx and yy for the pairs (x,y)AP(x,y)\in A_{P}. (The assignment of vertex pairs to processors can be arbitrary, as long as they partition V×VV\times V.) Then each processor selects the same random vertex subset SVS\subseteq V of size O(1/ε)O(1/\varepsilon), and discards (or does not query/compute) the edges not incident with SS among those it can see (APA_{P}). After this, each processor outputs, for each vv, the pairs (v,w)(v,w) in APA_{P} (note that for each vv, there are only O(1/ε)O(1/\varepsilon) different such pairs). With this information the pivot set TT is the subset of SS with no neighbour smaller than itself (in some random order), and then the label of vv’s cluster is the first element of TT adjacent to vv. This can be computed easily in another round.

Note that the sum of the memory used by all processors is O(n/ε)O(n/\varepsilon), so for constant ε\varepsilon we also get a (semi-)streaming algorithm that makes one pass over the data, with edges arriving in arbitrary order. In two passes we can reduce memory usage to O(n+1/ε2)O(n+1/\varepsilon^{2}): first store the adjacency matrix of the subgraph induced by the random sample SS, and compute the set TT of pivots. In the second pass, keep an integer for each vv that indicates the first element of TT that has appeared as a neighbour of vv in the edges seen so far. At the end, this integer will be vv’s cluster.

Variants of correlation clustering. The second algorithm (based on cut matrices) can easily be extended to chromatic correlation clustering [12], and to bi-clustering (co-clustering) [13].

8 Concluding remarks

This paper initiates the investigation into local correlation clustering, devising algorithms with sublinear time and query complexity. The tradeoff between the running time of our algorithms and the quality of the solution found is close to optimal. Moreover, our solutions are amenable to simple implementations and they can also be naturally adapted to the distributed and streaming settings in order to improve their latency or memory usage.

The notion of local clustering introduced in this paper opens an interesting line of work, which might lead to various contributions in more applied scenarios. For instance, the ability of local algorithms to (among others) quickly estimate the cost of the best clustering can provide a powerful a primitive for decision-making, around which to build new data analysis frameworks. The streaming capabilities of the algorithms may also prove useful in clustering large-scale evolving graphs: this might be applied to detect communities in on-line social networks.

Another intriguing question is whether one can devise other graph-querying models that allow for improved theoretical results while being reasonable from a practical viewpoint. The O(n3/2)O(n^{3/2})-time constant-factor approximation algorithm using neighborhood oracles that we discussed in Section LABEL:sec:extensions suggests that this may be a fruitful direction to pursue in further research. The question seems particulary relevant in order to apply local techniques to very sparse graphs.

References

  • [1] N. Ailon, R. Begleiter, and E. Ezra. Active learning using smooth relative regret approximations with applications. In Proc. of 25th COLT, pages 19.1–19.20, 2012.
  • [2] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. Journal of the ACM, 55(5), 2008.
  • [3] N. Ailon, B. Chazelle, S. Comandur, and D. Liu. Property-preserving data reconstruction. Algorithmica, 51(2):160–182, 2008.
  • [4] N. Ailon and E. Liberty. Correlation clustering revisited: The “true“ cost of error minimization problems. In Proc. of 36th ICALP, pages 24–36, 2009.
  • [5] N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. SIAM Journal on Discrete Mathematics, 16(3):393–417, 2003.
  • [6] N. Alon, R. A. Duke, H. Lefmann, V. Rödl, and R. Yuster. The algorithmic aspects of the regularity lemma. Journal of Algorithms, 16(1):80–109, 1994.
  • [7] N. Alon, W. Fernández de la Vega, R. Kannan, and M. Karpinski. Random sampling and approximation of MAX-CSPs. Journal of Computer and System Sciences, 67(2):212–243, 2003.
  • [8] N. Alon and A. Shapira. A characterization of the (natural) graph properties testable with one-sided error. SIAM Journal on Computing, 37:1703–1727, 2008.
  • [9] M.-F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering via similarity functions. In Proc. of 40th STOC, pages 671–680, 2008.
  • [10] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89–113, 2004.
  • [11] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. Journal of Computational Biology, 6(3/4):281–297, 1999.
  • [12] F. Bonchi, A. Gionis, F. Gullo, and A. Ukkonen. Chromatic correlation clustering. In Proc. of 18th KDD, pages 1321–1329, 2012.
  • [13] S. Busygin, O. A. Prokopyev, and P. M. Pardalos. Biclustering in data mining. Computers & Operations Research, 35(9):2964–2987, 2008.
  • [14] M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360–383, 2005.
  • [15] A. Czumaj and C. Sohler. Sublinear-time approximation algorithms for clustering via random sampling. Random Structures and Algorithms, 30(1–2):226–256, 2007.
  • [16] A. Czumaj and C. Sohler. Small space representations for metric min-sum k-clustering and their applications. Theoretical Computer Science, 46(3):416–442, 2010.
  • [17] E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs. Theoretical Computer Science, 361(2-3):172–187, 2006.
  • [18] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2-3):207–216, 2005.
  • [19] E. Fischer, A. Matsliah, and A. Shapira. Approximate hypergraph partitioning and applications. SIAM Journal on Computing, 39:3155–3185, 2010.
  • [20] E. Fischer and I. Newman. Testing versus estimation of graph properties. SIAM Journal on Computing, 37(2):482–501, 2007.
  • [21] A. M. Frieze and R. Kannan. The regularity lemma and approximation schemes for dense problems. In Proc. of 37th FOCS, pages 12–20, 1996.
  • [22] A. M. Frieze and R. Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175–220, 1999.
  • [23] A. M. Frieze and R. Kannan. A simple algorithm for constructing Szemerédi’s regularity partition. Electronic Journal of Combinatorics, 6, 1999.
  • [24] A. M. Frieze, R. Kannan, and S. Vempala. Fast Monte Carlo algorithms for finding low-rank approximations. Journal of the ACM, 51(6):1025–1041, 2004.
  • [25] I. Giotis and V. Guruswami. Correlation clustering with a fixed number of clusters. Theory of Computing, 2(13):249–266, 2006.
  • [26] T. Gowers. Lower bounds of tower type for Szemerédi’s uniformity lemma. Geometric and Functional Analysis, 7(2):322–337, 1997.
  • [27] O. Hassanzadeh, F. Chiang, R. J. Miller, and H. C. Lee. Framework for evaluating clustering algorithms in duplicate detection. PVLDB, 2(1):1282–1293, 2009.
  • [28] T. Hofmann and J. M. Buhmann. Active data clustering. In Advances in Neural Information Processing Systems 10 (NIPS), 1997.
  • [29] M. Karpinski and W. Schudy. Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems. In Proc. of 41st STOC, pages 313–322, 2009.
  • [30] S. Kim, S. Nowozin, P. Kohli, and C. D. Yoo. Higher-order correlation clustering for image segmentation. In NIPS, pages 1530–1538, 2011.
  • [31] J. Komlós, A. Shokoufandeh, M. Simonovits, and E. Szemerédi. The regularity lemma and its applications in graph theory. In Proc. of 19th STACS, pages 84–112, 2000.
  • [32] N. Mishra, D. Oblinger, and L. Pitt. Sublinear time approximate clustering. In Proc. of 12nd SODA, pages 439–447, 2001.
  • [33] M. Parnas and D. Ron. Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theoretical Computer Science, 381(1-3):183–196, 2007.
  • [34] R. Rubinfeld, G. Tamir, S. Vardi, and N. Xie. Fast local computation algorithms. In Proc. of second ITCS, pages 223–238, 2011.
  • [35] M. E. Saks and C. Seshadhri. Local monotonicity reconstruction. SIAM Journal on Computing, 39(7):2897–2926, 2010.
  • [36] G. N. Sárközy, F. Song, E. Szemerédi, and S. Trivedi. A practical regularity partitioning algorithm and its applications in clustering. Computing Research Repository, abs/1209.6540, 2012.
  • [37] R. Shamir, R. Sharan, and D. Tsur. Cluster graph modification problems. Discrete Applied Mathematics, 144(1-2):173–182, 2004.
  • [38] A. Sperotto and M. Pelillo. Szemerédi’s regularity lemma and its applications to pairwise clustering and segmentation. In Proc. of sixth Conference in Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 13–27, 2007.
  • [39] D. A. Spielman and S.-H. Teng. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on Computing, 42(1):1–26, 2013.
  • [40] J. Suomela. Survey of local algorithms. ACM Computing Surveys, 45(2):24:1–24:40, Mar. 2013.
  • [41] C. Swamy. Correlation clustering: maximizing agreements via semidefinite programming. In Proc. of 15th SODA, pages 526–527, 2004.
  • [42] A. C.-C. Yao. Probabilistic computations: Toward a unified measure of complexity. In Proc. of 18th FOCS, pages 222–227, 1977.

Appendix A A sharper bound for Lemma 4.10

Lemma A.1

Let a0(0,1)a_{0}\in(0,1) and define a sequence by ai+1=ai(1ai)a_{i+1}=a_{i}(1-a_{i}) for i1i\geq 1. Then for all j1j\geq 1,

aj11a^0+j+ln(a^0j)oj(1),a_{j}\leq\frac{1}{\frac{1}{\widehat{a}_{0}}+j+\ln{(\widehat{a}_{0}j)}-o_{j}(1)},

where a^0=min(a0,1a0)\widehat{a}_{0}=\min(a_{0},1-a_{0}).

Proof.  Since replacing a0a_{0} with 1a01-a_{0} does not affect the terms aja_{j} for j1j\geq 1, we can assume a0=a^01/2a_{0}=\widehat{a}_{0}\leq 1/2, in which case the result holds also for j=0j=0. Set mi=1aim_{i}=\frac{1}{a_{i}} for all i0i\geq 0. Then m0=1a02m_{0}=\frac{1}{a_{0}}\geq 2 and

mi+1mi=11mi(11mi)mi=mi2mi1mi=1+1mi1.m_{i+1}-m_{i}=\frac{1}{\frac{1}{m_{i}}(1-\frac{1}{m_{i}})}-m_{i}=\frac{m_{i}^{2}}{m_{i}-1}-m_{i}=1+\frac{1}{m_{i}-1}.

In other words, for all i1i\geq 1 we have

mi=m0+i+j=0i11mj1.m_{i}=m_{0}+i+\sum_{j=0}^{i-1}\frac{1}{m_{j}-1}.

Thus mim14m_{i}\geq m_{1}\geq 4, mim0+43im0+2i+1m_{i}\leq m_{0}+\frac{4}{3}i\leq m_{0}+2i+1, and

mim0+i+j=1i11m0+2j=m0+i+12j=1i11m02+j.m_{i}\geq m_{0}+i+\sum_{j=1}^{i-1}\frac{1}{m_{0}+2j}=m_{0}+i+\frac{1}{2}\sum_{j=1}^{i-1}\frac{1}{\frac{m_{0}}{2}+j}.

Since by the integral test

ln(b+1a)k=ab1kln(ba1),\ln\left(\frac{b+1}{a}\right)\leq\sum_{k=a}^{b}\frac{1}{k}\leq\ln\left(\frac{b}{a-1}\right),

we have

j=1i11m02+jln(1+2(i1)m0+2)ln(1+i1m0),\sum_{j=1}^{i-1}\frac{1}{\frac{m_{0}}{2}+j}\geq\ln\left(1+\frac{2(i-1)}{m_{0}+2}\right)\geq\ln\left(1+\frac{i-1}{m_{0}}\right),

so

mim0+i+12ln(1+a^0(i1)),m_{i}\geq m_{0}+i+\frac{1}{2}\ln\left(1+\widehat{a}_{0}\cdot(i-1)\right),

as we wished to show.      

Appendix B Finding cut decompositions implicitly

We give here an overview of Frieze and Kannan’s method [22]. In essence, the process works with the submatrix induced by certain randomly chosen subsets U,VU,V of size poly(1/ε){\operatorname{poly}}(1/\varepsilon) and defining 𝕀[xRi]\mathbb{I}\,[x\in R_{i}] and 𝕀[yRi]\mathbb{I}\,[y\in R_{i}] in terms of the adjacencies (matrix entries) of xx and yy with UU and VV.

We start with the following simple exponential-time algorithm for finding cut decompositions. Suppose we have found cut matrices D0,,Di1D_{0},\ldots,D_{i-1} and we want to find DiD_{i}. Let Wi=Aj<iDjW_{i}=A-\sum_{j<i}D_{j} be the residual matrix. While there exist sets Ri,SiR_{i}^{\prime},S_{i}^{\prime} with

|Wi(Ri,Si)|ε|Ri||Si|mn,|W_{i}(R_{i}^{\prime},S_{i}^{\prime})|\geq\varepsilon\sqrt{|R_{i}^{\prime}||S_{i}^{\prime}|}\sqrt{mn}, (8)

let Ri=Ri,Si=SiR_{i}=R_{i}^{\prime},S_{i}=S_{i}^{\prime}, di=Wi(Ri,Si)/(|Ri||Si|)d_{i}=W_{i}(R_{i},S_{i})/(|R_{i}||S_{i}|) and add Di=CUT(Ri,Si,di)D_{i}=CUT(R_{i},S_{i},d_{i}) to the decomposition. An easy computation shows that the squared Frobenius norm of the residual matrix decreases by Wi(Ri,Si)2/(|Ri||Si|)W_{i}(R_{i}^{\prime},S_{i}^{\prime})^{2}/(|R_{i}^{\prime}||S_{i}^{\prime}|), i.e., at least an ε2\varepsilon^{2} fraction of AF2mn\lVert{A}\rVert_{F}^{2}\leq mn. Therefore this process cannot go on for more than 1/ε21/\varepsilon^{2} steps. This gives a non-constructive proof of existence of cut decompositions.

How to make this procedure run in time independent of the matrix size? We can cut some slack here by replacing ε\varepsilon with some polynomial of ε\varepsilon with a larger exponent. Frieze and Kannan pick a row set RiR_{i}\subseteq{\mathcal{R}} and then use a sampling-based procedure to construct a column set Si𝒮S_{i}\subseteq{\mathcal{S}} such that the Ri×SiR_{i}\times S_{i} submatrix is sufficiently dense. Provided that the entries in the matrix WiW_{i} remain bounded and inequality (LABEL:eq:weight) holds for some Ri,SiR_{i}^{\prime},S_{i}^{\prime}, they are able to find Ri,Si𝒮R_{i}\in{\mathcal{R}},S_{i}\in{\mathcal{S}} such that

|Wi(Ri,Si)|poly(ε)mn,|W_{i}(R_{i},S_{i})|\geq{\operatorname{poly}}(\varepsilon)\cdot mn,

which implies an poly(ε){\operatorname{poly}}(\varepsilon)-fractional decrease in the squared Frobenius norm of the residual matrix. They show that, with probability at least poly(ε){\operatorname{poly}}(\varepsilon), we can take for PiP_{i} the set of all xx\in{\mathcal{R}} with Wi(x,v)νν2W_{i}(x,v)\cdot\nu\geq\nu^{2} for some randomly chosen v𝒞v\in{\mathcal{C}} and ν[1,1]\nu\in[-1,1]; and for SiS_{i} the set of all y𝒞y\in{\mathcal{C}} with Wi(Ri,y)ν0W_{i}(R_{i},y)\cdot\nu\geq 0.

We need to deal with how to represent the sets Ri,SiR_{i},S_{i} used in the decomposition in an implicit manner. We will write down a predicate that, given i[s]i\in[s] and xx\in{\mathcal{R}} (resp., y𝒮y\in{\mathcal{S}}), tells us whether xRix\in R_{i} (resp., ySiy\in S_{i}) and can be evaluated in time poly(1/ε){\operatorname{poly}}(1/\varepsilon) by making queries to AA. Although the size of RiR_{i}\subseteq{\mathcal{R}} may be linear in mm, its definition makes it possible to check for membership in RiR_{i} with one query to WiW_{i}. The set SiS_{i}, for its part, does not admit such a quick membership test, so Frieze and Kannan work with an approximation achieved by replacing RiR_{i} with a poly(1/ε){\operatorname{poly}}(1/\varepsilon)-sized portion thereof in the definition of SiS_{i}. With the new definition, membership in RiR_{i} and SiS_{i} can be computed in time poly(1/ε){\operatorname{poly}}(1/\varepsilon), as we shall see. Also, the density di=Wi(Ri,Si)/(|Ri||Si|)d_{i}=W_{i}(R_{i},S_{i})/(|R_{i}||S_{i}|) can be estimated to within ±ε2mn/16\pm\varepsilon^{2}mn/16 accuracy by sampling with poly(1/ε){\operatorname{poly}}(1/\varepsilon) queries to WiW_{i}.

Summarizing, we can build a cut decomposition in the following way. Let s=poly(1/ε)s={\operatorname{poly}}(1/\varepsilon). At at stage ii, i=0,1,s1i=0,1\ldots,s-1, the first ii cut matrices CUT(Ri,Si,di)CUT(R_{i},S_{i},d_{i}) are implicitly known. Given the previous ii cut matrices, the residual matrix WiW_{i} is given by

Wi(x,y)=A(x,y)j<idj𝕀[xRj]𝕀[ySj];W_{i}(x,y)=A(x,y)-\sum_{j<i}d_{j}\cdot\mathbb{I}\,[x\in R_{j}]\cdot\mathbb{I}\,[y\in S_{j}];

extend the notation to sets in the obvious manner.

The set RiR_{i} is defined in terms of a random element vi𝒞v_{i}\in{\mathcal{C}} and a random real νi[1,1]\nu_{i}\in[-1,1] by

Ri={xWi(x,vi)νν2}.R_{i}=\{x\in{\mathcal{R}}\mid W_{i}(x,v_{i})\cdot\nu\geq\nu^{2}\}.

The set SiS_{i} is defined in terms of RiR_{i} and a random sample UiU_{i}\subseteq{\mathcal{R}} of size poly(1/ε){\operatorname{poly}}(1/\varepsilon) by

Si={y𝒮Wi(UiRi,y)ν0}.S_{i}=\{y\in{\mathcal{S}}\mid W_{i}(U_{i}\cap R_{i},y)\cdot\nu\geq 0\}.

Finally, the density did_{i} is defined in terms of another random sample Zi×𝒮Z_{i}\subseteq{\mathcal{R}}\times{\mathcal{S}} of size poly(1/ε){\operatorname{poly}}(1/\varepsilon) by

di=𝔼(x,y)Zi[Wi(x,y)(x,y)Ri×Si].d_{i}=\operatorname*{\mathbb{E}}_{(x,y)\in Z_{i}}[W_{i}(x,y)\mid(x,y)\in R_{i}\times S_{i}].

Let U=i(UiΠ1(Zi))U=\bigcup_{i}(U_{i}\cup\Pi_{1}(Z_{i})), V=i(ViΠ2(Zi))V=\bigcup_{i}(V_{i}\cup\Pi_{2}(Z_{i})). We need to compute compute Wi(u,v)W_{i}(u,v), 𝕀[uRj]\mathbb{I}\,[u\in R_{j}], 𝕀[vSj]\mathbb{I}\,[v\in S_{j}] and djd_{j} for all (u,v)U×V(u,v)\in U\times V, jij\leq i. This can be done in time poly(s/ε){\operatorname{poly}}(s/\varepsilon) using dynamic programming and the formulas above. This allows us to compute Wi(x,y)W_{i}(x,y) for all x,yx,y in time poly(s/ε)=poly(1/ε){\operatorname{poly}}(s/\varepsilon)={\operatorname{poly}}(1/\varepsilon).