This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Hierarchical Clustering on Massive Datasets with Broad Guarantees

MohammadTaghi Hajiaghayi
Department of Computer Science
University of Maryland, College Park
hajiaghayi@gmail.com &Marina Knittel
Department of Computer Science
University of Maryland, College Park
mknittel@umd.edu
Abstract

Hierarchical clustering is a stronger extension of one of today’s most influential unsupervised learning methods: clustering. The goal of this method is to create a hierarchy of clusters, thus constructing cluster evolutionary history and simultaneously finding clusterings at all resolutions. We propose four traits of interest for hierarchical clustering algorithms: (1) empirical performance, (2) theoretical guarantees, (3) cluster balance, and (4) scalability. While a number of algorithms are designed to achieve one to two of these traits at a time, there exist none that achieve all four.

Inspired by Bateni et al.’s scalable and empirically successful Affinity Clustering [NeurIPs 2017], we introduce Affinity Clustering’s successor, Matching Affinity Clustering. Like its predecessor, Matching Affinity Clustering maintains strong empirical performance and uses Massively Parallel Communication as its distributed model. Designed to maintain provably balanced clusters, we show that our algorithm achieves good, constant factor approximations for Moseley and Wang’s revenue and Cohen-Addad et al.’s value. We show Affinity Clustering cannot approximate either function. Along the way, we also introduce an efficient kk-sized maximum matching algorithm in the MPC model.

1 Introduction

Clustering is one of the most prominent methods to provide structure, in this case clusters, to unlabeled data. It requires a single parameter kk for the number of clusters. Hierarchical clustering elaborates on this structure by adding a hierarchy of clusters contained within superclusters. This problem is unparameterized, and takes in data as a graph whose edge weights represent the similarity or dissimilarity between data points. A hierarchical clustering algorithm outputs a tree TT, whose leaves represent the input data, internal nodes represent the merging of data and clusters into clusters and superclusters, and root represents the cluster of all data.

Obviously it is more computationally intensive to find TT as opposed to a flat clustering. However, having access to such a structure provides two main advantages: (1) it allows a user to observe the data at different levels of granularity, effectively querying the structure for clusterings of size kk without recomputation, and (2) it constructs a history of data relationships that can yield additional perspectives. The latter is most readily applied to phylogenetics, where dendrograms depict the evolutionary history of genes and species (Kraskov et al., 2003). Hierarchical clustering in general has been used in a number of other unsupervised applications. In this paper, we explore four important qualities of a strong and efficient hierarchical clustering algorithm:

  1. 1.

    Theoretical guarantees. Previously, analysis of hierarchical clustering algorithms has relied on experimental evaluation. While this is one indicator for success, it cannot ensure performance across a large range of datasets. Researchers combat this by considering optimization functions to evaluate broader guarantees (Charikar et al., 2004; Lin et al., 2006). One function that has received significant attention recently (Charikar et al., 2018; Cohen-Addad et al., 2018) is a hierarchical clustering cost function proposed by Dasgupta (2016). This function is simple and intuitive, however, Charikar & Chatziafratis (2017) showed that it is likely not constant-factor approximable. To overcome this, we examine its dual, revenue, proposed by Moseley & Wang (2017), which considers a graph with similarity-based edge weights. For dissimilarity-based edge weights, we look to Cohen-Addad et al. (2018)’s value, another cost-inspired function. We are interested in constant factor approximations for these functions.

  2. 2.

    Empirical performance. As theoretical guarantees are often only intuitive proxies for broader evaluation, it is still important to evaluate the empirical performance of algorithms on specific, real datasets. Currently, Bateni et al. (2017)’s Affinity Clustering remains the state-of-the-art for scalable hierarchical clustering algorithms with strong empirical results. With Affinity Clustering as an inspirational baseline for our algorithm, we strive to preserve and, hopefully, extend Affinity Clustering’s empirical success.

  3. 3.

    Balance. One downside of algorithms like Affinity Clustering is that they are prone to creating extremely unbalanced clusters. There are a number of natural clustering problems where balanced clusters are preferable or more accurate for the problem, for example, clustering a population into genders. Some more specific applications include image collection clustering, where balanced clusters can make the database more easily navigable (Dengel et al., 2011), and wireless sensor networks, where balancing clusters of sensor nodes ensures no cluster head gets overloaded (Amgoth & Jana, 2014). Here, we define balance as the minimum ratio between cluster sizes.

  4. 4.

    Scalability. Most current approximations for revenue are serial and do not ensure performance at scale. We achieve scalability through distributed computation. Clustering itself, as well as many other big data problems, has been a topic of interest in the distributed community in recent years (Chitnis et al., 2015, 2016; Ghaffari et al., 2019). In particular, hierarchical clustering has been studied by Jin et al. (2013, 2015), but only Bateni et al. (2017) has attempted to ensure theoretical guarantees through the introduction of a Steiner-based cost metric. However, they provide little motivation for its use. Therefore, we are interested in evaluating distributed algorithms with respect to more well-founded optimization functions like revenue and value.

    For our distributed model, we look to Massively Parallel Communication (MPC), which was used to design Affinity Clustering. MPC is a restrictive, theoretical abstraction of MapReduce: a popular programming framework famous for its ease of use, fault tolerance, and scalability (Dean & Ghemawat, 2008a). In the MPC model, individual machines carry only a fraction of the data and execute individual computations in rounds. At the end of each round, machines send limited messages to each other. Complexities of interest are the number of rounds and the individual machine space. This framework has been used in the analysis for many large-scale problems in big data, including clustering (Im et al., 2017; Ludwig, 2015; Ghaffari et al., 2019). It is a natural selection for this work.

1.1 Related Work

There exist algorithms that can achieve up to two of these qualities at a time. Affinity Clustering, notably, exhibits good empirical performance and scalability using MPC. While Bateni et al. (2017) describe some minor theoretical guarantees for Affinity Clustering, we believe that proving an algorithm’s ability to optimize for revenue and value is a stronger and more well-founded result due to their popularity and relation to Dasgupta’s cost function. A simple random divisive algorithm proposed by Charikar & Chatziafratis (2017) was shown to achieve a 1/31/3 expected approximation for revenue and can be efficiently implemented using MPC. However, it is notably nondeterministic, and we show that it does not exhibit good empirical performance. Similarly, balanced partitioning may achieve balanced clusters, but it is unclear whether it is scalable, and it has not been shown to achieve strong theoretical guarantees.

For both revenue and value, Average Linkage achieves near-state-of-the-art 1/31/3 and 2/32/3-approximations respectively (Moseley & Wang, 2017; Cohen-Addad et al., 2018). Charikar et al. (2019) marginally improves these factors to 1/3+ϵ1/3+\epsilon for revenue and 2/3+ϵ2/3+\epsilon for value, through semi-definite programming (SDP, a non-distributable method). However, since value and revenue both strove to characterize Average Linkage’s optimization goal, and this was only marginally beat by an SDP, we do not expect to surpass Average Linkage in the restrictive distributed context.

After the completion of our work, a new algorithm was introduced achieving an impressive 0.585 approximation for revenue Alon et al. (2020). While this result was unknown during the course of this work, it sets a new standard to strive for in future work.

1.2 Our contributions

In this work, we propose a new algorithm, Matching Affinity Clustering, for distributed hierarchical clustering. Inspired by Affinity Clustering’s reliance on the minimum spanning tree in order to greedily merge clusters (Bateni et al., 2017), Matching Affinity Clustering merges clusters based on iterative matchings. It notably generalizes to both the edge weight similarity and dissimilarity contexts, and achieves all four desired qualities.

In Section 4, we theoretically motivate Matching Affinity Clustering by proving it achieves a good approximation for both revenue and value (the latter depending on the existence of an MPC minimum matching algorithm), nearing the bounds achieved by Average Linkage:

Theorem 1.

In the revenue context (where edge weights are data similarity), with O~(n)\widetilde{O}(n) machine space, Matching Affinity Clustering achieves (whp):

  • a (1/3ϵ)(1/3-\epsilon)-approximation for revenue in O(log(n)loglog(n)(1/ϵ)O(1/ϵ))O(\log(n)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds when n=2Nn=2^{N},

  • and a (1/9ϵ)(1/9-\epsilon)-approximation for revenue in O(log(nW)loglog(n)(1/ϵ)O(1/ϵ))O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds in general.

Theorem 2.

Assume there exists an MPC algorithm that achieves an α\alpha-approximation for minimum weight kk-sized matching whp in O(f(n))O(f(n)) rounds and O~(n)\widetilde{O}(n) machine space. In the value context (where edge weights are data distances) and in O(f(n)log(n))O(f(n)\log(n)) rounds with O~(n)\widetilde{O}(n) machine space, Matching Affinity Clustering achieves (whp):

  • a 23α\frac{2}{3}\alpha-approximation for value when n=2Nn=2^{N},

  • and a 13α\frac{1}{3}\alpha-approximation for value in general.

Furthermore, in Theorem 3, we prove that Matching Affinity Clustering can give no guarantees with respect to revenue or value. The discussion and proof of this theorem can be found in the Appendix.

Theorem 3.

Affinity Clustering cannot achieve better than a O(1/n)O(1/n)-factor approximation for revenue or value.

We also present an efficient and near-optimal MPC algorithm for kk-sized maximum matching in Theorem 4 in Section 3. This is used by Matching Affinity Clustering.

To evaluate the empirical performance of our algorithm, we run Bateni et al. (2017)’s experiments used for Affinity Clustering on small-scale datasets in Section 6. We find Matching Affinity Clustering performs competitively with respect to state-of-the-art algorithms. On filtered, balanced data, we find that Matching Affinity Clustering consistently outperforms other algorithms by at least a small but clear margin. This implies Matching Affinity Clustering may be more useful on balanced datasets than Affinity Clutsering.

To confirm the balance of our algorithm, we are able to prove that Matching Affinity Clustering achieves perfectly balanced clusters on datasets of size 2N2^{N}, and otherwise guarantee near balance (a cluster size ratio of at most 2). See Lemma 1. This was also confirmed in our empirical evaluation in Section 6.

Finally, we show in Section 4 that Matching Affinity Clustering is highly scalable because it was designed in the same MPC framework as Affinity Clustering. We provide similar complexity guarantees to Affinity Clustering.

Matching Affinity Clustering is ultimately a nice, simply motivated successor to Affinity Clustering that achieves all four desired qualities: empirical performance, theoretical guarantees, balance, and scalability. No other algorithm that we know of does this.

2 Background

In this section, we describe basic notation, hierarchical clustering cost functions, and Massively Parallel Communication (MPC).

2.1 Preliminaries

The standard hierarchical clustering problem takes in a set of data represented as a graph where weights on edges measure similarity or dissimilarity between data. In this paper, edge weights, denoted wG(u,v)w_{G}(u,v) for a graph GG and may be similarities or differences as specified.

One of the most simple, serial hierarchical clustering algorithms is Average Linkage, which provides a good approximation for Moseley and Wang’s dual (Moseley & Wang, 2017). It starts with each vertex being its own cluster, and builds the next cluster by merging the two clusters AA and BB that maximize the average distance between points in the clusters:

1|A||B|u,vGwG(u,v).\frac{1}{|A|\cdot|B|}\sum_{u,v\in G}w_{G}(u,v).

Finally, we note that much of our results occur with high probability (whp), as is true with many MPC algorithms. This means they occur with probability 11/(Ω(n))1-1/(\Omega(n)), where the denominator is generally exponential in nn. The probabilistic aspects of the algorithm come from our use of Ghaffari et al. (2018)’s maximum matching algorithm, which finds a (1+ϵ)(1+\epsilon)-approximate maximum matching with high probability. As we do not introduce a probabilistic aspect ourselves, this will not be discussed in depth in our proofs, but will be stated in the theorems and lemmas.

3rd level iji\lor j 2nd levelT(ij)T(i\lor j) 1st level ii 0th level leaves(T(ij))\operatorname{leaves}(T(i\lor j)) jj nonleaves(T(ij))\operatorname{non-leaves}(T(i\lor j))
Figure 1: An example hierarchical tree TT. ii and jj are data points, or leaves in the tree. Then T(ij)T(i\lor j), the subtree in black, is rooted at their least common ancestor. The red vertices are the leaves of this true, and the blue are the non-leaves.

2.2 Optimization functions

Consider some hierarchical tree, TT. We say iji\lor j for leaves ii and jj is the least common ancestor of ii and jj. The subtree rooted at an interior vertex vv is T[v]T[v], therefore the subtree representing the smallest cluster that contains both ii and jj is T[ij]T[i\lor j]. Let leaves(T[v])\operatorname{leaves}(T[v]) be the set of leaves in T[v]T[v], and nonleaves(T[v])\operatorname{non-leaves}(T[v]) be the set of all of the leaves of TT but not T[v]T[v]. Now we can describe Dasgupta’s function.

Definition 1 (Dasgupta (2016)).

Dasgupta’s cost function of tree TT on graph GG with similarity-based edge weights wGw_{G} is a minimization function.

costG(T)=i,jV(G)wG(i,j)|leaves(T[ij])|.\operatorname{cost}_{G}(T)=\sum_{i,j\in V(G)}w_{G}(i,j)|\operatorname{leaves}(T[i\lor j])|.

To minimize edge weight contribution, we want a small |leaves(T[ij])||\operatorname{leaves}(T[i\lor j])| for heavy edges. This ensures that heavy edges will be merged earlier in the tree. To calculate this, it is easier to break it down into a series of merge costs for each node in TT. It counts the costs that accrue due to the merge at that node so that we can keep track of the cost throughout the construction of TT. It is defined as:

Definition 2 (Moseley & Wang (2017)).

The merge cost of a node in TT which merges disjoint clusters AA and BB is:

mergecostG(A,B)=\displaystyle\operatorname{merge-cost}_{G}(A,B)= |B|aA,cG(AB)wG(a,c)+|A|bB,cG(AB)wG(b,c).\displaystyle|B|\sum_{a\in A,c\in G\setminus(A\cup B)}w_{G}(a,c)+|A|\sum_{b\in B,c\in G\setminus(A\cup B)}w_{G}(b,c).

This breaks down the cost of a hierarchy tree into a series of merge costs. Consider some edge, (i,j)(i,j). At each merge containing exclusively ii or jj, this edge contributes wG(i,j)w_{G}(i,j) times the size of the other cluster. In the hierarchical tree, this counts how many vertices accrue during merges along the paths from ii and jj to iji\lor j. However, this does not account for the leaves ii or jj themselves, so we need to add wG(i,j)w_{G}(i,j) two extra times in addition to each merge. This means we can derive the total cost from the merge costs as: costG(T)=2i,jV(G)wG(i,j)+merges A,BmergecostG(A,B).\operatorname{cost}_{G}(T)=2\sum_{i,j\in V(G)}w_{G}(i,j)+\sum_{\text{merges }A,B}\operatorname{merge-cost}_{G}(A,B).

Next, we consider Moseley and Wang’s dual to Dasgupta’s function Moseley & Wang (2017).

Definition 3 (Moseley & Wang (2017)).

The revenue of tree TT on graph GG with similarity-based edge weights is a maximization function.

revG(T)=i,jV(G)wG(i,j)|nonleaves(T[ij])|.\operatorname{rev}_{G}(T)=\sum_{i,j\in V(G)}w_{G}(i,j)|\operatorname{non-leaves}(T[i\lor j])|.

We can, in a similar fashion to Dasgupta’s cost function, break revenue down into a series of merge revenues.

Definition 4.

Moseley & Wang (2017). The merge revenue of a node in TT which merges disjoint clusters AA and BB is:

mergerevG(A,B)=(n|A||B|)aA,bBwG(a,b).\operatorname{merge-rev}_{G}(A,B)=(n-|A|-|B|)\sum_{a\in A,b\in B}w_{G}(a,b).

Note that for some ii and jj, wG(i,j)w_{G}(i,j) is contributed exactly once, when ii and jj merge at iji\lor j, and n|A||B|n-|A|-|B| is the number of non-leaves at that step. Therefore: revG(T)=i,jV(G)mergerevG(i,j)\operatorname{rev}_{G}(T)=\sum_{i,j\in V(G)}\operatorname{merge-rev}_{G}(i,j). In addition, note the contribution of each i,ji,j pair, which is scaled by wG(i,j)w_{G}(i,j), is the number of leaves of iji\lor j for revenue, and the number of non-leaves of iji\lor j for cost. Therefore the contribution of each edge for revenue is nn minus the contribution for cost, scaled by wG(i,j)w_{G}(i,j). In other words: revG(T)=ni,jV(G)wG(i,j)costG(T)\operatorname{rev}_{G}(T)=n\sum_{i,j\in V(G)}w_{G}(i,j)-\operatorname{cost}_{G}(T).

While cost is a popular and well-founded metric, Charikar & Chatziafratis (2017) found that it is not constant factor approximable under the Small Set Expansion Hypothesis. On the other hand, Moseley & Wang (2017) proved that revenue is, and Average Linkage achieves a 1/31/3-approximation. This makes it a more practical function to work with.

Our other function of interest is Cohen-Addad et al. (2018)’s value function. This was introduced as a Dasgupta-inspired optimization function where edge weights represent distances. It looks exactly like cost, except it is now a maximization function because it is in the distance context.

Definition 5 (Cohen-Addad et al. (2018)).

The value of tree TT on graph GG with dissimilarity-based edge weights wGw_{G} is a maximization function.

valG(T)=i,jV(G)wG(i,j)|leaves(T[ij])|.\operatorname{val}_{G}(T)=\sum_{i,j\in V(G)}w_{G}(i,j)|\operatorname{leaves}(T[i\lor j])|.

Like revenue, value is constant factor-approximable. In fact, the best approximation (other than an SDP) for value is Average Linkage’s 2/32/3-approximation (Cohen-Addad et al., 2018). To our knowledge, there are no distributable approximations for value.

2.3 Massively Parallel Communication (MPC)

Massively Parallel Communication (MPC) is a model of distributed computation used in programmatic frameworks like MapReduce (Dean & Ghemawat, 2008b), Hadoop (White, 2009), and Spark (Zaharia et al., 2010). MPC consists of “rounds” of computation, where parts of the input are distributed across machines with limited memory, computation is done locally for each machine, and then the machines send limited messages to each other. The primary complexities of interest are machine space, which should be O~(n)\widetilde{O}(n), and the number of rounds. Many MPC algorithms are extremely efficient. For instance, Affinity Clustering in some cases can have constantly many rounds, and otherwise may use up to O(log2n)O(\log^{2}n) rounds (Bateni et al., 2017).

3 Finding a k-sized maximum matching

The algorithm we introduce in Section 4 requires the use of a (1ϵ)(1-\epsilon)-approximation for the maximum kk-sized (or less) matching, where k>n/2k>n/2. For this we will use Ghaffari et al. (2018)’s (1ϵ)(1-\epsilon)-approximation for maximum matching in MPC, which runs in O(loglog(n)(1/ϵ)1/ϵ)O(\log\log(n)\cdot(1/\epsilon)^{1/\epsilon}) rounds with O(n/polylog(n))O(n/polylog(n)) space. Inspired by the results of Hassin et al. (1997), we provide a distributed reduction between matching and kk-matching. To do this, we add n2kn-2k vertices and edges of weight QQ (which is found with a binary search) between the new and original vertices, and run the matching algorithm. The algorithm can be seen below and the proof is found in the Appendix in the full paper.

Theorem 4.

There exists an MPC algorithm for kk-sized maximum matching with nonnegative edge weights and max edge weight WW for k>n/2k>n/2 that achieves a (1ϵ)(1-\epsilon)-approximation whp in O(log(nW)loglog(n)(1/ϵ)1/ϵ)O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{1/\epsilon}) rounds and O(n/polylog(n))O(n/polylog(n)) machine space.

1:  Let UU be set of dummy vertices for |U|=n2k|U|=n-2k;
2:  Let δ\delta be the minimum of ϵ\epsilon and the value satisfying ϵ=(c+1)δδ2\epsilon=(c+1)\delta-\delta^{2} for kcnk\leq cn;
3:  VVUV^{\prime}\leftarrow V\cup U;
 // Constructing the transformed graph
4:  EE{(u,v):uU,vV}E^{\prime}\leftarrow E\cup\{(u,v):u\in U,v\in V\};
5:  while Binary search of Q[nW]Q\in[nW] for the min QQ that results in |M|k|M|\leq k and 1k(1δ)w(M)Q\frac{1}{k(1-\delta)}w(M)\leq Q do
6:     w(u,v)=Qw^{\prime}(u,v)=Q for all uU,vVu\in U,v\in V;
7:     MMatch(V,E)M\leftarrow\textsc{Match}(V^{\prime},E^{\prime});
 // Ghaffari et al. (2018)’s matching algorithm
8:     MM{(u,v):(u,v)M,uU,vV}M\leftarrow M\setminus\{(u,v):(u,v)\in M,u\in U,v\in V\};
 // Remove edges not in GG
9:  end while
Algorithm 1 Approximate kk-Sized Matching

4 Bounds on a matching-based hierarchical clustering algorithm

We now introduce our main algorithm, Matching Affinity Clustering. For revenue, we show it achieves a (13ϵ)\left(\frac{1}{3}-\epsilon\right)-approximation for graphs with 2N2^{N} vertices, and a (19ϵ)\left(\frac{1}{9}-\epsilon\right)-approximation in general. Similarly, for value, we show it achieves a 23α\frac{2}{3}\alpha-approximation for graphs with 2N2^{N} vertices, and a 13α\frac{1}{3}\alpha-approximation in general, given an α\alpha-approximation algorithm for minimum weighted kk-sized matching in MPC.

4.1 Matching Affinity Clustering

aa bb cc dd ee ff gg a,ga,g b,bb,b^{\prime} c,fc,f d,ed,e
Figure 2: An example of the first iteration of Matching Affinity Clustering. We start by doing a 6-sized matching on the current graph on the right. We then duplicate unmatched vertices and merge to create the next cluster graph with 2n2^{n} vertices on the right. In subsequent iterations, matches are perfect. Edge weights are the Average Linkage between clusters (non-edges are zero).

Matching Affinity Clustering is defined in Algorithm 2. Its predecessor, Affinity Clustering, uses the MST to select edges to merge across, which sometimes causes imbalanced clusters. This is one reason why it cannot achieve a good approximation for revenue or value (Theorem 3). We fix this by, instead, using iterated maximum matchings (for similarity edge weights) and minimum perfect matchings (for dissimilarity edge weights). This ensures that on n=2Nn=2^{N} vertices for some NN, clusters will always be perfectly balanced. Otherwise, we will show how to achieve at least relative balance.

The algorithm starts with one cluster per each of nn vertices. Let 2N2^{N} be the smallest value such that 2Nn2^{N}\geq n. It finds a maximum (resp. minimum) matching of size k=2n2Nk=2n-2^{N} (line 8, this means it matches 2n2N2n-2^{N} vertices with n2N1n-2^{N-1} edges) and merges these vertices (line 12, Figure 2). Note that if n=2Nn=2^{N}, then k=0k=0 and the first step is a perfect matching. After this step, we have 2N12^{N-1} clusters. We then transform the graph into a graph of clusters with edge weights equal to the Average Linkage between clusters (lines 17-21). We find a maximum (resp. minimum) perfect matching of clusters in this new graph (line 10), then iterate.

1:  Input: A graph GG with weight function w:E(G)+w:E(G)\to\mathbb{Z}^{+}.
2:  n|V|n\leftarrow|V|
3:  NN\leftarrow Such that 2N1<n2N2^{N-1}<n\leq 2^{N}
4:  𝒞G\mathcal{C}\leftarrow G;
 // Current clustering graph, see Definition 6
5:  while n>1n>1 do
6:     Yield 𝒞\mathcal{C};
 // Output each level of the hierarchy
7:     if First iteration then
8:         MkMatch(𝒞,2n2N)M\leftarrow\textsc{kMatch}(\mathcal{C},2n-2^{N});
 // Alg. 2 (Appendix)
9:     else
10:        MMatch(𝒞)M\leftarrow\textsc{Match}(\mathcal{C});
 // Ghaffari et al. (2018)
11:     end if
12:     V{v=(i,j):(i,j)M}V\leftarrow\{v=(i,j):(i,j)\in M\};
13:     EV×VE\leftarrow V\times V;
14:     ww\leftarrow\emptyset;
15:     n|V|n\leftarrow|V|;
16:     Allocate each CjVC_{j}\in V to a machine;
17:     for Every machine mjm_{j} on CjC_{j} that merged Aj,BjV(𝒞)A_{j},B_{j}\in V(\mathcal{C}) do
18:        for Every other CkVC_{k}\in V that merged Ak,BkV(𝒞)A_{k},B_{k}\in V(\mathcal{C}) do
19:           w(Cj,Ck)14(w𝒞(Aj,Ak)+w𝒞(Aj,Bk)+w𝒞(Bj,Ak)+w𝒞(Bj,Bk))w(C_{j},C_{k})\leftarrow\frac{1}{4}(w_{\mathcal{C}}(A_{j},A_{k})+w_{\mathcal{C}}(A_{j},B_{k})+w_{\mathcal{C}}(B_{j},A_{k})+w_{\mathcal{C}}(B_{j},B_{k}))
20:        end for
21:     end for
22:     𝒞(V,E,w)\mathcal{C}\leftarrow(V,E,w)
23:  end while
Algorithm 2 Matching Affinity Clustering

4.2 Revenue approximation

Now, we evaluate the efficiency and approximation factor of Matching Affinity Clustering with respect to revenue. In this section, edge weights represent the similarity between points. Proofs are in the Appendix in the full version of the paper. Ultimately, we will show the following.

Theorem 1.

In the revenue context (where edge weights are data similarity), with O~(n)\widetilde{O}(n) machine space, Matching Affinity Clustering achieves (whp):

  • a (1/3ϵ)(1/3-\epsilon)-approximation for revenue in O(log(n)loglog(n)(1/ϵ)O(1/ϵ))O(\log(n)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds when n=2Nn=2^{N},

  • and a (1/9ϵ)(1/9-\epsilon)-approximation for revenue in O(log(nW)loglog(n)(1/ϵ)O(1/ϵ))O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds in general.

This will be a significant motivation for Matching Affinity Clustering’s theoretical strength. As stated previously, one of the goals of Matching Affinity Clustering is to keep the cluster sizes balanced at each level. However, in the first step, note that Matching Affinity Clustering creates n2N1n-2^{N-1} clusters of size 2, and the rest of the vertices form singleton clusters. Therefore, to use this benefit of Matching Affinity Clustering, we need to ensure that cluster sizes will never deviate too much.

Lemma 1.

After the first round of merges, Matching Affinity Clustering maintains cluster balance (ie, the minimum ratio between cluster sizes) of 1/21/2 whp.

After every matching, the algorithm creates a new graph with vertices representing clusters and edges representing the average linkage between clusters. We will call this a clustering graph.

Definition 6.

A clustering graph 𝒞(G,C)\mathcal{C}(G,C) for graph GG and clustering C={C1,,Ck}C=\{C_{1},\ldots,C_{k}\} of GG is a complete graph with vertex set V=CV=C. Its edge weights are the average linkage between clusters. Specifically, for vertices vCiv_{C_{i}} and vCjv_{C_{j}} in 𝒞(G,C)\mathcal{C}(G,C) corresponding to clusters CiC_{i} and CjC_{j} where iji\neq j, the weight of the edge between these vertices is:

w𝒞(G,C)(vCi,vCj)=1|Ci||Cj|uCi,wCjwG(u,w).w_{\mathcal{C}(G,C)}(v_{C_{i}},v_{C_{j}})=\frac{1}{|C_{i}|\cdot|C_{j}|}\sum_{u\in C_{i},w\in C_{j}}w_{G}(u,w).

The fact that the edge weights in the clustering graph are the average linkage between clusters denotes the similarities between Matching Affinity Clustering and Average Linkage. Essentially, we are trying to optimize for average linkage at each step, but instead of merging two clusters, we merge many pairs of clusters at once with a maximum matching.

Since Matching Affinity Clustering computes this graph, we must show how to efficiently transform a clustering graph at the iith level, 𝒞(G,Ci)\mathcal{C}(G,C^{i}) with clustering CiC^{i}, into a clustering at the i+1i+1th level, 𝒞(G,Ci+1)\mathcal{C}(G,C^{i+1}) with clustering Ci+1C^{i+1}.

Lemma 2.

Given 𝒞(G,Ci)\mathcal{C}(G,C^{i}) and Ci+1C^{i+1} where clusters are all composed of two subclusters in CiC^{i}, 𝒞(G,Ci+1)\mathcal{C}(G,C^{i+1}) can be computed in the MPC model with O~(n)\widetilde{O}(n) machine space and one round.

This will eventually be used for our proof of efficiency of Theorem 1. For now, we return our attention to the approximation factor. Our approximation proof is going to observe the total merge cost and revenue across all merges on a single level of the hierarchy. For concision, we introduce the following notation to describe cost and revenue over a single clustering.

Definition 7.

The clustering revenue based off of some superclustering CC^{\prime} of CC on graph GG is the sum of the merge revenues of combining clusters in CC to create clusters in CC^{\prime}. It is denoted by clusteringrevG(C,C)\operatorname{clustering-rev}_{G}(C,C^{\prime}).

Definition 8.

The clustering cost based off of some superclustering CC^{\prime} of CC on graph GG is the sum of the merge costs of combining clusters in CC to create clusters in CC^{\prime}. It is denoted by clusteringcostG(C,C)\operatorname{clustering-cost}_{G}(C,C^{\prime}).

In order to prove an approximation for revenue, we want to compare each clustering revenue and cost. First, we must show that Matching Affinity Clustering has a large clustering revenue at any level.

Lemma 3.

Let clusters CiC^{i} and Ci+1C^{i+1} be the iith and i+1i+1th level clusterings found by Matching Affinity Clustering, where C0=VC^{0}=V. Let pp be the indicator that is 1 if nn is not a power of 2. Then the clustering revenue of Matching Affinity Clustering at the iith level is at least (whp):

clusteringrevG(Ci,Ci+1)23i2p+1(2ni11)(A,B)Miw𝒞(G,Ci)(vA,vB).\displaystyle\operatorname{clustering-rev}_{G}(C^{i},C^{i+1})\geq 2^{3i-2p+1}\left(2^{n-i-1}-1\right)\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}).

Now we address clustering cost. This time, we must show an upper bound for clustering cost at the iith level in terms of clustering revenue at the iith level. Let MiM_{i} be the matching Matching Affinity Clustering uses to merge CiC^{i} into Ci+1C^{i+1}. Then MiM_{i} is a (1ϵ)(1-\epsilon)-approximation of the optimum MiM_{i}^{*}.

Lemma 4.

Let CiC^{i} and Ci+1C^{i+1} be the iith and i+1i+1th level clusterings found by Matching Affinity Clustering, where the iith step uses matching Mi(1ϵ)MM_{i}\geq(1-\epsilon)M^{*} for maximum matching MM^{*} and C0=VC^{0}=V. Then the clustering cost of Matching Affinity Clustering at the iith level is at most (whp):

clusteringcostG(Ci,Ci+1)22p+11ϵclusteringrevG(Ci,Ci+1).\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1})\leq\frac{2^{2p+1}}{1-\epsilon}\operatorname{clustering-rev}_{G}(C^{i},C^{i+1}).

Now we are ready to prove the approximation factor for Matching Affinity Clustering. We combine Lemma 4 with properties of revenue from Section 2.2 to obtain an expression for revenue in terms of (n2)(n-2) times the sum of weights in the graph. We use this as a bound for the optimal revenue.

Lemma 5.

Matching Affinity Clustering obtains a (1/3ϵ)(1/3-\epsilon)-approximation for revenue on graphs of size 2N2^{N}, and a (1/9ϵ)(1/9-\epsilon)-approximation on general graphs whp.

Finally, the round complexity is limited by the iterations and calls to the matching algorithm. The space complexity is determined by the clustering graph construction.

Lemma 6.

Matching Affinity Clustering uses O~(n)\widetilde{O}(n) space per machine and runs in O(log(n)loglog(n)(1/ϵ)O(1/ϵ))O(\log(n)\allowbreak\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds on graphs of size 2N2^{N}, and O(log(nW)loglog(n)(1/ϵ)O(1/ϵ))O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds in general.

Lemmas 5 and 6 are sufficient to prove Theorem 1. Our algorithm achieves an approximation for revenue efficiently in the MPC model. In addition, the algorithm creates a desirably near-balanced hierarchical clustering tree.

We now prove the approximation bound tightness for Matching Affinity Clustering when |V|=2N|V|=2^{N}. Recently, Charikar et al. (2019) proved by counterexample that Average Linkage achieves at best a (1/3+o(1))(1/3+o(1))-approximation on certain graphs. We find that Matching Affinity Clustering acts the same as Average Linkage on these graphs, and so has at best a (1/3+o(1))(1/3+o(1))-approximation.

Theorem 5.

There is a graph GG on which Matching Affinity Clustering achieves no better than a (1/3+o(1))(1/3+o(1))-approximation of the optimal revenue.

4.3 Value approximation

Now we consider Matching Affinity Clustering when edge weights represent distances instead of similarities. In this context, instead of running a kk-sized maximum matching and then iterative general maximum matchings, we run a kk-sized minimum matching and then iterative general minimum perfect matchings. Therefore, this algorithm is dependent on the existence of a kk-sized minimum matching algorithm in MPC. Due to its similarity to other classical problems with 1+ϵ1+\epsilon solutions in MPC (Ghaffari et al., 2018; Behnezhad et al., 2019), we conjecture:

Conjecture 1.

There exists an MPC algorithm that achieves a (1+ϵ)(1+\epsilon)-approximation for minimum weight kk-sized matching whp that uses O~(n)\widetilde{O}(n) machine space.

Given such an algorithm, we can show that Matching Affinity Clustering approximates value.

Theorem 6.

Assume there exists an MPC algorithm that achieves an α\alpha-approximation for minimum weight kk-sized matching whp in O(f(n))O(f(n)) rounds and O~(n)\widetilde{O}(n) machine space. In the value context (where edge weights are data distances) and in O(f(n)log(n))O(f(n)\log(n)) rounds with O~(n)\widetilde{O}(n) machine space, Matching Affinity Clustering achieves (whp):

  • a 23α\frac{2}{3}\alpha-approximation for value when n=2Nn=2^{N},

  • and a 13α\frac{1}{3}\alpha-approximation for value in general.

The proof for this result is quite similar to the proof for the 2/32/3-approximation of Average Linkage by Cohen-Addad et al. (2018). Instead of focusing on single merges, however, we observed the entire set of merges across a clustering layer in our hierarchy. Then we can make the same argument about the value across an entire level of the hierarchy, and use the cluster balance from Lemma 1 to achieve our result.

If Conjecture 1 holds, then the approximation factors become 2/3ϵ2/3-\epsilon and 1/3ϵ1/3-\epsilon respectively. We see a similar pattern as the revenue result, where the algorithm nears the state-of-the-art 2/32/3-approximation achieved by Average Linkage (Cohen-Addad et al., 2018) on datasets of size n=2Nn=2^{N}, and still achieves a constant factor in general. Finally, we can additionally show the former approximation is tight. See the construction and proofs in the Appendix.

Theorem 7.

There is a graph GG on which Matching Affinity Clustering achieves no better than a (2/3+o(1))(2/3+o(1))-approximation of the optimal revenue.

4.4 Round comparison to Affinity Clustering

In this section, we only consider Matching Affinity Clustering in the similarity edge weight context. The round complexities of Matching Affinity Clustering and regular Affinity Clustering depend on graph qualities, and in certain cases one outperforms the other. On dense graphs with n1+cn^{1+c} edges for constant cc, Bateni et al. (2017) showed that Affinity Clustering runs in log(c/ϵ)+1\lceil\log(c/\epsilon)\rceil+1 rounds. On sparse graphs, it runs in O(log2n)O(\log^{2}n) rounds, and it runs in O(logn)O(\log n) rounds when given access to a distributed hash table. We saw that Matching Affinity Clustering runs in O(log(n)loglog(n)(1/ϵ)O(1/ϵ))O\left(\log(n)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}\right) rounds on graphs of size 2N2^{N}, and O(log(nW)loglog(n)(1/ϵ)O(1/ϵ))O\left(\log(nW)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}\right) in general for max edge weight WW.

There are two situations where our algorithm outperforms Affinity Clustering. First, if the graph is sparse and the number of vertices is 2N2^{N}, then our algorithm runs in O(log(n)loglog(n)(1/ϵ)O(1/ϵ))O\big{(}\log(n)\allowbreak\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}\big{)} rounds, and Affinity Clustering runs in O(log2(n))O(\log^{2}(n)) rounds. Otherwise, if the graph is sparse, Matching Affinity Clustering performs better as long as the largest edge weight is W=o(exp(log2(n)/loglog(n))n)W=o\left(\frac{\exp(\log^{2}(n)/\log\log(n))}{n}\right). This is strictly larger than constant. If WW is large, Affinity Clustering is slightly more efficient. Finally, if the graph is dense, Affinity Clustering achieves an impressive constant round complexity, and is therefore more efficient. In any case, Matching Affinity Clustering is an efficient and highly scalable algorithm.

5 Affinity Clustering approximation bounds

In this section and all following sections, we provide the proofs for all theorems and lemmas introduced in this paper. It is broken down into sections based off of the sections corresponding to sections in the paper itself.

We start by proving Theorem 3. Bateni et al. (2017) were in part motivated by the lack of theoretical guarantees for distributed hierarchical clustering algorithms. Thus, they introduced Affinity Clustering, based off of Borůvka (1926)’s algorithm for parallel MST. In every parallel round of Borůvka’s algorithm, each connected component (starting with disconnected vertices) selects the lowest-weight outgoing edge and adds that to the solution, eventually creating an MST. Affinity Clustering creates clusters of each component. Note that Affinity Clustering was evaluated on a graph with weights representing dissimilarities between vertices, as opposed to our representation where weights are similarities. It is easy to verify that Affinity Clustering functions equivalently using max spanning tree in our representation. Bateni et al. (2017) theoretically validate their algorithm by defining a cost function based off the cost of the minimum Steiner tree for each cluster in the hierarchy, however they do not motivate this metric. Therefore, it is more interesting to evaluate in terms of revenue and value. We ultimately show:

Theorem 3.

Affinity Clustering cannot achieve better than a O(1/n)O(1/n)-factor approximation for revenue or value.

We will split this into two cases for each objective function. We start with revenue. First, note that when Affinity Clustering merges clusters in common connected components, it creates one supercluster (ie, cluster of clusters) for all clusters in that component. Therefore, it may merge many clusters at once. A brief counterexample of why such a hierarchy does not work is when the max spanning tree is a star. Here, all vertices will be merged to a cluster in one round for a revenue of zero, which is not approximately optimal. To evaluate this algorithm, we must consider all possible ways Affinity Clustering might decide to resolve edges on the max spanning tree of the input graph. We propose a graph family that shows Affinity Clustering cannot achieve a good revenue approximation. The hierarchy we use for comparison is one that Matching Affinity Clustering would find, not including the kk-matching step. We prove the following lemma.

Lemma 7.

There exists a family of graphs on which Affinity Clustering cannot achieve better than a O(1/n)O(1/n)-factor approximation for revenue.

We now move on to value.

Lemma 8.

There exists a family of graphs on which Affinity Clustering cannot achieve better than a O(1/n)O(1/n)-factor approximation for value.

Finally, we simply combine the results of Lemma 7 and Lemma 8 to prove Theorem 3.

6 Experiments

We now empirically validate these results to further motivate Matching Affinity Clustering. The algorithm is implemented as a sequence of maximum or minimum perfect matchings, and the testing software is provided in supplementary material. The software as well as the five UCI datasets Dua & Graff (2017) ranging between 150 and 5620 data points are exactly the same as those that were used for small-scale evaluation of Affinity Clustering Bateni et al. (2017). The data is represented by a vector of features. Similarity-based edge weights are the cosine similarity between vectors, and dissimilarity-based edge weights are the L2 norm. Most data and algorithms are deterministic and thus have consistent outcomes, but for any randomness, we run the experiment 50 times and take the average. Just like the evaluation of Affinity Clustering, our evaluation runs hierarchical clustering algorithms on kk-clustering problems until we find a kk-clustering within the hierarchy. This was compared to the ground truth clustering for the dataset.

We evaluate performance using the Rand index, which was designed by Rand (1971) to be similar to accuracy in the unsupervised context. This is an established and commonly used metric for evaluating clustering algorithms and was used in the evaluation of Affinity Clustering.

Definition 9 (Rand (1971)).

Given a set V={v1,,vn}V=\{v_{1},\ldots,v_{n}\} of nn points and two clusterings X={X1,,Xr}X=\{X_{1},\ldots,X_{r}\} and Y={Y1,,Ys}Y=\{Y_{1},\ldots,Y_{s}\} of VV, we define:

  • aa: the number of pairs in VV that are in the same cluster in XX and in the same cluster in YY.

  • bb: the number of pairs in VV that are in different clusters in XX and in different clusters in YY.

Refer to caption
(a) Rand Index on Raw Data
Refer to caption
(b) Rand Index on Filtered Data
Refer to caption
(c) Cluster Balance on Raw Data
Refer to caption
(d) Cluster Balance on Filtered Data
Figure 3: Rand Index and cluster balance on raw and filtered (randomly pruned for balance and n=2Nn=2^{N}) UCI datasets. Legend (bars, left to right): Max Matching Affinity Clustering is blue, Min Matching Affinity Clustering is orange, Affinity Clustering is green.

The Rand index r(X,Y)r(X,Y) is (a+b)/(n2)(a+b)/\binom{n}{2}. By having the ground truth clustering TT of a dataset, we define the Rand index score of a clustering XX to be r(X,T)r(X,T).

6.1 Comparison with Affinity Clustering

In addition, we are interested in evaluating the balance between cluster sizes in the clusterings, which indicates how good our algorithms are at evaluating balanced data. We use the cluster size ratio of a clustering, which was observed in Bateni et al. (2017). For a clustering X={X1,,Xr}X=\{X_{1},\ldots,X_{r}\}, the size ratio is mini,j[r]|Xi|/|Xj|\min_{i,j\in[r]}|X_{i}|/|X_{j}|.

In Figure 3(a), we see the Rand indices of Max Matching Affinity Clustering (ie, in the similarity context), Min Matching Affinity Clustering (ie, in the distance context), and Affinity Clustering. A Rand index is between 0 and 1, where higher scores indicate the clustering is more similar to the ground truth. Matching Affinity Clustering performs similarly to state of the art algorithms like Affinity Clustering on all data except the Soy-Bean dataset. A full evaluation on other algorithms (see the Appendix) illustrates that Matching Affinity Clustering outperforms other algorithms like Random Clustering and Average Linkage.

Figure 3(b) depicts the same information but on a slightly modified dataset. Here, we randomly remove data until (1) the dataset is of size 2N2^{N}, and (2) ground truth clusters are balanced. We did this 50 times and took the average results. This is motivated by Matching Affinity Clustering’s stronger theoretical guarantees on datasets of size 2N2^{N} and ensured cluster balance. As expected, Matching Affinity Clustering performs consistently better than Affinity Clustering on filtered data, albeit by a a small margin in many cases. This shows that, experimentally, Matching Affinity Clustering performs better on balanced datasets of size 2N2^{N}.

Finally, Figures 3(c) and 3(d) depict the cluster size ratios on the raw and filtered data respectively. In theory, at every level in the hierarchy of Matching Affinity Clustering, no cluster can be less than half as small as another (Lemma 1). However, in our evaluation, we are comparing a single kk-clustering, which may not precisely correspond to a level in the hierarchy. In this case, we take some clusters from the first level with fewer than kk clusters and the last level with more than kk clusters. Therefore, since cluster sizes double at each level, the lower bound for the cluster size ratio is now 1/41/4. This is reflected Figure 3(c), where Matching Affinity Clustering stays consistently above this minimum, and often exceeds it by quite a bit. On the filtered data (Figure 3(d)), Matching Affinity Clustering maintains perfect balance in every instance, whereas Affinity Clustering performs much worse. Thus, Matching Affinity Clustering has proven empirically successful on small datasets.

6.2 Comparison with other algorithms

Here we include more complete visualizations of the performance of all tested algorithms. Like in the main body of text, we find the rand index and cluster size ratio on balanced and filtered data. These tests were run in the same way as the tests in the main body, we just present results on other common algorithms for completeness. The results are presented in Figure 4 in the Appendix.

Most of these results are as expected and simply reproduce the results from Bateni et al. (2017). However, we add one more algorithm: random clustering. Again, this is the clustering that randomly recursively partitions the data into a hierarchy. In our experiments, random clustering had surprisingly good cluster balance ratios (see Figure 4(d)). In fact, on raw data, it on average had more balanced clusters than Matching Affinity Clustering on three of the datasets.

There are three main observations about why Matching Affinity Clustering is still clearly more empirically successful than random clustering. First, notice that on filtered data in Figure 4(e), Matching Affinity Clustering has more balanced clusters than Random clustering by a very wide margin. Second, it is clear in Figures 4(b) and 4(c) that Matching Affinity Clustering consistently and significantly outperforms random clustering. Third, random clustering is nondeterministic, whereas Matching Affinity Clustering is works with high probability. Therefore, Matching Affinity Clustering’s theoretical strengths and empirical performances are much stronger assurances than that of random clustering. Therefore, while an argument can be made that random clustering seems to empirically balance clusters well, Matching Affinity Clustering still does better in a number of respects, and thus is a more useful algorithm.

7 Conclusion

Matching Affinity Clustering is the first hierarchical clustering algorithm to simultaneously achieve our four desirable traits. (1) Theoretically, it guarantees state-of-the-art approximations for revenue and value (given an approximation for MPC minimum perfect matching) when n=2Nn=2^{N}, and good approximations for revenue and value in general. Affinity Clustering cannot approximate either function. (2) Compared to Affinity Clustering, our algorithm achieves similar empirical success on general datasets and performs even better when datasets are balanced and of size 2N2^{N}. (3) Clusters are theoretically and empirically balanced. (4) It is scalable.

These attributes were proved through theoretical analysis and small-scale evaluation. While we were unable to perform the same large-scale tests as Bateni et al. (2017), our methods still establish several advantages to the proposed approach. Matching Affinity Clustering simultaneously attains stronger broad theoretical guarantees, scalability through distribution, and small-scale empirical success. Therefore, we believe that Matching Affinity Clustering holds significant value over its predecessor as well as other state-of-the-art hierarchical clustering algorithms, particularly with its niche capability on balanced datasets.

References

  • Alon et al. (2020) Alon, N., Azar, Y., and Vainstein, D. Hierarchical clustering: A 0.585 revenue approximation. In Conference on Learning Theory, COLT, Proceedings of Machine Learning Research, pp.  153–162, 2020.
  • Amgoth & Jana (2014) Amgoth, T. and Jana, P. K. Energy efficient and load balanced clustering algorithms for wireless sensor networks. IJICT, 6(3/4):272–291, 2014.
  • Bateni et al. (2017) Bateni, M., Behnezhad, S., Derakhshan, M., Hajiaghayi, M., Kiveris, R., Lattanzi, S., and Mirrokni, V. S. Affinity clustering: Hierarchical clustering at scale. In Guyon et al. (2017), pp.  6867–6877.
  • Behnezhad et al. (2019) Behnezhad, S., Hajiaghayi, M., and Harris, D. G. Exponentially faster massively parallel maximal matching. In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA, November 9-12, 2019, pp. 1637–1649, 2019.
  • Borůvka (1926) Borůvka, O. O jistém problému minimálním. Práce Moravské přírodovědecké společnosti. Mor. přírodovědecká společnost, 1926.
  • Charikar & Chatziafratis (2017) Charikar, M. and Chatziafratis, V. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Klein, P. N. (ed.), Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pp.  841–854. SIAM, 2017.
  • Charikar et al. (2004) Charikar, M., Chekuri, C., Feder, T., and Motwani, R. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417–1440, 2004.
  • Charikar et al. (2018) Charikar, M., Chatziafratis, V., Niazadeh, R., and Yaroslavtsev, G. Hierarchical clustering for euclidean data. CoRR, abs/1812.10582, 2018.
  • Charikar et al. (2019) Charikar, M., Chatziafratis, V., and Niazadeh, R. Hierarchical clustering better than average-linkage. In Chan, T. M. (ed.), Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pp.  2291–2304. SIAM, 2019.
  • Chitnis et al. (2016) Chitnis, R., Cormode, G., Esfandiari, H., Hajiaghayi, M., McGregor, A., Monemizadeh, M., and Vorotnikova, S. Kernelization via sampling with applications to finding matchings and related problems in dynamic graph streams. In Krauthgamer, R. (ed.), Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pp.  1326–1344. SIAM, 2016.
  • Chitnis et al. (2015) Chitnis, R. H., Cormode, G., Hajiaghayi, M. T., and Monemizadeh, M. Parameterized streaming: Maximal matching and vertex cover. In Indyk, P. (ed.), Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pp.  1234–1251. SIAM, 2015.
  • Cohen-Addad et al. (2018) Cohen-Addad, V., Kanade, V., Mallmann-Trenn, F., and Mathieu, C. Hierarchical clustering: Objective functions and algorithms. In Czumaj, A. (ed.), Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp.  378–397. SIAM, 2018.
  • Dasgupta (2016) Dasgupta, S. A cost function for similarity-based hierarchical clustering. In Wichs, D. and Mansour, Y. (eds.), Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pp.  118–127. ACM, 2016.
  • Dean & Ghemawat (2008a) Dean, J. and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008a.
  • Dean & Ghemawat (2008b) Dean, J. and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008b.
  • Dengel et al. (2011) Dengel, A., Althoff, T., and Ulges, A. Balanced clustering for content-based image browsing. In German Computer Science Society, Informatiktage, 03 2011.
  • Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Ghaffari et al. (2018) Ghaffari, M., Gouleakis, T., Konrad, C., Mitrovic, S., and Rubinfeld, R. Improved massively parallel computation algorithms for mis, matching, and vertex cover. In Newport, C. and Keidar, I. (eds.), Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC 2018, Egham, United Kingdom, July 23-27, 2018, pp.  129–138. ACM, 2018.
  • Ghaffari et al. (2019) Ghaffari, M., Lattanzi, S., and Mitrovic, S. Improved parallel algorithms for density-based network clustering. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 2201–2210, 2019.
  • Guyon et al. (2017) Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.). Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017.
  • Hassin et al. (1997) Hassin, R., Rubinstein, S., and Tamir, A. Approximation algorithms for maximum dispersion. Oper. Res. Lett., 21(3):133–137, 1997.
  • Im et al. (2017) Im, S., Moseley, B., and Sun, X. Efficient massively parallel methods for dynamic programming. In Hatami, H., McKenzie, P., and King, V. (eds.), Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pp.  798–811. ACM, 2017.
  • Jin et al. (2013) Jin, C., Patwary, M. M. A., Hendrix, W., Agrawal, A., Liao, W.-k., and Choudhary, A. Disc: A distributed single-linkage hierarchical clustering algorithm using mapreduce. International Workshop on Data Intensive Computing in the Clouds (DataCloud), 11 2013.
  • Jin et al. (2015) Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., and Choudhary, A. N. A scalable hierarchical clustering algorithm using spark. In First IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2015, Redwood City, CA, USA, March 30 - April 2, 2015, pp.  418–426. IEEE Computer Society, 2015.
  • Kraskov et al. (2003) Kraskov, A., Stögbauer, H., Andrzejak, R. G., and Grassberger, P. Hierarchical clustering using mutual information. CoRR, q-bio.QM/0311037, 2003.
  • Lin et al. (2006) Lin, G., Nagarajan, C., Rajaraman, R., and Williamson, D. P. A general approach for incremental approximation and hierarchical clustering. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, pp.  1147–1156. ACM Press, 2006.
  • Ludwig (2015) Ludwig, S. A. Mapreduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Machine Learning & Cybernetics, 6(6):923–934, 2015.
  • Moseley & Wang (2017) Moseley, B. and Wang, J. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In Guyon et al. (2017), pp.  3097–3106.
  • Rand (1971) Rand, W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. ISSN 01621459.
  • White (2009) White, T. Hadoop - The Definitive Guide: MapReduce for the Cloud. O’Reilly, 2009. ISBN 978-0-596-52197-4.
  • Zaharia et al. (2010) Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Nahum, E. M. and Xu, D. (eds.), 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010. USENIX Association, 2010.

Appendix A Distributed maximum kk-sized matching

In this section, we prove our results for distributed kk-sized matching and additionally provide the pseudocode.

Theorem 4.

There exists an MPC algorithm for kk-sized maximum matching with nonnegative edge weights and max edge weight WW for k>n/2k>n/2 that achieves a (1ϵ)(1-\epsilon)-approximation whp in O(log(nW)loglog(n)(1/ϵ)1/ϵ)O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{1/\epsilon}) rounds and O(n/polylog(n))O(n/polylog(n)) machine space.

Proof.

Let us define QQ by a binary search on values 1 through nWnW to find the minimum QQ that satisfies a halting condition: that the resulting MM the algorithm finds satisfies |M|=k|M|=k and 1k(1δ)w(M)=Q\frac{1}{k(1-\delta)}w(M)=Q (line 5) where WW is the largest weight in GG. First we transform the graph. Create a vertex set UU of n2kn-2k dummy vertices, add them to our vertex set, and connect them to all vertices in GG with edge weights QQ (lines 1 to 1). Then we run Ghaffari et al. (2018)’s algorithm (line 1) on this new graph with error δ\delta being the minimum of the value satisfying ϵ=(c+1)δδ2\epsilon=(c+1)\delta-\delta^{2} and ϵ\epsilon itself, and kcnk\leq cn. Find the portion of this matching in GG, and use this to check our halting condition.

We must start by showing that if MG,kM_{G,k} is a (1ϵ)(1-\epsilon)-approximate kk-matching in GG, then when Q=1k(1ϵ)w(MG,K)Q=\frac{1}{k(1-\epsilon)}w(M_{G,K}), our algorithm finds a kk-matching with a (1ϵ)(1-\epsilon)-approximate weight. Assume there is such a matching, and consider when our algorithm selects QQ for this value. Consider the matching the algorithm finds in the transformed graph. Assume for contradiction that some uUu\in U is not matched to any vVv\in V, and every edge in the matching from GG has weight over QQ. Because uu is not connected to any vertex in UU, that means it isn’t matched at all. And since uu is connected to all vVv\in V with a positive edge, for uu to not have been matched, all vVv\in V must have been matched. Since k<n/2k<n/2, a perfect matching on GG has at least kk edges. Thus the weight of the algorithm’s matching in GG, M𝒜,GM_{\mathcal{A},G}, is bounded below by kk edges of weight greater than QQ.

w(M𝒜,G)>kQ=11ϵw(MG,k).w(M_{\mathcal{A},G})>kQ=\frac{1}{1-\epsilon}w(M_{G,k}).

But OPTG,k for kk-sized matching in GG must be at least as large as this, so then (1ϵ)w(OPTG,k)>w(MG,k)(1-\epsilon)w(\textsc{OPT}_{G,k})>w(M_{G,k}), which is a contradiction on the assumption of MG,kM_{G,k}. Otherwise, if there is some uUu\in U that is not matched to any vVv\in V where one of the edges in the matching from GG has weight QQ or less, removing that match and pairing one of those vertices with uu can only improve the matching. Thus we can add a processing step afterwards to ensure all n2kn-2k new vertices are matched, while not decreasing the value of the matching.

Thus our algorithm matches all n2kn-2k vertices in UU to n2kn-2k vertices in VV, and the remaining 2k2k vertices in VV create a matching for a total size of kk or less. Thus the algorithm outputs an at most kk-sized matching in GG, so this selection of QQ will make the halting condition true.

Let M𝒜M_{\mathcal{A}} be the matching our algorithm finds in the transformed graph. Then the total weight is,

w(M𝒜)=(nk)Q+w(M𝒜,G).w(M_{\mathcal{A}})=(n-k)Q+w(M_{\mathcal{A},G}).

By the same argument as before, but without the 1ϵ1-\epsilon factor, there is an OPT in the transformed graph that matches all vertices in UU to vertices in VV. Thus the expression for OPT is similar, where OPTG,k is the optimum for a kk-sized matching in GG.

w(OPT)=(nk)Q+w(OPTG,k).w(\textsc{OPT})=(n-k)Q+w(\textsc{OPT}_{G,k}).

We know M𝒜M_{\mathcal{A}} is a (1ϵ)(1-\epsilon)-approximation for OPT, so we can combine these two equations.

(nk)Q+w(M𝒜,G)\displaystyle(n-k)Q+w(M_{\mathcal{A},G})\geq (1ϵ)(nk)Q+(1ϵ)w(OPTG,k).\displaystyle(1-\epsilon)(n-k)Q+(1-\epsilon)w(\textsc{OPT}_{G,k}).

We are interested in the portion of the solution in GG, or M𝒜,GM_{\mathcal{A},G}.

w(M𝒜,G)ϵ(nk)Q+(1ϵ)w(OPTG,k).w(M_{\mathcal{A},G})\geq-\epsilon(n-k)Q+(1-\epsilon)w(\textsc{OPT}_{G,k}).

Recall that Q=1k(1ϵ)w(MG,k)Q=\frac{1}{k(1-\epsilon)}w(M_{G,k}), and MG,kM_{G,k} is a kk-sized matching in GG. Therefore w(MG,k)(OPTG,k)w(M_{G,k})\leq(\textsc{OPT}_{G,k}). We can apply this to our inequality and simplify.

w(M𝒜,G)\displaystyle w(M_{\mathcal{A},G})\geq ϵ(nk)1k(1ϵ)w(OPTG,k)+(1ϵ)w(OPTG,k),\displaystyle-\epsilon(n-k)\frac{1}{k(1-\epsilon)}w(\textsc{OPT}_{G,k})+(1-\epsilon)w(\textsc{OPT}_{G,k}),
=\displaystyle= (1ϵ(1+(n/k1)(1ϵ)))w(OPTG,k).\displaystyle\left(1-\epsilon(1+(n/k-1)(1-\epsilon))\right)w(\textsc{OPT}_{G,k}).

Since k=O(n)k=O(n), then n/kn/k is bounded above by some constant. Then the approximation factor is 1ϵ(1+c(1ϵ))=1(c+1)ϵ+ϵ21-\epsilon(1+c(1-\epsilon))=1-(c+1)\epsilon+\epsilon^{2}. So for any approximation factor ϵ\epsilon, we can select some δ\delta to run Ghaffari et al. (2018)’s algorithm such that our algorithm gives a (1δ)(1-\delta)-approximation.

The algorithm searches for the minimum QQ that satisfies this, so all that’s left to prove is that a selection of Q<1k(1δ)w(MG,k)Q<\frac{1}{k(1-\delta)}w(M_{G,k}) yields a kk-sized matching M𝒜,GM_{\mathcal{A},G} where Q=1k(1δ)w(M𝒜,G)Q=\frac{1}{k(1-\delta)}w(M_{\mathcal{A},G}) still is a (1ϵ)(1-\epsilon)-approximation. If this is true, it must have matched all n2kn-2k vertices in UU with vertices in VV and selected kk edges from GG. The value of this, where we sub in our value for QQ, is:

w(M𝒜)\displaystyle w(M_{\mathcal{A}}) =(n2k)Q+w(M𝒜,G),\displaystyle=(n-2k)Q+w(M_{\mathcal{A},G}),
=(n2k)w(M𝒜,G)+w(M𝒜,G),\displaystyle=(n-2k)w(M_{\mathcal{A},G})+w(M_{\mathcal{A},G}),
=(n2k+1)w(M𝒜,G).\displaystyle=(n-2k+1)w(M_{\mathcal{A},G}).

Therefore, the approximation factor on the transformed graph is equivalent to the approximation factor on GG. Since we ran Ghaffari et al. (2018)’s algorithm on the transformed graph with error δ\delta where δϵ\delta\leq\epsilon, this yields a matching within error of OPTG,k.

Therefore, our algorithm returns the desired approximation. This algorithm requires O(log(nW))O(\log(nW)) iterations for the binary search on QQ. In each iteration, the only significant computation in both round and space complexity is the use of the Ghaffari et al. (2018) algorithm that uses O(loglog(n)(1/ϵ)1/ϵ)O(\log\log(n)\cdot(1/\epsilon)^{1/\epsilon}) rounds and O(n/polylog(n))O(n/polylog(n)) machine space. Thus our algorithm runs in O(log(nW)loglog(n)(1/ϵ)1/ϵ)O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{1/\epsilon}) rounds and O(n/polylog(n))O(n/polylog(n)).

Appendix B Revenue approximation

Lemma 1.

After the first round of merges, Matching Affinity Clustering maintains cluster balance (ie, the minimum ratio between cluster sizes) of 1/21/2 whp.

Proof.

After the first round of merges, note that any duplicated vertex must be merged with its duplicate. This is because the edge weight between these vertices is essentially infinite (for the purposes of this paper, we will say it is arbitrarily large). Thus no duplicates will be matched with another duplicate, so each of the 2-sized clusters has at least one real vertex. In any subsequent merges, this property will hold. Thus it holds for all clusters beyond the initial singleton clusters. ∎

Lemma 2.

Given 𝒞(G,Ci)\mathcal{C}(G,C^{i}) and Ci+1C^{i+1} where clusters are all composed of two subclusters in CiC^{i}, 𝒞(G,Ci+1)\mathcal{C}(G,C^{i+1}) can be computed in the MPC model with O~(n)\widetilde{O}(n) machine space and one round.

Proof.

We start by constructing the vertex set, Vi+1V^{i+1}, which corresponds to the clusters at the new i+1i+1th level. So for every Cji+1Ci+1C^{i+1}_{j}\in C^{i+1}, create a vertex vCji+1v_{C^{i+1}_{j}} and put it in vertex set Vi+1V^{i+1}. It must be a complete graph, so we can add edges between all pairs of vertices.

Consider two vertices vCji+1v_{C^{i+1}_{j}} and vCki+1v_{C^{i+1}_{k}}. Since we merge sets of two clusters at each round, these must have come from two clusters in CiC^{i} each. Say they merged clusters from vertices u1,u2u_{1},u_{2} and v1,v2v_{1},v_{2} respectively. Note that these vertices are from the previous clustering graph, 𝒞(G,Ci)\mathcal{C}(G,C^{i}). Then the edges (u1,v1),(u1,v2),(u2,v1),(u2,v2)(u_{1},v_{1}),(u_{1},v_{2}),(u_{2},v_{1}),(u_{2},v_{2}) have weights that are the average distances between corresponding ii-level clusters (because they were from the previous clustering graph, 𝒞(G,Ci)\mathcal{C}(G,C^{i})). Since the clusters in CiC^{i} all have the same size, we can calculate the weight as follows.

w𝒞(G,Ci+1)(vCi+1j,vCki+1)=14(\displaystyle w_{\mathcal{C}(G,C^{i+1})}(v_{C^{i+1}j},v_{C^{i+1}_{k}})=\frac{1}{4}( w𝒞(G,Ci)(u1,v1)+w𝒞(G,Ci)(u1,v2)+w𝒞(G,Ci)(u2,v1)+w𝒞(G,Ci)(u2,v2)).\displaystyle w_{\mathcal{C}(G,C^{i})}(u_{1},v_{1})+w_{\mathcal{C}(G,C^{i})}(u_{1},v_{2})+w_{\mathcal{C}(G,C^{i})}(u_{2},v_{1})+w_{\mathcal{C}(G,C^{i})}(u_{2},v_{2})).

This is true because the weights in 𝒞(G,Ci)\mathcal{C}(G,C^{i}) are already the average weights in the iith level clusters, so they are normalized for the clusters size, which is 1/41/4th of the cluster size at the next level. So when we sum together the four edge weights, we account for all the edges that contribute to the edge weight in the next level, then we only need to divide by 4 to find the average.

Matching Affinity Clustering can utilize one machine per (i+1)(i+1)-level cluster . Each machine needs to keep track of the distance between its subclusters and all other subclusters at level ii. It can then do this calculation to capture edge weights in one round with O(n)O(n) space. ∎

Lemma 3.

Let clusters CiC^{i} and Ci+1C^{i+1} be the iith and i+1i+1th level clusterings found by Matching Affinity Clustering, where C0=VC^{0}=V. Let pp be the indicator that is 1 if nn is not a power of 2. Then the clustering revenue of Matching Affinity Clustering at the iith level is at least (whp):

clusteringrevG(Ci,Ci+1)23i2p+1(2ni11)(A,B)Miw𝒞(G,Ci)(vA,vB).\displaystyle\operatorname{clustering-rev}_{G}(C^{i},C^{i+1})\geq 2^{3i-2p+1}\left(2^{n-i-1}-1\right)\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}).
Proof.

First, we want to break down the clustering revenue into the sum of its merge revenues. Since each match in our matching MiM_{i} defines a cluster in the next level of the hierarchy, we can view each merge as a match in MiM_{i}. Then we apply the definition of merge revenue.

clusteringrevG(Ci,Ci+1)\displaystyle\operatorname{clustering-rev}_{G}(C^{i},C^{i+1}) =(A,B)MimergerevG(A,B),\displaystyle=\sum_{(A,B)\in M_{i}}\operatorname{merge-rev}_{G}(A,B),
=(A,B)Mi(2n|A||B|)aA,bBwG(a,b).\displaystyle=\sum_{(A,B)\in M_{i}}(2^{n}-|A|-|B|)\sum_{a\in A,b\in B}w_{G}(a,b).

Because we start with a power of two vertices after padding, each step can find a perfect matching, thus yielding a power of two many clusters of equal size at each level. Then since cluster size doubles each round, the cluster size at the iith iteration is 2i2^{i}. Even though some of the vertices may not contribute to the revenue, this is an upper bound on the size. So n|A||B|n-|A|-|B| in this formula is at least 2n2i+12^{n}-2^{i+1}. This is the work done in (1) below.

clusteringrevG(Ci,Ci+1)\displaystyle\operatorname{clustering-rev}_{G}(C^{i},C^{i+1}) (2n2i+1)(A,B)MiaA,bBwG(a,b),\displaystyle\geq(2^{n}-2^{i+1})\sum_{(A,B)\in M_{i}}\sum_{a\in A,b\in B}w_{G}(a,b), (1)
=(2n2i+1)(A,B)Mi|A||B|w𝒞(G,Ci)(vA,vB),\displaystyle=(2^{n}-2^{i+1})\sum_{(A,B)\in M_{i}}|A||B|w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}), (2)
=22i2p(2n2i+1)(A,B)Miw𝒞(G,Ci)(vA,vB),\displaystyle=2^{2i-2p}(2^{n}-2^{i+1})\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}), (3)
=23i2p+1(2ni11)(A,B)Miw𝒞(G,Ci)(vA,vB).\displaystyle=2^{3i-2p+1}(2^{n-i-1}-1)\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}). (4)

In (2), we simply substitute in place of the sum of the edge weights between AA and BB in GG. By definition, the edge weight between vAv_{A} and vBv_{B} in the clustering graph is the average of the same edge edge weights in GG. Thus if we just scale that by |A||B||A||B|, we can substitute it in for the sum of those edge weights. In (3), we simply pull out |A||B||A||B|. These contain 2i2^{i} total vertices, and by Lemma 1, they contain at least 2i12^{i-1} that contribute to the revenue for a total factor of 22i22^{2i-2}. If there were 2n2^{n} vertices to start, then all clusters contain only real vertices, so the factor is 22i2^{2i}. With the indicator, this is 22i2p2^{2i-2p}. We then simplify in (4). ∎

Lemma 4.

Let CiC^{i} and Ci+1C^{i+1} be the iith and i+1i+1th level clusterings found by Matching Affinity Clustering, where the iith step uses matching Mi(1ϵ)MM_{i}\geq(1-\epsilon)M^{*} for maximum matching MM^{*} and C0=VC^{0}=V. Then the clustering cost of Matching Affinity Clustering at the iith level is at most (whp):

clusteringcostG(Ci,Ci+1)22p+11ϵclusteringrevG(Ci,Ci+1).\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1})\leq\frac{2^{2p+1}}{1-\epsilon}\operatorname{clustering-rev}_{G}(C^{i},C^{i+1}).
Proof of Lemma 4.

As in Lemma 3, we want to break apart the clustering cost at the iith level into a series of merge costs. Again, we know the merge costs can be defined through the matching MiM_{i}.

clusteringcostG(Ci,Ci+1)\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1}) =(A,B)MimergecostG(A,B),\displaystyle=\sum_{(A,B)\in M_{i}}\operatorname{merge-cost}_{G}(A,B), (1)
=(A,B)Mi(|A|bB,cABwG(b,c)+|B|aA,cABwG(a,c)),\displaystyle=\sum_{(A,B)\in M_{i}}\left(|A|\sum_{b\in B,c\notin A\cup B}w_{G}(b,c)+|B|\sum_{a\in A,c\notin A\cup B}w_{G}(a,c)\right), (2)
2i(A,B)Mi(bB,cABwG(b,c)+aA,cABwG(a,c)).\displaystyle\leq 2^{i}\sum_{(A,B)\in M_{i}}\left(\sum_{b\in B,c\notin A\cup B}w_{G}(b,c)+\sum_{a\in A,c\notin A\cup B}w_{G}(a,c)\right). (3)

At this step, we broke the clustering cost into merge costs of the matching (1), applied the definition of merge cost (2), and pulled out |A|=|B|2i|A|=|B|\leq 2^{i} (3). Consider the inner clusters. Instead of selecting cABc\notin A\cup B, we can consider cc being in any other cluster from CiC^{i}. Let C1i,,CkiCiC_{1}^{i},\ldots,C_{k}^{i}\in C^{i} be all clusters other than AA or BB (i.e., CjiA,BC^{i}_{j}\neq A,B). Then we can define cc as an element in any cluster CjiC^{i}_{j} for j[k]j\in[k]. After, we simply rearrange the indices of summation.

clusteringcostG(Ci,Ci+1)\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1}) 2i(A,B)Mi(bB,j[k]cCjiwG(b,c)+aA,j[k]cCjiwG(a,c)),\displaystyle\leq 2^{i}\sum_{(A,B)\in M_{i}}\left(\sum_{b\in B,j\in[k]}\sum_{c\in C_{j}^{i}}w_{G}(b,c)+\sum_{a\in A,j\in[k]}\sum_{c\in C_{j}^{i}}w_{G}(a,c)\right),
=2i(A,B)Mi(j[k]bB,cCjiwG(b,c)+j[k]aA,cCjiwG(a,c)).\displaystyle=2^{i}\sum_{(A,B)\in M_{i}}\left(\sum_{j\in[k]}\sum_{b\in B,c\in C_{j}^{i}}w_{G}(b,c)+\sum_{j\in[k]}\sum_{a\in A,c\in C_{j}^{i}}w_{G}(a,c)\right).

Recall that the edge weights in 𝒞(G,Ci)\mathcal{C}(G,C^{i}) are the average edge weights between clusters in CiC^{i} on graph GG. Again, to turn this into just the summation of the edge weights, we must scale by |B||Cji||B||C^{i}_{j}| and |A||Cji||A||C^{i}_{j}|.

clusteringcostG(Ci,Ci+1)\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1}) 2i(A,B)Mi(j[k]|B||Cji|w𝒞(G,Ci)(vB,vCj)+j[k]|A||Cji|w𝒞(G,Ci)(vA,vCji)),\displaystyle\leq 2^{i}\sum_{(A,B)\in M_{i}}\left(\sum_{j\in[k]}|B||C^{i}_{j}|w_{\mathcal{C}(G,C^{i})}(v_{B},v_{C_{j}})+\sum_{j\in[k]}|A||C^{i}_{j}|w_{\mathcal{C}(G,C^{i})}(v_{A},v_{C^{i}_{j}})\right),
=23i(A,B)Mi(j[k]w𝒞(G,Ci)(vB,vCj)+j[k]w𝒞(G,Ci)(vA,vCj)).\displaystyle=2^{3i}\sum_{(A,B)\in M_{i}}\left(\sum_{j\in[k]}w_{\mathcal{C}(G,C^{i})}(v_{B},v_{C_{j}})+\sum_{j\in[k]}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{C_{j}})\right).

For each iteration of the outer summation, we are taking all the edges with one endpoint as AA and all edges with one as BB (besides the edge from AA to BB itself) and adding their weights. Since the only edge in MiM_{i} with an endpoint at AA or BB is the edge from AA to BB, the summation covers all edges with one endpoint as either AA or BB that are not in MiM_{i}. Consider an edge from some CC to CC^{\prime} that isn’t in MiM_{i}. In every iteration besides possibly the first, MiM_{i} matches everything, so we will consider CC and CC^{\prime} in separate iterations of the sum. In both of these iterations, we add the weight w𝒞(G,Ci)(vC,vC)w_{\mathcal{C}(G,C^{i})}(v_{C},v_{C^{\prime}}). Thus, each edge in 𝒞(G,Ci)\mathcal{C}(G,C^{i}) outside of MiM_{i} is accounted for twice, and no edge in MiM_{i} is accounted for.

Consider a multigraph HH with vertex set V(𝒞(G,Ci))V(\mathcal{C}(G,C^{i})) and an edge set that contains all edges except those in MiM_{i} twice over. Note since 𝒞(G,Ci)\mathcal{C}(G,C^{i}) was a complete graph, HH must be a 2(|V(𝒞(G,Ci)|2)2(|V(\mathcal{C}(G,C^{i})|-2)-regular graph. Thus, we could find a perfect matching in HH with a maximal matching algorithm, remove those edges to decrease all degrees by 1, and repeat on the new regular graph. Do this until all vertices have degree 0. Since each degree gets decremented by 1 each iteration, there must be a total of 2(|V(𝒞(G,Ci)|2)2(|V(\mathcal{C}(G,C^{i})|-2) matchings N1,N2,,N2(|V(𝒞(G,Ci)|2)N_{1},N_{2},\ldots,N_{2(|V(\mathcal{C}(G,C^{i})|-2)}. Thus the clustering cost can be alternatively thought of as the sum of the weights of these alternate matchings in clustering graph 𝒞(G,Ci)\mathcal{C}(G,C^{i}).

clusteringcostG(Ci,Ci+1)\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1})
23ij[2(|V(𝒞(G,Ci)|2)]w𝒞(G,Ci)(Nj),\displaystyle\qquad\leq 2^{3i}\sum_{j\in[2(|V(\mathcal{C}(G,C^{i})|-2)]}w_{\mathcal{C}(G,C^{i})}(N_{j}), (4)
23ij[2(|V(𝒞(G,Ci)|2)]w𝒞(G,Ci)(Mi),\displaystyle\qquad\leq 2^{3i}\sum_{j\in[2(|V(\mathcal{C}(G,C^{i})|-2)]}w_{\mathcal{C}(G,C^{i})}(M_{i}^{*}), (5)
23i+1(|V(𝒞(G,Ci)|2)w𝒞(G,Ci)(Mi).\displaystyle\qquad\leq 2^{3i+1}(|V(\mathcal{C}(G,C^{i})|-2)w_{\mathcal{C}(G,C^{i})}(M_{i}^{*}). (6)

In (4), we viewed the summations as the sum of weights of the alternative matchings described earlier. Step (5) utilizes the fact that MiM_{i}^{*} is a maximum matching, so the weight of any NjN_{j} is bounded above by the weight of MiM_{i}^{*}. Finally, in (6), we note that the summation does not depend on jj, and so we remove the summation.

Since MiM_{i} is an approximation of the maximum matching on 𝒞(G,Ci)\mathcal{C}(G,C^{i}), we know w𝒞(G,Ci)(Mi)w𝒞(G,Ci)(Mi)/(1ϵ)w_{\mathcal{C}(G,C^{i})}(M_{i}^{*})\leq w_{\mathcal{C}(G,C^{i})}(M_{i})/(1-\epsilon). We can substitute this in and then rewrite it as the summation of edge weights in MiM_{i}.

clusteringcostG(Ci,Ci+1)\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1}) 23i+1(|V(𝒞(G,Ci)|2)11ϵw𝒞(G,Ci)(Mi),\displaystyle\leq 2^{3i+1}(|V(\mathcal{C}(G,C^{i})|-2)\frac{1}{1-\epsilon}w_{\mathcal{C}(G,C^{i})}(M_{i}),
23i+1(|V(𝒞(G,Ci)|2)11ϵ(A,B)Miw𝒞(G,Ci)(vA,vB).\displaystyle\leq 2^{3i+1}(|V(\mathcal{C}(G,C^{i})|-2)\frac{1}{1-\epsilon}\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}).

The total number of vertices in 𝒞(G,Ci)\mathcal{C}(G,C^{i}) (ie, the total number of clusters at the iith level) is just the total number of vertices over the cluster sizes: 2n/2i2^{n}/2^{i}. Plugging that in gives the desired result.

clusteringcostG(Ci,Ci+1)\displaystyle\operatorname{clustering-cost}_{G}(C^{i},C^{i+1}) 23i+1(2n2i2)11ϵ(A,B)Miw𝒞(G,Ci)(vA,vB),\displaystyle\leq 2^{3i+1}\left(\frac{2^{n}}{2^{i}}-2\right)\frac{1}{1-\epsilon}\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}), (7)
23i+2(2ni11)11ϵ(A,B)Miw𝒞(G,Ci)(vA,vB),\displaystyle\leq 2^{3i+2}\left(2^{n-i-1}-1\right)\frac{1}{1-\epsilon}\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}),
22p+11ϵclusteringrevG(Ci,Ci+1).\displaystyle\leq\frac{2^{2p+1}}{1-\epsilon}\operatorname{clustering-rev}_{G}(C^{i},C^{i+1}). (8)

The first steps consist of plugging in the cluster sizes and performing algebraic simplifications. Finally, Step (8) refers to Lemma 3 for the clustering revenue. Recall this is an upper bound for the clustering cost in GG, and so the proof is complete.

So far, we have covered most of the lemma’s claim. Now we just need to account for the first step, when we utilize the kk-sized matching. Note the argument is dependent on MiM_{i} being a perfect matching, where M0M_{0} may only be a maximum matching on 2N2n2N-2^{n} vertices. The proof structure here will function similarly. In this case, we still construct a multigraph HH as described on the subset of GG containing vertices matching in M0M_{0}, then we add double copies of all the edges between vertices in the matching that aren’t matched to each other for a max degree of 4N2n+144N-2^{n+1}-4 and thus create 4N2n+144N-2^{n+1}-4 matchings on the 2N2n2N-2^{n} vertices to cover these edges. However, the cost also accounts for edges from the matched vertices to the unmatched vertices. We can construct a bipartite graph with all these edges once. Then all vertices on one side of the bipartition have degree 2nN2^{n}-N, and the vertices on the other side have degree 2N2n2N-2^{n}. So we can construct 2nN2^{n}-N matchings with 2N2n2N-2^{n} edges in this graph that cover all edges. Alternatively, this can be viewed as 2n+12N2^{n+1}-2N matchings on 2N2n2N-2^{n} vertices. In this case, we have a bunch of sized matchings on 2N2n2N-2^{n} vertices, N1,N2,,N2N4N_{1},N_{2},\ldots,N_{2N-4}. The rest of the arguments hold. Since i=0i=0, step (7) becomes the following.

clusteringcostG(C0,C1)\displaystyle\operatorname{clustering-cost}_{G}(C^{0},C^{1}) 2(N2)11ϵ(A,B)Miw𝒞(G,Ci)(vA,vB),\displaystyle\leq 2\left(N-2\right)\cdot\frac{1}{1-\epsilon}\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}), (9)
2(2n2)11ϵ(A,B)Miw𝒞(G,Ci)(vA,vB),\displaystyle\leq 2(2^{n}-2)\frac{1}{1-\epsilon}\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}), (10)
4(2n11)11ϵ(A,B)Miw𝒞(G,Ci)(vA,vB),\displaystyle\leq 4(2^{n-1}-1)\frac{1}{1-\epsilon}\sum_{(A,B)\in M_{i}}w_{\mathcal{C}(G,C^{i})}(v_{A},v_{B}),
231ϵclusteringrevG(C0,C1).\displaystyle\leq\frac{2^{3}}{1-\epsilon}\operatorname{clustering-rev}_{G}(C^{0},C^{1}). (12)

This mirrors the computation in steps (7) through (8). In (9), we substitute ii in for (7), and also replace the number of matchings with the new number of matchings (though recall at this point, we have already halved that value). Step (10) applies the fact that N<2nN<2^{n}. Finally, in (11), we substitute in the clustering revenue. In the analysis for revenue, note that there do not exist unreal vertices yet, so we can consider p=0p=0 when we refer to the Lemma 3. However, for this Lemma proof, this is the case where we do eventually duplicate vertices, so we analyze it along with other clusterings where p=1p=1, so it only needs to meet the condition when p=1p=1.

clusteringcostG(C0,C1)\displaystyle\operatorname{clustering-cost}_{G}(C^{0},C^{1}) 22p+11ϵclusteringrevG(C0,C1).\displaystyle\leq\frac{2^{2p+1}}{1-\epsilon}\operatorname{clustering-rev}_{G}(C^{0},C^{1}). (13)

Thus concludes our proof.

Lemma 5.

Matching Affinity Clustering obtains a (1/3ϵ)(1/3-\epsilon)-approximation for revenue on graphs of size 2N2^{N}, and a (1/9ϵ)(1/9-\epsilon)-approximation on general graphs whp.

Proof.

We prove this by constructing Matching Affinity Clustering. Our algorithm starts by allocating one machine to each cluster. Run Algorithm 1 for either the desired kk- or n/2n/2-matching, which finds our 1ϵ1-\epsilon approximate matching, to create clusters of two vertices each, then apply the algorithm from Lemma 2 to construct the next clustering graph based off this clustering. Repeat this process until we have a single cluster.

From Lemma 4, we see that at each round, the cumulative cost is bounded above by 21ϵ\frac{2}{1-\epsilon} times the revenue. Then we utilize the definition of the cost of an entire hierarchy tree TT to get bounds.

costG(T)\displaystyle\operatorname{cost}_{G}(T) =2u,vG,uvwG(u,v)+merges of A,BmergecostG(A,B),\displaystyle=2\sum_{u,v\in G,u\neq v}w_{G}(u,v)+\sum_{\text{merges of }A,B}\operatorname{merge-cost}_{G}(A,B), (1)
=2u,vG,uvwG(u,v)+i[logn]clusteringcostG(Ci,Ci+1),\displaystyle=2\sum_{u,v\in G,u\neq v}w_{G}(u,v)+\sum_{i\in[\log n]}\operatorname{clustering-cost}_{G}(C^{i},C^{i+1}), (2)
2u,vG,uvwG(u,v)+23p+11ϵi[logn]clusteringrevG(Ci,Ci+1),\displaystyle\leq 2\sum_{u,v\in G,u\neq v}w_{G}(u,v)+\frac{2^{3p+1}}{1-\epsilon}\sum_{i\in[\log n]}\operatorname{clustering-rev}_{G}(C^{i},C^{i+1}), (3)
2u,vG,uvwG(u,v)+22p+11ϵrevG(T).\displaystyle\leq 2\sum_{u,v\in G,u\neq v}w_{G}(u,v)+\frac{2^{2p+1}}{1-\epsilon}\operatorname{rev}_{G}(T). (4)

Step (1) simply break down the total cost into merge costs, and then step (2) consolidates merge costs in each level of the hierarchy into clustering costs. Note that every iteration halves the number of clusters, so there must be logn\log n iterations. In (3), we apply the result from Lemma 4, and finally in (4), we add up all the clustering revenues into the total hierarchy revenue. We can then examine hierarchy revenue.

revG(T)\displaystyle\operatorname{rev}_{G}(T) =nu,vG,uvwG(u,v)costG(T),\displaystyle=n\sum_{u,v\in G,u\neq v}w_{G}(u,v)-\operatorname{cost}_{G}(T),
nu,vG,uvwG(u,v)2u,vG,uvwG(u,v)22p+11ϵrevG(T).\displaystyle\geq n\sum_{u,v\in G,u\neq v}w_{G}(u,v)-2\sum_{u,v\in G,u\neq v}w_{G}(u,v)-\frac{2^{2p+1}}{1-\epsilon}\operatorname{rev}_{G}(T).

The above simply utilizes the duality of revenue and cost, and then substitution from step (4). Next we only require algebraic manipulation to isolate revG(T)\operatorname{rev}_{G}(T).

22p+1+1ϵ1ϵrevG(T)(n2)u,vG,uvwG(u,v).\displaystyle\frac{2^{2p+1}+1-\epsilon}{1-\epsilon}\operatorname{rev}_{G}(T)\geq(n-2)\sum_{u,v\in G,u\neq v}w_{G}(u,v).
revG(T)1ϵ22p+1+1ϵ(n2)u,vG,uvwG(u,v).\displaystyle\operatorname{rev}_{G}(T)\geq\frac{1-\epsilon}{2^{2p+1}+1-\epsilon}(n-2)\sum_{u,v\in G,u\neq v}w_{G}(u,v).

Then we know the optimal solution TT^{*} can’t have more than n2n-2 non-leaves, which means that each edge will only contribute at most n2n-2 times its weight to the revenue. Thus, revG(T)(n2)u,vG,uvwG(u,v)\operatorname{rev}_{G}(T^{*})\leq(n-2)\sum_{u,v\in G,u\neq v}w_{G}(u,v). In addition, since 1ϵ22p+1+1ϵ\frac{1-\epsilon}{2^{2p+1}+1-\epsilon} can be arbitrarily close to 122p+1+1\frac{1}{2^{2p+1}+1}, we rewrite it as 122p+1+1ϵ\frac{1}{2^{2p+1}+1}-\epsilon. Apply all these to our most recent inequality to get the desired results.

revG(T)\displaystyle\operatorname{rev}_{G}(T)\geq (122p+1+1ϵ)revG(T).\displaystyle\left(\frac{1}{2^{2p+1}+1}-\epsilon\right)\operatorname{rev}_{G}(T^{*}).

For an input of size 2n2^{n}, we have p=0p=0 and get a 13ϵ\frac{1}{3}-\epsilon approximation for revenue. For all other inputs, p=1p=1, and we get a 19ϵ\frac{1}{9}-\epsilon approximation. We note that these applications of cost and revenue properties are heavily inspired by Moseley and Wang’s proof for the approximation of Average Linkage Moseley & Wang (2017). ∎

Lemma 6.

Matching Affinity Clustering uses O~(n)\widetilde{O}(n) space per machine and runs in O(log(n)loglog(n)(1/ϵ)O(1/ϵ))O(\log(n)\allowbreak\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds on graphs of size 2N2^{N}, and O(log(nW)loglog(n)(1/ϵ)O(1/ϵ))O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds in general.

Proof.

First, we use Algorithm 1 to obtain a kk-sized matching, which runs in O(log(nW)loglog(n)(1/ϵ)1/ϵ)O(\log(nW)\log\log(n)\cdot(1/\epsilon)^{1/\epsilon}) rounds and O(n/polylog(n))O(n/polylog(n)) machine space. After this, there are logn\log n iterations, and at each iteration, we use Ghaffari et al. (2018)’s matching algorithm Algorithm, which finds our (1ϵ)(1-\epsilon)-approximate matching in O(loglogn(1/ϵ)O(1))O(\log\log n\cdot(1/\epsilon)^{O(1)}) rounds and O(n/polylog(n))O(n/polylog(n)) machine space Ghaffari et al. (2018). We also transform the graph as in Lemma 2, which adds no round complexity, but requires O(n)O(n) space. Thus in total, this requires O(log(nW)log(n)loglog(n)(1/ϵ)O(1/ϵ))O(\log(nW)\log(n)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds and O(n)O(n) space per machine. However, note that when p=0p=0, we can just run Ghaffari et al. (2018)’s algorithm directly, and achieve a slightly better bound of O(log(n)loglog(n)(1/ϵ)O(1/ϵ))O(\log(n)\log\log(n)\cdot(1/\epsilon)^{O(1/\epsilon)}) rounds and O(n)O(n) space per machine. ∎

Theorem 5.

There is a graph GG on which Matching Affinity Clustering achieves no better than a (1/3+o(1))(1/3+o(1))-approximation of the optimal revenue.

Proof of Theorem 5.

The graph GG consists of NN vertices. We divide the vertices into N1/3N^{1/3} “columns” to make large cliques. In every column, make a clique with edge weights of 1. In addition, enumerate all vertices in each column. For some index ii, we take all iith vertices in each column and make a “row” (so there are N2/3N^{2/3} rows that are essentially orthogonal to the columns). Rows become cliques as well, with edge weights of 1+ϵ1+\epsilon. All non-edges are assumed to have weight zero.

This is the graph described by Charikar et al. (2019) to show Average Linkage only achieves a 1/31/3-approximation, at best, for revenue. They are able to achieve this because Average Linkage will greedily select all the 1+ϵ1+\epsilon edges to merge across first. We can then leverage these results by showing that Matching Affinity Clustering, too, merges across these edges first.

In our formulation, we assume N=23nN=2^{3n} for some nn. Then there are 2n2^{n} columns and 22n2^{2n} rows with 22n2^{2n} and 2n2^{n} vertices respectively. Additionally, our algorithm skips the kk-matching steps (ie, all vertices are real). In the first round, we can clearly find a maximum perfect matching by simply matching within the rows, and we can assure this for our approximate matching by selecting a small enough error. In the next clustering graph, since edge weights are the average linkage between nodes, note that the highest edge weights are still going to be 1+ϵ1+\epsilon within the rows. Therefore, this matching will continue occurring until it can no longer find perfect matchings within the rows. Since the rows are cliques of 2n2^{n} vertices, this will happen until all each row is merged into a single cluster. This is then sufficient to refer to the results of Charikar et al. (2019) to get a 1/31/3 bound. ∎

Appendix C Value approximation

Our goal in this section is to prove Theorem 6.

Theorem 6.

Assume there exists an MPC algorithm that achieves an α\alpha-approximation for minimum weight kk-sized matching whp in O(f(n))O(f(n)) rounds and O~(n)\widetilde{O}(n) machine space. In the value context (where edge weights are data distances) and in O(f(n)log(n))O(f(n)\log(n)) rounds with O~(n)\widetilde{O}(n) machine space, Matching Affinity Clustering achieves (whp):

  • a 23α\frac{2}{3}\alpha-approximation for value when n=2Nn=2^{N},

  • and a 13α\frac{1}{3}\alpha-approximation for value in general.

Since this is effectively the same algorithm as the revenue context, we can the analysis of Lemma 6 and simplify it to show the complexity of the algorithm. All that is left to do is prove the approximation. Our proof will be quite similar to that of Cohen-Addad et al. (2018), however we will require some clever manipulation to handle many merges at a time. Fortunately, the fact that we merge based off a minimum matching with respect to Average Linkage makes the analysis still follow the Cohen-Addad et al. (2018) proof quite well. We start with a lemma.

Lemma 9.

Let TT be the tree returned by Matching Affinity Clustering in the distance context. Consider any clustering CC at some iteration of Matching Affinity Clustering above the first level. Let CiC_{i} be the iith cluster which merged clusters AiA_{i} and BiB_{i} in the previous iteration. Say there are kk clusters in CC. Then whp, given an α\alpha-approximation MPC algorithm for minimum weight kk-sized matching whp:

i=1kw(Ai,Bi)i=1k|Ai||Bi|2αi=1k(w(Ai)+w(Bi))i=1k(|Ai|(|Ai|1)+|Bi|(|Bi|1))\displaystyle\frac{\sum_{i=1}^{k}w(A_{i},B_{i})}{\sum_{i=1}^{k}|A_{i}|\cdot|B_{i}|}\geq 2\alpha\frac{\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))}{\sum_{i=1}^{k}(|A_{i}|(|A_{i}|-1)+|B_{i}|(|B_{i}|-1))}
Proof.

Let a=12i=1k|Ai|(|Ai|1)a=\frac{1}{2}\sum_{i=1}^{k}|A_{i}|(|A_{i}|-1) and b=12i=1k|Bi|(|Bi|1)b=\frac{1}{2}\sum_{i=1}^{k}|B_{i}|(|B_{i}|-1). Let A=i=1kAiA=\cup_{i=1}^{k}A_{i} and B=i=1kBiB=\cup_{i=1}^{k}B_{i}. Using these, one can see that the average edge weight of all edges contained in any AiA_{i} or BiB_{i} cluster is:

i=1k(w(Ai)+w(Bi))a+b\frac{\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))}{a+b}

These edges were all merged across at some point lower in the hierarchy. This means that the edge set between AiA_{i}’s and BiB_{i}’s are the union of all edges merged across in lower clusterings in the hierarchy. Therefore, by averaging, we can say there exists a clustering CC^{\prime} (with |C|=k|C^{\prime}|=k^{\prime}) below CC in the hierarchy, with clusters and splits similarly defined as CC^{\prime}, AiA_{i}^{\prime} and BiB_{i}^{\prime} respectively, such that:

i=1kw(Ai,Bi)i=1k|Ai||Bi|i=1k(w(Ai)+w(Bi))a+b\frac{\sum_{i=1}^{k^{\prime}}w(A_{i}^{\prime},B_{i}^{\prime})}{\sum_{i=1}^{k}|A_{i}^{\prime}|\cdot|B_{i}^{\prime}|}\geq\frac{\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))}{a+b}

Now we would like to build a similar expression for the edges between all AiA_{i} and BiB_{i}. The average of these edge weights is the following expression:

i=1kw(Ai,Bi)i=1k|Ai||Bi|\frac{\sum_{i=1}^{k}w(A_{i},B_{i})}{\sum_{i=1}^{k}|A_{i}|\cdot|B_{i}|}

Consider the iteration that formed CC^{\prime}. Notice because CC is a higher level of the hierarchy, every cluster AjA_{j}^{\prime} and BjB_{j}^{\prime} must be a subset of some AiA_{i} or BiB_{i}. Fix some ii, and consider all the edges that cross from some AjA_{j}^{\prime} to some BkB_{k}^{\prime} such that Aj,BkAiBi=CiA_{j}^{\prime},B_{k}^{\prime}\subset A_{i}\cup B_{i}=C_{i}. The union of all these edges precisely makes up the set of edges between AiA_{i} and BiB_{i}. Do this for every ii, and we can see this makes up all the edges of interest. We can decompose this into a set of matchings across the entire dataset. By another averaging argument, we can say there exists another alternate clustering C′′C^{\prime\prime} (as opposed to CC^{\prime}) which only matches clusters AjA_{j}^{\prime} and BkB_{k}^{\prime} that are descendants of AiA_{i} and BiB_{i} respectively such that:

i=1kw(Ai′′,Bi′′)i=1k|Ai′′||Bi′′|i=1kw(Ai,Bi)i=1k|Ai||Bi|\frac{\sum_{i=1}^{k}w(A_{i}^{\prime\prime},B_{i}^{\prime\prime})}{\sum_{i=1}^{k}|A_{i}^{\prime\prime}|\cdot|B_{i}^{\prime\prime}|}\leq\frac{\sum_{i=1}^{k}w(A_{i},B_{i})}{\sum_{i=1}^{k}|A_{i}|\cdot|B_{i}|}

Note that this was a valid matching, and therefore a valid clustering, at the same time that CC^{\prime} was selected. Also, note that either both CC and CC^{\prime} were perfect matchings, or they were both restricted to the same kk size in the first step of the algorithm. Thus, since CC^{\prime} was an α\alpha-approximate minimum (kk-sized) matching in the graph where edges are the average edge weights between clusters, we know:

i=1kw(Ai,Bi)i=1k|Ai||Bi|αi=1kw(Ai′′,Bi′′)i=1k|Ai′′||Bi′′|\frac{\sum_{i=1}^{k}w(A_{i}^{\prime},B_{i}^{\prime})}{\sum_{i=1}^{k}|A_{i}^{\prime}|\cdot|B_{i}^{\prime}|}\leq\alpha\frac{\sum_{i=1}^{k}w(A_{i}^{\prime\prime},B_{i}^{\prime\prime})}{\sum_{i=1}^{k}|A_{i}^{\prime\prime}|\cdot|B_{i}^{\prime\prime}|}

Putting this all together, we find the desired result:

i=1k(w(Ai)+w(Bi))a+bαi=1kw(Ai,Bi)i=1k|Ai||Bi|\frac{\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))}{a+b}\leq\alpha\frac{\sum_{i=1}^{k}w(A_{i},B_{i})}{\sum_{i=1}^{k}|A_{i}|\cdot|B_{i}|}

And we can use this to prove our theorem, similar to the results of Cohen-Addad et al. (2018).

Proof of Theorem 6.

We prove this by induction on the level of the tree. At some level, with clustering CC, consider truncating the entire tree TT at that level, and thus only consider the subtrees below CC, ie with roots in CC. Call this tree TCT_{C}. We will consider the aggregate value accumulated by this level. Trivially, the base case holds. Then we can split the value of CC into the value accumulated at the most recent clustering step and value one step below CC. We use induction on the latter value. Since the approximation ratio is 13\frac{1}{3} for p=1p=1 and 12\frac{1}{2} for p=0p=0, we can write the ratio as (13)p(12)1p\left(\frac{1}{3}\right)^{p}\left(\frac{1}{2}\right)^{1-p}.

val(TC)\displaystyle\operatorname{val}(T_{C})\geq i=1k(|Ai|+|Bi|)w(Ai,Bi)+(13)p(23)1pi=1k(|Ai|w(Ai)+|Bi|w(Bi))\displaystyle\sum_{i=1}^{k}(|A_{i}|+|B_{i}|)w(A_{i},B_{i})+\left(\frac{1}{3}\right)^{p}\left(\frac{2}{3}\right)^{1-p}\sum_{i=1}^{k}(|A_{i}|w(A_{i})+|B_{i}|w(B_{i}))

Now we would like to apply Lemma 1 to modify the first term in a similar manner to Cohen-Addad et al. Specifically, we want to extract terms of the form |Ai|w(Bi)|A_{i}|w(B_{i}) and |Bi|w(Ai)|B_{i}|w(A_{i}). We will find this is harder to do with our formulation of Lemma 1, and therefore we have to rely on Lemma x that says that the cluster balance is at least 12\frac{1}{2}. Let m=min({|Ai|}i=1k{|Bi|}i=1k})m=\min\left(\{|A_{i}|\}_{i=1}^{k}\cup\{|B_{i}|\}_{i=1}^{k}\}\right) be the minimum cluster size. This and our cluster balance ratio implies that m|Ai|,|Bi|2mm\leq|A_{i}|,|B_{i}|\leq 2m for all ii. Note that when N=2nN=2^{n}, we have perfect cluster balance, so |Ai|=|Bi|=m|A_{i}|=|B_{i}|=m. Thus for our indicator pp, m|Ai|,|Bi|2pmm\leq|A_{i}|,|B_{i}|\leq 2^{p}m. Now we can manipulate our Lemma 1 result. Start by isolating the numerator on the left.

i=1kw(Ai,Bi)2αi=1k|Ai||Bi|i=1k(w(Ai)+w(Bi))i=1k(|Ai|(|Ai|1)+|Bi|(|Bi|1))\displaystyle\sum_{i=1}^{k}w(A_{i},B_{i})\geq 2\alpha\frac{\sum_{i=1}^{k}|A_{i}|\cdot|B_{i}|\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))}{\sum_{i=1}^{k}(|A_{i}|(|A_{i}|-1)+|B_{i}|(|B_{i}|-1))}

Note now that i=1k|Ai||Bi|m2\sum_{i=1}^{k}|A_{i}|\cdot|B_{i}|\geq m^{2} and |Ai|(|Ai|1)22pm2|A_{i}|(|A_{i}|-1)\leq 2^{2p}m^{2} and similarly for BiB_{i}.

i=1kw(Ai,Bi)\displaystyle\sum_{i=1}^{k}w(A_{i},B_{i})\geq 2αkm2i=1k(w(Ai)+w(Bi))22p+1km2\displaystyle 2\alpha\frac{km^{2}\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))}{2^{2p+1}km^{2}}
=\displaystyle= αi=1k(w(Ai)+w(Bi))22p\displaystyle\alpha\frac{\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))}{2^{2p}}

To get the correct term on the left, we see that i=1k(|Ai|+|Bi|)w(Ai,Bi)2mi=1kw(Ai,Bi)\sum_{i=1}^{k}(|A_{i}|+|B_{i}|)w(A_{i},B_{i})\geq 2m\sum_{i=1}^{k}w(A_{i},B_{i}). So we can multiply this result by 2m2m, and then plug it into a portion of the first term. To preserve the ratio for both p=1p=1 and p=0p=0, we multiply it by (23)p(12)1p\left(\frac{2}{3}\right)^{p}\left(\frac{1}{2}\right)^{1-p}.

val(TC)\displaystyle\operatorname{val}(T_{C})\geq (123)1p(113)pi=1k(|Ai|+|Bi|)w(Ai,Bi)\displaystyle\left(1-\frac{2}{3}\right)^{1-p}\left(1-\frac{1}{3}\right)^{p}\sum_{i=1}^{k}(|A_{i}|+|B_{i}|)w(A_{i},B_{i})
+2m22p(23)p(13)1pαi=1k(w(Ai)+w(Bi))\displaystyle+\frac{2m}{2^{2p}}\cdot\left(\frac{2}{3}\right)^{p}\left(\frac{1}{3}\right)^{1-p}\alpha\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))
+(13)p(23)1pαi=1k(|Ai|w(Ai)+|Bi|w(Bi))\displaystyle+\left(\frac{1}{3}\right)^{p}\left(\frac{2}{3}\right)^{1-p}\alpha\sum_{i=1}^{k}(|A_{i}|w(A_{i})+|B_{i}|w(B_{i}))

Next, note m|Ai|,|Bi|m\geq|A_{i}|,|B_{i}|. This can be used on the second term.

val(TC)\displaystyle\operatorname{val}(T_{C})\geq (13)p(23)1pi=1k(|Ai|+|Bi|)w(Ai,Bi)\displaystyle\left(\frac{1}{3}\right)^{p}\left(\frac{2}{3}\right)^{1-p}\sum_{i=1}^{k}(|A_{i}|+|B_{i}|)w(A_{i},B_{i})
+(13)p(23)1pαi=1k(w(Ai)+w(Bi))\displaystyle+\cdot\left(\frac{1}{3}\right)^{p}\left(\frac{2}{3}\right)^{1-p}\alpha\sum_{i=1}^{k}(w(A_{i})+w(B_{i}))
+(13)p(12)1pαi=1k(|Ai|w(Ai)+|Bi|w(Bi))\displaystyle+\left(\frac{1}{3}\right)^{p}\left(\frac{1}{2}\right)^{1-p}\alpha\sum_{i=1}^{k}(|A_{i}|w(A_{i})+|B_{i}|w(B_{i}))
\displaystyle\geq (13)p(23)1pαi=1k|Ci|w(Ci)\displaystyle\left(\frac{1}{3}\right)^{p}\left(\frac{2}{3}\right)^{1-p}\alpha\sum_{i=1}^{k}|C_{i}|w(C_{i})

Therefore, this captures 23α\frac{2}{3}\alpha of the weight of each subtree at height CC when n=2Nn=2^{N}, and 13α\frac{1}{3}\alpha more generally. ∎

Finally, we can show Theorem 7, which shows the tightness of the stronger approximation factor.

Theorem 7.

There is a graph GG on which Matching Affinity Clustering achieves no better than a (2/3+o(1))(2/3+o(1))-approximation of the optimal revenue.

Proof.

Consider GG which is almost a bipartite graph between partitions AA and BB (with |A|=|B||A|=|B|), except with a single perfect matching removed. For instance, if we enumerate A={a1,,an}A=\{a_{1},\ldots,a_{n}\} and B={b1,,bn}B=\{b_{1},\ldots,b_{n}\}, we have w(ai,bj)=1w(a_{i},b_{j})=1 for all iji\neq j and w(ai,bi)=0w(a_{i},b_{i})=0. And since it’s bipartite, w(ai,aj)=w(bi,bj)=0w(a_{i},a_{j})=w(b_{i},b_{j})=0 for all i,ji,j.

Consider the removed perfect matching to get clusters {a1,b1},,{an,bn}\{a_{1},b_{1}\},\ldots,\{a_{n},b_{n}\}. Matching Affinity Clustering could start by executing these merges, as this is a zero weight (and thus minimum) matching.

Now consider GG^{\prime}, the remaining graph after these merges with a vertex for each cluster and edges representing the total edge weight between clusters. This is a complete graph of size nn with 2-weight edges. By Dasgupta (2016)’s results, we know the value (with is calculated the same as cost) of any hierarchy on GG^{\prime} is 23(n3n)\frac{2}{3}(n^{3}-n). However, this ignores the fact that clusters are size 2, so the contribution of this part to the hierarchy on GG yields a revenue of 43(n3n)\frac{4}{3}(n^{3}-n).

Now let’s consider the obvious good hierarchy, where we simply merge all of AA, then all of BB, then merge AA and BB together. Thus all n(n2)n(n-2) edges will be merged into a cluster of size 2n2n for a total value of 2(n32n2)2(n^{3}-2n^{2}).

Asymptotically, then, Matching Affinity Clustering only achieves a ratio of 2/32/3 on this graph. ∎

Appendix D Affinity Clustering approximation bounds

We now formally prove our bounds on the theoretical performance of Affinity Clustering.

Theorem 3.

Affinity Clustering cannot achieve better than a O(1/n)O(1/n)-factor approximation for revenue or value.

Recall that this is shown with the following two lemmas:

Lemma 7.

There exists a family of graphs on which Affinity Clustering cannot achieve better than a O(1/n)O(1/n)-factor approximation for revenue.

Proof.

Consider a complete bipartite graph GG with 2n2^{n} vertices such that each partition, LL and RR, has 2n12^{n-1} vertices, and all edges have weight 1. To make it a complete graph, we simply fill in the rest of the graph with 0 weight edges. We first consider how Affinity Clustering might act on GG. To start, each vertex reaches across one if its highest weight adjacent edges, a weight one edge, and merges with that vertex. Therefore, a vertex in LL will merge with a vertex in RR, and vice versa. There are many ways this could occur. We consider one specific possibility.

Take vertices cLc_{L} and cRc_{R} from LL and RR respectively. It is possible that every vertex in LcLL\setminus c_{L} will merge with cRc_{R} and every vertex in RcRR\setminus c_{R} will merge with cLc_{L}. It doesn’t matter which vertices cRc_{R} and cLc_{L} try to merge with. Then GG is divided into two subgraphs, both of which are stars centered at cLc_{L} and cRc_{R} respectively. The spokes of the stars have unit edge weights, and all other edges have weight 0. In the next step, the two subgraphs will merge into one cluster. Since there are no non-leaves at that point, it contributes nothing to the total revenue. Therefore, all revenues are encoded in the first step.

Since the subgraphs are identical, they will contribute the same amount to the hierarchy revenue. Recall that we are trying to prove a bound for whatever method Affinity Clustering might choose to break down a merging of a large subgraph into a series of independent clusters. Notice, however, due to the symmetries of the subgraphs, it does not matter in what order the independent merges occur. Therefore, we consider an arbitrary order. Let TT be the hierarchy of Affinity Clustering with this arbitrary order. We must break it down into individual merges, and let T0T_{0} be the portion of the hierarchy contributing to one of the stars. We only need to sum over the merges of nonzero weight edges.

revG(T)=\displaystyle\operatorname{rev}_{G}(T)= 2revG(T0).\displaystyle 2\operatorname{rev}_{G}(T_{0}).

At each step of merging the star subgraph, we merge a single vertex across a unit weight edge into the cluster containing the star center. Call this growing cluster CiC_{i} at the iith merge. Let viv_{i} be the vertex that gets merged with CC at the iith step. Then we can break this down into 2n112^{n-1}-1 total merges.

revG(T)=\displaystyle\operatorname{rev}_{G}(T)= 2i=12n11merge-revG({vi},Ci),\displaystyle 2\sum_{i=1}^{2^{n-1}-1}merge\text{-}\operatorname{rev}_{G}(\{v_{i}\},C_{i}), (1)
=\displaystyle= 2i=02n122ni1,\displaystyle 2\sum_{i=0}^{2^{n-1}-2}2^{n}-i-1, (2)
=\displaystyle= O(22n).\displaystyle O(2^{2n}). (3)

In step (1), we simply break a single star’s merging into a series of individual merges in temporal order. Step (2) uses the fact that there are 2n2^{n} total vertices and ii and 1 vertices in the groups being merged to apply the definition of merge revenue. Finally, in (3), we simply evaluate the summation.

Now we consider how Matching Affinity Clustering will act on this graph. It simply finds the maximum matching. In the first iteration, it must match across unit weight edges, and therefore is a perfect matching on the bipartite graph. After this, due to symmetry, it simply finds any perfect matching at each iteration until all clusters are merged. Since the number of vertices is 2n2^{n}, it can always find such a perfect matching. Let TT^{\prime} be Matching Affinity Clustering’s hierarchy on GG. We will break this down into clusterings at each level, as we did in Section 4. Note there are log2(2n)=n\log_{2}(2^{n})=n total clusterings required in the hierarchy. And at each clustering, we have a matching MiM_{i} that we merge across, so we can break it down into merges across matches. Let any CiC^{i} be the usual clustering at the iith level of this algorithm.

revG(T)=\displaystyle\operatorname{rev}_{G}(T^{\prime})= i=1nclusteringrevG(Ci,Ci+1),\displaystyle\sum_{i=1}^{n}\operatorname{clustering-rev}_{G}(C^{i},C^{i+1}),
=\displaystyle= i=1n(A,B)MimergerevG(A,B).\displaystyle\sum_{i=1}^{n}\sum_{(A,B)\in M_{i}}\operatorname{merge-rev}_{G}(A,B).

Note that by symmetry, each merge on a level contributes the same amount to the revenue. Therefore, we can simply count the number of merges and their contributions. At the iith level, there are 2ni2^{n-i} total merges. The size of each cluster being merged is 2i12^{i-1}, so the number of non-leaves is 2n22i1=2n2i2^{n}-2\cdot 2^{i-1}=2^{n}-2^{i}. Finally, the number of edges being merged across at each merge, since the graph is bipartite, is just 2i2^{i}.

revG(T)=\displaystyle\operatorname{rev}_{G}(T^{\prime})= i=1n2i1(2n2i)2i=O(23n).\displaystyle\sum_{i=1}^{n}2^{i-1}(2^{n}-2^{i})\cdot 2^{i}=O(2^{3n}).

Therefore, if we take the ratio of the revenues in the long run, we find the following.

revG(T)revG(T)=O(2n).\displaystyle\frac{\operatorname{rev}_{G}(T^{\prime})}{\operatorname{rev}_{G}(T)}=O(2^{n}).

So by cleverly selecting nn, we can make the clustering found by Affinity Clustering arbitrarily worse. If the size of the graph is N=2nN=2^{n}, then Affinity Clustering can only achieve at best a 1/O(n)1/O(n) approximation for this graph.

Lemma 8.

There exists a family of graphs on which Affinity Clustering cannot achieve better than a O(1/n)O(1/n)-factor approximation for value.

Proof.

Consider simply a graph GG that is a matching (ie, each vertex is connected to exactly one other vertex with edge weight 1) with 4n4n vertices. Again, recall that Affinity Clustering must match along the edges of a minimum spanning tree.

Partition the vertices into sets of four, which consist of two pairs. Consider one such set: v1v_{1} is matched to v2v_{2} and u1u_{1} is matched to u2u_{2}. Most of the edges here are zero. Therefore, a potential component of the minimum spanning tree is the line v1,u1,u2,v2v_{1},u_{1},u_{2},v_{2}. If we do this for all sets, we can then simply pick an arbitrary root for each set (ie, v1v_{1}), make some arbitrary order of sets, and connect the roots in a line. All of these added edges are weight 0, so this is clearly a valid MST.

However, note that each edge is contained within some set of four. So if Affinity Clustering merges across these edges first, then the largest cluster the 1-weight edges can be merged in has four vertices. Say the tree returned by Affinity Clustering is TT. Note that we have 2n2n edges.

val(T)4i=12n1=8n.val(T)\leq 4\sum_{i=1}^{2n}1=8n.

We now observe the optimal solution. Since this is bipartite, we can simply merge each side of the partition first. Then we can merge the two partitions together at the top of the hierarchy. This means all edges are merged into the final, 4n4n-sized cluster. Call this TT^{*}.

val(T)=4ni=12n1=8n2.val(T^{*})=4n\sum_{i=1}^{2n}1=8n^{2}.

Thus val(T)/val(T)=1/O(n)val(T)/val(T^{*})=1/O(n). Thus, Affinity Clustering cannot achieve better than a 1/O(n)1/O(n) approximation on this family of graphs. ∎

D.1 Experiments

Here we provide the full depiction of all experiments.

Refer to caption
(a) Legend
Refer to caption
(b) Rand Index on Raw Data
Refer to caption
(c) Rand Index on Filtered Data
Refer to caption
(d) Cluster Balance on Raw Data
Refer to caption
(e) Cluster Balance on Filtered Data
Figure 4: Rand Index scores and cluster balance on raw and filtered (randomly pruned so ground truth is balanced, n=2Nn=2^{N}) UCI datasets.