This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Department of Computer Science and Engineering, Indian Institute of Technology Delhi.thanks: 11email: {rjaiswal, amitk}@cse.iitd.ac.in

Clustering What Matters in Constrained Settings

(Improved Outlier to Outlier-Free Reductions)
Ragesh Jaiswal and Amit Kumar
Abstract

Constrained clustering problems generalize classical clustering formulations, e.g., kk-median, kk-means, by imposing additional constraints on the feasibility of a clustering. There has been significant recent progress in obtaining approximation algorithms for these problems, both in the metric and the Euclidean settings. However, the outlier version of these problems, where the solution is allowed to leave out mm points from the clustering, is not well understood. In this work, we give a general framework for reducing the outlier version of a constrained kk-median or kk-means problem to the corresponding outlier-free version with only (1+ε)(1+\varepsilon)-loss in the approximation ratio. The reduction is obtained by mapping the original instance of the problem to f(k,m,ε)f(k,m,\varepsilon) instances of the outlier-free version, where f(k,m,ε)=(k+mε)O(m)f(k,m,\varepsilon)=\left(\frac{k+m}{\varepsilon}\right)^{O(m)}. As specific applications, we get the following results:

  • First FPT (in the parameters kk and mm) (1+ε)(1+\varepsilon)-approximation algorithm for the outlier version of capacitated kk-median and kk-means in Euclidean spaces with hard capacities.

  • First FPT (in the parameters kk and mm) (3+ε)(3+\varepsilon) and (9+ε)(9+\varepsilon) approximation algorithms for the outlier version of capacitated kk-median and kk-means, respectively, in general metric spaces with hard capacities.

  • First FPT (in the parameters kk and mm) (2δ)(2-\delta)-approximation algorithm for the outlier version of the kk-median problem under the Ulam metric.

Our work generalizes the results of [BGJK20] and [AISX23] to a larger class of constrained clustering problems. Further, our reduction works for arbitrary metric spaces and so can extend clustering algorithms for outlier-free versions in both Euclidean and arbitrary metric spaces.

1 Introduction

Center-based clustering problems such as kk-median and the kk-means are important data processing tasks. Given a metric DD on a set of nnpoints 𝒳\mathcal{X} and a parameter kk, the goal here is to partition the set of points into kk clusters, say C1,,CkC_{1},\ldots,C_{k}, and assign the points in each cluster to a corresponding cluster center, say c1,,ckc_{1},\ldots,c_{k} respectively, such that the objective i=1kxCiD(x,ci)z\sum_{i=1}^{k}\sum_{x\in C_{i}}D(x,c_{i})^{z} is minimized. Here zz is a parameter which is 1 for kk-median and 2 for kk-means. The outlier version of these problems is specified by another parameter mm, where a solution is allowed to leave out up to mm points from the clusters. Outlier versions capture settings where the input may contain a few highly erroneous data points. Both the outlier and the outlier-free versions have been well-studied in the literature with constant factor approximations known for both the kk-means and the kk-median problem [ANSW17, AGK+04, CGETS02]. In addition, fixed parameter tractable (FPT) (1+ε)(1+\varepsilon)-approximation algorithms are known for these problems in the Euclidean setting [KSS10, FMS07, BGJK20]: the running time of such algorithms is of the form f(k,m,ε)poly(n,d)f(k,m,\varepsilon)\cdot poly(n,d), where f()f() is an exponential function of the parameters k,m,εk,m,\varepsilon and dd denotes the dimensionality of the points.

A more recent development in clustering problems has been the notion of constrained clustering. A constrained clustering problem specifies additional conditions on a feasible partitioning of the input points into kk clusters. For example, the rr-gathering problem requires that each cluster in a feasible partitioning must contain at least rr data points. Similarly, the well-known capacitated clustering problem specifies an upper bound on the size of each cluster. Constrained clustering formulations can also capture various types of fairness constraints: each data point has a label assigned to it, and we may require upper or lower bounds on the number (or fraction) of points with a certain label in each cluster. Table 1 gives a list of some of these problems. FPT (in the parameter kk) constant factor approximation algorithms are known for a large class of these problems (see Table 2).

It is worth noting that constrained clustering problems are distinct from outlier clustering: the former restricts the set of feasible partitioning of input points, whereas the latter allows us to reduce the set of points that need to be partitioned into clusters. There has not been much progress on constrained clustering problems in the outlier setting (also see [KLS18] for unbounded integrality gap for the natural LP relaxation for the outlier clustering versions). In this work, we bridge this lag between the outlier and the outlier-free versions of constrained clustering problems by giving an almost approximation-preserving reduction from the former to the latter. As long as the parameters of interest (i.e., k,mk,m) are small, the reduction works in polynomial time. Using our reduction, an FPT α\alpha-approximation algorithm for the outlier-free version of a constrained clustering problem leads to an FPT (α+ε)(\alpha+\varepsilon)-approximation algorithm for the outlier version of the same problem. For general metric spaces, this implies the first FPT constant-approximation for outlier versions of several constrained clustering problems; and similarly, we get new FPT (1+ε)(1+\varepsilon)-approximation algorithms for several outlier constrained clustering problems –see Table 2 for the precise details.

This kind of FPT approximation preserving reduction in the context of Euclidean kk-means was first given by [BGJK20] using a sampling-based approach.  [GJK20] extended the sampling ideas of [BGJK20] to general metric spaces but did not give an approximation-preserving reduction.  [AISX23] gave a reduction for general metric spaces using a coreset construction. In this work, we use the sampling-based ideas of [BGJK20] to obtain an approximation-preserving reduction from the outlier version to the outlier-free version with improved parameters over [AISX23]. Moreover, our reduction works for most known constrained clustering settings as well.

1.1 Preliminaries

We give a general definition of a constrained clustering problem. For a positive integer kk, we shall use [k][k] to denote the set {1,,k}\{1,\ldots,k\}. Let (𝒳,D)(\mathcal{X},D) denote the metric space with distance function DD. For a point xx and a subset SS of points, we shall use D(x,S)D(x,S) to denote minySD(x,y)\min_{y\in S}D(x,y).

The set 𝒳\mathcal{X} contains subsets FF and XX: here XX denotes the set of input points and FF the set of points where a center can be located. An outlier constrained clustering problem is specified by the following parameters and functions:

  • kk: the number of clusters.

  • mm: the number of points which can be left out from the clusters.

  • a function check: given a partitioning X0,X1,,XkX_{0},X_{1},\ldots,X_{k} of XX (here X0X_{0} is the set of outliers) and centers f1,,fkf_{1},\ldots,f_{k}, each lying in the set FF, the function check(X1,,Xk,f1,,fk){\textsf{check}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k}) outputs 1 iff this is a feasible clustering. For example, in the rr-gathering problem, the check(X0,X1,,Xk,f1,,fk){\textsf{check}}(X_{0},X_{1},\ldots,X_{k},f_{1},\ldots,f_{k}) outputs 1 iff |Xi|r|X_{i}|\geq r for each i[k]i\in[k]. The check function depends only on the cardinality of the sets X1,,XkX_{1},\ldots,X_{k} and the locations f1,,fkf_{1},\ldots,f_{k}. This already captures many of the constrained clustering problems. Our framework also applies to the more general labelled version (see details below).

  • a cost function cost: given a partitioning X0,X1,,XkX_{0},X_{1},\ldots,X_{k} of XX and centers f1,,fkf_{1},\ldots,f_{k},

    cost(X1,,Xk,f1,,fk):=i[k]xXiDz(x,fi),{\textsf{cost}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k}):=\sum_{i\in[k]}\sum_{x\in X_{i}}D^{z}(x,f_{i}),

    where zz is either 1 (the outlier constrained kk-median problem) or 2 (the outlier constrained kk-means problem).

Given an instance =(X,F,k,m,check,cost){\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}}) of an outlier constrained clustering problem as above, the goal is to find a partitioning X0,X1,,XkX_{0},X_{1},\ldots,X_{k} of XX and centers f1,,fkFf_{1},\ldots,f_{k}\in F such that |X0|m|X_{0}|\leq m,
check(X1,,Xk,f1,,fk){\textsf{check}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k}) is 1 and cost(X0,X1,,Xk,f1,,fk){\textsf{cost}}(X_{0},X_{1},\ldots,X_{k},f_{1},\ldots,f_{k}) is minimized. The outlier-free constrained clustering problem is specified as above, except that the parameter mm is 0. For sake of brevity, we leave out the parameter mm and the set X0X_{0} while defining the instance {\mathcal{I}}, and functions check and cost.

We shall also consider a more general class of constrained clustering problems, where each input point is assigned a label. In other words, an instance {\mathcal{I}} of such a problem is specified by a tuple (X,F,k,m,σ,check,cost)(X,F,k,m,\sigma,{\textsf{check}},{\textsf{cost}}), where σ:XL\sigma:X\rightarrow L for a finite set LL. Note that the check function may depend on the function σ\sigma. For example, σ\sigma could assign a label “red” or “blue” to each point in XX and the check function would require that each cluster XiX_{i} should have an equal number of red and blue points. In addition to the locations f1,,fkf_{1},\ldots,f_{k}, the check(X1,,Xk,f1,,fk,σ){\textsf{check}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k},\sigma) function also depends on |σ1(l)Xj||\sigma^{-1}(l)\cap X_{j}| for each lL,j[k]l\in L,j\in[k], i.e., the number of points with a particular label in each of the clusters. Indirectly, this also implies that the check function can impose conditions on the labels of the outliers points. For example, the colourful kk-median problem discussed in [AISX23] has the constraint that mim_{i} clients from the label type ii should be designated as outliers, given that every client has a unique label.  Table 1 gives a description of some of these problems.

We shall use the approximate triangle inequality, which states that for z{1,2}z\in\{1,2\} and any three points x1,x2,x3𝒳x_{1},x_{2},x_{3}\in\mathcal{X},

Dz(x1,x3)z(Dz(x1,x2)+Dz(x2,x3)).\displaystyle D^{z}(x_{1},x_{3})\leq z\left(D^{z}(x_{1},x_{2})+D^{z}(x_{2},x_{3})\right). (1)

Problem Description Unconstrained kk-median (Constraint type: unconstrained) Input: (F,X,k)(F,X,k) Output: (X1,,Xk,f1,,fk)(X_{1},...,X_{k},f_{1},...,f_{k}) Constraints: None, i.e., check(X1,,Xk,f1,,fk){\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k}) always equals 1. Objective: Minimise ixXiD(x,fi)\sum_{i}\sum_{x\in X_{i}}D(x,f_{i}). (This includes various versions corresponding to specific metrics such as     Ulam metric on permutations, metric spaces with constant doubling dimension etc.) Fault-tolerant kk-median (Constraint type: unconstrained but labelled) [HHL+16, IV20] Input: (F,X,k)(F,X,k) and a number h(x)kh(x)\leq k for every facility xXx\in X Output: (f1,,fk)(f_{1},...,f_{k}) Constraints: None. Objective: Minimise xXj=1h(x)D(x,fπx(j))\sum_{x\in X}\sum_{j=1}^{h(x)}D(x,f_{\pi_{x}(j)}),                 where πx(j)\pi_{x}(j) is the index of jthj^{th} nearest center to xx in (f1,,fk)(f_{1},...,f_{k}) (Label: h(x)h(x) may be regarded as the label of the client xx. So, the number of distinct labels k\ell\leq k.) Balanced kk-median (Constraint type: size) [APF+10, Din20] Input: (F,X,k)(F,X,k) and integers (r1,,rk)(r_{1},...,r_{k}), (l1,,lk)(l_{1},...,l_{k}), Output: (X1,,Xk,f1,,fk)(X_{1},...,X_{k},f_{1},...,f_{k}) Constraints: XiX_{i} should have at least rir_{i} and at most lil_{i} clients,                     i.e., check(X1,,Xk,f1,,fk)=1{\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1 iff i,ri|Xi|li\forall i,r_{i}\leq|X_{i}|\leq l_{i} . Objective: Minimise ixXiD(x,fi)\sum_{i}\sum_{x\in X_{i}}D(x,f_{i}). (Versions corresponding to specific values of rir_{i}’s and lil_{i}’s are known by different names.     The version corresponding to l1==lk=|X|l_{1}=...=l_{k}=|X| is called the rr-gather problem and     the version where r1==rk=0r_{1}=...=r_{k}=0 is called the ll-capacity problem.) Capacitated kk-median (Constraint type: center + size) [CAL19] Input: (F,X,k)(F,X,k) and with capacity s(f)s(f) for every facility fFf\in F Output: (X1,,Xk,f1,,fk)(X_{1},...,X_{k},f_{1},...,f_{k}) Constraints: The number of clients, XiX_{i}, assigned to fif_{i} is at most s(fi)s(f_{i}),                     i.e., check(X1,,Xk,f1,,fk)=1{\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1 iff i,|Xi|s(fi)\forall i,|X_{i}|\leq s(f_{i}) . Objective: Minimise ixXiD(x,fi)\sum_{i}\sum_{x\in X_{i}}D(x,f_{i}). Matroid kk-median (Constraint type: center) [KKN+11, CAGK+19] Input: (F,X,k)(F,X,k) and a Matroid on FF Output: (X1,,Xk,f1,,fk)(X_{1},...,X_{k},f_{1},...,f_{k}) Constraints: The number of clients, XiX_{i}, assigned to fif_{i} is at most s(fi)s(f_{i}),                     i.e., check(X1,,Xk,f1,,fk)=1{\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1 iff {f1,,fk}\{f_{1},...,f_{k}\} is an independent set of the Matroid . Objective: Minimise ixXiD(x,fi)\sum_{i}\sum_{x\in X_{i}}D(x,f_{i}). Strongly private kk-median (Constraint type: label + size) [RS18] Input: (F,X,k)(F,X,k) and numbers (l1,,lw)(l_{1},...,l_{w}). Each client has a label {1,,w}\in\{1,...,w\}. Output: (X1,,Xk,f1,,fk)(X_{1},...,X_{k},f_{1},...,f_{k}) Constraints: Every XiX_{i} has at least ljl_{j} clients with label jj,                     i.e., check(X1,,Xk,f1,,fk)=1{\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1 iff i,j,|XiSj|lj\forall i,j,|X_{i}\cap S_{j}|\geq l_{j},                     where SjS_{j} is the set of clients with label jj . Objective: Minimise ixXiD(x,fi)\sum_{i}\sum_{x\in X_{i}}D(x,f_{i}). (Labels: The number of distinct labels =w\ell=w). ll-diversity kk-median (Constraint type: label + size) [BGK+19] Input: (F,X,k)(F,X,k) and a number l>1l>1. Each client has one colour from {1,,w}\in\{1,...,w\} Output: (X1,,Xk,f1,,fk)(X_{1},...,X_{k},f_{1},...,f_{k}) Constraints: The fraction of clients with colour jj in every XiX_{i} is at least 1/l1/l,                     i.e., check(X1,,Xk,f1,,fk)=1{\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1 iff i,j,|XiSj||Xi|/l\forall i,j,|X_{i}\cap S_{j}|\leq|X_{i}|/l,                     where SjS_{j} is the set of clients with colour jj . Objective: Minimise ixXiD(x,fi)\sum_{i}\sum_{x\in X_{i}}D(x,f_{i}). (Labels: Each colour can be regarded as a label and hence the number of distinct labels =w\ell=w). Fair kk-median (Constraint type: label + size) [BGK+19, BCFN19] Input: (F,X,k)(F,X,k) and fairness values (α1,,αw),(β1,,βw)(\alpha_{1},...,\alpha_{w}),(\beta_{1},...,\beta_{w}). Each client has colours from {1,,w}\in\{1,...,w\} Output: (X1,,Xk,f1,,fk)(X_{1},...,X_{k},f_{1},...,f_{k}) Constraints: The fraction of clients with colour jj in every XiX_{i} is between αj\alpha_{j} and βj\beta_{j},                     i.e., check(X1,,Xk,f1,,fk)=1{\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1                      iff i,j,αj|Xi||XiSj|β|Xi|\forall i,j,\alpha_{j}|X_{i}|\leq|X_{i}\cap S_{j}|\leq\beta|X_{i}|, where SjS_{j} is the set of clients with colour jj . Objective: Minimise ixXiD(x,fi)\sum_{i}\sum_{x\in X_{i}}D(x,f_{i}). (There are two versions: (i) each client has a unique label, and (ii) a client can have multiple labels.) (Labels: For the first version =w\ell=w and for the second version =2w\ell=2^{w}.)

Table 1: The table defines various outlier-free versions of the constrained kk-median problem. The kk-means versions are defined similarly using D2D^{2} instead of DD. We include a few references. The problems are categorized based on the type of constraints. There are three main types of constraints (i) size (constraints on the cluster size), (ii) center (constraints on the points a center can service), and (iii) label (constraints on the label of points in clusters). A constrained problem can have a combination of these constraint types.

1.2 Our results

Our main result reduces the outlier constrained clustering problem to the outlier-free version. In our reduction, we shall also use approximation algorithms for the (unconstrained) kk-median and kk-means problems. We assume we have a constant factor approximation algorithm for these problems111Several such constant factor approximation algorithms exist [ANSW17, AGK+04, CGETS02].: let 𝒞\mathcal{C} denote such an algorithm with running time T𝒞(n)T_{\mathcal{C}}(n) on an input of size nn. Note that 𝒞\mathcal{C} would be an algorithm for the kk-means or the kk-median problem depending on whether z=1z=1 or 22 in the definition of the cost function.

Theorem 1.1 (Main Theorem)

Consider an instance =(X,F,k,m,check,cost){\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}}) of an outlier constrained clustering problem. Let 𝒜\mathcal{A} be an α\alpha-approximation algorithm for the corresponding outlier-free constrained clustering problem; let T𝒜(n)T_{\mathcal{A}}(n) be the running time of 𝒜\mathcal{A} on an input of size nn. Given a positive ε>0\varepsilon>0, there is an α(1+ε)\alpha(1+\varepsilon)-approximation algorithm for {\mathcal{I}} with running time T𝒞(n)+qT𝒜(n)+O(n(k+mz+1logmεz))+O(qm2(k+m)3)T_{\mathcal{C}}(n)+q\cdot T_{\mathcal{A}}(n)+O\left(n\cdot(k+\frac{m^{z+1}\log{m}}{\varepsilon^{z}})\right)+O\left(qm^{2}(k+m)^{3}\right), where nn is the size of {\mathcal{I}} and q=f(k,m,ε)=(k+mε)O(m)q=f(k,m,\varepsilon)=\left(\frac{k+m}{\varepsilon}\right)^{O(m)}, and z=1z=1 or 22 depending on the cost function (i.e., z=1z=1 for kk-median objection and z=2z=2 for kk-means objective).

The above theorem implies that as long as there is an FPT or polynomial-time approximation algorithm for the constrained, outlier-free kk-median or kk-means clustering problem, there is an FPT approximation algorithm (with almost the same approximation ratio) for the corresponding outlier version. We prove this result by creating qq instances of the outlier-free version of {\mathcal{I}} and picking the best solution on these instances using the algorithm 𝒜\mathcal{A}. We also extend the above result to the labelled version:

Theorem 1.2 (Main Theorem: labelled version)

Consider an instance =(X,F,k,m,μ,check,cost){\mathcal{I}}=(X,F,k,m,\mu,{\textsf{check}},{\textsf{cost}}) of an outlier constrained clustering problem with labels on input points. Let 𝒜\mathcal{A} be an α\alpha-approximation algorithm for the corresponding outlier-free constrained clustering problem; let T𝒜(n)T_{\mathcal{A}}(n) be the running time of 𝒜\mathcal{A} on an input of size nn. Given a positive ε>0\varepsilon>0, there is an α(1+ε)\alpha(1+\varepsilon)-approximation algorithm for {\mathcal{I}} with running time T𝒞(n)+qT𝒜(n)+O(n(k+mz+1logmεz))+O(qm2(k+m)3)T_{\mathcal{C}}(n)+q\cdot T_{\mathcal{A}}(n)+O\left(n\cdot(k+\frac{m^{z+1}\log{m}}{\varepsilon^{z}})\right)+O\left(q\ell m^{2}(k+m)^{3}\right), where nn is the size of {\mathcal{I}}, q=f(k,m,ε)=((k+m)ε)O(m)q=f(k,m,\varepsilon)=\left(\frac{(k+m)\ell}{\varepsilon}\right)^{O(m)} with \ell being the number of distinct labels, and z=1z=1 or 22 depending on the cost function (i.e., z=1z=1 for kk-median objection and z=2z=2 for kk-means objective).

The consequences of our results for specific constrained clustering problems is summarized in Table 2. We give the results of related works [BGJK20, GJK20, AISX23] in the same table to see the contributions of this work. Our contributions can be divided into two main categories:

  1. 1.

    Matching the best-known result: This can be further divided into two categories:

    1. (a)

      Matching results of [AISX23]: [AISX23] gives an outlier to outlier-free reduction. We also give such a reduction using a different technique with slightly better parameters. This means that we match all the results of [AISX23], which includes problems such as the classical kk-median/means problems, the Matroid kk-median problem, the colorful kk-median problem, and kk-median in certain special metrics. See rows 2-6 in Table 2.

    2. (b)

      Matching results of [GJK20]: [GJK20] gives FPT approximation algorithms for certain constrained problems on which the coreset-based approach of [AISX23] is not known to work. See the last row of Table 2. [GJK20] gives algorithms for outlier and outlier-free versions with the same approximation guarantee. Since the best outlier-free approximation is also from [GJK20], our results currently only match the approximation guarantees of [GJK20]. However, if there is an improvement in any of these problems, our results will immediately beat the known outlier results of [GJK20].

  2. 2.

    Best known results: Since our results hold for a larger class of constrained problems than earlier works, there are certain problems for which our results give the best-known FPT approximation algorithm. The list includes capacitated kk-median/kk-means with hard capacities in general metric and Euclidean spaces. It also includes the kk-median problem in the Ulam metric. A recent development in the Ulam kk-median problem [CDK23] has broken the 22-approximation barrier. Our reduction allows us to take this development to the outlier setting as well. The outlier-free results from which our best results are derived using our reduction are given in Table 2 (see rows 7-9).

Problem Outlier-free         Outlier version [GJK20] [AISX23] This work Euclidean kk-means (i.e., F=d,XdF=\mathbb{R}^{d},X\subset\mathbb{R}^{d}) (1+ε)(1+\varepsilon) [BJK18] ×\times (1+ε)(1+\varepsilon) (1+ε)(1+\varepsilon) kk-median (1+2e+ε)\left(1+\frac{2}{e}+\varepsilon\right) [CAGK+19] (3+ε)(3+\varepsilon) (1+2e+ε)\left(1+\frac{2}{e}+\varepsilon\right) (1+2e+ε)\left(1+\frac{2}{e}+\varepsilon\right) kk-means (1+8e+ε)\left(1+\frac{8}{e}+\varepsilon\right) [CAGK+19] (9+ε)(9+\varepsilon) (1+8e+ε)\left(1+\frac{8}{e}+\varepsilon\right) (1+8e+ε)\left(1+\frac{8}{e}+\varepsilon\right) kk-median/means in metrics: (i) constant doubling dimension (ii) metrics induced by graphs of bounded treewidth (iii) metrics induced by graphs that exclude a fixed graph as a minor (1+ε)\left(1+\varepsilon\right) [CASS21] (3+ε)(3+\varepsilon) kk-median (9+ε)(9+\varepsilon) kk-means (1+ε)\left(1+\varepsilon\right) (1+ε)\left(1+\varepsilon\right) Matroid kk-median (2+ε)\left(2+\varepsilon\right) [CAGK+19] (3+ε)(3+\varepsilon) (2+ε)\left(2+\varepsilon\right) (2+ε)\left(2+\varepsilon\right) Colourful kk-median (1+2e+ε)\left(1+\frac{2}{e}+\varepsilon\right) [CAGK+19] (3+ε)(3+\varepsilon) (1+2e+ε)\left(1+\frac{2}{e}+\varepsilon\right) (1+2e+ε)\left(1+\frac{2}{e}+\varepsilon\right) Ulam kk-median (here F=XF=X) (2δ)\left(2-\delta\right) [CDK23] (2+ε)(2+\varepsilon) ×\times (2δ)\left(2-\delta\right) Euclidean Capacitated kk-median/means (1+ε)\left(1+\varepsilon\right) [CAL19] ×\times ×\times (1+ε)\left(1+\varepsilon\right) Capacitated kk-median Capacitated kk-means (3+ε)\left(3+\varepsilon\right) (9+ε)(9+\varepsilon) [CAL19] ×\times ×\times ×\times ×\times (3+ε)\left(3+\varepsilon\right) (9+ε)(9+\varepsilon) Uniform/non-uniform rr-gather kk-median/means (uniform implies r1=r2==rkr_{1}=r_{2}=...=r_{k}) Uniform/non-uniform ll-capacity kk-median/means (uniform implies l1=l2==lkl_{1}=l_{2}=...=l_{k}) Uniform/non-uniform balanced kk-median/means (uniform implies r1=r2==rkr_{1}=r_{2}=...=r_{k} and l1=l2==lkl_{1}=l_{2}=...=l_{k}) (3+ε)(3+\varepsilon) (kk-median) (3+ε)(3+\varepsilon) (kk-median) ×\times (3+ε)(3+\varepsilon) (kk-median) Uniform/non-uniform fault tolerant kk-median/means (uniform implies same h(x)h(x) for every xx) (9+ε)(9+\varepsilon) (kk-means) (9+ε)(9+\varepsilon) (kk-means) ×\times (9+ε)(9+\varepsilon) (kk-means) Strongly private kk-median/means [GJK20] ll-diversity kk-median/means Fair kk-median/means

Table 2: A ×\times means that the techniques are not known to apply to the problem. The new results that do not follow from the previously known results are shaded  . The results that were not explicitly reported but follows from the techniques in the paper are shaded   The techniques of [AISX23] do not apply to the Ulam kk-median problem since the outlier-free algorithm works on unweighted instances. Note that all the FPT (3+ε)(3+\varepsilon) and (9+ε)(9+\varepsilon) approximations for the outlier-free versions (leftmost column) in the last row follow from the outlier-free results in [GJK20]. However, the approximation guarantees in the rightmost column depend on those in the leftmost. This means that, unlike the rigid (3+ε)(3+\varepsilon) and (9+ε)(9+\varepsilon) approximation guarantees of [GJK20] in the middle column, the approximation guarantee in the rightmost column will improve with every improvement in the leftmost.

1.3 Comparison with earlier work

As discussed earlier, the idea of a reduction from a outlier clustering problem to the corresponding outlier-free version in the context of the Euclidean kk-means problem was suggested by [BGJK20] using a D2D^{2}-sampling based idea. [GJK20] used the sampling ideas to design approximation algorithms for the outlier versions of various constrained clustering problems. However, the approximation guarantee obtained by [GJK20] was limited to (3+ε)(3+\varepsilon) for a large class of constrained kk-median and (9+ε)(9+\varepsilon) for the constrained kk-means problems, and it was not clear how to extend these techniques to get improved guarantees. As a result, their techniques could not exploit the recent developments by [CAGK+19] in the design of (1+2/e+ε)(1+2/e+\varepsilon) and (1+8/e+ε)(1+8/e+\varepsilon) FPT approximation algorithms for the classical outlier-free kk-median and kk-means problems respectively in general metric spaces. [AISX23] gave an outlier-to-outlier-free reduction, making it possible to extend the above-mentioned FPT approximation guarantees for the outlier-free setting to the outlier setting.

The reduction of [AISX23] is based on the coreset construction by [Che09] using uniform sampling. A coreset for a dataset is a weighted set of points such that the clustering of the coreset points with respect to any set of kk centers is the same (within a 1±ε1\pm\varepsilon factor) as that of the original set points. The coreset construction in [Che09] starts with a set CC of centers that give constant factor approximation. They consider O(logn)O(\log{n}) “rings” around these centers, uniformly sample points from each of these rings, and set the weight of the sampled points appropriately. The number of sampled points, and hence the size of the coreset, is (|C|lognε)2\left(\frac{|C|\log{n}}{\varepsilon}\right)^{2}. [AISX23] showed that when starting with (k+m)(k+m) centers that give a constant approximation to the classical (k+m)(k+m)-median problem, the coreset obtained as above has the following additional property: for any set of kk centers, the clustering cost of the original set of points excluding mm outliers is same (again, within 1±ε1\pm\varepsilon factor) as that of the coreset, again allowing for exclusion of a subset of mm points from it. This means that by trying out all mm subsets from the coreset, we ensure that at least one subset acts as a good outlier set. Since the coreset size is ((k+m)lognε)2\left(\frac{(k+m)\log{n}}{\varepsilon}\right)^{2}, the number of outlier-free instances that we construct is ((k+m)lognε)O(m)\left(\frac{(k+m)\log{n}}{\varepsilon}\right)^{O(m)}. Using (logn)O(m)=max{mO(m),nO(1)}(\log{n})^{O(m)}=\max\{m^{O(m)},n^{O(1)}\}, this is of the form f(k,m,ε)nO(1)f(k,m,\varepsilon)\cdot n^{O(1)} for a suitable function ff. At this point, we note the first quantitative difference from our result. In our algorithm, we save the (logn)O(m)(\log{n})^{O(m)} factor, which also means that the number of instances does not depend on the problem size nn. Further, a coreset based construction restricts the kind of problems it can be applied to. The coreset property that the cost of original points is the same as that of the weighted cost of coreset points holds when points are assigned to the closest center (i.e., the entire weight of the coreset goes to the closest center).222The reason is how Haussler’s lemma is applied to bound the cost difference. This works for the classical unconstrained kk-median and kk-means problems (as well as the few other settings considered in [AISX23]). However, for several constrained clustering problems, it may not hold that every point is assigned to the closest center. There have been some recent developments [BFS21, BCAJ+22] in designing coresets for constrained clustering settings. However, they have not been shown to apply to the outlier setting. Another recent work [HJLW22] designs coresets for the outlier setting, but like [AISX23], it has limited scope and has not been shown to extend for most constrained settings. Our DzD^{z}-sampling-based technique has the advantage that instead of running the outlier-free algorithm on a coreset as in [AISX23], it works directly with the dataset. That is, we run the outlier-free algorithm on the dataset (after removing outlier candidates). This also makes our results helpful in weighted settings (e.g., see [CDK23]) where the outlier-free algorithm is known to work only for unweighted datasets – note that a coreset is a weighted set).

1.4 Our Techniques

In this section, we give a high-level description of our algorithm. Let {\mathcal{I}} denote an instance of outlier constrained clustering on a set of points XX and 𝒪{\cal O} denote an optimal solution to {\mathcal{I}}. The first observation is that the optimal cost of the outlier-free and unconstrained clustering with k+mk+m centers on XX is a lower bound on the cost of 𝒪{\cal O} (Claim 1). 333This observation was used in both [BGJK20] and [AISX23]. Let CC denote the set of these (k+m)(k+m) centers (we can use any constant factor approximation for the unconstrained version to find CC). The intuition behind choosing CC is that the centers in 𝒪{\cal O} should be close to CC.

Now we divide the set of mm outliers in 𝒪{\cal O} into two subsets: those which are far from CC and the remaining ones close to CC (“near” outliers). Our first idea is to randomly sample a subset SS of O(mlogm)O(m\log m) points from XX with sampling probability proportional to distance (or square of distance) from the set CC. This sampling ensures that SS contains the far outliers with high probability (Claim 2). We can then cycle through all subsets of SS to guess the exact subset of far outliers. Handling the near outliers is more challenging and forms the heart of the technical contribution of this paper.

We “assign” each near outlier to its closest point in CC – let XN,joptX^{{\textsf{opt}}}_{N,j} be the set of outliers assigned to cjc_{j}. By cycling over all choices, we can guess the cardinality tjt_{j} of each of the sets XN,joptX^{{\textsf{opt}}}_{N,j}. We now set up a suitable minimum cost bipartite bb-matching instance which assigns a set of tjt_{j} points to each center cjc_{j}. Let X^j{\widehat{X}}_{j} be the set of points assigned to cjc_{j}. Our algorithm uses jX^j\cup_{j}{\widehat{X}}_{j} as the set of near outliers. In the analysis, we need to argue that there is a way of matching the points in XN,joptX^{{\textsf{opt}}}_{N,j} to X^j{\widehat{X}}_{j} whose total cost (sum of distances or squared distances between matched points) is small (Lemma 1). The hope is that we can go from the optimal set of outliers in 𝒪{\cal O} to the ones in the algorithm and argue that the increase in cost is small. Since we are dealing with constrained clustering, we need to ensure that this process does not change the size of each of the clusters. To achieve this, we need to further modify the matching between the two sets of outliers (Lemma 2). Finally, with this modified matching, we are able to argue that the cost of the solution produced by the algorithm is close to that of the optimal solution. The extension to the labelled version follows along similar lines.

In the remaining paper, we prove our two main results, Theorem 1.1 and Theorem 1.2. The main discussion will be for Theorem 1.1 since Theorem 1.2 is an extension of Theorem 1.1 that uses the same proof ideas. In the following sections, we give the details of our algorithm (Section 2) and its analysis (Section 3). In Section 3.1, we discuss the extension to the labelled version.

2 Algorithm

In this section, we describe the algorithm for the outlier constrained clustering problem. Consider an instance =(X,F,k,m,check,cost){\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}}) of this problem. Recall that the parameter z=1z=1 or 22 depends on whether the cost function is like the kk-median or the kk-means objective respectively. In addition, we assume the existence of the following algorithms:

  • A constant factor algorithm for the kk-median or the kk-means problem (depending on z=1z=1 or z=2z=2 respectively): an instance here is specified by a tuple (X,F,k)(X^{\prime},F^{\prime},k^{\prime}) only, where XX^{\prime} is the set of input points, FF^{\prime} is the set of potential locations for a center, and kk^{\prime} denotes the number of clusters. We shall use 𝒞\mathcal{C} to denote this algorithm.

  • An algorithm 𝒜\mathcal{A} for the outlier-free version of this problem. An instance here is given by a tuple (X,F,k,check,cost)(X^{\prime},F^{\prime},k,{\textsf{check}},{\textsf{cost}}) where the check and the cost functions are same as those in {\mathcal{I}}.

  • An algorithm \mathcal{M} for the bb-matching problem: an instance of the bb-matching problem is specified by a weighted bi-partite graph G=(L,R={v1,,vr},E)G=(L,R=\{v_{1},\ldots,v_{r}\},E), with edge ee having weight ee; and a tuple (t1,,tr)(t_{1},\ldots,t_{r}), where ti,i[r],t_{i},i\in[r], are non-negative integers. A solution needs to find a subset of EE^{\prime} of EE such each vertex of LL is incident with at most one edge of EE^{\prime}, and each vertex vjRv_{j}\in R is incident with exactly tjt_{j} edges of EE^{\prime}. The goal is to find such a set EE^{\prime} of minimum total weight.

We now define DzD^{z}-sampling:

Definition 2.1

Given sets CC and XX of points, DzD^{z}-sampling from XX w.r.t. CC samples a point xXx\in X, where the probability of sampling xx is proportional to Dz(x,C)D^{z}(x,C).

The algorithm is described in Algorithm 1. It first runs the algorithm 𝒞\mathcal{C} to obtain a set of (k+m)(k+m) centers CC in line 1. In line 1, we sample a subset SS where each point in SS is sampled independently using DzD^{z}-sampling w.r.t. CC. Given a subset YY, we say that a tuple 𝝉=(t1,,tk+m){\boldsymbol{\tau}}=(t_{1},\ldots,t_{k+m}) is valid if tj0t_{j}\geq 0 for all j[k+m]j\in[k+m], and jtj+|Y|=m\sum_{j}t_{j}+|Y|=m. For each subset YY of size m\leq m of SS and for each valid tuple 𝝉{\boldsymbol{\tau}}, the algorithm constructs a solution (X0(Y,𝝉),X1(Y,𝝉),,Xk(Y,𝝉))(X^{(Y,{\boldsymbol{\tau}})}_{0},X^{(Y,{\boldsymbol{\tau}})}_{1},\ldots,X^{(Y,{\boldsymbol{\tau}})}_{k}), where X0(Y,𝝉)X^{(Y,{\boldsymbol{\tau}})}_{0} denotes the set of outlier points. This is done by first computing the set X0(Y,𝝉)X^{(Y,{\boldsymbol{\tau}})}_{0}, and then using the algorithm 𝒜\mathcal{A} on the remaining points (line 1). To find the set X0(Y,𝝉)X^{(Y,{\boldsymbol{\tau}})}_{0}, we construct an instance (Y,𝝉){\mathcal{I}}^{(Y,{\boldsymbol{\tau}})} of bb-matching first (line 1). This instance is defined as follows: the bipartite graph has the set of (k+m)(k+m) centers CC on the left side and the set of points XX on the right side. The weight of an edge between a vertex vCv\in C and wXw\in X is equal to Dz(v,w)D^{z}(v,w). For each vertex vjCv_{j}\in C, we require that it is matched to exactly tjt_{j} points of XX. We run the algorithm \mathcal{M} on this instance of bb-matching (line 1). We define X0(Y,𝝉)X^{(Y,{\boldsymbol{\tau}})}_{0} as the set of points of XX matched by this algorithm. Finally, we output the solution of minimum cost (line 1).

0.1 Input: :=(X,F,k,m,check,cost){\mathcal{I}}:=(X,F,k,m,{\textsf{check}},{\textsf{cost}})
0.2 Execute 𝒞\mathcal{C} on the instance :=(X,F,k+m){\mathcal{I}}^{\prime}:=(X,F,k+m) to obtain a set CC of k+mk+m centers.
0.3 Sample a set SS of 4βmlogmε\frac{4\beta m\log m}{\varepsilon} points with replacement, each using DzD^{z}-sampling from XX w.r.t. CC.
0.4 for  each subset YS,|Y|mY\subset S,|Y|\leq m do
0.5       for each valid tuple 𝛕=(t1,,tk+m){\boldsymbol{\tau}}=(t_{1},\ldots,t_{k+m})  do
0.6             Construct the instance (F,τ){\mathcal{I}}^{(F,{\bf\tau})}
0.7             Run \mathcal{M} on (Y,τ){\mathcal{I}}^{(Y,{\bf\tau})} and let X0(Y,𝝉)X_{0}^{(Y,{\boldsymbol{\tau}})} be the set of matched points in XX.
0.8             Run the algorithm 𝒜\mathcal{A} on the instance (X(X0(Y,𝝉)Y),F,k,check,cost)(X\setminus(X_{0}^{(Y,{\boldsymbol{\tau}})}\cup Y),F,k,{\textsf{check}},{\textsf{cost}}).
0.9             Let (X1(Y,𝝉),,Xk(Y,𝝉))(X^{(Y,{\boldsymbol{\tau}})}_{1},\ldots,X^{(Y,{\boldsymbol{\tau}})}_{k}) be the clustering produced by 𝒜\mathcal{A}.
0.10            
0.11      
0.12Let (Y,𝝉)(Y^{\star},{\boldsymbol{\tau}}^{\star}) be the pair for which cost(X1(Y,𝝉),,Xk(Y,𝝉)){\textsf{cost}}(X^{(Y,{\boldsymbol{\tau}})}_{1},\ldots,X^{(Y,{\boldsymbol{\tau}})}_{k}) is minimized.
Output (X0(Y,𝝉),X1(Y,𝝉),,Xk(Y,𝝉))(X^{(Y^{\star},{\boldsymbol{\tau}}^{\star})}_{0},X^{(Y^{\star},{\boldsymbol{\tau}}^{\star})}_{1},\ldots,X^{(Y^{\star},{\boldsymbol{\tau}}^{\star})}_{k}).
Algorithm 1 Algorithm for outlier constrained clustering.

3 Analysis

We now analyze Algorithm 1. We refer to the notation used in this algorithm. Let =(X,F,k,m,check,cost){\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}}) be the instance of the outlier constrained clustering problem. Let opt(){\textsf{opt}}({\mathcal{I}}) denote the optimal cost of a solution for the instance {\mathcal{I}}. Assume that the algorithm 𝒞\mathcal{C} for the unconstrained clustering problem (used in line 1) is a β\beta-approximation algorithm. We overload notation and use cost(C){\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C) to denote the cost of the solution CC for the instance {\mathcal{I}}^{\prime}. Observe that the quantity cost(C){\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C) can be computed as follows: each point in XX is assigned to the closest point in CC, and then we compute the total cost (which could be the kk-median or the kk-means cost based on the value of the parameter zz) of this assignment. We first relate cost(C){\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C) to opt(){\textsf{opt}}({\mathcal{I}}).

Claim 1

cost(C)βopt(){\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)\leq\beta\cdot{\textsf{opt}}({\mathcal{I}}).

Proof

Let (XO,X1,,Xk)(X_{O},X_{1},...,X_{k}) denote the optimal solution for {\mathcal{I}}, where X0X_{0} denotes the set of mm outlier points. Let c1,,ckc_{1},\ldots,c_{k} be the centers of the clusters X1,,XkX_{1},\ldots,X_{k} respectively. Consider the solution to {\mathcal{I}}^{\prime} consisting of centers C:=X0{c1,,ck}C^{\prime}:=X_{0}\cup\{c_{1},\ldots,c_{k}\}. Clearly, cost(C)opt(){\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C^{\prime})\leq{\textsf{opt}}({\mathcal{I}}) (we have inequality here because the solution X1,,XkX_{1},\ldots,X_{k} may not be a Voronoi partition with respect to c1,,ckc_{1},\ldots,c_{k}). Since 𝒞\mathcal{C} is a β\beta-approximation algorithm, we know that cost(C)βcost(C){\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)\leq\beta\cdot{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C). Combining these two facts implies the desired result. ∎

We now consider an optimal solution for the instance {\mathcal{I}}: let X0opt,X2opt,,XkoptX^{{\textsf{opt}}}_{0},X^{{\textsf{opt}}}_{2},\ldots,X^{{\textsf{opt}}}_{k} be the partition of the input points XX in this solution, with X0optX^{{\textsf{opt}}}_{0} being the set of mm outliers. Depending on the distance from CC, we divide the set X0optX^{{\textsf{opt}}}_{0} into two subsets – XFoptX^{{\textsf{opt}}}_{F} (“far” points) and XNoptX^{{\textsf{opt}}}_{N} (“near” points) as follows:

XFopt:={xX0opt|Dz(x,C)εcost(C)2βm},XNopt:=XXFopt.X^{{\textsf{opt}}}_{F}:=\left\{x\in X^{{\textsf{opt}}}_{0}|D^{z}(x,C)\geq\frac{\varepsilon\,{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}{2\beta m}\right\},\quad X^{{\textsf{opt}}}_{N}:=X\setminus X^{{\textsf{opt}}}_{F}.

Recall that we sample a set SS of 4βmlogmε\frac{4\beta m\log{m}}{\varepsilon} clients using DzD^{z}-sampling with respect to center set CC (line 1 in Algorithm 1). Note that the probability of sampling a point xx is given by

Dz(x,C)xXDz(x,C)=Dz(x,C)cost(C).\displaystyle\frac{D^{z}(x,C)}{\sum_{x^{\prime}\in X}D^{z}(x,C)}=\frac{D^{z}(x,C)}{{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}. (2)

We first show that SS contains all the points in XFoptX^{{\textsf{opt}}}_{F} with high probability.

Claim 2

𝐏𝐫[XFoptS]11/m\mathbf{Pr}[X^{{\textsf{opt}}}_{F}\subseteq S]\geq 1-1/m.

Proof

Inequality Equation 2 shows that the probability of sampling a point xXFoptx\in X^{{\textsf{opt}}}_{F} is Dz(x,C)cost(C)ε2βm\frac{D^{z}(x,C)}{{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}\geq\frac{\varepsilon}{2\beta m}. So the probability that xx is not present in SS is at most (1ε2βm)4βmlogmε1m2\left(1-\frac{\varepsilon}{2\beta m}\right)^{\frac{4\beta m\log{m}}{\varepsilon}}\leq\frac{1}{m^{2}}. The desired result now follows union bound.∎

For rest of the analysis, we assume that the event in Claim 2 holds. We now note that the sum of the cost of assigning XNoptX^{{\textsf{opt}}}_{N} to CC is at most εopt()\varepsilon\cdot{\textsf{opt}}({\mathcal{I}}).

Claim 3

xXNoptDz(x,C)ε2opt()\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,C)\leq\frac{\varepsilon}{2}\cdot{\textsf{opt}}({\mathcal{I}}).

Proof

The claim follows from the following sequence of inequalities:

xXNoptDz(x,C)<xXNoptεcost(C)2βmxXNoptεopt()2mε2opt(),\displaystyle\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,C)<\sum_{x\in X^{{\textsf{opt}}}_{N}}\frac{\varepsilon\,{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}{2\beta m}\leq\sum_{x\in X^{{\textsf{opt}}}_{N}}\frac{\varepsilon\cdot{\textsf{opt}}({\mathcal{I}})}{2m}\leq\frac{\varepsilon}{2}\cdot{\textsf{opt}}({\mathcal{I}}),

where the first inequality follows from the definition of XNoptX^{{\textsf{opt}}}_{N} and the second inequality follows from Claim 1. ∎

For every point in XNoptX^{{\textsf{opt}}}_{N}, we identify the closest center in C={c1,,cm+k}C=\{c_{1},\ldots,c_{m+k}\} (breaking ties arbitrarily). For each j[k+m]j\in[k+m], let XN,joptX^{{\textsf{opt}}}_{N,j} be the set of points in XNoptX^{{\textsf{opt}}}_{N} which are closest to cjc_{j}. Let t^j{\hat{t}}_{j} denote |XN,jopt||X^{{\textsf{opt}}}_{N,j}|. Consider an iteration of line 11 where Y=XFopt,𝝉=(t^1,,t^k+m)Y=X^{{\textsf{opt}}}_{F},{\boldsymbol{\tau}}=({\hat{t}}_{1},\ldots,{\hat{t}}_{k+m}). Observe that 𝝉{\boldsymbol{\tau}} is valid with respect to YY because j[m+k]|t^j|+|Y|=m\sum_{j\in[m+k]}|{\hat{t}}_{j}|+|Y|=m. Let X^1,,X^m+k{\widehat{X}}_{1},\ldots,{\widehat{X}}_{m+k} be the set of points assigned to c1,,cm+kc_{1},\ldots,c_{m+k} respectively by the algorithm \mathcal{M}. Intuitively, we will like to construct a solution where the set of outliers is given by X^:=XFoptX^1X^m+k{\widehat{X}}:=X^{{\textsf{opt}}}_{F}\cup{\widehat{X}}_{1}\cup\cdots\cup{\widehat{X}}_{m+k}. We now show that the set X^{\widehat{X}} is “close” to X0optX^{{\textsf{opt}}}_{0}, the set of outliers in the optimal solution. In order to do this, we set up a bijection μ:X0optX^\mu:X^{{\textsf{opt}}}_{0}\rightarrow{\widehat{X}}, where μ\mu restricted to XFoptX^{{\textsf{opt}}}_{F} is identity, and μ\mu restricted to any of the sets XN,joptX^{{\textsf{opt}}}_{N,j} is a bijection from XN,joptX^{{\textsf{opt}}}_{N,j} to X^j{\widehat{X}}_{j}. Such a function μ\mu is possible because for each j[m+k]j\in[m+k], |XN,jopt|=|X^j|=t^j|X^{{\textsf{opt}}}_{N,j}|=|{\widehat{X}}_{j}|={\hat{t}}_{j}. We now prove this closeness property.

Lemma 1
xX0optDz(x,μ(x))εzopt().\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,\mu(x))\leq\varepsilon\cdot z\cdot{\textsf{opt}}({\mathcal{I}}).
Proof

We first note a useful property of the solution given by the algorithm \mathcal{M}. One of the possible solutions for the instance (Y,𝝉){\mathcal{I}}^{(Y,{\boldsymbol{\tau}})} could have been assigning XN,joptX^{{\textsf{opt}}}_{N,j} to the center cjc_{j}. Since \mathcal{M} is an optimal algorithm for bb-matching, we get

j[k+m]xX^jDz(x,cj)j[k+m]xXN,joptDz(x,cj)=xXNoptDz(x,C)ε2opt(),\displaystyle\sum_{j\in[k+m]}\sum_{x\in{\widehat{X}}_{j}}D^{z}(x,c_{j})\leq\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}D^{z}(x,c_{j})=\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,C)\leq\frac{\varepsilon}{2}\cdot{\textsf{opt}}({\mathcal{I}}), (3)

where the last inequality follows from Claim 3. Now,

xX0optDz(x,μ(x))\displaystyle\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,\mu(x)) =xXNoptDz(x,μ(x))=j[k+m]xXN,joptDz(x,μ(x))\displaystyle=\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,\mu(x))=\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}D^{z}(x,\mu(x))
(1)zj[k+m]xXN,jopt(Dz(x,cj)+Dz(cj,μ(x))),\displaystyle\stackrel{{\scriptstyle\eqref{eq:tr}}}{{\leq}}z\cdot\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}\left(D^{z}(x,c_{j})+D^{z}(c_{j},\mu(x))\right), (4)

where the first equality follows from the fact that μ\mu is identity on XFoptX^{{\textsf{opt}}}_{F}. Since μ\mu is a bijection from XN,joptX^{{\textsf{opt}}}_{N,j} to X^j{\widehat{X}}_{j}, the above can also be written as

zj[k+m]xXN,joptDz(x,cj)+zj[k+m]xX^jDz(x,cj)zεopt(),z\cdot\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}D^{z}(x,c_{j})+z\cdot\sum_{j\in[k+m]}\sum_{x\in{\widehat{X}}_{j}}D^{z}(x,c_{j})\leq z\cdot\varepsilon\;{\textsf{opt}}({\mathcal{I}}),

where the last inequality follows from Claim 3 and (3). This proves the desired result. ∎

The mapping μ\mu described above may have the following undesirable property: there could be a point xX0optX^x\in X^{{\textsf{opt}}}_{0}\cap{\widehat{X}} such that μ(x)x\mu(x)\neq x. This could happen if xXN,joptx\in X^{{\textsf{opt}}}_{N,j} and xX^ix\in{\widehat{X}}_{i} where iji\neq j. We now show that μ\mu can be modified to another bijection μ^:X0optX^{\widehat{\mu}}:X^{{\textsf{opt}}}_{0}\rightarrow{\widehat{X}} which is identity on X0optX^.X^{{\textsf{opt}}}_{0}\cap{\widehat{X}}. Note that the mapping μ^{\widehat{\mu}} is only needed for the analysis of the algorithm.

Lemma 2

There is a bijection μ^:X0optX^{\widehat{\mu}}:X^{{\textsf{opt}}}_{0}\rightarrow{\widehat{X}} such that μ^(x)=x{\widehat{\mu}}(x)=x for all xX0optX^x\in X^{{\textsf{opt}}}_{0}\cap{\widehat{X}} and

xX0optDz(x,μ^(x))mz1εzopt().\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,{\widehat{\mu}}(x))\leq m^{z-1}\,\varepsilon\cdot z\cdot{\textsf{opt}}({\mathcal{I}}).
Proof

We construct a directed graph H=(V1,E1)H=(V_{1},E_{1}) where V1=X0optX^V_{1}=X^{{\textsf{opt}}}_{0}\cup{\widehat{X}}. For every xX0optx\in X^{{\textsf{opt}}}_{0}, we add the directed arc (x,μ(X))(x,\mu(X)) to E1E_{1}. Observe that a self loop in HH implies that μ(x)=x\mu(x)=x. Every vertex in X0optX^X^{{\textsf{opt}}}_{0}\setminus{\widehat{X}} has 0 in-degree and out-degree 1; whereas a vertex in X^X0opt{\widehat{X}}\setminus X^{{\textsf{opt}}}_{0} has in-degree 1 and 0 out-degree. Vertices in X^X0opt{\widehat{X}}\cap X^{{\textsf{opt}}}_{0} have exactly one incoming and outgoing arc (in case of a self-loop, it counts towards both the in-degree and the out-degree of the corresponding vertex).

The desired bijection μ^{\widehat{\mu}} is initialized to μ\mu. Let cost(μ^){\textsf{cost}}({\widehat{\mu}}) denote xX0optDz(x,μ^(x))\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,{\widehat{\mu}}(x)); define cost(μ){\textsf{cost}}(\mu) similarly. It is easy to check HH is vertex disjoint union of directed cycles and paths. In case of a directed cycle CC on more than 1 vertex, it must be the case that each of the vertices in CC belong to X^X0opt{\widehat{X}}\cap X^{{\textsf{opt}}}_{0}. In this case, we update μ^{\widehat{\mu}} be defining μ^(x)=x{\widehat{\mu}}(x)=x for each xCx\in C. Clearly this can only decrease cost(μ^){\textsf{cost}}({\widehat{\mu}}). Let P1,,PlP_{1},\ldots,P_{l} be the set of directed paths in HH. For each path PjP_{j}, we perform the following update: let PjP_{j} be a path from aja_{j} to bjb_{j}. We know that ajXoptX^a_{j}\in X^{{\textsf{opt}}}\setminus{\widehat{X}}, bjX^X0optb_{j}\in{\widehat{X}}\setminus X^{{\textsf{opt}}}_{0} and each internal vertex of PjP_{j} lies in X^X0opt{\widehat{X}}\cap X^{{\textsf{opt}}}_{0}. We update μ^{\widehat{\mu}} as follows; μ^(aj)=bj{\widehat{\mu}}(a_{j})=b_{j} and μ^(v)=v{\widehat{\mu}}(v)=v for each internal vertex vv of PjP_{j}. The overall increase in cost(μ^){\textsf{cost}}({\widehat{\mu}}) is equal to

j[l](Dz(aj,bj)i=1njDz(vji,vji1)),\displaystyle\sum_{j\in[l]}\left(D^{z}(a_{j},b_{j})-\sum_{i=1}^{n_{j}}D^{z}(v_{j}^{i},v_{j}^{i-1})\right), (5)

where aj=vj0,vj1,,vjnj=bja_{j}=v_{j}^{0},v_{j}^{1},\ldots,v_{j}^{n_{j}}=b_{j} denotes the sequence of vertices in PjP_{j}. If z=1z=1, triangle inequality shows that the above quantity is at most 0. In case z=2z=2,

D2(aj,bj)nj(i=1njD2(vji,vji1)),D^{2}(a_{j},b_{j})\leq n_{j}\left(\sum_{i=1}^{n_{j}}D^{2}(v_{j}^{i},v_{j}^{i-1})\right),

and so the quantity in (5) is at most (nj1)i=1njD2(vji,vji1).(n_{j}-1)\sum_{i=1}^{n_{j}}D^{2}(v_{j}^{i},v_{j}^{i-1}).

It follows that cost(μ^)mz1cost(μ).{\textsf{cost}}({\widehat{\mu}})\leq m^{z-1}{\textsf{cost}}(\mu). The desired result now follows from Lemma 1. ∎

We run the algorithm 𝒜\mathcal{A} on the outlier-free constrained clustering instance ′′=(XX^,F,k,check,cost){\mathcal{I}}^{\prime\prime}=(X\setminus{\widehat{X}},F,k,{\textsf{check}},{\textsf{cost}}) (line 1 in Algorithm 1). Let opt(′′){\textsf{opt}}({\mathcal{I}}^{\prime\prime}) be the optimal cost of a solution for this instance. The following key lemma shows that opt(′′){\textsf{opt}}({\mathcal{I}}^{\prime\prime}) is close to opt(){\textsf{opt}}({\mathcal{I}}).

Lemma 3

opt(′′)(1+ε1z(2m+1)z1)opt().{\textsf{opt}}({\mathcal{I}}^{\prime\prime})\leq(1+\varepsilon^{\frac{1}{z}}(2m+1)^{z-1}){\textsf{opt}}({\mathcal{I}}).

Proof

We shall use the solution (X0opt,,Xkopt)(X^{{\textsf{opt}}}_{0},\ldots,X^{{\textsf{opt}}}_{k}) to construct a feasible solution for ′′{\mathcal{I}}^{\prime\prime}. For each j[k]j\in[k], let ZjZ_{j} denote XjoptX^X^{{\textsf{opt}}}_{j}\cap{\widehat{X}}. Let μ^1(Zj){\widehat{\mu}}^{-1}(Z_{j}) denote the pre-image under μ^{\widehat{\mu}} of ZjZ_{j}. Since ZjX^X0optZ_{j}\subseteq{\widehat{X}}\setminus X^{{\textsf{opt}}}_{0}, μ^1(Zj)X0optX^{\widehat{\mu}}^{-1}(Z_{j})\subseteq X^{{\textsf{opt}}}_{0}\setminus{\widehat{X}}. For each j[k]j\in[k], define

Xj:=(XjoptZj)μ^1(Zj).X^{\prime}_{j}:=(X^{{\textsf{opt}}}_{j}\setminus Z_{j})\cup{\widehat{\mu}}^{-1}(Z_{j}).
Claim 4
j=1kXj=XX^.\bigcup_{j=1}^{k}X_{j}^{\prime}=X\setminus{\widehat{X}}.
Proof

For any j[k]j\in[k], we have already argued that μ^1(Zj)X0optX^XX^{\widehat{\mu}}^{-1}(Z_{j})\subseteq X^{{\textsf{opt}}}_{0}\setminus{\widehat{X}}\subseteq X\setminus{\widehat{X}}. Clearly, XjoptZjXX^X^{{\textsf{opt}}}_{j}\setminus Z_{j}\subseteq X\setminus{\widehat{X}}. Therefore XjXX^X_{j}^{\prime}\subseteq X\setminus{\widehat{X}}. Therefore, j[k]XjXX^\cup_{j\in[k]}X_{j}^{\prime}\subseteq X\setminus{\widehat{X}}. Since |Xj|=|Xjopt||X_{j}^{\prime}|=|X^{{\textsf{opt}}}_{j}|,

j[k]|Xj|=nm=|XX^|.\sum_{j\in[k]}|X_{j}^{\prime}|=n-m=|X\setminus{\widehat{X}}|.

This proves the claim. ∎

The above claim implies that (X1,,Xk)(X_{1}^{\prime},\ldots,X_{k}^{\prime}) is a partition of XX^X\setminus{\widehat{X}}. Since |Xj|=|Xjopt||X_{j}^{\prime}|=|X^{{\textsf{opt}}}_{j}| for all j[k]j\in[k] and the function check only depends on the cardinality of the sets in the partition, (X1,,Xk)(X_{1}^{\prime},\ldots,X_{k}^{\prime}) is a feasible partition (under check) of XX^X\setminus{\widehat{X}}. In the optimal solution for {\mathcal{I}}, let f1opt,,fkoptf^{{\textsf{opt}}}_{1},\ldots,f^{{\textsf{opt}}}_{k} be the kk centers corresponding to the clusters X1opt,,XkoptX^{{\textsf{opt}}}_{1},\ldots,X^{{\textsf{opt}}}_{k} respectively. Now,

opt(′′)\displaystyle{\textsf{opt}}({\mathcal{I}}^{\prime\prime}) cost(X1,,Xk)j[k]xXjDz(x,fjopt)\displaystyle\leq{\textsf{cost}}(X_{1}^{\prime},\ldots,X_{k}^{\prime})\leq\sum_{j\in[k]}\sum_{x\in X_{j}^{\prime}}D^{z}(x,f^{{\textsf{opt}}}_{j}) (6)

For each j[k]j\in[k], we estimate the quantity xXjDz(x,fjopt)\sum_{x\in X_{j}^{\prime}}D^{z}(x,f^{{\textsf{opt}}}_{j}). Using the definition of XjX_{j}^{\prime} and triangle inequality, this quantity can be expressed as

xXjoptZjDz(x,fjopt)+xμ^1(Zj)Dz(x,fjopt)\displaystyle\sum_{x\in X^{{\textsf{opt}}}_{j}\setminus Z_{j}}D^{z}(x,f^{{\textsf{opt}}}_{j})+\sum_{x\in{\widehat{\mu}}^{-1}(Z_{j})}D^{z}(x,f^{{\textsf{opt}}}_{j}) xXjoptZjDz(x,fjopt)+xμ^1(Zj)(D(x,μ^(x))+D(μ^(x),fjopt))z\displaystyle\leq\sum_{x\in X^{{\textsf{opt}}}_{j}\setminus Z_{j}}D^{z}(x,f^{{\textsf{opt}}}_{j})+\sum_{x\in{\widehat{\mu}}^{-1}(Z_{j})}\left(D(x,{\widehat{\mu}}(x))+D({\widehat{\mu}}(x),f^{{\textsf{opt}}}_{j})\right)^{z} (7)

When z=1z=1, the above is at most (replacing xx by μ^(x){\widehat{\mu}}(x) in the second expression on RHS)

xXjoptD(x,fjopt)+xZjD(x,μ^(x)).\sum_{x\in X^{{\textsf{opt}}}_{j}}D(x,f^{{\textsf{opt}}}_{j})+\sum_{x\in Z_{j}}D(x,{\widehat{\mu}}(x)).

Using this bound in (6), we see that

opt(′′)opt()+xX0optD(x,μ^(x))(1+ε)opt(),{\textsf{opt}}({\mathcal{I}}^{\prime\prime})\leq{\textsf{opt}}({\mathcal{I}})+\sum_{x\in X^{{\textsf{opt}}}_{0}}D(x,{\widehat{\mu}}(x))\leq(1+\varepsilon){\textsf{opt}}({\mathcal{I}}),

where the last inequality follows from Lemma 2. This proves the desired result for z=1z=1. When z=2z=2, we use the fact that for any two reals a,ba,b,

(a+b)2(1+ε)a2+b2(1+1ε).(a+b)^{2}\leq(1+\sqrt{\varepsilon})a^{2}+b^{2}\left(1+\frac{1}{\sqrt{\varepsilon}}\right).

Using this fact, the expression in the RHS of (7) can be upper bounded by

(1+ε)xXjoptD2(x,fjopt)+(1+1ε)xZjD2(x,μ^(x)).(1+\sqrt{\varepsilon})\sum_{x\in X^{{\textsf{opt}}}_{j}}D^{2}(x,f^{{\textsf{opt}}}_{j})+\left(1+\frac{1}{\sqrt{\varepsilon}}\right)\sum_{x\in Z_{j}}D^{2}(x,{\widehat{\mu}}(x)).

Substituting this expression in (6) and using Lemma 2, we see that

opt(′′)(1+ε)opt()+2mεopt().{\textsf{opt}}({\mathcal{I}}^{\prime\prime})\leq(1+\sqrt{\varepsilon}){\textsf{opt}}({\mathcal{I}})+2m\sqrt{\varepsilon}{\textsf{opt}}({\mathcal{I}}).

This proves the desired result for z=2z=2. ∎

The approximation preserving properties of Theorem 1.1 follow from the above analysis. For the kk-means problem, since the approximation term is (1+ε(2m+1))(1+\sqrt{\varepsilon}(2m+1)), we can replace ε\varepsilon with ε2/(2m+1)2\varepsilon^{2}/(2m+1)^{2} in the algorithm and analysis to obtain a (1+ε)(1+\varepsilon) factor. Let us quickly check the running time of the algorithm. The algorithm first runs 𝒞\mathcal{C} that takes T𝒞(n)T_{\mathcal{C}}(n) time. This is followed by DzD^{z}-sampling O(mz+1logmεz)O(\frac{m^{z+1}\log{m}}{\varepsilon^{z}}) points, which takes O(n(k+mz+1logmεz))O(n\cdot(k+\frac{m^{z+1}\log{m}}{\varepsilon^{z}})) time. The number of iterations of the for-loops is determined by the number of subsets of SS, which is i=0m(|S|i)=(mε)O(m)\sum_{i=0}^{m}\binom{|S|}{i}=\left(\frac{m}{\varepsilon}\right)^{O(m)}, and the number of possibilities for τ\tau, which is at most (2m+k1m)=(m+k)O(m)\binom{2m+k-1}{m}=(m+k)^{O(m)}. This gives the number of iterations q=f(k,m,ε)=(k+mε)O(m)q=f(k,m,\varepsilon)=\left(\frac{k+m}{\varepsilon}\right)^{O(m)}. In every iteration, in addition to running 𝒜\mathcal{A}, we solve a weighted b-matching problem on a bipartite graph (LR,E)(L\cup R,E) where RR has (k+m)(k+m) vertices (corresponding to the k+mk+m centers in the center set CC) and LL has at most (k+m)m(k+m)\cdot m vertices (considering mm closest clients for every center is sufficient). So, every iteration costs T𝒜(n)+O((k+m)3m2)T_{\mathcal{A}}(n)+O((k+m)^{3}m^{2}) time. This gives the running time expression in Theorem 1.1.

3.1 Extension to labelled version

In this section, we extend Algorithm 1 to the setting where points in XX have labels from a finite set LL and the check(){\textsf{check}}() function can also depend on the number of points with a certain label in a cluster. The overall structure of Algorithm 1 remains unchanged; we just indicate the changes needed in this algorithm.

Given a non-negative integer pp, a label partition of pp is defined as a tuple ψ=(q1,,q|L|)\psi=(q_{1},\ldots,q_{|L|}) such that iqi=p\sum_{i}q_{i}=p. The intuition is that given a set SS of size pp, q1q_{1} points get the first label in LL, q2q_{2} points in SS get the second label in LL, and so on. Now, given a subset YY, define a valid tuple 𝝉{\boldsymbol{\tau}} w.r.t. YY as a tuple ((t1,ψ1),,(tk+m,ψk+m)),((t_{1},\psi_{1}),\ldots,(t_{k+m},\psi_{k+m})), where (i) jtj+|Y|=m\sum_{j}t_{j}+|Y|=m, and (ii) ψj\psi_{j} is a label partition of tjt_{j}. As in line 1 in Algorithm 1, we cycle over all such valid tuples. The definition of a solution to the bb-matching instance (Y,τ){\mathcal{I}}^{(Y,\tau)} changes as follows. Let ψj=(nj1,,nj)\psi_{j}=(n_{j}^{1},\ldots,n_{j}^{\ell}), where =|L|\ell=|L|. Then a solution to (Y,τ){\mathcal{I}}^{(Y,\tau)} needs to satisfy the condition that for each point cjCc_{j}\in C and each label lLl\in L, exactly njln_{j}^{l} points in XX are matched to cjc_{j}. Note that this also implies that exactly tjt_{j} points are matched to cjc_{j}. This matching problem can be easily reduced to weighted bipartite matching by making tjt_{j} copies of each point cjc_{j}, and for each label ll, adding edges between njln_{j}^{l} distinct copies of cjc_{j} to vertices of label ll only. The rest of the details of Algorithm 1 remain unchanged. Note that the running time of the algorithm changes because we now have to cycle over all partitions of each of the numbers tjt_{j}.

The analysis of the algorithm proceeds in an analogous manner as that of Algorithm 1. We just need to consider the iteration of the algorithm, where we correctly guess the size of each of the sets XN,joptX^{{\textsf{opt}}}_{N,j} and the number of points of each label in this set.

References

  • [AGK+04] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3):544–562, 2004.
  • [AISX23] Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers, 2023.
  • [ANSW17] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 61–72, Oct 2017.
  • [APF+10] Gagan Aggarwal, Rina Panigrahy, Tomás Feder, Dilys Thomas, Krishnaram Kenthapadi, Samir Khuller, and An Zhu. Achieving anonymity via clustering. ACM Trans. Algorithms, 6(3), July 2010.
  • [BCAJ+22] V. Braverman, V. Cohen-Addad, H. Jiang, R. Krauthgamer, C. Schwiegelshohn, M. Toftrup, and X. Wu. The power of uniform sampling for coresets. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 462–473, Los Alamitos, CA, USA, nov 2022. IEEE Computer Society.
  • [BCFN19] Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [BFS21] Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198 of Leibniz International Proceedings in Informatics (LIPIcs), pages 23:1–23:15, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
  • [BGJK20] Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On Sampling Based Algorithms for k-Means. In Nitin Saxena and Sunil Simon, editors, 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020), volume 182 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1–13:17, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
  • [BGK+19] Ioana O. Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R. Schmidt, and Melanie Schmidt. On the Cost of Essentially Fair Clusterings. In Dimitris Achlioptas and László A. Végh, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019), volume 145 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1–18:22, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [BJK18] Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster algorithms for the constrained k-means problem. Theor. Comp. Sys., 62(1):93–115, January 2018.
  • [CAGK+19] Vincent Cohen-Addad, Anupam Gupta, Amit Kumar, Euiwoong Lee, and Jason Li. Tight FPT Approximations for k-Median and k-Means. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), volume 132 of Leibniz International Proceedings in Informatics (LIPIcs), pages 42:1–42:14, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [CAL19] Vincent Cohen-Addad and Jason Li. On the Fixed-Parameter Tractability of Capacitated Clustering. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), volume 132 of Leibniz International Proceedings in Informatics (LIPIcs), pages 41:1–41:14, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [CASS21] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 169–182, New York, NY, USA, 2021. Association for Computing Machinery.
  • [CDK23] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Clustering permutations: New techniques with streaming applications. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA, volume 251 of LIPIcs, pages 31:1–31:24. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
  • [CGETS02] Moses Charikar, Sudipto Guha, Éva Tardos, and David B. Shmoys. A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences, 65(1):129 – 149, 2002.
  • [Che09] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009.
  • [Din20] Hu Ding. Faster balanced clusterings in high dimension. Theoretical Computer Science, 842:28–40, 2020.
  • [FMS07] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for kk-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, SCG ’07, pages 11–18, New York, NY, USA, 2007. ACM.
  • [GJK20] Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. FPT Approximation for Constrained Metric k-Median/Means. In Yixin Cao and Marcin Pilipczuk, editors, 15th International Symposium on Parameterized and Exact Computation (IPEC 2020), volume 180 of Leibniz International Proceedings in Informatics (LIPIcs), pages 14:1–14:19, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
  • [HHL+16] Mohammadtaghi Hajiaghayi, Wei Hu, Jian Li, Shi Li, and Barna Saha. A constant factor approximation algorithm for fault-tolerant k-median. ACM Trans. Algorithms, 12(3), April 2016.
  • [HJLW22] Lingxiao Huang, Shaofeng H. C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering, 2022.
  • [IV20] Tanmay Inamdar and Kasturi Varadarajan. Fault tolerant clustering with outliers. In Evripidis Bampis and Nicole Megow, editors, Approximation and Online Algorithms, pages 188–201, Cham, 2020. Springer International Publishing.
  • [KKN+11] Ravishankar Krishnaswamy, Amit Kumar, Viswanath Nagarajan, Yogish Sabharwal, and Barna Saha. The matroid median problem. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’11, page 1117–1130, USA, 2011. Society for Industrial and Applied Mathematics.
  • [KLS18] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 646–659, New York, NY, USA, 2018. Association for Computing Machinery.
  • [KSS10] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1–5:32, February 2010.
  • [RS18] Clemens Rösner and Melanie Schmidt. Privacy Preserving Clustering with Constraints. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), volume 107 of Leibniz International Proceedings in Informatics (LIPIcs), pages 96:1–96:14, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.