¹¹institutetext: Department of Computer Science and Engineering, Indian Institute of Technology Delhi.^†^†thanks: ¹¹email: {rjaiswal, amitk}@cse.iitd.ac.in

Clustering What Matters in Constrained Settings

(Improved Outlier to Outlier-Free Reductions)

Ragesh Jaiswal and Amit Kumar

Abstract

Constrained clustering problems generalize classical clustering formulations, e.g., $k$ -median, $k$ -means, by imposing additional constraints on the feasibility of a clustering. There has been significant recent progress in obtaining approximation algorithms for these problems, both in the metric and the Euclidean settings. However, the outlier version of these problems, where the solution is allowed to leave out $m$ points from the clustering, is not well understood. In this work, we give a general framework for reducing the outlier version of a constrained $k$ -median or $k$ -means problem to the corresponding outlier-free version with only $(1+\varepsilon)$ -loss in the approximation ratio. The reduction is obtained by mapping the original instance of the problem to $f(k,m,\varepsilon)$ instances of the outlier-free version, where $f(k,m,\varepsilon)=\left(\frac{k+m}{\varepsilon}\right)^{O(m)}$ . As specific applications, we get the following results:

•

First FPT (in the parameters $k$ and $m$ ) $(1+\varepsilon)$ -approximation algorithm for the outlier version of capacitated $k$ -median and $k$ -means in Euclidean spaces with hard capacities.
•

First FPT (in the parameters $k$ and $m$ ) $(3+\varepsilon)$ and $(9+\varepsilon)$ approximation algorithms for the outlier version of capacitated $k$ -median and $k$ -means, respectively, in general metric spaces with hard capacities.
•

First FPT (in the parameters $k$ and $m$ ) $(2-\delta)$ -approximation algorithm for the outlier version of the $k$ -median problem under the Ulam metric.

Our work generalizes the results of [BGJK20] and [AISX23] to a larger class of constrained clustering problems. Further, our reduction works for arbitrary metric spaces and so can extend clustering algorithms for outlier-free versions in both Euclidean and arbitrary metric spaces.

1 Introduction

Center-based clustering problems such as $k$ -median and the $k$ -means are important data processing tasks. Given a metric $D$ on a set of $n$ points $\mathcal{X}$ and a parameter $k$ , the goal here is to partition the set of points into $k$ clusters, say $C_{1},\ldots,C_{k}$ , and assign the points in each cluster to a corresponding cluster center, say $c_{1},\ldots,c_{k}$ respectively, such that the objective $\sum_{i=1}^{k}\sum_{x\in C_{i}}D(x,c_{i})^{z}$ is minimized. Here $z$ is a parameter which is 1 for $k$ -median and 2 for $k$ -means. The outlier version of these problems is specified by another parameter $m$ , where a solution is allowed to leave out up to $m$ points from the clusters. Outlier versions capture settings where the input may contain a few highly erroneous data points. Both the outlier and the outlier-free versions have been well-studied in the literature with constant factor approximations known for both the $k$ -means and the $k$ -median problem [ANSW17, AGK⁺04, CGETS02]. In addition, fixed parameter tractable (FPT) $(1+\varepsilon)$ -approximation algorithms are known for these problems in the Euclidean setting [KSS10, FMS07, BGJK20]: the running time of such algorithms is of the form $f(k,m,\varepsilon)\cdot poly(n,d)$ , where $f()$ is an exponential function of the parameters $k,m,\varepsilon$ and $d$ denotes the dimensionality of the points.

A more recent development in clustering problems has been the notion of constrained clustering. A constrained clustering problem specifies additional conditions on a feasible partitioning of the input points into $k$ clusters. For example, the $r$ -gathering problem requires that each cluster in a feasible partitioning must contain at least $r$ data points. Similarly, the well-known capacitated clustering problem specifies an upper bound on the size of each cluster. Constrained clustering formulations can also capture various types of fairness constraints: each data point has a label assigned to it, and we may require upper or lower bounds on the number (or fraction) of points with a certain label in each cluster. Table 1 gives a list of some of these problems. FPT (in the parameter $k$ ) constant factor approximation algorithms are known for a large class of these problems (see Table 2).

It is worth noting that constrained clustering problems are distinct from outlier clustering: the former restricts the set of feasible partitioning of input points, whereas the latter allows us to reduce the set of points that need to be partitioned into clusters. There has not been much progress on constrained clustering problems in the outlier setting (also see [KLS18] for unbounded integrality gap for the natural LP relaxation for the outlier clustering versions). In this work, we bridge this lag between the outlier and the outlier-free versions of constrained clustering problems by giving an almost approximation-preserving reduction from the former to the latter. As long as the parameters of interest (i.e., $k,m$ ) are small, the reduction works in polynomial time. Using our reduction, an FPT $\alpha$ -approximation algorithm for the outlier-free version of a constrained clustering problem leads to an FPT $(\alpha+\varepsilon)$ -approximation algorithm for the outlier version of the same problem. For general metric spaces, this implies the first FPT constant-approximation for outlier versions of several constrained clustering problems; and similarly, we get new FPT $(1+\varepsilon)$ -approximation algorithms for several outlier constrained clustering problems –see Table 2 for the precise details.

This kind of FPT approximation preserving reduction in the context of Euclidean $k$ -means was first given by [BGJK20] using a sampling-based approach. [GJK20] extended the sampling ideas of [BGJK20] to general metric spaces but did not give an approximation-preserving reduction. [AISX23] gave a reduction for general metric spaces using a coreset construction. In this work, we use the sampling-based ideas of [BGJK20] to obtain an approximation-preserving reduction from the outlier version to the outlier-free version with improved parameters over [AISX23]. Moreover, our reduction works for most known constrained clustering settings as well.

1.1 Preliminaries

We give a general definition of a constrained clustering problem. For a positive integer $k$ , we shall use $[k]$ to denote the set $\{1,\ldots,k\}$ . Let $(\mathcal{X},D)$ denote the metric space with distance function $D$ . For a point $x$ and a subset $S$ of points, we shall use $D(x,S)$ to denote $\min_{y\in S}D(x,y)$ .

The set $\mathcal{X}$ contains subsets $F$ and $X$ : here $X$ denotes the set of input points and $F$ the set of points where a center can be located. An outlier constrained clustering problem is specified by the following parameters and functions:

•

$k$ : the number of clusters.
•

$m$ : the number of points which can be left out from the clusters.
•

a function check: given a partitioning $X_{0},X_{1},\ldots,X_{k}$ of $X$ (here $X_{0}$ is the set of outliers) and centers $f_{1},\ldots,f_{k}$ , each lying in the set $F$ , the function ${\textsf{check}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k})$ outputs 1 iff this is a feasible clustering. For example, in the $r$ -gathering problem, the ${\textsf{check}}(X_{0},X_{1},\ldots,X_{k},f_{1},\ldots,f_{k})$ outputs 1 iff $|X_{i}|\geq r$ for each $i\in[k]$ . The check function depends only on the cardinality of the sets $X_{1},\ldots,X_{k}$ and the locations $f_{1},\ldots,f_{k}$ . This already captures many of the constrained clustering problems. Our framework also applies to the more general labelled version (see details below).
•

a cost function cost: given a partitioning $X_{0},X_{1},\ldots,X_{k}$ of $X$ and centers $f_{1},\ldots,f_{k}$ ,

${\textsf{cost}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k}):=\sum_{i\in[k]}\sum_{x\in X_{i}}D^{z}(x,f_{i}),$

where $z$ is either 1 (the outlier constrained $k$ -median problem) or 2 (the outlier constrained $k$ -means problem).

Given an instance ${\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}})$ of an outlier constrained clustering problem as above, the goal is to find a partitioning $X_{0},X_{1},\ldots,X_{k}$ of $X$ and centers $f_{1},\ldots,f_{k}\in F$ such that $|X_{0}|\leq m$ ,
${\textsf{check}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k})$ is 1 and ${\textsf{cost}}(X_{0},X_{1},\ldots,X_{k},f_{1},\ldots,f_{k})$ is minimized. The outlier-free constrained clustering problem is specified as above, except that the parameter $m$ is 0. For sake of brevity, we leave out the parameter $m$ and the set $X_{0}$ while defining the instance ${\mathcal{I}}$ , and functions check and cost.

We shall also consider a more general class of constrained clustering problems, where each input point is assigned a label. In other words, an instance ${\mathcal{I}}$ of such a problem is specified by a tuple $(X,F,k,m,\sigma,{\textsf{check}},{\textsf{cost}})$ , where $\sigma:X\rightarrow L$ for a finite set $L$ . Note that the check function may depend on the function $\sigma$ . For example, $\sigma$ could assign a label “red” or “blue” to each point in $X$ and the check function would require that each cluster $X_{i}$ should have an equal number of red and blue points. In addition to the locations $f_{1},\ldots,f_{k}$ , the ${\textsf{check}}(X_{1},\ldots,X_{k},f_{1},\ldots,f_{k},\sigma)$ function also depends on $|\sigma^{-1}(l)\cap X_{j}|$ for each $l\in L,j\in[k]$ , i.e., the number of points with a particular label in each of the clusters. Indirectly, this also implies that the check function can impose conditions on the labels of the outliers points. For example, the colourful $k$ -median problem discussed in [AISX23] has the constraint that $m_{i}$ clients from the label type $i$ should be designated as outliers, given that every client has a unique label. Table 1 gives a description of some of these problems.

We shall use the approximate triangle inequality, which states that for $z\in\{1,2\}$ and any three points $x_{1},x_{2},x_{3}\in\mathcal{X}$ ,

\displaystyle D^{z}(x_{1},x_{3})\leq z\left(D^{z}(x_{1},x_{2})+D^{z}(x_{2},x_{3})\right).

(1)

Problem Description Unconstrained $k$ -median (Constraint type: unconstrained) Input: $(F,X,k)$ Output: $(X_{1},...,X_{k},f_{1},...,f_{k})$ Constraints: None, i.e., ${\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})$ always equals 1. Objective: Minimise $\sum_{i}\sum_{x\in X_{i}}D(x,f_{i})$ . (This includes various versions corresponding to specific metrics such as Ulam metric on permutations, metric spaces with constant doubling dimension etc.) Fault-tolerant $k$ -median (Constraint type: unconstrained but labelled) [HHL⁺16, IV20] Input: $(F,X,k)$ and a number $h(x)\leq k$ for every facility $x\in X$ Output: $(f_{1},...,f_{k})$ Constraints: None. Objective: Minimise $\sum_{x\in X}\sum_{j=1}^{h(x)}D(x,f_{\pi_{x}(j)})$ , where $\pi_{x}(j)$ is the index of $j^{th}$ nearest center to $x$ in $(f_{1},...,f_{k})$ (Label: $h(x)$ may be regarded as the label of the client $x$ . So, the number of distinct labels $\ell\leq k$ .) Balanced $k$ -median (Constraint type: size) [APF⁺10, Din20] Input: $(F,X,k)$ and integers $(r_{1},...,r_{k})$ , $(l_{1},...,l_{k})$ , Output: $(X_{1},...,X_{k},f_{1},...,f_{k})$ Constraints: $X_{i}$ should have at least $r_{i}$ and at most $l_{i}$ clients, i.e., ${\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1$ iff $\forall i,r_{i}\leq|X_{i}|\leq l_{i}$ . Objective: Minimise $\sum_{i}\sum_{x\in X_{i}}D(x,f_{i})$ . (Versions corresponding to specific values of $r_{i}$ ’s and $l_{i}$ ’s are known by different names. The version corresponding to $l_{1}=...=l_{k}=|X|$ is called the $r$ -gather problem and the version where $r_{1}=...=r_{k}=0$ is called the $l$ -capacity problem.) Capacitated $k$ -median (Constraint type: center + size) [CAL19] Input: $(F,X,k)$ and with capacity $s(f)$ for every facility $f\in F$ Output: $(X_{1},...,X_{k},f_{1},...,f_{k})$ Constraints: The number of clients, $X_{i}$ , assigned to $f_{i}$ is at most $s(f_{i})$ , i.e., ${\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1$ iff $\forall i,|X_{i}|\leq s(f_{i})$ . Objective: Minimise $\sum_{i}\sum_{x\in X_{i}}D(x,f_{i})$ . Matroid $k$ -median (Constraint type: center) [KKN⁺11, CAGK⁺19] Input: $(F,X,k)$ and a Matroid on $F$ Output: $(X_{1},...,X_{k},f_{1},...,f_{k})$ Constraints: The number of clients, $X_{i}$ , assigned to $f_{i}$ is at most $s(f_{i})$ , i.e., ${\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1$ iff $\{f_{1},...,f_{k}\}$ is an independent set of the Matroid . Objective: Minimise $\sum_{i}\sum_{x\in X_{i}}D(x,f_{i})$ . Strongly private $k$ -median (Constraint type: label + size) [RS18] Input: $(F,X,k)$ and numbers $(l_{1},...,l_{w})$ . Each client has a label $\in\{1,...,w\}$ . Output: $(X_{1},...,X_{k},f_{1},...,f_{k})$ Constraints: Every $X_{i}$ has at least $l_{j}$ clients with label $j$ , i.e., ${\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1$ iff $\forall i,j,|X_{i}\cap S_{j}|\geq l_{j}$ , where $S_{j}$ is the set of clients with label $j$ . Objective: Minimise $\sum_{i}\sum_{x\in X_{i}}D(x,f_{i})$ . (Labels: The number of distinct labels $\ell=w$ ). $l$ -diversity $k$ -median (Constraint type: label + size) [BGK⁺19] Input: $(F,X,k)$ and a number $l>1$ . Each client has one colour from $\in\{1,...,w\}$ Output: $(X_{1},...,X_{k},f_{1},...,f_{k})$ Constraints: The fraction of clients with colour $j$ in every $X_{i}$ is at least $1/l$ , i.e., ${\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1$ iff $\forall i,j,|X_{i}\cap S_{j}|\leq|X_{i}|/l$ , where $S_{j}$ is the set of clients with colour $j$ . Objective: Minimise $\sum_{i}\sum_{x\in X_{i}}D(x,f_{i})$ . (Labels: Each colour can be regarded as a label and hence the number of distinct labels $\ell=w$ ). Fair $k$ -median (Constraint type: label + size) [BGK⁺19, BCFN19] Input: $(F,X,k)$ and fairness values $(\alpha_{1},...,\alpha_{w}),(\beta_{1},...,\beta_{w})$ . Each client has colours from $\in\{1,...,w\}$ Output: $(X_{1},...,X_{k},f_{1},...,f_{k})$ Constraints: The fraction of clients with colour $j$ in every $X_{i}$ is between $\alpha_{j}$ and $\beta_{j}$ , i.e., ${\textsf{check}}(X_{1},...,X_{k},f_{1},...,f_{k})=1$ iff $\forall i,j,\alpha_{j}|X_{i}|\leq|X_{i}\cap S_{j}|\leq\beta|X_{i}|$ , where $S_{j}$ is the set of clients with colour $j$ . Objective: Minimise $\sum_{i}\sum_{x\in X_{i}}D(x,f_{i})$ . (There are two versions: (i) each client has a unique label, and (ii) a client can have multiple labels.) (Labels: For the first version $\ell=w$ and for the second version $\ell=2^{w}$ .)

Table 1: The table defines various outlier-free versions of the constrained

k

-median problem. The

k

-means versions are defined similarly using

D^{2}

instead of

D

. We include a few references. The problems are categorized based on the type of constraints. There are three main types of constraints (i) size (constraints on the cluster size), (ii) center (constraints on the points a center can service), and (iii) label (constraints on the label of points in clusters). A constrained problem can have a combination of these constraint types.

1.2 Our results

Our main result reduces the outlier constrained clustering problem to the outlier-free version. In our reduction, we shall also use approximation algorithms for the (unconstrained) $k$ -median and $k$ -means problems. We assume we have a constant factor approximation algorithm for these problems¹¹1Several such constant factor approximation algorithms exist [ANSW17, AGK⁺04, CGETS02].: let $\mathcal{C}$ denote such an algorithm with running time $T_{\mathcal{C}}(n)$ on an input of size $n$ . Note that $\mathcal{C}$ would be an algorithm for the $k$ -means or the $k$ -median problem depending on whether $z=1$ or $2$ in the definition of the cost function.

Theorem 1.1 (Main Theorem)

Consider an instance ${\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}})$ of an outlier constrained clustering problem. Let $\mathcal{A}$ be an $\alpha$ -approximation algorithm for the corresponding outlier-free constrained clustering problem; let $T_{\mathcal{A}}(n)$ be the running time of $\mathcal{A}$ on an input of size $n$ . Given a positive $\varepsilon>0$ , there is an $\alpha(1+\varepsilon)$ -approximation algorithm for ${\mathcal{I}}$ with running time $T_{\mathcal{C}}(n)+q\cdot T_{\mathcal{A}}(n)+O\left(n\cdot(k+\frac{m^{z+1}\log{m}}{\varepsilon^{z}})\right)+O\left(qm^{2}(k+m)^{3}\right)$ , where $n$ is the size of ${\mathcal{I}}$ and $q=f(k,m,\varepsilon)=\left(\frac{k+m}{\varepsilon}\right)^{O(m)}$ , and $z=1$ or $2$ depending on the cost function (i.e., $z=1$ for $k$ -median objection and $z=2$ for $k$ -means objective).

The above theorem implies that as long as there is an FPT or polynomial-time approximation algorithm for the constrained, outlier-free $k$ -median or $k$ -means clustering problem, there is an FPT approximation algorithm (with almost the same approximation ratio) for the corresponding outlier version. We prove this result by creating $q$ instances of the outlier-free version of ${\mathcal{I}}$ and picking the best solution on these instances using the algorithm $\mathcal{A}$ . We also extend the above result to the labelled version:

Theorem 1.2 (Main Theorem: labelled version)

Consider an instance ${\mathcal{I}}=(X,F,k,m,\mu,{\textsf{check}},{\textsf{cost}})$ of an outlier constrained clustering problem with labels on input points. Let $\mathcal{A}$ be an $\alpha$ -approximation algorithm for the corresponding outlier-free constrained clustering problem; let $T_{\mathcal{A}}(n)$ be the running time of $\mathcal{A}$ on an input of size $n$ . Given a positive $\varepsilon>0$ , there is an $\alpha(1+\varepsilon)$ -approximation algorithm for ${\mathcal{I}}$ with running time $T_{\mathcal{C}}(n)+q\cdot T_{\mathcal{A}}(n)+O\left(n\cdot(k+\frac{m^{z+1}\log{m}}{\varepsilon^{z}})\right)+O\left(q\ell m^{2}(k+m)^{3}\right)$ , where $n$ is the size of ${\mathcal{I}}$ , $q=f(k,m,\varepsilon)=\left(\frac{(k+m)\ell}{\varepsilon}\right)^{O(m)}$ with $\ell$ being the number of distinct labels, and $z=1$ or $2$ depending on the cost function (i.e., $z=1$ for $k$ -median objection and $z=2$ for $k$ -means objective).

The consequences of our results for specific constrained clustering problems is summarized in Table 2. We give the results of related works [BGJK20, GJK20, AISX23] in the same table to see the contributions of this work. Our contributions can be divided into two main categories:

1.
Matching the best-known result: This can be further divided into two categories:
1. (a)
  
  Matching results of [AISX23]: [AISX23] gives an outlier to outlier-free reduction. We also give such a reduction using a different technique with slightly better parameters. This means that we match all the results of [AISX23], which includes problems such as the classical $k$ -median/means problems, the Matroid $k$ -median problem, the colorful $k$ -median problem, and $k$ -median in certain special metrics. See rows 2-6 in Table 2.
2. (b)
  
  Matching results of [GJK20]: [GJK20] gives FPT approximation algorithms for certain constrained problems on which the coreset-based approach of [AISX23] is not known to work. See the last row of Table 2. [GJK20] gives algorithms for outlier and outlier-free versions with the same approximation guarantee. Since the best outlier-free approximation is also from [GJK20], our results currently only match the approximation guarantees of [GJK20]. However, if there is an improvement in any of these problems, our results will immediately beat the known outlier results of [GJK20].
2.

Best known results: Since our results hold for a larger class of constrained problems than earlier works, there are certain problems for which our results give the best-known FPT approximation algorithm. The list includes capacitated $k$ -median/ $k$ -means with hard capacities in general metric and Euclidean spaces. It also includes the $k$ -median problem in the Ulam metric. A recent development in the Ulam $k$ -median problem [CDK23] has broken the $2$ -approximation barrier. Our reduction allows us to take this development to the outlier setting as well. The outlier-free results from which our best results are derived using our reduction are given in Table 2 (see rows 7-9).

Problem Outlier-free Outlier version [GJK20] [AISX23] This work Euclidean $k$ -means (i.e., $F=\mathbb{R}^{d},X\subset\mathbb{R}^{d}$ ) $(1+\varepsilon)$ [BJK18] $\times$ $(1+\varepsilon)$ $(1+\varepsilon)$ $k$ -median $\left(1+\frac{2}{e}+\varepsilon\right)$ [CAGK⁺19] $(3+\varepsilon)$ $\left(1+\frac{2}{e}+\varepsilon\right)$ $\left(1+\frac{2}{e}+\varepsilon\right)$ $k$ -means $\left(1+\frac{8}{e}+\varepsilon\right)$ [CAGK⁺19] $(9+\varepsilon)$ $\left(1+\frac{8}{e}+\varepsilon\right)$ $\left(1+\frac{8}{e}+\varepsilon\right)$ $k$ -median/means in metrics: (i) constant doubling dimension (ii) metrics induced by graphs of bounded treewidth (iii) metrics induced by graphs that exclude a fixed graph as a minor $\left(1+\varepsilon\right)$ [CASS21] $(3+\varepsilon)$ $k$ -median $(9+\varepsilon)$ $k$ -means $\left(1+\varepsilon\right)$ $\left(1+\varepsilon\right)$ Matroid $k$ -median $\left(2+\varepsilon\right)$ [CAGK⁺19] $(3+\varepsilon)$ $\left(2+\varepsilon\right)$ $\left(2+\varepsilon\right)$ Colourful $k$ -median $\left(1+\frac{2}{e}+\varepsilon\right)$ [CAGK⁺19] $(3+\varepsilon)$ $\left(1+\frac{2}{e}+\varepsilon\right)$ $\left(1+\frac{2}{e}+\varepsilon\right)$ Ulam $k$ -median (here $F=X$ ) $\left(2-\delta\right)$ [CDK23] $(2+\varepsilon)$ $\times$ $\left(2-\delta\right)$ Euclidean Capacitated $k$ -median/means $\left(1+\varepsilon\right)$ [CAL19] $\times$ $\times$ $\left(1+\varepsilon\right)$ Capacitated $k$ -median Capacitated $k$ -means $\left(3+\varepsilon\right)$ $(9+\varepsilon)$ [CAL19] $\times$ $\times$ $\times$ $\times$ $\left(3+\varepsilon\right)$ $(9+\varepsilon)$ Uniform/non-uniform $r$ -gather $k$ -median/means (uniform implies $r_{1}=r_{2}=...=r_{k}$ ) Uniform/non-uniform $l$ -capacity $k$ -median/means (uniform implies $l_{1}=l_{2}=...=l_{k}$ ) Uniform/non-uniform balanced $k$ -median/means (uniform implies $r_{1}=r_{2}=...=r_{k}$ and $l_{1}=l_{2}=...=l_{k}$ ) $(3+\varepsilon)$ ( $k$ -median) $(3+\varepsilon)$ ( $k$ -median) $\times$ $(3+\varepsilon)$ ( $k$ -median) Uniform/non-uniform fault tolerant $k$ -median/means (uniform implies same $h(x)$ for every $x$ ) $(9+\varepsilon)$ ( $k$ -means) $(9+\varepsilon)$ ( $k$ -means) $\times$ $(9+\varepsilon)$ ( $k$ -means) Strongly private $k$ -median/means [GJK20] $l$ -diversity $k$ -median/means Fair $k$ -median/means

Table 2: A

\times

means that the techniques are not known to apply to the problem. The new results that do not follow from the previously known results are shaded . The results that were not explicitly reported but follows from the techniques in the paper are shaded The techniques of [AISX23] do not apply to the Ulam

k

-median problem since the outlier-free algorithm works on unweighted instances. Note that all the FPT

(3+\varepsilon)

and

(9+\varepsilon)

approximations for the outlier-free versions (leftmost column) in the last row follow from the outlier-free results in [GJK20]. However, the approximation guarantees in the rightmost column depend on those in the leftmost. This means that, unlike the rigid

(3+\varepsilon)

and

(9+\varepsilon)

approximation guarantees of [GJK20] in the middle column, the approximation guarantee in the rightmost column will improve with every improvement in the leftmost.

1.3 Comparison with earlier work

As discussed earlier, the idea of a reduction from a outlier clustering problem to the corresponding outlier-free version in the context of the Euclidean $k$ -means problem was suggested by [BGJK20] using a $D^{2}$ -sampling based idea. [GJK20] used the sampling ideas to design approximation algorithms for the outlier versions of various constrained clustering problems. However, the approximation guarantee obtained by [GJK20] was limited to $(3+\varepsilon)$ for a large class of constrained $k$ -median and $(9+\varepsilon)$ for the constrained $k$ -means problems, and it was not clear how to extend these techniques to get improved guarantees. As a result, their techniques could not exploit the recent developments by [CAGK⁺19] in the design of $(1+2/e+\varepsilon)$ and $(1+8/e+\varepsilon)$ FPT approximation algorithms for the classical outlier-free $k$ -median and $k$ -means problems respectively in general metric spaces. [AISX23] gave an outlier-to-outlier-free reduction, making it possible to extend the above-mentioned FPT approximation guarantees for the outlier-free setting to the outlier setting.

The reduction of [AISX23] is based on the coreset construction by [Che09] using uniform sampling. A coreset for a dataset is a weighted set of points such that the clustering of the coreset points with respect to any set of $k$ centers is the same (within a $1\pm\varepsilon$ factor) as that of the original set points. The coreset construction in [Che09] starts with a set $C$ of centers that give constant factor approximation. They consider $O(\log{n})$ “rings” around these centers, uniformly sample points from each of these rings, and set the weight of the sampled points appropriately. The number of sampled points, and hence the size of the coreset, is $\left(\frac{|C|\log{n}}{\varepsilon}\right)^{2}$ . [AISX23] showed that when starting with $(k+m)$ centers that give a constant approximation to the classical $(k+m)$ -median problem, the coreset obtained as above has the following additional property: for any set of $k$ centers, the clustering cost of the original set of points excluding $m$ outliers is same (again, within $1\pm\varepsilon$ factor) as that of the coreset, again allowing for exclusion of a subset of $m$ points from it. This means that by trying out all $m$ subsets from the coreset, we ensure that at least one subset acts as a good outlier set. Since the coreset size is $\left(\frac{(k+m)\log{n}}{\varepsilon}\right)^{2}$ , the number of outlier-free instances that we construct is $\left(\frac{(k+m)\log{n}}{\varepsilon}\right)^{O(m)}$ . Using $(\log{n})^{O(m)}=\max\{m^{O(m)},n^{O(1)}\}$ , this is of the form $f(k,m,\varepsilon)\cdot n^{O(1)}$ for a suitable function $f$ . At this point, we note the first quantitative difference from our result. In our algorithm, we save the $(\log{n})^{O(m)}$ factor, which also means that the number of instances does not depend on the problem size $n$ . Further, a coreset based construction restricts the kind of problems it can be applied to. The coreset property that the cost of original points is the same as that of the weighted cost of coreset points holds when points are assigned to the closest center (i.e., the entire weight of the coreset goes to the closest center).²²2The reason is how Haussler’s lemma is applied to bound the cost difference. This works for the classical unconstrained $k$ -median and $k$ -means problems (as well as the few other settings considered in [AISX23]). However, for several constrained clustering problems, it may not hold that every point is assigned to the closest center. There have been some recent developments [BFS21, BCAJ⁺22] in designing coresets for constrained clustering settings. However, they have not been shown to apply to the outlier setting. Another recent work [HJLW22] designs coresets for the outlier setting, but like [AISX23], it has limited scope and has not been shown to extend for most constrained settings. Our $D^{z}$ -sampling-based technique has the advantage that instead of running the outlier-free algorithm on a coreset as in [AISX23], it works directly with the dataset. That is, we run the outlier-free algorithm on the dataset (after removing outlier candidates). This also makes our results helpful in weighted settings (e.g., see [CDK23]) where the outlier-free algorithm is known to work only for unweighted datasets – note that a coreset is a weighted set).

1.4 Our Techniques

In this section, we give a high-level description of our algorithm. Let ${\mathcal{I}}$ denote an instance of outlier constrained clustering on a set of points $X$ and ${\cal O}$ denote an optimal solution to ${\mathcal{I}}$ . The first observation is that the optimal cost of the outlier-free and unconstrained clustering with $k+m$ centers on $X$ is a lower bound on the cost of ${\cal O}$ (Claim 1). ³³3This observation was used in both [BGJK20] and [AISX23]. Let $C$ denote the set of these $(k+m)$ centers (we can use any constant factor approximation for the unconstrained version to find $C$ ). The intuition behind choosing $C$ is that the centers in ${\cal O}$ should be close to $C$ .

Now we divide the set of $m$ outliers in ${\cal O}$ into two subsets: those which are far from $C$ and the remaining ones close to $C$ (“near” outliers). Our first idea is to randomly sample a subset $S$ of $O(m\log m)$ points from $X$ with sampling probability proportional to distance (or square of distance) from the set $C$ . This sampling ensures that $S$ contains the far outliers with high probability (Claim 2). We can then cycle through all subsets of $S$ to guess the exact subset of far outliers. Handling the near outliers is more challenging and forms the heart of the technical contribution of this paper.

We “assign” each near outlier to its closest point in $C$ – let $X^{{\textsf{opt}}}_{N,j}$ be the set of outliers assigned to $c_{j}$ . By cycling over all choices, we can guess the cardinality $t_{j}$ of each of the sets $X^{{\textsf{opt}}}_{N,j}$ . We now set up a suitable minimum cost bipartite $b$ -matching instance which assigns a set of $t_{j}$ points to each center $c_{j}$ . Let ${\widehat{X}}_{j}$ be the set of points assigned to $c_{j}$ . Our algorithm uses $\cup_{j}{\widehat{X}}_{j}$ as the set of near outliers. In the analysis, we need to argue that there is a way of matching the points in $X^{{\textsf{opt}}}_{N,j}$ to ${\widehat{X}}_{j}$ whose total cost (sum of distances or squared distances between matched points) is small (Lemma 1). The hope is that we can go from the optimal set of outliers in ${\cal O}$ to the ones in the algorithm and argue that the increase in cost is small. Since we are dealing with constrained clustering, we need to ensure that this process does not change the size of each of the clusters. To achieve this, we need to further modify the matching between the two sets of outliers (Lemma 2). Finally, with this modified matching, we are able to argue that the cost of the solution produced by the algorithm is close to that of the optimal solution. The extension to the labelled version follows along similar lines.

In the remaining paper, we prove our two main results, Theorem 1.1 and Theorem 1.2. The main discussion will be for Theorem 1.1 since Theorem 1.2 is an extension of Theorem 1.1 that uses the same proof ideas. In the following sections, we give the details of our algorithm (Section 2) and its analysis (Section 3). In Section 3.1, we discuss the extension to the labelled version.

2 Algorithm

In this section, we describe the algorithm for the outlier constrained clustering problem. Consider an instance ${\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}})$ of this problem. Recall that the parameter $z=1$ or $2$ depends on whether the cost function is like the $k$ -median or the $k$ -means objective respectively. In addition, we assume the existence of the following algorithms:

•

A constant factor algorithm for the $k$ -median or the $k$ -means problem (depending on $z=1$ or $z=2$ respectively): an instance here is specified by a tuple $(X^{\prime},F^{\prime},k^{\prime})$ only, where $X^{\prime}$ is the set of input points, $F^{\prime}$ is the set of potential locations for a center, and $k^{\prime}$ denotes the number of clusters. We shall use $\mathcal{C}$ to denote this algorithm.
•

An algorithm $\mathcal{A}$ for the outlier-free version of this problem. An instance here is given by a tuple $(X^{\prime},F^{\prime},k,{\textsf{check}},{\textsf{cost}})$ where the check and the cost functions are same as those in ${\mathcal{I}}$ .
•

An algorithm $\mathcal{M}$ for the $b$ -matching problem: an instance of the $b$ -matching problem is specified by a weighted bi-partite graph $G=(L,R=\{v_{1},\ldots,v_{r}\},E)$ , with edge $e$ having weight $e$ ; and a tuple $(t_{1},\ldots,t_{r})$ , where $t_{i},i\in[r],$ are non-negative integers. A solution needs to find a subset of $E^{\prime}$ of $E$ such each vertex of $L$ is incident with at most one edge of $E^{\prime}$ , and each vertex $v_{j}\in R$ is incident with exactly $t_{j}$ edges of $E^{\prime}$ . The goal is to find such a set $E^{\prime}$ of minimum total weight.

We now define $D^{z}$ -sampling:

Definition 2.1

Given sets $C$ and $X$ of points, $D^{z}$ -sampling from $X$ w.r.t. $C$ samples a point $x\in X$ , where the probability of sampling $x$ is proportional to $D^{z}(x,C)$ .

The algorithm is described in Algorithm 1. It first runs the algorithm $\mathcal{C}$ to obtain a set of $(k+m)$ centers $C$ in line 1. In line 1, we sample a subset $S$ where each point in $S$ is sampled independently using $D^{z}$ -sampling w.r.t. $C$ . Given a subset $Y$ , we say that a tuple ${\boldsymbol{\tau}}=(t_{1},\ldots,t_{k+m})$ is valid if $t_{j}\geq 0$ for all $j\in[k+m]$ , and $\sum_{j}t_{j}+|Y|=m$ . For each subset $Y$ of size $\leq m$ of $S$ and for each valid tuple ${\boldsymbol{\tau}}$ , the algorithm constructs a solution $(X^{(Y,{\boldsymbol{\tau}})}_{0},X^{(Y,{\boldsymbol{\tau}})}_{1},\ldots,X^{(Y,{\boldsymbol{\tau}})}_{k})$ , where $X^{(Y,{\boldsymbol{\tau}})}_{0}$ denotes the set of outlier points. This is done by first computing the set $X^{(Y,{\boldsymbol{\tau}})}_{0}$ , and then using the algorithm $\mathcal{A}$ on the remaining points (line 1). To find the set $X^{(Y,{\boldsymbol{\tau}})}_{0}$ , we construct an instance ${\mathcal{I}}^{(Y,{\boldsymbol{\tau}})}$ of $b$ -matching first (line 1). This instance is defined as follows: the bipartite graph has the set of $(k+m)$ centers $C$ on the left side and the set of points $X$ on the right side. The weight of an edge between a vertex $v\in C$ and $w\in X$ is equal to $D^{z}(v,w)$ . For each vertex $v_{j}\in C$ , we require that it is matched to exactly $t_{j}$ points of $X$ . We run the algorithm $\mathcal{M}$ on this instance of $b$ -matching (line 1). We define $X^{(Y,{\boldsymbol{\tau}})}_{0}$ as the set of points of $X$ matched by this algorithm. Finally, we output the solution of minimum cost (line 1).

0.1 Input:

{\mathcal{I}}:=(X,F,k,m,{\textsf{check}},{\textsf{cost}})

0.2 Execute

\mathcal{C}

on the instance

{\mathcal{I}}^{\prime}:=(X,F,k+m)

to obtain a set

C

k+m

centers.

0.3 Sample a set

S

\frac{4\beta m\log m}{\varepsilon}

points with replacement, each using

D^{z}

-sampling from

X

w.r.t.

C

0.4 for each subset $Y\subset S,|Y|\leq m$ do

0.5 for each valid tuple ${\boldsymbol{\tau}}=(t_{1},\ldots,t_{k+m})$ do

0.6 Construct the instance

{\mathcal{I}}^{(F,{\bf\tau})}

0.7 Run

\mathcal{M}

{\mathcal{I}}^{(Y,{\bf\tau})}

and let

X_{0}^{(Y,{\boldsymbol{\tau}})}

be the set of matched points in

X

0.8 Run the algorithm

\mathcal{A}

on the instance

(X\setminus(X_{0}^{(Y,{\boldsymbol{\tau}})}\cup Y),F,k,{\textsf{check}},{\textsf{cost}})

0.9 Let

(X^{(Y,{\boldsymbol{\tau}})}_{1},\ldots,X^{(Y,{\boldsymbol{\tau}})}_{k})

be the clustering produced by

\mathcal{A}

0.10

0.11

0.12Let

(Y^{\star},{\boldsymbol{\tau}}^{\star})

be the pair for which

{\textsf{cost}}(X^{(Y,{\boldsymbol{\tau}})}_{1},\ldots,X^{(Y,{\boldsymbol{\tau}})}_{k})

is minimized.

Output

(X^{(Y^{\star},{\boldsymbol{\tau}}^{\star})}_{0},X^{(Y^{\star},{\boldsymbol{\tau}}^{\star})}_{1},\ldots,X^{(Y^{\star},{\boldsymbol{\tau}}^{\star})}_{k})

Algorithm 1 Algorithm for outlier constrained clustering.

3 Analysis

We now analyze Algorithm 1. We refer to the notation used in this algorithm. Let ${\mathcal{I}}=(X,F,k,m,{\textsf{check}},{\textsf{cost}})$ be the instance of the outlier constrained clustering problem. Let ${\textsf{opt}}({\mathcal{I}})$ denote the optimal cost of a solution for the instance ${\mathcal{I}}$ . Assume that the algorithm $\mathcal{C}$ for the unconstrained clustering problem (used in line 1) is a $\beta$ -approximation algorithm. We overload notation and use ${\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)$ to denote the cost of the solution $C$ for the instance ${\mathcal{I}}^{\prime}$ . Observe that the quantity ${\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)$ can be computed as follows: each point in $X$ is assigned to the closest point in $C$ , and then we compute the total cost (which could be the $k$ -median or the $k$ -means cost based on the value of the parameter $z$ ) of this assignment. We first relate ${\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)$ to ${\textsf{opt}}({\mathcal{I}})$ .

Claim 1

${\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)\leq\beta\cdot{\textsf{opt}}({\mathcal{I}})$ .

Proof

Let $(X_{O},X_{1},...,X_{k})$ denote the optimal solution for ${\mathcal{I}}$ , where $X_{0}$ denotes the set of $m$ outlier points. Let $c_{1},\ldots,c_{k}$ be the centers of the clusters $X_{1},\ldots,X_{k}$ respectively. Consider the solution to ${\mathcal{I}}^{\prime}$ consisting of centers $C^{\prime}:=X_{0}\cup\{c_{1},\ldots,c_{k}\}$ . Clearly, ${\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C^{\prime})\leq{\textsf{opt}}({\mathcal{I}})$ (we have inequality here because the solution $X_{1},\ldots,X_{k}$ may not be a Voronoi partition with respect to $c_{1},\ldots,c_{k}$ ). Since $\mathcal{C}$ is a $\beta$ -approximation algorithm, we know that ${\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)\leq\beta\cdot{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)$ . Combining these two facts implies the desired result. ∎

We now consider an optimal solution for the instance ${\mathcal{I}}$ : let $X^{{\textsf{opt}}}_{0},X^{{\textsf{opt}}}_{2},\ldots,X^{{\textsf{opt}}}_{k}$ be the partition of the input points $X$ in this solution, with $X^{{\textsf{opt}}}_{0}$ being the set of $m$ outliers. Depending on the distance from $C$ , we divide the set $X^{{\textsf{opt}}}_{0}$ into two subsets – $X^{{\textsf{opt}}}_{F}$ (“far” points) and $X^{{\textsf{opt}}}_{N}$ (“near” points) as follows:

X^{{\textsf{opt}}}_{F}:=\left\{x\in X^{{\textsf{opt}}}_{0}|D^{z}(x,C)\geq\frac{\varepsilon\,{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}{2\beta m}\right\},\quad X^{{\textsf{opt}}}_{N}:=X\setminus X^{{\textsf{opt}}}_{F}.

Recall that we sample a set $S$ of $\frac{4\beta m\log{m}}{\varepsilon}$ clients using $D^{z}$ -sampling with respect to center set $C$ (line 1 in Algorithm 1). Note that the probability of sampling a point $x$ is given by

\displaystyle\frac{D^{z}(x,C)}{\sum_{x^{\prime}\in X}D^{z}(x,C)}=\frac{D^{z}(x,C)}{{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}.

(2)

We first show that $S$ contains all the points in $X^{{\textsf{opt}}}_{F}$ with high probability.

Claim 2

$\mathbf{Pr}[X^{{\textsf{opt}}}_{F}\subseteq S]\geq 1-1/m$ .

Proof

Inequality Equation 2 shows that the probability of sampling a point $x\in X^{{\textsf{opt}}}_{F}$ is $\frac{D^{z}(x,C)}{{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}\geq\frac{\varepsilon}{2\beta m}$ . So the probability that $x$ is not present in $S$ is at most $\left(1-\frac{\varepsilon}{2\beta m}\right)^{\frac{4\beta m\log{m}}{\varepsilon}}\leq\frac{1}{m^{2}}$ . The desired result now follows union bound.∎

For rest of the analysis, we assume that the event in Claim 2 holds. We now note that the sum of the cost of assigning $X^{{\textsf{opt}}}_{N}$ to $C$ is at most $\varepsilon\cdot{\textsf{opt}}({\mathcal{I}})$ .

Claim 3

$\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,C)\leq\frac{\varepsilon}{2}\cdot{\textsf{opt}}({\mathcal{I}})$ .

Proof

The claim follows from the following sequence of inequalities:

\displaystyle\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,C)<\sum_{x\in X^{{\textsf{opt}}}_{N}}\frac{\varepsilon\,{\textsf{cost}}_{{\mathcal{I}}^{\prime}}(C)}{2\beta m}\leq\sum_{x\in X^{{\textsf{opt}}}_{N}}\frac{\varepsilon\cdot{\textsf{opt}}({\mathcal{I}})}{2m}\leq\frac{\varepsilon}{2}\cdot{\textsf{opt}}({\mathcal{I}}),

where the first inequality follows from the definition of $X^{{\textsf{opt}}}_{N}$ and the second inequality follows from Claim 1. ∎

For every point in $X^{{\textsf{opt}}}_{N}$ , we identify the closest center in $C=\{c_{1},\ldots,c_{m+k}\}$ (breaking ties arbitrarily). For each $j\in[k+m]$ , let $X^{{\textsf{opt}}}_{N,j}$ be the set of points in $X^{{\textsf{opt}}}_{N}$ which are closest to $c_{j}$ . Let ${\hat{t}}_{j}$ denote $|X^{{\textsf{opt}}}_{N,j}|$ . Consider an iteration of line 1–1 where $Y=X^{{\textsf{opt}}}_{F},{\boldsymbol{\tau}}=({\hat{t}}_{1},\ldots,{\hat{t}}_{k+m})$ . Observe that ${\boldsymbol{\tau}}$ is valid with respect to $Y$ because $\sum_{j\in[m+k]}|{\hat{t}}_{j}|+|Y|=m$ . Let ${\widehat{X}}_{1},\ldots,{\widehat{X}}_{m+k}$ be the set of points assigned to $c_{1},\ldots,c_{m+k}$ respectively by the algorithm $\mathcal{M}$ . Intuitively, we will like to construct a solution where the set of outliers is given by ${\widehat{X}}:=X^{{\textsf{opt}}}_{F}\cup{\widehat{X}}_{1}\cup\cdots\cup{\widehat{X}}_{m+k}$ . We now show that the set ${\widehat{X}}$ is “close” to $X^{{\textsf{opt}}}_{0}$ , the set of outliers in the optimal solution. In order to do this, we set up a bijection $\mu:X^{{\textsf{opt}}}_{0}\rightarrow{\widehat{X}}$ , where $\mu$ restricted to $X^{{\textsf{opt}}}_{F}$ is identity, and $\mu$ restricted to any of the sets $X^{{\textsf{opt}}}_{N,j}$ is a bijection from $X^{{\textsf{opt}}}_{N,j}$ to ${\widehat{X}}_{j}$ . Such a function $\mu$ is possible because for each $j\in[m+k]$ , $|X^{{\textsf{opt}}}_{N,j}|=|{\widehat{X}}_{j}|={\hat{t}}_{j}$ . We now prove this closeness property.

Lemma 1

\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,\mu(x))\leq\varepsilon\cdot z\cdot{\textsf{opt}}({\mathcal{I}}).

Proof

We first note a useful property of the solution given by the algorithm $\mathcal{M}$ . One of the possible solutions for the instance ${\mathcal{I}}^{(Y,{\boldsymbol{\tau}})}$ could have been assigning $X^{{\textsf{opt}}}_{N,j}$ to the center $c_{j}$ . Since $\mathcal{M}$ is an optimal algorithm for $b$ -matching, we get

\displaystyle\sum_{j\in[k+m]}\sum_{x\in{\widehat{X}}_{j}}D^{z}(x,c_{j})\leq\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}D^{z}(x,c_{j})=\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,C)\leq\frac{\varepsilon}{2}\cdot{\textsf{opt}}({\mathcal{I}}),

(3)

where the last inequality follows from Claim 3. Now,

	$\displaystyle\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,\mu(x))$	$\displaystyle=\sum_{x\in X^{{\textsf{opt}}}_{N}}D^{z}(x,\mu(x))=\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}D^{z}(x,\mu(x))$
		$\displaystyle\stackrel{{\scriptstyle\eqref{eq:tr}}}{{\leq}}z\cdot\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}\left(D^{z}(x,c_{j})+D^{z}(c_{j},\mu(x))\right),$		(4)

where the first equality follows from the fact that $\mu$ is identity on $X^{{\textsf{opt}}}_{F}$ . Since $\mu$ is a bijection from $X^{{\textsf{opt}}}_{N,j}$ to ${\widehat{X}}_{j}$ , the above can also be written as

z\cdot\sum_{j\in[k+m]}\sum_{x\in X^{{\textsf{opt}}}_{N,j}}D^{z}(x,c_{j})+z\cdot\sum_{j\in[k+m]}\sum_{x\in{\widehat{X}}_{j}}D^{z}(x,c_{j})\leq z\cdot\varepsilon\;{\textsf{opt}}({\mathcal{I}}),

where the last inequality follows from Claim 3 and (3). This proves the desired result. ∎

The mapping $\mu$ described above may have the following undesirable property: there could be a point $x\in X^{{\textsf{opt}}}_{0}\cap{\widehat{X}}$ such that $\mu(x)\neq x$ . This could happen if $x\in X^{{\textsf{opt}}}_{N,j}$ and $x\in{\widehat{X}}_{i}$ where $i\neq j$ . We now show that $\mu$ can be modified to another bijection ${\widehat{\mu}}:X^{{\textsf{opt}}}_{0}\rightarrow{\widehat{X}}$ which is identity on $X^{{\textsf{opt}}}_{0}\cap{\widehat{X}}.$ Note that the mapping ${\widehat{\mu}}$ is only needed for the analysis of the algorithm.

Lemma 2

There is a bijection ${\widehat{\mu}}:X^{{\textsf{opt}}}_{0}\rightarrow{\widehat{X}}$ such that ${\widehat{\mu}}(x)=x$ for all $x\in X^{{\textsf{opt}}}_{0}\cap{\widehat{X}}$ and

\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,{\widehat{\mu}}(x))\leq m^{z-1}\,\varepsilon\cdot z\cdot{\textsf{opt}}({\mathcal{I}}).

Proof

We construct a directed graph $H=(V_{1},E_{1})$ where $V_{1}=X^{{\textsf{opt}}}_{0}\cup{\widehat{X}}$ . For every $x\in X^{{\textsf{opt}}}_{0}$ , we add the directed arc $(x,\mu(X))$ to $E_{1}$ . Observe that a self loop in $H$ implies that $\mu(x)=x$ . Every vertex in $X^{{\textsf{opt}}}_{0}\setminus{\widehat{X}}$ has 0 in-degree and out-degree 1; whereas a vertex in ${\widehat{X}}\setminus X^{{\textsf{opt}}}_{0}$ has in-degree 1 and 0 out-degree. Vertices in ${\widehat{X}}\cap X^{{\textsf{opt}}}_{0}$ have exactly one incoming and outgoing arc (in case of a self-loop, it counts towards both the in-degree and the out-degree of the corresponding vertex).

The desired bijection ${\widehat{\mu}}$ is initialized to $\mu$ . Let ${\textsf{cost}}({\widehat{\mu}})$ denote $\sum_{x\in X^{{\textsf{opt}}}_{0}}D^{z}(x,{\widehat{\mu}}(x))$ ; define ${\textsf{cost}}(\mu)$ similarly. It is easy to check $H$ is vertex disjoint union of directed cycles and paths. In case of a directed cycle $C$ on more than 1 vertex, it must be the case that each of the vertices in $C$ belong to ${\widehat{X}}\cap X^{{\textsf{opt}}}_{0}$ . In this case, we update ${\widehat{\mu}}$ be defining ${\widehat{\mu}}(x)=x$ for each $x\in C$ . Clearly this can only decrease ${\textsf{cost}}({\widehat{\mu}})$ . Let $P_{1},\ldots,P_{l}$ be the set of directed paths in $H$ . For each path $P_{j}$ , we perform the following update: let $P_{j}$ be a path from $a_{j}$ to $b_{j}$ . We know that $a_{j}\in X^{{\textsf{opt}}}\setminus{\widehat{X}}$ , $b_{j}\in{\widehat{X}}\setminus X^{{\textsf{opt}}}_{0}$ and each internal vertex of $P_{j}$ lies in ${\widehat{X}}\cap X^{{\textsf{opt}}}_{0}$ . We update ${\widehat{\mu}}$ as follows; ${\widehat{\mu}}(a_{j})=b_{j}$ and ${\widehat{\mu}}(v)=v$ for each internal vertex $v$ of $P_{j}$ . The overall increase in ${\textsf{cost}}({\widehat{\mu}})$ is equal to

\displaystyle\sum_{j\in[l]}\left(D^{z}(a_{j},b_{j})-\sum_{i=1}^{n_{j}}D^{z}(v_{j}^{i},v_{j}^{i-1})\right),

(5)

where $a_{j}=v_{j}^{0},v_{j}^{1},\ldots,v_{j}^{n_{j}}=b_{j}$ denotes the sequence of vertices in $P_{j}$ . If $z=1$ , triangle inequality shows that the above quantity is at most 0. In case $z=2$ ,

D^{2}(a_{j},b_{j})\leq n_{j}\left(\sum_{i=1}^{n_{j}}D^{2}(v_{j}^{i},v_{j}^{i-1})\right),

and so the quantity in (5) is at most $(n_{j}-1)\sum_{i=1}^{n_{j}}D^{2}(v_{j}^{i},v_{j}^{i-1}).$

It follows that ${\textsf{cost}}({\widehat{\mu}})\leq m^{z-1}{\textsf{cost}}(\mu).$ The desired result now follows from Lemma 1. ∎

We run the algorithm $\mathcal{A}$ on the outlier-free constrained clustering instance ${\mathcal{I}}^{\prime\prime}=(X\setminus{\widehat{X}},F,k,{\textsf{check}},{\textsf{cost}})$ (line 1 in Algorithm 1). Let ${\textsf{opt}}({\mathcal{I}}^{\prime\prime})$ be the optimal cost of a solution for this instance. The following key lemma shows that ${\textsf{opt}}({\mathcal{I}}^{\prime\prime})$ is close to ${\textsf{opt}}({\mathcal{I}})$ .

Lemma 3

${\textsf{opt}}({\mathcal{I}}^{\prime\prime})\leq(1+\varepsilon^{\frac{1}{z}}(2m+1)^{z-1}){\textsf{opt}}({\mathcal{I}}).$

Proof

We shall use the solution $(X^{{\textsf{opt}}}_{0},\ldots,X^{{\textsf{opt}}}_{k})$ to construct a feasible solution for ${\mathcal{I}}^{\prime\prime}$ . For each $j\in[k]$ , let $Z_{j}$ denote $X^{{\textsf{opt}}}_{j}\cap{\widehat{X}}$ . Let ${\widehat{\mu}}^{-1}(Z_{j})$ denote the pre-image under ${\widehat{\mu}}$ of $Z_{j}$ . Since $Z_{j}\subseteq{\widehat{X}}\setminus X^{{\textsf{opt}}}_{0}$ , ${\widehat{\mu}}^{-1}(Z_{j})\subseteq X^{{\textsf{opt}}}_{0}\setminus{\widehat{X}}$ . For each $j\in[k]$ , define

X^{\prime}_{j}:=(X^{{\textsf{opt}}}_{j}\setminus Z_{j})\cup{\widehat{\mu}}^{-1}(Z_{j}).

Claim 4

\bigcup_{j=1}^{k}X_{j}^{\prime}=X\setminus{\widehat{X}}.

Proof

For any $j\in[k]$ , we have already argued that ${\widehat{\mu}}^{-1}(Z_{j})\subseteq X^{{\textsf{opt}}}_{0}\setminus{\widehat{X}}\subseteq X\setminus{\widehat{X}}$ . Clearly, $X^{{\textsf{opt}}}_{j}\setminus Z_{j}\subseteq X\setminus{\widehat{X}}$ . Therefore $X_{j}^{\prime}\subseteq X\setminus{\widehat{X}}$ . Therefore, $\cup_{j\in[k]}X_{j}^{\prime}\subseteq X\setminus{\widehat{X}}$ . Since $|X_{j}^{\prime}|=|X^{{\textsf{opt}}}_{j}|$ ,

\sum_{j\in[k]}|X_{j}^{\prime}|=n-m=|X\setminus{\widehat{X}}|.

This proves the claim. ∎

The above claim implies that $(X_{1}^{\prime},\ldots,X_{k}^{\prime})$ is a partition of $X\setminus{\widehat{X}}$ . Since $|X_{j}^{\prime}|=|X^{{\textsf{opt}}}_{j}|$ for all $j\in[k]$ and the function check only depends on the cardinality of the sets in the partition, $(X_{1}^{\prime},\ldots,X_{k}^{\prime})$ is a feasible partition (under check) of $X\setminus{\widehat{X}}$ . In the optimal solution for ${\mathcal{I}}$ , let $f^{{\textsf{opt}}}_{1},\ldots,f^{{\textsf{opt}}}_{k}$ be the $k$ centers corresponding to the clusters $X^{{\textsf{opt}}}_{1},\ldots,X^{{\textsf{opt}}}_{k}$ respectively. Now,

\displaystyle{\textsf{opt}}({\mathcal{I}}^{\prime\prime})

\displaystyle\leq{\textsf{cost}}(X_{1}^{\prime},\ldots,X_{k}^{\prime})\leq\sum_{j\in[k]}\sum_{x\in X_{j}^{\prime}}D^{z}(x,f^{{\textsf{opt}}}_{j})

(6)

For each $j\in[k]$ , we estimate the quantity $\sum_{x\in X_{j}^{\prime}}D^{z}(x,f^{{\textsf{opt}}}_{j})$ . Using the definition of $X_{j}^{\prime}$ and triangle inequality, this quantity can be expressed as

\displaystyle\sum_{x\in X^{{\textsf{opt}}}_{j}\setminus Z_{j}}D^{z}(x,f^{{\textsf{opt}}}_{j})+\sum_{x\in{\widehat{\mu}}^{-1}(Z_{j})}D^{z}(x,f^{{\textsf{opt}}}_{j})

\displaystyle\leq\sum_{x\in X^{{\textsf{opt}}}_{j}\setminus Z_{j}}D^{z}(x,f^{{\textsf{opt}}}_{j})+\sum_{x\in{\widehat{\mu}}^{-1}(Z_{j})}\left(D(x,{\widehat{\mu}}(x))+D({\widehat{\mu}}(x),f^{{\textsf{opt}}}_{j})\right)^{z}

(7)

When $z=1$ , the above is at most (replacing $x$ by ${\widehat{\mu}}(x)$ in the second expression on RHS)

\sum_{x\in X^{{\textsf{opt}}}_{j}}D(x,f^{{\textsf{opt}}}_{j})+\sum_{x\in Z_{j}}D(x,{\widehat{\mu}}(x)).

Using this bound in (6), we see that

{\textsf{opt}}({\mathcal{I}}^{\prime\prime})\leq{\textsf{opt}}({\mathcal{I}})+\sum_{x\in X^{{\textsf{opt}}}_{0}}D(x,{\widehat{\mu}}(x))\leq(1+\varepsilon){\textsf{opt}}({\mathcal{I}}),

where the last inequality follows from Lemma 2. This proves the desired result for $z=1$ . When $z=2$ , we use the fact that for any two reals $a,b$ ,

(a+b)^{2}\leq(1+\sqrt{\varepsilon})a^{2}+b^{2}\left(1+\frac{1}{\sqrt{\varepsilon}}\right).

Using this fact, the expression in the RHS of (7) can be upper bounded by

(1+\sqrt{\varepsilon})\sum_{x\in X^{{\textsf{opt}}}_{j}}D^{2}(x,f^{{\textsf{opt}}}_{j})+\left(1+\frac{1}{\sqrt{\varepsilon}}\right)\sum_{x\in Z_{j}}D^{2}(x,{\widehat{\mu}}(x)).

Substituting this expression in (6) and using Lemma 2, we see that

{\textsf{opt}}({\mathcal{I}}^{\prime\prime})\leq(1+\sqrt{\varepsilon}){\textsf{opt}}({\mathcal{I}})+2m\sqrt{\varepsilon}{\textsf{opt}}({\mathcal{I}}).

This proves the desired result for $z=2$ . ∎

The approximation preserving properties of Theorem 1.1 follow from the above analysis. For the $k$ -means problem, since the approximation term is $(1+\sqrt{\varepsilon}(2m+1))$ , we can replace $\varepsilon$ with $\varepsilon^{2}/(2m+1)^{2}$ in the algorithm and analysis to obtain a $(1+\varepsilon)$ factor. Let us quickly check the running time of the algorithm. The algorithm first runs $\mathcal{C}$ that takes $T_{\mathcal{C}}(n)$ time. This is followed by $D^{z}$ -sampling $O(\frac{m^{z+1}\log{m}}{\varepsilon^{z}})$ points, which takes $O(n\cdot(k+\frac{m^{z+1}\log{m}}{\varepsilon^{z}}))$ time. The number of iterations of the for-loops is determined by the number of subsets of $S$ , which is $\sum_{i=0}^{m}\binom{|S|}{i}=\left(\frac{m}{\varepsilon}\right)^{O(m)}$ , and the number of possibilities for $\tau$ , which is at most $\binom{2m+k-1}{m}=(m+k)^{O(m)}$ . This gives the number of iterations $q=f(k,m,\varepsilon)=\left(\frac{k+m}{\varepsilon}\right)^{O(m)}$ . In every iteration, in addition to running $\mathcal{A}$ , we solve a weighted b-matching problem on a bipartite graph $(L\cup R,E)$ where $R$ has $(k+m)$ vertices (corresponding to the $k+m$ centers in the center set $C$ ) and $L$ has at most $(k+m)\cdot m$ vertices (considering $m$ closest clients for every center is sufficient). So, every iteration costs $T_{\mathcal{A}}(n)+O((k+m)^{3}m^{2})$ time. This gives the running time expression in Theorem 1.1.

3.1 Extension to labelled version

In this section, we extend Algorithm 1 to the setting where points in $X$ have labels from a finite set $L$ and the ${\textsf{check}}()$ function can also depend on the number of points with a certain label in a cluster. The overall structure of Algorithm 1 remains unchanged; we just indicate the changes needed in this algorithm.

Given a non-negative integer $p$ , a label partition of $p$ is defined as a tuple $\psi=(q_{1},\ldots,q_{|L|})$ such that $\sum_{i}q_{i}=p$ . The intuition is that given a set $S$ of size $p$ , $q_{1}$ points get the first label in $L$ , $q_{2}$ points in $S$ get the second label in $L$ , and so on. Now, given a subset $Y$ , define a valid tuple ${\boldsymbol{\tau}}$ w.r.t. $Y$ as a tuple $((t_{1},\psi_{1}),\ldots,(t_{k+m},\psi_{k+m})),$ where (i) $\sum_{j}t_{j}+|Y|=m$ , and (ii) $\psi_{j}$ is a label partition of $t_{j}$ . As in line 1 in Algorithm 1, we cycle over all such valid tuples. The definition of a solution to the $b$ -matching instance ${\mathcal{I}}^{(Y,\tau)}$ changes as follows. Let $\psi_{j}=(n_{j}^{1},\ldots,n_{j}^{\ell})$ , where $\ell=|L|$ . Then a solution to ${\mathcal{I}}^{(Y,\tau)}$ needs to satisfy the condition that for each point $c_{j}\in C$ and each label $l\in L$ , exactly $n_{j}^{l}$ points in $X$ are matched to $c_{j}$ . Note that this also implies that exactly $t_{j}$ points are matched to $c_{j}$ . This matching problem can be easily reduced to weighted bipartite matching by making $t_{j}$ copies of each point $c_{j}$ , and for each label $l$ , adding edges between $n_{j}^{l}$ distinct copies of $c_{j}$ to vertices of label $l$ only. The rest of the details of Algorithm 1 remain unchanged. Note that the running time of the algorithm changes because we now have to cycle over all partitions of each of the numbers $t_{j}$ .

The analysis of the algorithm proceeds in an analogous manner as that of Algorithm 1. We just need to consider the iteration of the algorithm, where we correctly guess the size of each of the sets $X^{{\textsf{opt}}}_{N,j}$ and the number of points of each label in this set.

References

[AGK⁺04] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3):544–562, 2004.
[AISX23] Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers, 2023.
[ANSW17] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 61–72, Oct 2017.
[APF⁺10] Gagan Aggarwal, Rina Panigrahy, Tomás Feder, Dilys Thomas, Krishnaram Kenthapadi, Samir Khuller, and An Zhu. Achieving anonymity via clustering. ACM Trans. Algorithms, 6(3), July 2010.
[BCAJ⁺22] V. Braverman, V. Cohen-Addad, H. Jiang, R. Krauthgamer, C. Schwiegelshohn, M. Toftrup, and X. Wu. The power of uniform sampling for coresets. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 462–473, Los Alamitos, CA, USA, nov 2022. IEEE Computer Society.
[BCFN19] Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[BFS21] Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198 of Leibniz International Proceedings in Informatics (LIPIcs), pages 23:1–23:15, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
[BGJK20] Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On Sampling Based Algorithms for k-Means. In Nitin Saxena and Sunil Simon, editors, 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020), volume 182 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1–13:17, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
[BGK⁺19] Ioana O. Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R. Schmidt, and Melanie Schmidt. On the Cost of Essentially Fair Clusterings. In Dimitris Achlioptas and László A. Végh, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019), volume 145 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1–18:22, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[BJK18] Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster algorithms for the constrained k-means problem. Theor. Comp. Sys., 62(1):93–115, January 2018.
[CAGK⁺19] Vincent Cohen-Addad, Anupam Gupta, Amit Kumar, Euiwoong Lee, and Jason Li. Tight FPT Approximations for k-Median and k-Means. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), volume 132 of Leibniz International Proceedings in Informatics (LIPIcs), pages 42:1–42:14, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[CAL19] Vincent Cohen-Addad and Jason Li. On the Fixed-Parameter Tractability of Capacitated Clustering. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), volume 132 of Leibniz International Proceedings in Informatics (LIPIcs), pages 41:1–41:14, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[CASS21] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, page 169–182, New York, NY, USA, 2021. Association for Computing Machinery.
[CDK23] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Clustering permutations: New techniques with streaming applications. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA, volume 251 of LIPIcs, pages 31:1–31:24. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
[CGETS02] Moses Charikar, Sudipto Guha, Éva Tardos, and David B. Shmoys. A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences, 65(1):129 – 149, 2002.
[Che09] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009.
[Din20] Hu Ding. Faster balanced clusterings in high dimension. Theoretical Computer Science, 842:28–40, 2020.
[FMS07] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for $k$ -means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, SCG ’07, pages 11–18, New York, NY, USA, 2007. ACM.
[GJK20] Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. FPT Approximation for Constrained Metric k-Median/Means. In Yixin Cao and Marcin Pilipczuk, editors, 15th International Symposium on Parameterized and Exact Computation (IPEC 2020), volume 180 of Leibniz International Proceedings in Informatics (LIPIcs), pages 14:1–14:19, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
[HHL⁺16] Mohammadtaghi Hajiaghayi, Wei Hu, Jian Li, Shi Li, and Barna Saha. A constant factor approximation algorithm for fault-tolerant k-median. ACM Trans. Algorithms, 12(3), April 2016.
[HJLW22] Lingxiao Huang, Shaofeng H. C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering, 2022.
[IV20] Tanmay Inamdar and Kasturi Varadarajan. Fault tolerant clustering with outliers. In Evripidis Bampis and Nicole Megow, editors, Approximation and Online Algorithms, pages 188–201, Cham, 2020. Springer International Publishing.
[KKN⁺11] Ravishankar Krishnaswamy, Amit Kumar, Viswanath Nagarajan, Yogish Sabharwal, and Barna Saha. The matroid median problem. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’11, page 1117–1130, USA, 2011. Society for Industrial and Applied Mathematics.
[KLS18] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 646–659, New York, NY, USA, 2018. Association for Computing Machinery.
[KSS10] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1–5:32, February 2010.
[RS18] Clemens Rösner and Melanie Schmidt. Privacy Preserving Clustering with Constraints. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), volume 107 of Leibniz International Proceedings in Informatics (LIPIcs), pages 96:1–96:14, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.