Outliers Detection Is Not So Hard: Approximation Algorithms for Robust Clustering Problems Using Local Search Techniques

Yishui Wang School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, P.R. China. Email: wangys@ustb.edu.cn. Rolf H. M

{\ddot{\rm o}}

hring Institute for Applied Optimization, Department of Computer Science and Technology, Hefei University, P.R. China, and The Combinatorial Optimization and Graph Algorithms (COGA) group, Institute for Mathematics, Technical University of Berlin, Germany. Email: Rolf.Moehring@tu-berlin.de. Chenchen Wu Corresponding author. College of Science, Tianjin University of Technology, Tianjin 300384, P.R. China. Email: wu_chenchen_tjut@163.com. Dachuan Xu Department of Operations Research and Information Engineering, Beijing University of Technology, Beijing 100124, P.R. China. Email: xudc@bjut.edu.cn. Dongmei Zhang School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, P.R. China. Email: zhangdongmei@sdjzu.edu.cn.

In this paper, we consider two types of robust models of the $k$ -median/ $k$ -means problems: the outlier-version ( $k$ -MedO/ $k$ -MeaO) and the penalty-version ( $k$ -MedP/ $k$ -MeaP), in which we can mark some points as outliers and discard them. In $k$ -MedO/ $k$ -MeaO, the number of outliers is bounded by a given integer. In $k$ -MedP/ $k$ -MeaP, we do not bound the number of outliers, but each outlier will incur a penalty cost. We develop a new technique to analyze the approximation ratio of local search algorithms for these two problems by introducing an adapted cluster that can capture useful information about outliers in the local and the global optimal solution. For $k$ -MeaP, we improve the best known approximation ratio based on local search from $25+\varepsilon$ to $9+\varepsilon$ . For $k$ -MedP, we obtain the best known approximation ratio. For $k$ -MedO/ $k$ -MeaO, there exists only two bi-criteria approximation algorithms based on local search. One violates the outlier constraint (the constraint on the number of outliers), while the other violates the cardinality constraint (the constraint on the number of clusters). We consider the former algorithm and improve its approximation ratios from $17+\varepsilon$ to $3+\varepsilon$ for $k$ -MedO, and from $274+\varepsilon$ to $9+\varepsilon$ for $k$ -MeaO.

1 Introduction

Using large data sets to make better decisions is becoming more important and routinely applied in Operations Research, Management Science, Biology, Computer Science, and Machine Learning (see e.g. Bernstein et al. 2019, Borgwardt and Happach 2019, Hochbaum and Liu 2018, Lu and Wedig 2013). Clustering large data is a fundamental problem in data analytics. Among many clustering types, center-based clustering is the most popular and widely used one. Center-based clustering problems include $k$ -median, $k$ -means, $k$ -center, facility location problems, and so on (see e.g. Ahmadian et al. 2017, Byrka et al. 2014, Lloyd 1982, Li 2013, Li et al. 2013, Ni et al. 2020). The $k$ -median and $k$ -means problems are the most basic and classic clustering problems. The goal of $k$ -median/means clustering is to find $k$ cluster centers such that the total (squared) distance from each input datum to the closest cluster center is minimized. Usually, one considers $k$ -median problems in arbitrary metrics while $k$ -means problems in the Euclidean space $\mathbb{R}^{D}$ .

Both problems are NP-hard to approximate beyond with the lower bounds $1+2/e\approx 1.736$ (Jain et al. 2003) and $1.07$ (Cohen-Addad and Karthik 2019) for $k$ -median and $k$ -means, respectively. There are many papers on designing efficient approximation algorithms. The best known approximations are $2.675+\varepsilon$ (Byrka et al. 2014) and $6.357+\varepsilon$ (Ahmadian et al. 2017) for $k$ -median and $k$ -means, respectively. If we restrict to a fixed-dimensional Euclidean space, the $k$ -median and $k$ -means problems have a PTAS (see Arora et al. 1998, Cohen-Addad et al. 2019, Friggstad et al. 2019b).

However, real-world data sets often contain outliers which may totally spoil $k$ -median/means clustering results. To overcome this problem, robust clustering techniques have been developed to avoid being affected by outliers. In general, there are two types of robust formulations: $k$ -median/means with outliers ( $k$ -MedO/ $k$ -MeaO) and $k$ -median/means with penalties ( $k$ -MedP/ $k$ -MeaP). We formally define these problems as follows.

Definition 1.1 ( $k$ -Median Problem with Outliers/Penalties).

In the $k$ -median problem with outliers ( $k$ -MedO), we are given a client set $\mathcal{X}$ of $n$ points, a facility set $\mathcal{F}$ of $m$ points, a metric space $(\mathcal{X}\cup\mathcal{F},d)$ , and two positive integers $k<m$ and $z<n$ . The aim is to find a subset $S\subseteq\mathcal{F}$ of cardinality at most $k$ , and an outlier set $P\subseteq\mathcal{X}$ of cardinality at most $z$ such that the objective function $\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s)$ is minimized. In the $k$ -median problem with penalties ( $k$ -MedP), we have the same input except that the cardinality restrictions on the penalty set $P\subseteq\mathcal{X}$ is replaced by a nonnegative penalty $p_{x}$ for each $x\in\mathcal{X}$ , and the objective function is to minimize $\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s)+\sum_{x\in P}p_{x}$ .

Definition 1.2 ( $k$ -Means Problem with Outliers/Penalties).

In the $k$ -means problem with outliers ( $k$ -MeaO), we are given a data set $\mathcal{X}$ in $\mathbb{R}^{d}$ of $n$ points and two positive integers $k$ and $z<n$ . Let $d(u,v):=\|u-v\|_{2}$ be the Euclidean distance of the points $u$ and $v$ . The aim is to find a cluster center set $S\subseteq\mathbb{R}^{d}$ of cardinality at most $k$ , and an outlier set $P\subseteq\mathcal{X}$ of cardinality at most $z$ such that the objective function $\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s)^{2}$ is minimized. In the $k$ -means problem with penalties ( $k$ -MeaP), we have the same input except that the cardinality restrictions on penalty set $P\subseteq\mathcal{X}$ is instead of a nonnegative penalty $p_{x}$ for each $x\in\mathcal{X}$ , and the objective function is to minimize $\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s)^{2}+\sum_{x\in P}p_{x}$ .

From the perspective of clustering, we can view a facility in the $k$ -median problem as a center, and view a client as a point. To avoid confusion, we use “center” and “point” for all problems we consider in this paper.

Basic versions of these problems have been widely studied, and many approximation algorithms based on many different techniques, including LP-rounding (see e.g. Charikar et al. 2002, Charikar and Li 2012, Li 2013), primal-dual (see e.g. Ahmadian et al. 2017, Jain and Vazirani 2001), dual-fitting (see e.g. Jain et al. 2003, Mahdian et al. 2006), local search (see e.g. Arya et al. 2004, Kanungo et al. 2004, Korupolu et al. 2000), Lagrangian relaxation (see e.g. Jain et al. 2003, Jain and Vazirani 2001), bi-point rounding (see e.g. Jain et al. 2003, Jain and Vazirani 2001), and pseudo-approximation (see e.g. Byrka et al. 2014, Li and Svensson 2016), have been developed and applied.

We now discuss the state-of-art approximation results for robust version of $k$ -median/ $k$ -means. Chen (2008) has presented the first constant but very large approximation algorithm for $k$ -MedO via successive local search. Krishnaswamy et al. (2018) have obtained an iterative LP rounding framework yielding $(7.081+\varepsilon)$ - and $(53.002+\varepsilon)$ -approximation algorithms for $k$ -MedO and $k$ -MeaO, respectively. To the best of our knowledge, these are only two constant factor approximation results for $k$ -MedO and $k$ -MeaO.

The first constant $4$ -approximation for $k$ -MedP has been given by Charikar et al. (2001) using Lagrangian relaxation framework of Jain and Vazirani (2001). The best $(3+\varepsilon)$ -approximation for $k$ -MedP has been obtained by Hajiaghayi et al. (2012), who called the problem the red-blue median problem. Three years later, Wang et al. (2015) have independently obtained the same factor approximation for $k$ -MedP and have further generalized it to a $(3.732+\varepsilon)$ -approximation for the $k$ -facility location problem with linear penalties, which is a common generalization of facility location (in which there are facility opening costs and no cardinality constraint) and $k$ -MedP. Both of them use local search techniques. Zhang (2007) has obtained the approximation ratio $3.732+\varepsilon$ for the $k$ -facility location problem ( $k$ -FLP). The currently best ratio of $3.25$ for $k$ -FLP is due to Charikar and Li (2012). For the $k$ -median problem with uniform penalties, Wu et al. (2018) have adapted the pseudo-approximation technique of Li and Svensson (2016) and obtained a $(2.732+\varepsilon)$ -approximation.

Zhang et al. (2019) have presented the first constant $(25+\varepsilon)$ - approximation algorithm for $k$ -MeaP using local search. Feng et al. (2019) have improved this to a $(19.849+\varepsilon)$ -approximation by combing Lagrangian relaxation with bipoint rounding. A summary of the up-to-date approximation results for $k$ -MedO/ $k$ -MeaO and $k$ -MedP/ $k$ -MeaP along with their ordinary versions is given in Table 1.

Table 1: Comparion of (robust) clustering problems.

Techniques and reference	$k$ -median	$k$ -MedO	$k$ -MedP	$k$ -means	$k$ -MeaO	$k$ -MeaP
LP rounding (Charikar et al. 2002)	$6\frac{2}{3}$
Lagrangian relaxation (Jain and Vazirani 2001)	$6$			$108$
Lagrangian relaxation (Charikar et al. 2001)			$4$
Lagrangian relaxation (Jain et al. 2003)	$4$
Local search (Arya et al. 2004)	$3+\varepsilon$
Local search (Kanungo et al. 2004)				$9+\varepsilon$
Successive local search (Chen 2008)		constant
Dependent LP rounding (Charikar and Li 2012)	$3.25$
Local search (Hajiaghayi et al. 2012)			$3+\varepsilon$
Pseudo-approximation (Li and Svensson 2016)	$2.732+\varepsilon$
Pseudo-approximation (Byrka et al. 2014)	$2.675+\varepsilon$
Iterative LP rounding (Krishnaswamy et al. 2018)		$7.081+\varepsilon$			$53.002+\varepsilon$
Primal-dual (Ahmadian et al. 2017)				$6.357+\varepsilon$
Local search (Zhang et al. 2019)						$25+\varepsilon$
Bipoint rounding (Feng et al. 2019)						$19.849+\varepsilon$

The available literature suggests two observations concerning the approximation factor: i) $k$ -MedP/ $k$ -MeaP seems more easy to approximate than $k$ -MedO/ $k$ -MeaO. ii) The existence of outliers make the approximation of the corresponding robust clustering problems much harder than the ordinary clustering problems.

The best known approximation ratios for $k$ -MedO and $k$ -MeaO have been obtained by LP-rounding, but these algorithms are not strongly polynomial-time since they involve solving linear programs. Concerning time complexity, local search is better than LP-rounding, and this technique has been well applied to $k$ -median/ $k$ -means and their penalty versions. Furthermore, the standard local search algorithm is also used for $k$ -median/ $k$ -means with some special metrics such as the minor-free metric (Cohen-Addad et al. 2019) and the doubling metric (Friggstad et al. 2019b). These two papers show that the standard local search scheme yields a PTAS for the considered problems when the dimension is fixed. Their results hold in particular for the Euclidean metric, since both the minor-free metric and the doubling metric are extensions of the Euclidean metric.

Unfortunately, the standard local search algorithm for $k$ -MedO/ $k$ -MeaO can not produce a feasible solution with a bounded approximation ratio (Friggstad et al. 2019a). So some research directions focus on bi-criteria approximation algorithms based on local search for these two problems. These algorithms have a bounded approximation ratio but violate either the $k$ -constraint or the outlier constraint by a bounded factor. Gupta et al. (2017) have developed a method for addressing outliers in a local search algorithm, yielding a bi-criteria $(274+\varepsilon,O(\frac{k}{\varepsilon}\log n\delta))$ -approximation algorithm ( $\delta$ as defined in Section 1.2) that violates the outlier constraint. Friggstad et al. (2019a) have provided $(3+\varepsilon,1+\varepsilon)$ - and $(25+\varepsilon,1+\varepsilon)$ -local search bi-criteria approximation algorithms for $k$ -MedO and $k$ -MeaO respectively.

We will consider the standard local search algorithm for $k$ -MedP/ $k$ -MeaP, and the outlier-based local search algorithm by Gupta et al. (2017) for $k$ -MedO/ $k$ -MeaO. Using our new technique, we will improve the approximation ratios for $k$ -MeaP, $k$ -MeaO and $k$ -MedO. For $k$ -MedP, we obtain the same approximation ratio which is the best one possible.

We list the related results about local search algorithms for $k$ -MedO/ $k$ -MeaO and $k$ -MedP/ $k$ -MeaP in Table 2.

Table 2: Local search algorithms for (robust) clustering problems. The # centers blowup means the factor by which the cardinality constraint is violated. The # outliers blowup means the factor by which the outlier constraint is violated.

Reference	Problem	Ratio	# centers blowup	# outliers blowup
Arya et al. (2004)	$k$ -median	$3+\varepsilon$	none	none
Kanungo et al. (2004)	$k$ -means	$9+\varepsilon$	none	none
Chen (2008)	$k$ -MedO	constant	none	none
Hajiaghayi et al. (2012)	$k$ -MedP	$3+\varepsilon$	none	none
Zhang et al. (2019)	$k$ -MeaP	$25+\varepsilon$	none	none
Cohen-Addad et al. (2019)	$k$ -median/ $k$ -means in minor-free metrics with fixed dimension	PTAS	none	none
Friggstad et al. (2019b)	$k$ -median/ $k$ -means with fixed doubling dimension	PTAS	none	none
Friggstad et al. (2019a)	$k$ -MedO $k$ -MeaO	$3+\varepsilon$ $25+\varepsilon$	$1+\varepsilon$ $1+\varepsilon$	none none
Gupta et al. (2017)	$k$ -MedO $k$ -MeaO	$17+\varepsilon$ $274+\varepsilon$	none none	$O(k\log(n\delta)/\varepsilon)$ $O(k\log(n\delta)/\varepsilon)$
Our results	$k$ -MedP $k$ -MeaP $k$ -MedO $k$ -MedO $k$ -MeaO $k$ -MeaO	$3+\varepsilon$ $9+\varepsilon$ $5+\varepsilon$ $3+\varepsilon$ $25+\varepsilon$ $9+\varepsilon$	none none none none none none	none none $O(k\log(n\delta)/\varepsilon)$ $O(k^{2}\log(n\delta)/\varepsilon)$ $O(k\log(n\delta)/\varepsilon)$ $O(k^{2}\log(n\delta)/\varepsilon)$

1.1 Our techniques

We concentrate on $k$ -MedP and $k$ -MeaP to illustrate our techniques. The associated outlier versions are then easy generalizations.

In the standard local search algorithm, one starts from an arbitrary feasible solution. Operations such as add center, delete center, or swap centers, define the neighborhood of the currently feasible solution. One then searches for a local optimal solution in the whole neighborhood and takes it as the new current solution. This is iterated until the improvement becomes sufficiently small.

Similar to the previous analyses of local search algorithms for $k$ -median and $k$ -means, we want to find some valid inequalities by constructing swap operations in order to establish some “connections” between local and global optimal solutions. Integrating all these inequalities or connections carefully, we can bound cost of the local optimal solution by the global optimal cost.

In the analysis of $k$ -median (see Arya et al. 2004), these connections are given individually for each point (that is, each point yields an inequality that gives a bound of its cost after the constructed swap operation). We call this type of analysis an “individual form”.

Another analysis type is the “cluster form”, in which the connections between the local and global optimal solutions are revealed for some clusters containing several points. The cluster form analysis was first used for $k$ -means in Kanungo et al. (2004). In the work of Kanungo et al. (2004), the authors use the Centroid Lemma (introduced in Section 2.2) to obtain equality for each cluster in the optimal solution, and then deduce the approximation ratio by these equalities and the triangle inequality. They found that the cluster form analysis is tighter than the individual form. However, the same analysis does not apply to $k$ -MeaP due to the existence of outliers. Indeed, the clusters derived with equalities from the Centroid Lemma should contain no outliers in both the local and global optimal solutions, since they do not incur a cost for outliers.

To this end, we careful recognize and define an adapted cluster as a cluster that excludes outliers. In order to use the Centroid Lemma, we identify a new centroid for the adapted cluster and use the triangle inequality for the squared distances to identify the associated centroid of the global solution in the analysis. These new centroids can be found by a carefully defined mapping function.

Our cluster form analysis also applies to $k$ -MedP, although there is no result like the Centroid Lemma for this problem. In fact, we only need to denote the optimal center of the adapted cluster for $k$ -MedP (corresponding to the centroid in the analysis for $k$ -MeaP), and use its optimality to derive the inequality for the adapted cluster. During the entire process of the analysis, we do not compute the optimal center, so we do not need a result like the Centroid Lemma.

Our cluster form analysis establishes a bridge between local and global solutions for both robust and ordinary clusterings, and we obtain a clear and unified understanding of them. Furthermore, we believe that our technique can be generalized to other robust clustering problems such as the robust facility location and $k$ -center problems.

1.2 Our contributions

We use the standard local search algorithm for $k$ -MedP and $k$ -MeaP. Via a subtle cluster form analysis, we obtain the following result.

Theorem 1.3.

The standard local search algorithm yields $(3+\varepsilon)$ - and $(9+\varepsilon)$ -approximations for $k$ -MedP and $k$ -MeaP respectively.

Our analysis is different to that of Hajiaghayi et al. (2012) who have also obtained a local search $(3+\varepsilon)$ -approximation for $k$ -MedP, and improve the previous local search $(25+\varepsilon)$ -approximation (Zhang et al. 2019) and the primal-dual $(19.849+\varepsilon)$ -approximation (Feng et al. 2019) for $k$ -MeaP. Moreover, our result indicates that the penalty-version of the clustering problems have the same approximation ratios as the ordinary version, when we adopt the local search technique followed with our cluster form analysis.

For $k$ -MedO and $k$ -MeaO, we use the outlier-based local search algorithm (based on Gupta et al. 2017).

The algorithm has a parameter for controlling the descending step-length of the cost in each iteration. This parameter is fixed in Gupta et al. (2017), while it is an input in our algorithm because both the approximation ratio and the number of outliers blowup are associated with the value of this parameter. This helps us to reveal a tradeoff between the approximation ratio and the outlier blowup. When selecting appropriate values for this parameter, we can obtain constant approximation ratios. In the following theorems, $\delta$ denotes the maximal distance between two points in the data set.

Theorem 1.4.

The outlier-based local search algorithm yields bicriteria $(5+\varepsilon,O($ $\frac{k}{\varepsilon}\log(n\delta)))$ - and $(3+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))$ -approximations for $k$ -MedO, and bicriteria $(25+\varepsilon,O(\frac{k}{\varepsilon}\log(n\delta)))$ - and $(9+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))$ -approximations for $k$ -MeaO, where $O(\frac{k}{\varepsilon}\log(n\delta))$ and $O(\frac{k^{2}}{\varepsilon}\log(n\delta))$ are the factors by which the outlier constraint is violated.

With the same outlier blowup, our ratios obtained with single-swap significantly improve the previous ratios $17+\varepsilon$ and $274+\varepsilon$ for the $k$ -MedO and $k$ -MeaO, respectively. The multi-swap version improves this even more, but with a larger outlier blowup.

These results strengthens our comprehension of robust clustering problems from a local search aspect. Furthermore, our cluster form analysis has a high potential to be applied in the robust version for FLP and $k$ -FLP, since the structures of these two problems are similar to $k$ -MedP, and the analyses for the connection cost and facility opening cost are seperated in the previous papers that study local search algorithms for FLP and $k$ -FLP (see Arya et al. 2004, Zhang 2007).

1.3 Outline of the paper

Section 2 presents the unified models and notations for $k$ -MedP/ $k$ -MeaP and $k$ -MedO/ $k$ -MeaO, and some useful technical lemmas. Section 3 then presents our standard local search algorithms for $k$ -MedP/ $k$ -MeaP and our corresponding theoretical results. In Section 4, we develop our outlier-based local search algorithms for $k$ -MedO/ $k$ -MeaO and present our corresponding theoretical results. The conclusions are given in Section 5. All technical proofs are given in the appendices.

2 Preliminaries

2.1 The models

We use the following notation for the problems studied in this paper (in addition to the notation introduced in the introduction). ${\mathcal{C}}$ denotes the candidate center set, and $\Delta(a,b)$ denotes the connection cost between two points $a$ and $b$ . For $k$ -MedP and $k$ -MedO, we have ${\mathcal{C}}={\mathcal{F}}$ and $\Delta(a,b)=d(a,b)$ ; for $k$ -MeaP and $k$ -MeaO, we have ${\mathcal{C}}={\mathcal{X}}$ and $\Delta(a,b)=d^{2}(a,b)$ . Then, the penalty-version can be formulated as

\min_{S\subseteq{\mathcal{C}},P\subseteq{\mathcal{X}}}\sum\limits_{x\in{\mathcal{X}}\setminus P}\min_{s\in S}\Delta(s,x)+\sum\limits_{x\in P}p_{x},

and the outlier-version can be formulated as

\min_{S\subseteq{\mathcal{C}},P\subseteq{\mathcal{X}}:|P|\leq z}\sum\limits_{x\in{\mathcal{X}}\setminus P}\min_{s\in S}\Delta(s,x).

Considering $k$ -MeaP and $k$ -MedP, we assume that $S$ is a set of $k$ centers. It is obvious that the optimal penalized point set with respect to $S$ is $P=\{x\in\mathcal{X}|p_{x}\leq\min_{s\in S}d(s,x)\}$ for $k$ -MedP and $P=\{x\in\mathcal{X}|p_{x}\leq\min_{s\in S}d^{2}(s,x)\}$ for $k$ -MeaP, implying that $S$ determines the corresponding $k$ clusters $N(s):=\{x\in\mathcal{X}\setminus P|s_{x}=s\}$ for all $s\in S$ , where $s_{x}$ denotes the closest center in $S$ to $x\in\mathcal{X}\setminus P$ , i.e., $s_{x}:=\operatorname*{argmin}_{s\in S}d(s,x)$ . Thus, we also call $S$ a feasible solution for $k$ -MedP and $k$ -MeaP.

Given a center set $S$ and a subset $R\subseteq{\mathcal{X}}$ , we suppose that ${\mathcal{X}}\setminus R=\{x_{1},x_{2},\dots,$ $x_{|{\mathcal{X}}\setminus R|}\}$ subject to $d(s_{x_{1}},x_{1})\geq d(s_{x_{2}},x_{2})\geq\dots\geq d(s_{x_{|{\mathcal{X}}\setminus R|}},x_{|{\mathcal{X}}\setminus R|})$ . Let ${\rm outlier}(S,R)$ $:=\{x_{1},x_{2},\dots,x_{z}\}$ if $|{\mathcal{X}}\setminus R|\geq z$ , otherwise, ${\rm outlier}(S,R):={\mathcal{X}}\setminus R$ . We simplify ${\rm outlier}(S,\emptyset)$ to ${\rm outlier}(S)$ . For $k$ -MedO and $k$ -MeaO, it is obvious that the optimal outlier set with respect to $S$ is ${\rm outlier}(S)$ , implying that the set $S$ can be seen as a feasible solution. We also use $(S,P)$ to denote a solution (not necessarily feasible) in which the center set is $S$ and the outlier set is $P$ for $k$ -MedO and $k$ -MeaO.

2.2 Some technical lemmas

Given a data subset $D\subseteq\mathcal{X}$ and a point $c\in{\mathcal{C}}$ , we define $\Delta(c,D):=\sum_{x\in D}\Delta(c,x)$ . Let ${\rm cent}_{{\mathcal{C}}}(D)$ be a center point in ${\mathcal{C}}$ that optimizes the objective of the $k$ -means/ $k$ -median problem, i.e., ${\rm cent}_{{\mathcal{C}}}(D):=\operatorname*{argmin}_{c\in{\mathcal{C}}}\Delta(c,D)$ . We remark that the notation $\operatorname*{argmin}$ ( $\operatorname*{argmax}$ ) denotes an arbitrary element that minimizes (maximizes) the objective. From the well-known centroid lemma (Kanungo et al. 2004), we get ${\rm cent}_{{\mathcal{C}}}(D)={\rm cent}(D)$ for $k$ -means, where ${\rm cent}(D)$ is the centroid of $D$ , that is defined as follows.

Definition 2.1 (Centroid).

Given a set $D\subseteq{\mathbb{R}}^{d}$ , we call the point $\sum_{x\in D}x/|D|$ denoted by ${\rm cent}(D)$ the centroid of $D$ .

Lemma 2.2 (Centroid Lemma (Kanungo et al. 2004)).

For any data subset $D\subseteq\mathcal{X}$ and a point $c\in\mathbb{R}^{d}$ , we have $d^{2}(c,D)=d^{2}({\rm cent}(D),D)+|D|d^{2}({\rm cent}(D),c)$ .

So, the candidate center points of a $k$ -means problem are the centroid points for all subsets of $\mathcal{X}$ . Note that the total amount of these candidate center points is $2^{|\mathcal{X}|}-1$ . To cut down this exponential magnitude, Matoušek (2000) introduces the concept of approximate centroid set shown in the following definition.

Definition 2.3.

A set ${\mathcal{C}}^{\prime}\subseteq\mathbb{R}^{d}$ is an $\varepsilon$ -approximate centroid set for $\mathcal{X}\subseteq\mathbb{R}^{d}$ if for any set $D\subseteq\mathcal{X}$ , we have $\min_{c\in{\mathcal{C}}^{\prime}}d^{2}(c,D)\leq(1+\varepsilon)\min_{c\in\mathbb{R}^{d}}d^{2}(c,D)$ .

The following lemma shows the important observation that a polynomial size $\hat{\epsilon}$ -approximate centroid set for $\mathcal{X}$ can be found in polynomial time. In the remainder of this paper, we restrict that the candidate center set of $k$ -MeaP/ $k$ -MeaO is the $\hat{\varepsilon}$ -approximate centroid set ${\mathcal{C}}^{\prime}$ , by utilizing this observation.

Lemma 2.4 (Matoušek (2000)).

Given an $n$ -point set ${\mathcal{X}}$ and a real number $\varepsilon>0$ , an $\varepsilon$ -approximate centroid set for ${\mathcal{X}}$ , of size $O\left(n\varepsilon^{-d}\log(1/\varepsilon)\right)$ , can be computed in time $O\left(n\log n+n\varepsilon^{-d}\log(1/\varepsilon)\right)$ .

For the $k$ -median problem, we do not need the approximate center set, since the candidate centers are in the finite set ${\mathcal{F}}$ .

3 Local search approximation algorithms for $k$ -MedP and $k$ -MeaP

Let $\rho$ be a fixed integer. For any feasible solution $S$ , $A\subseteq S$ and $B\subseteq\mathcal{C}\setminus S$ with $|A|=|B|\leq\rho$ , we define the so-called multi-swap operation ${\rm swap}(A$ , $B)$ such that all centers in $A$ are dropped from $S$ and all centers in $B$ are added to $S$ .

We further denote the connection cost of the point $x\in{\mathcal{X}}$ by ${\rm cost}_{c}(x)$ , i.e., ${\rm cost}_{c}(x):=\Delta(s_{x},x)$ , and denote by ${\rm cost}_{c}$ , ${\rm cost}_{p}$ , and ${\rm cost}(S)$ the following expressions ${\rm cost}_{c}:=\sum_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)$ ; ${\rm cost}_{p}:=\sum_{x\in P}p_{x}$ ; ${\rm cost}(S):={\rm cost}_{c}+{\rm cost}_{p}$ , where $P$ is the optimal penalized point set with respect to $S$ .

Now we are ready to present our multi-swap local search algorithm.

Algorithm 1 The multi-swap local search algorithm: LS-Multi-Swap( ${\mathcal{X}},C,k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho$ )

0: data set

{\mathcal{X}}

, candidate center set

C

, penalty cost

p_{j}

for all

j\in{\mathcal{X}}

, positive integers

k

and

\rho\leq k

0: center set

S\subseteq C

1: Arbitrarily choose a

k

-center subset

S

from

C

2: Compute

(A,B):=\arg\min_{A\subseteq S,B\subseteq C\setminus S,|A|=|B|\leq\rho}{\rm cost}(S\setminus A\cup B).

3: while

{\rm cost}(S\setminus A\cup B)<{\rm cost}(S)

4: Set

S:=S\setminus A\cup B

5: Compute

(A,B):=\arg\min_{A\subseteq S,B\subseteq C\setminus S,|A|=|B|\leq\rho}{\rm cost}(S\setminus A\cup B).

6: end while

7: return

S

For $k$ -MedP, we run LS-Multi-Swap( ${\mathcal{X}},{\mathcal{F}},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho$ ); for $k$ -MeaP, we first call the algorithm of Makarychev et al. (2016) to construct an ${\hat{\varepsilon}}$ -approximate centroid set $\mathcal{{\mathcal{C}}}^{\prime}\subseteq{\mathcal{X}}$ , then run LS-Multi-Swap( ${\mathcal{X}},{\mathcal{C}}^{\prime},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho$ ). The values of $\rho$ and $\hat{\varepsilon}$ will be determined in our analysis of the algorithm.

3.1 The analysis

Let $S^{*}$ be a global optimal solution with the penalized set $P^{*}=\{x\in\mathcal{X}|p_{x}\leq\min_{s^{*}\in S}\Delta(x,s^{*})\}$ . Similar to the feasible solution $S$ , we introduce the corresponding notations $s^{*}_{x}$ , $N^{*}(s^{*})$ , ${\rm cost}^{*}_{c}(x)$ , ${\rm cost}_{c}^{*}$ , ${\rm cost}_{p}^{*}$ and ${\rm cost}(S^{*})$ .

We use the standard analysis for a local search algorithm, in which some swap operations between $S$ and $S^{*}$ are constructed, and then each point is reassigned to a center in the new solution. In the cluster form analysis, we try to bound the new cost for a set of points, rather than bounding the cost of each point individually and independently. To this end, we introduce the adapted cluster as follows.

N^{*}_{q}(s^{*}):=N^{*}(s^{*})\setminus P,\qquad\forall s^{*}\in S^{*}.

With the adapted cluster, we set $\tilde{S}^{*}:=\{{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}))|s^{*}\in S^{*}\}$ . We introduce a mapping $\phi:\tilde{S}^{*}\rightarrow S$ and map each point $c\in\tilde{S}^{*}$ to $\phi(c):=\arg\min_{s\in S}d(c,s)$ . We say that the center $\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}))$ captures $s^{*}$ . Considering one of all constructed swap operations, we will reassign some points to a center determined by the mapping $\phi$ (for instance, reassign the point $x$ to $\phi({\rm cent}(N^{*}_{q}(s^{*}_{x})))$ . The details will be stated later).

Combining all swap operations, the sum of the costs of these points appears in the right hand side of the inequality which is derived from the local optimality of $S$ . For $k$ -MeaP, we can bound this sum by the connection costs of $S$ and $S^{*}$ , see Lemma 3.1. Note that all these points are not outliers in both $S$ and $S^{*}$ . This is the reason why we need to use the adapted cluster rather than the cluster $N^{*}(s^{*})$ which was used in the analysis for $k$ -means (Gupta et al. 2017).

In the proof of Lemma 3.1, we divide the set ${\mathcal{X}}\setminus(P\cup P^{*})$ into some adapted clusters with respect to all $s^{*}\in S^{*}$ , and apply the Centroid Lemma to each adapted cluster. Afterwards we bound the square of distances between a centroid $c$ of the adapted cluster and its mapped point $\phi(c)$ . This explains why the domain of the mapping $\phi$ is the set of centroids of adapted clusters.

Lemma 3.1.

Let $S$ and $S^{*}$ be a local optimal solution and a global optimal solution of $k$ -MeaP, respectively. Then,

	$\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}d^{2}(\phi({\rm cent}(N^{}_{q}(s^{*}_{x}))),x)$	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}\left(2{\rm cost}^{}_{c}(x)+{\rm cost}_{c}(x)\right)+$		(1)
			$\displaystyle 2\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}{\rm cost}^{}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}.$		(1)

Proof.

With the Cauchy-Schwarz inequality, we obtain

			$\displaystyle\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))\cdot d(x,s_{x})$		(2)
		$\displaystyle\leq$	$\displaystyle\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))}\cdot\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s_{x})}.$		(2)

Lemma 2.2 and the definition of $\phi(\cdot)$ then yield

	$\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)$	(3)
$\displaystyle=$	$\displaystyle\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}))),x)$
$\displaystyle=$	$\displaystyle\sum\limits_{s^{}\in S^{}}\left[d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),N^{}_{q}(s^{}))+\|N^{}_{q}(s^{})\|\cdot d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}))))\right]$
$\displaystyle=$	$\displaystyle\sum\limits_{s^{}\in S^{}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}))))$
$\displaystyle\leq$	$\displaystyle\sum\limits_{s^{}\in S^{}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),s_{x}).$

Using the triangle inequality for $d(\cdot,\cdot)$ , we obtain

	$\displaystyle\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),s_{x})$	(4)
$\displaystyle\leq$	$\displaystyle\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}\left(d(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))+d(x,s_{x})\right)^{2}$
$\displaystyle=$	$\displaystyle\sum\limits_{s^{}\in S^{}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s_{x})$
	$\displaystyle+\ 2\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))\cdot d(x,s_{x}).$

Integrating (2)-(4) and using the definition of ${\rm cent}_{{\mathcal{C}}}(\cdot)$ then gives

			$\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)$
		$\displaystyle\leq$	$\displaystyle 2\sum\limits_{s^{}\in S^{}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s_{x})$
			$\displaystyle+\ 2\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))}\cdot\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s_{x})}$
		$\displaystyle\leq$	$\displaystyle 2\sum\limits_{s^{}\in S^{}}d^{2}(s^{},N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{*})}d^{2}(x,s_{x})$
			$\displaystyle+\ 2\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s^{})}\cdot\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{*})}d^{2}(x,s_{x})}$
		$\displaystyle=$	$\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}\left(2{\rm cost}^{}_{c}(x)+{\rm cost}_{c}(x)\right)+2\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}{\rm cost}^{}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}.$

Note that ${\rm cent}_{{\mathcal{C}}}(\cdot)={\rm cent}(\cdot)$ for $k$ -MeaP. So we complete the proof. ∎

Consider now $\phi(\tilde{S}^{*})$ , i.e., the image set of $\tilde{S^{*}}$ under $\phi$ . We list all elements of $\phi(\tilde{S}^{*})$ as $\phi(\tilde{S}^{*})=\{s_{1},...,s_{m}\}$ where $m:=|\phi(\tilde{S}^{*})|$ . For each $l\in\{1,...,m\}$ , let $S_{l}:=\{s_{l}\}$ and $S^{*}_{l}:=\{s^{*}\in S^{*}|\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}($ $s^{*})))=s_{l}\}$ . Thus, $S^{*}$ is partitioned into $S^{*}_{1},S^{*}_{2},...,S^{*}_{m}$ . Noting that $|S|=|S^{*}|=k$ , we can enlarge each $S_{l}$ such that $S_{1},S_{2},...,S_{m}$ is a partition of $S$ with $|S_{l}|=|S^{*}_{l}|$ for each $l\in\{1,2,...,m\}$ .

We will construct a swap operation between the points in $S_{l}$ and $S^{*}_{l}$ for each pair $(S_{l},S^{*}_{l})$ . Before doing this, we note that a center $s^{*}\in S^{*}$ need not belong to the candidate center set ${\mathcal{C}}^{\prime}$ for $k$ -MeaP. Thus, we introduce a center ${\hat{s}^{*}}\in\mathcal{C}$ associated with each $s^{*}\in S^{*}$ to ensure that the swap operation involved in $s^{*}$ can be implemented in Algorithm 3. For each $s^{*}\in S^{*}$ , let ${\hat{s}^{*}}:=\arg\min_{c\in{\mathcal{C}}^{\prime}}d(c,N^{*}(s^{*}))$ . Combined with Definition 2.3, we have (see Zhang et al. 2019)

$\displaystyle\sum\limits_{x\in N^{}(s^{})}d^{2}({\hat{s}^{*}},x)$	$\displaystyle=$	$\displaystyle d^{2}({\hat{s}^{}},N^{}(s^{}))=\min\limits_{c\in\mathcal{C}}d^{2}(c,N^{}(s^{*}))$	(5)
	$\displaystyle\leq$	$\displaystyle(1+{\hat{\varepsilon}})\min\limits_{c\in\mathbb{R}^{d}}d^{2}(c,N^{}(s^{}))=(1+{\hat{\varepsilon}})d^{2}(s^{},N^{}(s^{*}))$
	$\displaystyle=$	$\displaystyle(1+{\hat{\varepsilon}})\sum\limits_{x\in N^{}(s^{})}d^{2}(s^{*},x).$

The algorithm allows at most $\rho$ points to be swapped. To satisfy this condition, we consider the following two cases to construct swap operations (cf. Figure 1 for $\rho=3$ ).

Refer to caption — (a) $|S_{l}|\leq\rho$

Case 1

(cf. Figure 1(a)). For each $l$ with $|S_{l}|=|S^{*}_{l}|\leq\rho$ , we consider the pair $(S_{l},S^{*}_{l})$ . Let ${\hat{S}^{*}_{l}}:=\{\hat{s}^{*}|s^{*}\in S^{*}_{l}\}$ . W.l.o.g., we assume that ${\hat{S}^{*}_{l}}\subseteq{\mathcal{X}}\setminus S$ . For $k$ -MedP, we consider the swap $(S_{l},S^{*}_{l})$ ; for $k$ -MeaP, we consider the swap $(S_{l},{\hat{S}^{*}_{l}})$ . Utilizing these swap operations, we obtain the following result.

Lemma 3.2.

If $|S_{l}|=|S^{*}_{l}|\leq\rho$ , then we have

	$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),x)-{\rm cost}_{c}(x)\right)+$		(6)
			$\displaystyle\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\setminus P}({\rm cost}_{c}^{}(x)-{\rm cost}_{c}(x))+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\cap P}({\rm cost}_{c}^{}(x)-p_{x}).$		(6)

for $k$ -MedP, and

	$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),x)-{\rm cost}_{c}(x)\right)+$		(7)
			$\displaystyle\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-{\rm cost}_{c}(x))+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-p_{x}).$		(7)

for $k$ -MeaP.

Case 2

(cf. Figure 1(b)). For each $l$ with $|S_{l}|=|S^{*}_{l}|=m_{l}>\rho$ , we consider $(m_{l}-1)m_{l}$ pairs $(s,s^{*})$ with $s\in S_{l}\backslash\{s_{l}\}$ and $s^{*}\in S^{*}_{l}$ . For $k$ -MedP, we consider the swap $(s,s^{*})$ ; for $k$ -MeaP, we consider the swap $(s,{\hat{s}^{*}})$ . Utilizing these swap operations, we obtain the following result.

Lemma 3.3.

For any $s\in S_{l}\backslash\{s_{l}\}$ and $s^{*}\in S^{*}_{l}$ , we have

	$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in N(s)\cap P^{}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus P^{}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),x)-{\rm cost}_{c}(x)\right)+$		(8)
			$\displaystyle\sum\limits_{x\in N^{}(s^{})\setminus P}({\rm cost}_{c}^{}(x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{}(s^{})\cap P}({\rm cost}_{c}^{}(x)-p_{x})$		(8)

for $k$ -MedP, and

	$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in N(s)\cap P^{}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),x)-{\rm cost}_{c}(x)\right)+$		(9)
			$\displaystyle\sum\limits_{x\in N^{}(s^{})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{}(s^{})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-p_{x})$		(9)

for $k$ -MeaP.

Lemma 3.2 shows a relationship between the sets $S_{l}$ and $S^{*}_{l}$ , while Lemma 3.3 shows a relationship between two points in $S_{l}$ and $S^{*}_{l}$ respectively. We remark that Lemma 3.3 holds for all pairs $(S_{l},S^{*}_{l})$ (no matter whether $|S_{l}|>\rho$ ). This is useful for the analysis of the algorithm for $k$ -MedO/ $k$ -MeaO in Section 4.

Proof of Lemma 3.2.

We only prove it for $k$ -MeaP. The proof for $k$ -MedP is similar. After the operation swap $(S_{l},{\hat{S}^{*}_{l}})$ , we penalize all points in $N(s)\cap P^{*}$ for all $s\in S_{l}$ , reassign each point $x\in N^{*}(s^{*})$ to $\hat{s}^{*}$ for all $s^{*}\in S^{*}_{l}$ , and reassign each point $x\in N(s)\setminus\left(\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*}\right)$ to $\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})))$ for all $s\in S_{l}$ ( $s^{*}_{x}\notin S^{*}_{l}$ implies $\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})))\notin S_{l}$ ). Since the operation swap $(S_{l},{\hat{S}^{*}_{l}})$ does not improve the local optimal solution $S$ , we have

$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle{\rm cost}(S\setminus\{S_{l}\}\cup\{\hat{S}^{*}_{l}\})-{\rm cost}(S)$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+$
		$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus\left(\bigcup_{s^{}\in S^{}_{l}}N^{}(s^{})\cup P^{}\right)}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+$
		$\displaystyle\sum\limits_{s^{}\in S^{}_{l}}\sum\limits_{x\in N^{}(s^{})\setminus P}(d^{2}({\hat{s}^{}},x)-{\rm cost}_{c}(x))+\sum\limits_{s^{}\in S^{}_{l}}\sum\limits_{x\in N^{}(s^{})\cap P}(d^{2}({\hat{s}^{}},x)-p_{x}).\$

Combining this with $\sum_{x\in N^{*}(s^{*})}=\sum_{x\in N^{*}(s^{*})\setminus P}+\sum_{x\in N^{*}(s^{*})\cap P}$ and inequality (5) completes the proof. ∎

Proof of Lemma 3.3.

We again only prove it for $k$ -MeaP, and the proof for $k$ -MedP is again similar. Recall the definition of ${\hat{s}^{*}}$ . W.l.o.g., we assume that ${\hat{s}^{*}}\notin S$ . It follows from $s\neq s_{l}$ and $\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})))=s_{l}$ that $\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})))$ $\neq s$ when $x\in N(s)\setminus(N^{*}(s^{*})\cup P^{*})$ . Since the operation swap $(s,{\hat{s}^{*}})$ does not improve the current solution $S$ , we have

$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle{\rm cost}(S\setminus\{s\}\cup\{\hat{s}^{*}\})-{\rm cost}(S)$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in N(s)\cap P^{}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus(N^{}(s^{})\cup P^{})}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),x)-{\rm cost}_{c}(x)\right)$
		$\displaystyle+\sum\limits_{x\in N^{}(s^{})\setminus P}(d^{2}({\hat{s}^{}},x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{}(s^{})\cap P}(d^{2}({\hat{s}^{}},x)-p_{x})$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in N(s)\cap P^{}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),x)-{\rm cost}_{c}(x)\right)$
		$\displaystyle+\sum\limits_{x\in N^{}(s^{})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{}(s^{})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-p_{x}).$

This completes the proof. ∎

Combining Lemmas 3.2 and 3.3, we estimate the cost of $S$ for $k$ -MedP and $k$ -MeaP in the following two theorems respectively.

Theorem 3.4.

LS-Multi-Swap( ${\mathcal{X}},{\mathcal{F}},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho$ ) for $k$ -MedP produces a local optimal solution $S$ satisfying ${\rm cost}_{c}+{\rm cost}_{p}\leq(3+2/\rho){\rm cost}_{c}^{*}+(1+1/\rho){\rm cost}_{p}^{*}$ .

Theorem 3.5.

Let ${\mathcal{C}}^{\prime}$ be an $\hat{\varepsilon}$ -approximate centroid set for ${\mathcal{X}}$ . LS-Multi-Swap( ${\mathcal{X}},$ ${\mathcal{C}}^{\prime},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho$ ) for $k$ -MeaP produces a local optimal solution $S$ satisfying ${\rm cost}_{c}+{\rm cost}_{p}\leq\left(3+2/\rho+{\hat{\varepsilon}}\right)^{2}{\rm cost}_{c}^{*}+\left(3+2/\rho+{\hat{\varepsilon}}\right)\left(1+1/\rho\right){\rm cost}_{p}^{*}$ .

Proof of Theorem 3.4..

Note that $m_{l}/(m_{l}-1)\leq(\rho+1)/\rho$ and $d(\phi(c),x)\geq{\rm cost}_{c}(x)$ for any $c\in\tilde{S}^{*}$ and any $x\in{\mathcal{X}}$ . Summing the inequality (6) with weight $1$ and inequality (8) with weight $1/(m_{l}-1)$ over all constructed swap operations, we have

$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))$	(10)
		$\displaystyle+\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
		$\displaystyle+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})\setminus P}({\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))$
		$\displaystyle+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})\cap P}({\rm cost}_{c}^{*}(x)-p_{x}).$

The triangle inequality and the definition of $\phi(\cdot)$ imply that

	$\displaystyle d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),x)$	(11)
$\displaystyle\leq$	$\displaystyle d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x}))),{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})))+d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})),x)$
$\displaystyle\leq$	$\displaystyle d(s_{x},{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})))+d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})),x)$
$\displaystyle\leq$	$\displaystyle d(s_{x},x)+d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})),x)+d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})),x)$
$\displaystyle=$	$\displaystyle{\rm cost}_{c}(x)+2d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})),x).$

Combining inequalities (10) and (11), we obtain

$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\cap P^{}}(p_{x}-{\rm cost}_{c}(x))+\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{}}2d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{}_{x})),x)$	(12)
		$\displaystyle+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})\setminus P}({\rm cost}_{c}^{}(x)-{\rm cost}_{c}(x))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})\cap P}({\rm cost}_{c}^{}(x)-p_{x})$
	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{x\in P^{}}p_{x}+\left(1+\frac{1}{\rho}\right)\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N_{q}^{}(s^{})}2d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x})),x)$
		$\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{}}{\rm cost}_{c}^{}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)-\sum\limits_{x\in P}p_{x}$
	$\displaystyle=$	$\displaystyle\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{}+\left(1+\frac{1}{\rho}\right)\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N_{q}^{}(s^{})}2d({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x})),x)$
		$\displaystyle+\ {\rm cost}_{c}^{*}-{\rm cost}_{c}-{\rm cost}_{p}.$

From the definitions of $N^{*}_{q}(\cdot)$ and ${\rm cent}_{{\mathcal{C}}}(\cdot)$ , we get that

\displaystyle\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N_{q}^{*}(s^{*})}2d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)\leq\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N_{q}^{*}(s^{*})}2d(s^{*}_{x},x)\leq 2{\rm cost}_{c}^{*}.

(13)

Finally, we complete the proof by combining inequalities (12)-(13) for $\rho=2/\varepsilon$ . ∎

Proof of Theorem 3.5..

Similar to the proof of Theorem 3.4, we obtain by summing inequality (7) with weight $1$ and inequality (9) with weight $1/(m_{l}-1)$ over all constructed swap operations that

$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))$	(14)
		$\displaystyle+\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
		$\displaystyle+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))$
		$\displaystyle+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-p_{x}).$

Because of $\sum_{s\in S}\sum_{x\in N(s)\setminus P^{*}}=\sum_{x\in{\mathcal{X}}\setminus(P\cup P^{*})}$ and Lemma 3.1, the RHS of (14) is not larger than

	$\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)\sum\limits_{x\in\mathcal{X}\setminus P^{}}{\rm cost}_{c}^{}(x)-\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)$	(15)
	$\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P^{}}{\rm cost}_{c}^{}(x)}\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)}+\left(1+\frac{1}{\rho}\right)\sum\limits_{x\in P^{*}}p_{x}-\sum\limits_{x\in P}p_{x}$
$\displaystyle=$	$\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}-{\rm cost}_{c}+2\left(1+\frac{1}{\rho}\right)\sqrt{{\rm cost}_{c}^{}C_{s}}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}-{\rm cost}_{p}$
$\displaystyle\leq$	$\displaystyle\left(\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{}\right)-\left({\rm cost}_{c}+{\rm cost}_{p}\right)$
	$\displaystyle+\frac{2\left(1+\frac{1}{\rho}\right)}{\sqrt{3+\frac{2}{\rho}+{\hat{\varepsilon}}}}\sqrt{\left(\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{}\right)\left({\rm cost}_{c}+{\rm cost}_{p}\right)}.$

The RHS of (15) is equal to

	$\displaystyle\left(\sqrt{\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{}}+\alpha\sqrt{{\rm cost}_{c}+{\rm cost}_{p}}\right)$
	$\displaystyle\times\left(\sqrt{\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{}}-\beta\sqrt{{\rm cost}_{c}+{\rm cost}_{p}}\right)$

where

	$\displaystyle\alpha=\frac{1+1/\rho}{\sqrt{3+2/\rho+\hat{\varepsilon}}}+\sqrt{\frac{(1+1/\rho)^{2}}{3+2/\rho+\hat{\varepsilon}}+1-\varepsilon},$
	$\displaystyle\beta=-\frac{1+1/\rho}{\sqrt{3+2/\rho+\hat{\varepsilon}}}+\sqrt{\frac{(1+1/\rho)^{2}}{3+2/\rho+\hat{\varepsilon}}+1-\varepsilon}.$

This implies that

\displaystyle\sqrt{\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}}-\beta\sqrt{{\rm cost}_{c}+{\rm cost}_{p}}\geq 0,

which is equivalent to

\displaystyle{\rm cost}_{c}+{\rm cost}_{p}

\displaystyle\leq

\displaystyle\frac{1}{\beta^{2}}\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\frac{1}{\beta^{2}}\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}.

Observe that

\frac{1}{\beta^{2}}\leq 3+\frac{2}{\rho}+\hat{\varepsilon}.

So, we have

\displaystyle{\rm cost}_{c}+{\rm cost}_{p}\leq\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)^{2}{\rm cost}_{c}^{*}+\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}.

(16)

Substituting ${\hat{\varepsilon}}=1/\rho$ into (16) completes the proof. ∎

We remark that Algorithm 3 can be adapted to a polynomial-time algorithm that only sacrifices $\varepsilon$ in the approximation factor (see Arya et al. 2004). Combining this adaptation and Theorems 3.4-3.5, we obtain a $(3+\varepsilon)$ -approximation algorithm for $k$ -MedP, and a $(9+\varepsilon)$ -approximation algorithm for $k$ -MeaP, if $\rho$ and $\hat{\varepsilon}$ are sufficiently small.

4 Local search algorithm for $k$ -MedO/ $k$ -MeaO

In this section, we focus on $k$ -MedO and $k$ -MeaO. We apply the technique for addressing outliers in a local search algorithm provided by Gupta et al. (2017) to $k$ -MedO and $k$ -MeaO, and use our new analysis to improve the approximation ratio.

4.1 The algorithm

Each iteration of the outlier-based multi-swap local search algorithm has a no-swap step and a swap step. Supposing that the current solution is $(S,P)$ , the no-swap step implements an “add outliers” operation that adds the points in ${\rm outlier}(S,P)$ (defined in Section 2.1) to $P$ , if this operation can reduce the cost by a given factor. Then, the swap step searches for a better solution by the multi-swap together with the “add outliers” operations. The algorithm terminates when both the no-swap and the swap step can not reduce the cost by the given factor.

Let ${\rm cost}(S,P)$ denote the the cost of the solution $(S,P)$ . We give the formal description of the outlier-based local search algorithm in Algorithm 4.1. This algorithm has three parameters: $\rho$ is the number of points which are allowed to be swapped in a solution, $q$ and $\varepsilon$ are used to control the descending step-length of the cost. The parameter $q$ is fixed as $k$ in the algorithm provided by Gupta et al. (2017), while it is an input in our algorithm, because the approximation ratio is associated with the value of this parameter.

The following proposition holds for this algorithm.

Proposition 4.1 (Gupta et al. 2017).

Let $(S,P)$ be the solution produced by LS-Multi-Swap-Outlier( ${\mathcal{X}},$ $C,z,k,\rho,\varepsilon$ ), and set $q=k$ if $\rho=1$ , otherwise, set $q=k^{2}-k$ . Then

(i)

${\rm cost}(S,P\cup{\rm outlier}(S,P))\geq\left(1-\varepsilon/q\right){\rm cost}(S,P)$ ,
(ii)

${\rm cost}(S\setminus A\cup B,P\cup{\rm outlier}(S\setminus A\cup B,P))\geq\left(1-\varepsilon/q\right){\rm cost}(S,P)$ for any $A\subseteq S$ and $B\subseteq C$ .

Algorithm 2 The outlier-based local search algorithm: LS-Multi-Swap-Outlier( ${\mathcal{X}},C,z,k,\rho,q,\varepsilon$ )

0: Data set

{\mathcal{X}}

, candidate center set

C

, positive integers

z

k

q

and

\rho\leq k

, real number

\varepsilon>0

0: Center set

S\subseteq C

and outlier set

P\subseteq{\mathcal{X}}

1: Arbitrarily choose a

k

-center subset

S

from

C

2: Set

P:={\rm outlier}(C)

3: Set

\alpha:=+\infty

4: while

{\rm cost}(S,P)<\alpha

\alpha\leftarrow{\rm cost}(S,P)

6: if

{\rm cost}(S,P\cup{\rm outlier}(S,P))<\left(1-\dfrac{\varepsilon}{q}\right){\rm cost}(S,P)

then

7: Set

P:=P\cup{\rm outlier}(S,P)

8: end if

9: Compute

(A,B):=\arg\min_{A\subseteq S,B\subseteq C\setminus S,|A|=|B|\leq\rho}{\rm cost}(S\setminus A\cup B,P\cup{\rm outlier}(S\setminus A\cup B,P)).

10: Set

S^{\prime}:=S\setminus A\cup B

and

P^{\prime}:=P\cup{\rm outlier}(S\setminus A\cup B,P)

11: if

{\rm cost}(S^{\prime},P^{\prime})<\left(1-\dfrac{\varepsilon}{q}\right){\rm cost}(S,P)

then

12: Set

S:=S^{\prime}

and

P:=P^{\prime}

13: end if

14: end while

15: return

S

and

P

For $k$ -MedO, we run LS-Multi-Swap-Outlier( ${\mathcal{X}},{\mathcal{F}},z,k,\rho,q,\varepsilon$ ). For $k$ -MeaO, we run LS-Multi-Swap-Outlier( ${\mathcal{X}},{\mathcal{C}}^{\prime},z,k,\rho,q,\varepsilon$ ), where ${\mathcal{C}}^{\prime}$ is an ${\hat{\varepsilon}}$ -approximate centroid set for ${\mathcal{X}}$ . The values of $\rho$ , $\varepsilon$ , and $\hat{\varepsilon}$ will be determined in the analysis of the algorithm.

4.2 The analysis

The time complexity of Algorithm 4.1 is shown in the following theorem.

Theorem 4.2.

The running time of LS-Multi-Swap-Outlier( ${\mathcal{X}},{\mathcal{C}},z,k,\rho,q,\varepsilon$ ) is $O\left(\frac{k^{\rho}n^{\rho}q}{\varepsilon}\log(n\delta)\right)$ .

Proof.

The proof is similar to that in Gupta et al. (2017). For the sake of completeness, we present it also here. W.l.o.g, we can assume that the optimal value of the problem is at least $1$ by scaling the distances, except for the trivial case that $k=n-z$ . Under this assumption, the cost of any solution is at most $n\delta\geq 1$ . The number of iterations is at most $O(-\log_{1-\varepsilon/q}(n\delta))=O(\frac{q}{\varepsilon}\log(n\delta))$ , since the cost is reduced to at most $(1-\varepsilon/q)$ times the old cost in each iteration. The number of solutions searched by a swap operation is at most $O((kn)^{\rho})$ , since $|A|=|B|\leq\rho$ . This completes the proof. ∎

The algorithm may violates the outlier constraint in order to yield a bounded approximation ratio. We can also bound the number of outliers by a suitable factor, which is shown in the following result.

Theorem 4.3.

The number of outliers returned by LS-Multi-Swap-Outlier( ${\mathcal{X}},{\mathcal{C}},$ $z,k,\rho,q,\varepsilon$ ) is $O\left(\frac{zq}{\varepsilon}\right.$ $\left.\log(n\delta)\right)$ .

Proof.

From the proof of Theorem 4.2, we know that LS-Multi-Swap-Outlier( ${\mathcal{X}},{\mathcal{C}},$ $z,k,\rho,q,\varepsilon$ ) has at most $O\left(\frac{q}{\varepsilon}\log(n\delta)\right)$ iterations. In each iteration, the algorithm removes at most $2z$ outliers. This completes the proof. ∎

Let $(S,P)$ be the solution returned by Algorithm 4.1, and $(S^{*},P^{*})$ be the global optimal solution. Similar to the penalty version, we use the same notations (except that the outlier version has not penalty cost) and adopt the same partition of $S$ and $S^{*}$ ( $S=\cup_{l}S_{l},\ S^{*}=\cup_{l}S^{*}_{l}$ ). Similar to Lemmas 3.2 and 3.3, we obtain the following two results.

Lemma 4.4.

If $|S_{l}|=|S^{*}_{l}|\leq\rho$ , we have

	$\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$	$\displaystyle\leq$	$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+$		(17)
			$\displaystyle\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})}{\rm cost}_{c}^{}(x)-\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x).$		(17)

for $k$ -MedO, and

	$\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$	$\displaystyle\leq$	$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+$
			$\displaystyle\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x).$

for $k$ -MeaO.

Proof.

We only prove it for $k$ -MeaO. The proof for $k$ -MedO is similar. We consider the swap $(S_{l},{\hat{S}}^{*}_{l})$ . Since the swap step of the algorithm produces at most $z$ additional outliers in each iteration, and $|P\setminus\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*}|\leq|P|+z$ , we can let the points in $P\setminus\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*}$ be the additional outliers after the constructed swap operation. For the other points, it is obvious that we can apply the reassignments in the proof of Lemma 3.2 also here. Then, Proposition 4.1 yields

			$\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$
		$\displaystyle\leq$	$\displaystyle{\rm cost}(S\setminus S_{l}\cup\hat{S^{}_{l}},P\cup{\rm outlier}(S\setminus S_{l}\cup\hat{S^{}_{l}},P))-{\rm cost}(S,P)$
		$\displaystyle\leq$	$\displaystyle-\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}{\rm cost}_{c}(x)$
			$\displaystyle+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus\left(\bigcup_{s^{}\in S^{}_{l}}N^{}(s^{})\cup P^{}\right)}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
			$\displaystyle+\sum\limits_{s^{}\in S^{}_{l}}\sum\limits_{x\in N^{}(s^{})\setminus P}(d^{2}({\hat{s}^{}},x)-{\rm cost}_{c}(x))+\sum\limits_{s^{}\in S^{}_{l}}\sum\limits_{x\in N^{}(s^{})\cap P}d^{2}({\hat{s}^{}},x)$
		$\displaystyle\leq$	$\displaystyle-\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{}}{\rm cost}_{c}(x)+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\cap P}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)$
			$\displaystyle+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
			$\displaystyle+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))$
		$\displaystyle\leq$	$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
			$\displaystyle+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x),$

where the third inequality follows from (5).

∎

Lemma 4.5.

For any point $s\in S_{l}\setminus\{s_{l}\}$ and $s^{*}\in S^{*}_{l}$ , we have

	$\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in N(s)\setminus P^{}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$		(19)
			$\displaystyle+\sum\limits_{x\in N^{}(s^{})}{\rm cost}_{c}^{}(x)-\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x)$		(19)

for $k$ -MedO, and

	$\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$		(20)
			$\displaystyle+\sum\limits_{x\in N^{}(s^{})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x)$		(20)

for $k$ -MeaO.

Proof.

The proof is similar to those for Lemmas 3.3 and 4.4. ∎

Next we will construct some swap operations for each pair $(S_{l},S^{*}_{l})$ , and then apply Lemmas 4.4 and 4.5 to these swaps. Similar to the analysis for the penalty version, we consider two cases according to the size of $S_{l}$ : $|S_{l}|\leq\rho$ and $|S_{l}|=m_{l}>\rho$ .

Note that the number of constructed swap operations will appear in the coefficient of ${\rm cost}(S,P)$ after summing the inequalities in Lemmas 4.4 and 4.5. We want this number to be as small as possible, since it is proportional to the approximation ratio due to the later analysis. On the other hand, to obtain the entire cost of the solution $(S,P)$ , we need to swap all centers in $S$ at least once. Thus, for the case of $|S_{l}|\leq\rho$ , we consider the same swap operations in the analysis for the penalty version (we state it again in the following Case 1), since each center in $S_{l}$ is swapped exactly once.

For the case of $|S_{l}|=m_{l}>\rho$ , there are $m_{l}(m_{l}-1)$ single-swap operations in the analysis for the penalty version. This makes the coefficient of the cost of $(S^{*},P*)$ small ( $m_{l}/(m_{l}-1)\rightarrow 1$ when $m_{l}\rightarrow+\infty$ ), but the number of swaps is large. In this section, we consider two methods to construct swap operations for this case, which are stated in Methods 1 and 2 in the following Case 2. Note that Method 2 is the same as that in Section 3.

Case 1

(cf. Figure 1(a) for $\rho=3$ ). For each $l$ with $|S_{l}|=|S^{*}_{l}|\leq\rho$ , let $S_{l}=\{s_{l}\}$ and $S^{*}_{l}=\{s^{*}_{l}\}$ . We construct the swap $(s_{l},s^{*})$ for $k$ -MedO, and swap $(s_{l},{\hat{s}^{*}_{l}})$ for $k$ -MeaO.

Case 2.

For each $l$ with $|S_{l}|=|S^{*}_{l}|=m_{l}>1$ , let $S_{l}=\{s_{l},s_{l,2},\dots,s_{l,m_{l}}\}$ and $S^{*}_{l}=\{s^{*}_{l,1},s^{*}_{l,2},\dots,s^{*}_{l,m_{l}}\}$ .

Method 1

(cf. Figure 2). Set

\psi(s^{*}):=\left\{\begin{array}[]{ll}s_{l,2},&{\rm if}\ s^{*}=s^{*}_{l,1};\\ s_{l,2},&{\rm if}\ s^{*}=s^{*}_{l,2};\\ s_{l,3},&{\rm if}\ s^{*}=s^{*}_{l,3};\\ ...&...\\ s_{l,m_{l}},&{\rm if}\ s^{*}=s^{*}_{l,m_{l}}.\end{array}\right.

For each $s^{*}\in S^{*}_{l}$ , we construct swap $(\psi(s^{*}),s^{*})$ for $k$ -MedO, and swap $(\psi($ $s^{*}),{\hat{s}^{*}})$ for $k$ -MeaO.

Method 2

(cf. Figure 1(b)). We consider $(m_{l}-1)m_{l}$ pairs $(s,s^{*})$ with $s\in S_{l}\backslash\{s_{l}\}$ and $s^{*}\in S^{*}_{l}$ . For $k$ -MedO, we construct swap $(s,s^{*})$ for each pair; for $k$ -MeaO, we construct swap $(s,{\hat{s}^{*}})$ for each pair.

Combining these swap operations, we obtain the main results for Algorithm 4.1, which are shown in the following two theorems.

Theorem 4.6.

Let $(S,P)$ be the solution returned by LS-Multi-Swap-Outlier( ${\mathcal{X}},$ ${\mathcal{F}},z,k,\rho,q,\varepsilon$ ) for $k$ -MedO. If $(1+k)\varepsilon<q$ , then we have

\displaystyle{\rm cost}(S,P)\leq\frac{5}{1-(1+k)\varepsilon/q}\cdot{\rm cost}(S^{*},P^{*}).

(21)

If $(1+k^{2}-k)\varepsilon<q$ , then we have

\displaystyle{\rm cost}(S,P)\leq\frac{3+2/\rho}{1-(1+k^{2}-k)\varepsilon/q}\cdot{\rm cost}(S^{*},P^{*}).

(22)

Theorem 4.7.

Let ${\mathcal{C}}^{\prime}$ be an $\hat{\varepsilon}$ -approximate centroid set for the data set ${\mathcal{X}}$ , and $(S,P)$ be the solution returned by LS-Multi-Swap-Outlier( ${\mathcal{X}},{\mathcal{C}}^{\prime},z,k,\rho,q,\varepsilon$ ) for $k$ -MeaO. If $(5+\hat{\varepsilon})(1+k)\varepsilon<(9+\hat{\varepsilon})q$ , then we have

\displaystyle{\rm cost}(S,P)\leq\dfrac{5+\hat{\varepsilon}}{\beta_{1}^{2}}\cdot{\rm cost}(S^{*},P^{*})

(23)

where

\beta_{1}=-\frac{2}{\sqrt{5+\hat{\varepsilon}}}+\sqrt{\frac{4}{5+\hat{\varepsilon}}+1-\frac{(1+k)\varepsilon}{q}}.

If $(1+k^{2}-k)\varepsilon/q<(1+1/\rho)^{2}/(3+2/\rho+\hat{\varepsilon})+1$ , then we have

\displaystyle{\rm cost}(S,P)\leq\dfrac{3+2/\rho+\hat{\varepsilon}}{\beta_{2}^{2}}\cdot{\rm cost}(S^{*},P^{*})

(24)

where

\beta_{2}=-\frac{1+1/\rho}{\sqrt{3+2/\rho+\hat{\varepsilon}}}+\sqrt{\frac{(1+1/\rho)^{2}}{3+2/\rho+\hat{\varepsilon}}+1-\frac{(1+k^{2}-k)\varepsilon}{q}}.

Each of these two theorems gives two approximation ratios for Algorithm 4.1. The first one is obtained by Method 1, while the second one is obtained by Method 2.

Proof of Theorem 4.6..

We first prove inequality (21). For Case 2, we use Method 1 to construct swap operations. Note that each point in $S$ is swapped at most twice, and each point in $S^{*}$ is swapped once exactly, implying that the number of constructed swap operations is $|S^{*}|=k$ . Summing inequality (19) over these $k$ swaps and using Proposition 4.1, we obtain

$\displaystyle-\frac{k\varepsilon}{q}\cdot{\rm cost}(S,P)$	$\displaystyle\leq$	$\displaystyle 2\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$	(25)
		$\displaystyle+\sum\limits_{s^{}\in S^{}}\left(\sum\limits_{x\in N^{}(s^{})}{\rm cost}_{c}^{}(x)-\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x)\right)$
	$\displaystyle\leq$	$\displaystyle 2\sum\limits_{x\in{\mathcal{X}}\setminus(P\cup P^{})}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
		$\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{}}{\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{}\setminus P}{\rm cost}_{c}(x)$
	$\displaystyle\leq$	$\displaystyle 4{\rm cost}(S^{},P^{})+{\rm cost}(S^{},P^{})-{\rm cost}(S,P)+\sum\limits_{P^{*}\setminus P}{\rm cost}_{c}(x)$
	$\displaystyle=$	$\displaystyle 5{\rm cost}(S^{},P^{})-{\rm cost}(S,P)+\sum\limits_{P^{*}\setminus P}{\rm cost}_{c}(x),$

where the third inequality follows from (11) and (13).

Using the definition of ${\rm outlier}(\cdot,\cdot)$ , we obtain

$\displaystyle\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x)$	$\displaystyle\leq$	$\displaystyle\sum\limits_{x\in{\rm outlier}(S,P)}{\rm cost}_{c}(x)$	(26)
	$\displaystyle=$	$\displaystyle{\rm cost}(S,P)-{\rm cost}(S,P\cup{\rm outlier}(S,P))$
	$\displaystyle\leq$	$\displaystyle\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P).$

Combining inequalities (25)-(26), we have

\displaystyle 0\leq 5{\rm cost}(S^{*},P^{*})-\left(1-\dfrac{(1+k)\varepsilon}{q}\right){\rm cost}(S,P),

which is equivalent to (21) under the condition that $(1+k)\varepsilon<q$ .

Next we will prove the inequality (22). For Case 2, we use Method 2 to construct swap operations. Let $L_{1}:=\{l~{}|~{}|S_{l}|\leq\rho\}$ and $L_{2}:=\{l~{}|~{}|S_{l}|>\rho\}$ . Summing inequality (17) with weight 1 and inequality (19) with weight $1/(m_{l}-1)$ over all constructed swap operations, and observing that $m_{l}/(m_{l}-1)\leq(\rho+1)/\rho$ , we obtain

	$\displaystyle-\sum\limits_{l\in L_{1}}\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)-\sum\limits_{l\in L_{2}}\frac{1}{m_{l}-1}\cdot m_{l}(m_{l}-1)\cdot\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$	(27)
$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
	$\displaystyle+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})}{\rm cost}_{c}^{}(x)-\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x).$

Note that there are at most $k(k-1)$ constructed swap operations. It follows from $1/(m_{l}-1)\leq 1$ that

	$\displaystyle{\rm LHS~{}of~{}(\ref{ieq1-thm-outlier-kmedo})}$	$\displaystyle\geq$	$\displaystyle-\left(\|L_{1}\|+\sum\limits_{l\in L_{2}}m_{l}(m_{l}-1)\right)\cdot\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$		(28)
		$\displaystyle\geq$	$\displaystyle\dfrac{(k^{2}-k)\varepsilon}{q}\cdot{\rm cost}(S,P).$		(28)

Inequality (11) then yields the following upper bound for the RHS of (27).

	$\displaystyle{\rm RHS~{}of~{}(\ref{ieq1-thm-outlier-kmedo})}$	$\displaystyle\leq$	$\displaystyle\left(3+\frac{2}{\rho}\right)\sum\limits_{x\in\mathcal{X}\setminus P^{}}{\rm cost}_{c}^{}(x)-\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x)$		(29)
		$\displaystyle=$	$\displaystyle\left(3+\frac{2}{\rho}\right){\rm cost}_{c}^{}-{\rm cost}_{c}+\sum\limits_{x\in P^{}\setminus P}{\rm cost}_{c}(x).$		(29)

Combining inequalities (26)-(29), we have

\displaystyle 0\leq\left(3+\frac{2}{\rho}\right){\rm cost}(S^{*},P^{*})-\left(1-\dfrac{(1+k^{2}-k)\varepsilon}{q}\right){\rm cost}(S,P),

which is equivalent to (22) under the condition $(1+k^{2}-k)\varepsilon<q$ .

∎

Proof of Theorem 4.7..

We first use Method 1 for Case 2 to prove (23). Similar to the proof for $k$ -MedO, we have

$\displaystyle-\dfrac{k\varepsilon}{q}\cdot{\rm cost}(S,P)$	$\displaystyle\leq$	$\displaystyle 2\sum\limits_{x\in{\mathcal{X}}\setminus(P\cup P^{})}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$	(30)
		$\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{}}(1+\hat{\varepsilon}){\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{}\setminus P}{\rm cost}_{c}(x)$
	$\displaystyle\leq$	$\displaystyle 4\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}{\rm cost}^{}_{c}(x)$
		$\displaystyle+4\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}{\rm cost}^{}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}$
		$\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{}}(1+\hat{\varepsilon}){\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{}\setminus P}{\rm cost}_{c}(x)$
	$\displaystyle\leq$	$\displaystyle 4\sqrt{{\rm cost}(S^{},P^{})}\cdot\sqrt{{\rm cost}(S,P)}$
		$\displaystyle+(5+\hat{\varepsilon}){\rm cost}(S^{},P^{})-{\rm cost}(S,P)+\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P),$

where the second inequality follows from Lemma 3.1 (this lemma still holds for the outlier version of $k$ -means), and the third inequality follows from (26).

When $(5+\hat{\varepsilon})(1+k)\varepsilon<(9+\hat{\varepsilon})q$ , it follows by factorization that inequality (30) is equivalent to

	$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\left(\sqrt{\left(5+\hat{\varepsilon}\right){\rm cost}(S^{},P^{})}+\alpha\sqrt{{\rm cost}(S,P)}\right)$		(31)
			$\displaystyle\times\left(\sqrt{\left(5+\hat{\varepsilon}\right){\rm cost}(S^{},P^{})}-\beta_{1}\sqrt{{\rm cost}(S,P)}\right),$		(31)

where

	$\displaystyle\alpha$	$\displaystyle=$	$\displaystyle\frac{2}{\sqrt{5+\hat{\varepsilon}}}+\sqrt{\frac{4}{5+\hat{\varepsilon}}+1-\frac{(1+k)\varepsilon}{q}},$
	$\displaystyle\beta_{1}$	$\displaystyle=$	$\displaystyle-\frac{2}{\sqrt{5+\hat{\varepsilon}}}+\sqrt{\frac{4}{5+\hat{\varepsilon}}+1-\frac{(1+k)\varepsilon}{q}}.$

Since the first term of the RHS of (31) is non-negative, we obtain

\sqrt{\left(5+\hat{\varepsilon}\right){\rm cost}(S^{*},P^{*})}-\beta_{1}\sqrt{{\rm cost}(S,P)}\geq 0,

which gives (23).

Next we prove inequality (24). For Case 2, we use Method 2 to construct swap operations. Similar to the proof of Theorem 4.6, summing inequality (4.4) with weight 1 and inequality (20) with weight $1/(m_{l}-1)$ over all constructed swap operations implies that

	$\displaystyle-\sum\limits_{l\in L_{1}}\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)-\sum\limits_{l\in L_{2}}\frac{1}{m_{l}-1}\cdot m_{l}(m_{l}-1)\cdot\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$	(32)
$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
	$\displaystyle+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x).$

Because of Lemma 3.1, the RHS of (32) is bounded from above by

$\displaystyle{\rm RHS~{}of~{}(\ref{ieq1-thm-outlier-kmeao})}$	$\displaystyle\leq$	$\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)\sum\limits_{x\in\mathcal{X}\setminus P^{}}{\rm cost}_{c}^{}(x)-\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x)$	(33)
		$\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P^{}}{\rm cost}_{c}^{}(x)}\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)}$
	$\displaystyle=$	$\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}(S^{},P^{})-{\rm cost}(S,P)+\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x)$
		$\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{{\rm cost}(S^{},P^{}){\rm cost}(S,P)}.$

Combining inequalities (26), (28), (32) and (33), we have

	$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}(S^{},P^{})-\left(1-\frac{(1+k^{2}-k)\varepsilon}{q}\right){\rm cost}(S,P)$
			$\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{{\rm cost}(S^{},P^{}){\rm cost}(S,P)}.$

Using the factorization for this inequality and the condition in this theorem, we obtain the desired result.

∎

Consequently, we have the following corollaries that specify the tradeoff between the approximation ratio and the outlier blowup.

Corollary 4.8.

There exists a bi-criteria $(5+\varepsilon,O(\frac{k}{\varepsilon}\log(n\delta)))$ -, and a bi-criteria $(3+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))$ -approximation algorithm for $k$ -MedO.

Proof.

If $q\geq k+1$ , then

\frac{5}{1-(1+k)\varepsilon/q}\leq\frac{5}{1-\varepsilon}\sim 5+O(\varepsilon).

If $q\geq k^{2}-k+1$ and $\rho\geq 2/O(\varepsilon)$ , then

\frac{3+2/\rho}{1-(1+k^{k}-k)\varepsilon/q}\leq\frac{3+O(\varepsilon)}{1-\varepsilon}\sim 3+O(\varepsilon).

Combining the above results, Theorems 4.3 and 4.6 complete the proof. ∎

Corollary 4.9.

There exists a bi-criteria $(25+\varepsilon,O(\frac{k}{\varepsilon}\log(n\delta)))$ -, and a bi-criteria $(9+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))$ -approximation algorithm for $k$ -MeaO.

Proof.

Recall the definitions of $\beta_{1}$ and $\beta_{2}$ in Theorem 4.7. We then have

\dfrac{5+\hat{\varepsilon}}{\beta_{1}^{2}}\sim 25+O(\varepsilon+\hat{\varepsilon})

when $q=k+1$ , and

\dfrac{3+2/\rho+\hat{\varepsilon}}{\beta_{2}^{2}}\sim 9+O(\varepsilon+\hat{\varepsilon})

when $q=k^{2}-k+1$ and $\rho$ is sufficiently large. Then, combining these results, Theorems 4.3 and 4.7 completes the proof. ∎

5 Conclusions

The previous analyses of local search algorithms for the robust $k$ -median/ $k$ -means, use only the individual form, in which the constructed connections between the local and global optimal solutions are individual for each point. This has the disadvantage that the joint information about outliers remains hidden. In this paper, we develop a cluster form analysis and define the adapted cluster that captures the outlier information. We find that this new technique works better than the previous analysis methods of local search algorithms, since it improves the approximation ratios of local search algorithms for $k$ -MeaP, $k$ -MeaO and $k$ -MedO, and obtain the same ratio which is the best for $k$ -MedP.

We believe that our new technique will also work for the robust FLP, since the structure of FLP is similar to $k$ -median/ $k$ -means. Also, our technique seems to be promising for the robust $k$ -center problem, even for any algorithm for robust clustering problems that is based on local search.

6 Acknowledgments

The first author is supported by the NSFC under Grant No. 12001039. The second author is supported by the Science Foundation of the Anhui Education Department under Grant No. KJ2019A0834. The third author is supported by the NSFC under Grant No. 11971349. The fourth and fifth authors are supported by the NSFC under Grant No. 11871081. The fourth author is also supported by Beijing Natural Science Foundation Project under Grant No. Z200002.

References

Ahmadian et al. (2017) Ahmadian S, Norouzi-Fard A, Svensson O, Ward J (2017) Better guarantees for k-means and euclidean k-median by primal-dual algorithms. Proceedings of the 58th Annual Symposium on Foundations of Computer Science, 61–72 (IEEE).
Arora et al. (1998) Arora S, Raghavan P, Rao S (1998) Approximation schemes for euclidean k-medians and related problems. Proceedings of the 30th Annual ACM Symposium on Theory of Computing, 106–113.
Arya et al. (2004) Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for k-median and facility location problems. SIAM Journal on Computing 33(3):544–562.
Bernstein et al. (2019) Bernstein F, Modaresi S, Sauré D (2019) A dynamic clustering approach to data-driven assortment personalization. Management Science 65(5):2095–2115.
Borgwardt and Happach (2019) Borgwardt S, Happach F (2019) Good clusterings have large volume. Operations Research 67(1):215–231.
Byrka et al. (2014) Byrka J, Pensyl T, Rybicki B, Srinivasan A, Trinh K (2017) An improved approximation for k-median, and positive correlation in budgeted optimization. ACM Transactions on Algorithms 13(2):23:1–23:31.
Charikar et al. (2002) Charikar M, Guha S, Tardos É, Shmoys DB (2002) A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciencess 65(1):129–149.
Charikar et al. (2001) Charikar M, Khuller S, Mount DM, Narasimhan G (2001) Algorithms for facility location problems with outliers. Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms, 642–651 (Society for Industrial and Applied Mathematics).
Charikar and Li (2012) Charikar M, Li S (2012) A dependent lp-rounding approach for the k-median problem. International Colloquium on Automata, Languages, and Programming, 194–205 (Springer).
Chen (2008) Chen K (2008) A constant factor approximation algorithm for k-median clustering with outliers. Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, 826–835.
Cohen-Addad and Karthik (2019) Cohen-Addad V, Karthik C (2019) Inapproximability of clustering in lp metrics. Proceedings of the 60th Annual Symposium on Foundations of Computer Science, 519–539 (IEEE).
Cohen-Addad et al. (2019) Cohen-Addad V, Klein PN, Mathieu C (2019) Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. SIAM Journal on Computing 48(2):644–667.
Feng et al. (2019) Feng Q, Zhang Z, Shi F, Wang J (2019) An improved approximation algorithm for the k-means problem with penalties. Proceedings of the 2nd Annual International Workshop on Frontiers in Algorithmics, 170–181 (Springer).
Friggstad et al. (2019a) Friggstad Z, Khodamoradi K, Rezapour M, Salavatipour MR (2019a) Approximation schemes for clustering with outliers. ACM Transactions on Algorithms 15(2):1–26.
Friggstad et al. (2019b) Friggstad Z, Rezapour M, Salavatipour MR (2019b) Local search yields a ptas for k-means in doubling metrics. SIAM Journal on Computing 48(2):452–480.
Gupta et al. (2017) Gupta S, Kumar R, Lu K, Moseley B, Vassilvitskii S (2017) Local search methods for k-means with outliers. Proceedings of the 43rd International Conference on Very Large Data Bases 10(7):757–768.
Hajiaghayi et al. (2012) Hajiaghayi M, Khandekar R, Kortsarz G (2012) Local search algorithms for the red-blue median problem. Algorithmica 63(4):795–814.
Hochbaum and Liu (2018) Hochbaum DS, Liu S (2018) Adjacency-clustering and its application for yield prediction in integrated circuit manufacturing. Operations Research 66(6):1571–1585.
Jain et al. (2003) Jain K, Mahdian M, Markakis E, Saberi A, Vazirani VV (2003) Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM 50(6):795–824.
Jain and Vazirani (2001) Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM 48(2):274–296.
Kanungo et al. (2004) Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2004) A local search approximation algorithm for k-means clustering. Computational Geometry 28(2-3):89–112.
Korupolu et al. (2000) Korupolu MR, Plaxton CG, Rajaraman R (2000) Analysis of a local search heuristic for facility location problems. Journal of Algorithms 37(1):146–188.
Krishnaswamy et al. (2018) Krishnaswamy R, Li S, Sandeep S (2018) Constant approximation for k-median and k-means with outliers via iterative rounding. Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 646–659.
Li (2013) Li S (2013) A 1.488 approximation algorithm for the uncapacitated facility location problem. Information and Computation 222:45–58.
Li and Svensson (2016) Li S, Svensson O (2016) Approximating k-median via pseudo-approximation. SIAM Journal on Computing 45(2):530–547.
Li et al. (2013) Li Y, Shu J, Wang X, Xiu N, Xu D, Zhang J (2013) Approximation algorithms for integrated distribution network design problems. INFORMS Journal on Computing 25(3):572–584.
Lloyd (1982) Lloyd S (1982) Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2):129–137.
Lu and Wedig (2013) Lu SF, Wedig GJ (2013) Clustering, agency costs and operating efficiency: Evidence from nursing home chains. Management Science 59(3):677–694.
Mahdian et al. (2006) Mahdian M, Ye Y, Zhang J (2006) Approximation algorithms for metric facility location problems. SIAM Journal on Computing 36(2):411–432.
Makarychev et al. (2016) Makarychev K, Makarychev Y, Sviridenko M, Ward J (2016) A bi-criteria approximation algorithm for k-means. Proceedings of the 19th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX), and the 20th International Workshop on Randomization and Computation (RANDOM), 14:1–14:20 (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik).
Matoušek (2000) Matoušek J (2000) On approximate geometric k-clustering. Discrete & Computational Geometry 24(1):61–84.
Ni et al. (2020) Ni W, Shu J, Song M, Xu D, Zhang K (2020) A branch-and-price algorithm for facility location with general facility cost functions. INFORMS Journal on Computing URL http://dx.doi.org/10.1287/ijoc.2019.0921.
Wang et al. (2015) Wang Y, Xu D, Du D, Wu C (2015) Local search algorithms for k-median and k-facility location problems with linear penalties. Proceedings of the 9th Annual International Conference on Combinatorial Optimization and Applications, 60–71 (Springer).
Wu et al. (2018) Wu C, Du D, Xu D (2018) An approximation algorithm for the k-median problem with uniform penalties via pseudo-solution. Theoretical Computer Science 749:80–92.
Zhang et al. (2019) Zhang D, Hao C, Wu C, Xu D, Zhang Z (2019) Local search approximation algorithms for the k-means problem with penalties. Journal of Combinatorial Optimization 37(2):439–453.
Zhang (2007) Zhang P (2007) A new approximation algorithm for the k-facility location problem. Theoretical Computer Science 384(1):126–135.

	$\displaystyle\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),s_{x})$	(4)
$\displaystyle\leq$	$\displaystyle\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}\left(d(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))+d(x,s_{x})\right)^{2}$
$\displaystyle=$	$\displaystyle\sum\limits_{s^{}\in S^{}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s_{x})$
	$\displaystyle+\ 2\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))\cdot d(x,s_{x}).$

			$\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)$
		$\displaystyle\leq$	$\displaystyle 2\sum\limits_{s^{}\in S^{}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})),N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s_{x})$
			$\displaystyle+\ 2\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,{\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{})))}\cdot\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s_{x})}$
		$\displaystyle\leq$	$\displaystyle 2\sum\limits_{s^{}\in S^{}}d^{2}(s^{},N^{}_{q}(s^{}))+\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{*})}d^{2}(x,s_{x})$
			$\displaystyle+\ 2\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{})}d^{2}(x,s^{})}\cdot\sqrt{\sum\limits_{s^{}\in S^{}}\sum\limits_{x\in N^{}_{q}(s^{*})}d^{2}(x,s_{x})}$
		$\displaystyle=$	$\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}\left(2{\rm cost}^{}_{c}(x)+{\rm cost}_{c}(x)\right)+2\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}{\rm cost}^{}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}.$

	$\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)\sum\limits_{x\in\mathcal{X}\setminus P^{}}{\rm cost}_{c}^{}(x)-\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)$	(15)
	$\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P^{}}{\rm cost}_{c}^{}(x)}\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)}+\left(1+\frac{1}{\rho}\right)\sum\limits_{x\in P^{*}}p_{x}-\sum\limits_{x\in P}p_{x}$
$\displaystyle=$	$\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}-{\rm cost}_{c}+2\left(1+\frac{1}{\rho}\right)\sqrt{{\rm cost}_{c}^{}C_{s}}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}-{\rm cost}_{p}$
$\displaystyle\leq$	$\displaystyle\left(\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{}\right)-\left({\rm cost}_{c}+{\rm cost}_{p}\right)$
	$\displaystyle+\frac{2\left(1+\frac{1}{\rho}\right)}{\sqrt{3+\frac{2}{\rho}+{\hat{\varepsilon}}}}\sqrt{\left(\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{}\right)\left({\rm cost}_{c}+{\rm cost}_{p}\right)}.$

			$\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)$
		$\displaystyle\leq$	$\displaystyle{\rm cost}(S\setminus S_{l}\cup\hat{S^{}_{l}},P\cup{\rm outlier}(S\setminus S_{l}\cup\hat{S^{}_{l}},P))-{\rm cost}(S,P)$
		$\displaystyle\leq$	$\displaystyle-\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}{\rm cost}_{c}(x)$
			$\displaystyle+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus\left(\bigcup_{s^{}\in S^{}_{l}}N^{}(s^{})\cup P^{}\right)}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
			$\displaystyle+\sum\limits_{s^{}\in S^{}_{l}}\sum\limits_{x\in N^{}(s^{})\setminus P}(d^{2}({\hat{s}^{}},x)-{\rm cost}_{c}(x))+\sum\limits_{s^{}\in S^{}_{l}}\sum\limits_{x\in N^{}(s^{})\cap P}d^{2}({\hat{s}^{}},x)$
		$\displaystyle\leq$	$\displaystyle-\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{}}{\rm cost}_{c}(x)+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\cap P}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)$
			$\displaystyle+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
			$\displaystyle+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))$
		$\displaystyle\leq$	$\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$
			$\displaystyle+\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{}(x)-\sum\limits_{s^{}\in S_{l}^{}}\sum\limits_{x\in N^{}(s^{*})\setminus P}{\rm cost}_{c}(x),$

$\displaystyle-\dfrac{k\varepsilon}{q}\cdot{\rm cost}(S,P)$	$\displaystyle\leq$	$\displaystyle 2\sum\limits_{x\in{\mathcal{X}}\setminus(P\cup P^{})}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)$	(30)
		$\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{}}(1+\hat{\varepsilon}){\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{}\setminus P}{\rm cost}_{c}(x)$
	$\displaystyle\leq$	$\displaystyle 4\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}{\rm cost}^{}_{c}(x)$
		$\displaystyle+4\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{})}{\rm cost}^{}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}$
		$\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{}}(1+\hat{\varepsilon}){\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{}\setminus P}{\rm cost}_{c}(x)$
	$\displaystyle\leq$	$\displaystyle 4\sqrt{{\rm cost}(S^{},P^{})}\cdot\sqrt{{\rm cost}(S,P)}$
		$\displaystyle+(5+\hat{\varepsilon}){\rm cost}(S^{},P^{})-{\rm cost}(S,P)+\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P),$

Outliers Detection Is Not So Hard: Approximation Algorithms for Robust Clustering Problems Using Local Search Techniques

1 Introduction

Definition 1.1 (kk-Median Problem with Outliers/Penalties).

Definition 1.2 (kk-Means Problem with Outliers/Penalties).

1.1 Our techniques

1.2 Our contributions

Theorem 1.3.

Theorem 1.4.

1.3 Outline of the paper

2 Preliminaries

2.1 The models

2.2 Some technical lemmas

Definition 2.1 (Centroid).

Lemma 2.2 (Centroid Lemma (Kanungo et al. 2004)).

Definition 2.3.

Lemma 2.4 (Matoušek (2000)).

3 Local search approximation algorithms for kk-MedP and kk-MeaP

3.1 The analysis

Lemma 3.1.

Proof.

Lemma 3.2.

Lemma 3.3.

Proof of Lemma 3.2.

Proof of Lemma 3.3.

Theorem 3.4.

Theorem 3.5.

Proof of Theorem 3.4..

Proof of Theorem 3.5..

4 Local search algorithm for kk-MedO/kk-MeaO

4.1 The algorithm

Proposition 4.1 (Gupta et al. 2017).

4.2 The analysis

Theorem 4.2.

Proof.

Theorem 4.3.

Proof.

Lemma 4.4.

Proof.

Lemma 4.5.

Proof.

Theorem 4.6.

Theorem 4.7.

Proof of Theorem 4.6..

Proof of Theorem 4.7..

Corollary 4.8.

Proof.

Corollary 4.9.

Proof.

5 Conclusions

6 Acknowledgments

References

Definition 1.1 ( $k$ -Median Problem with Outliers/Penalties).

Definition 1.2 ( $k$ -Means Problem with Outliers/Penalties).

3 Local search approximation algorithms for $k$ -MedP and $k$ -MeaP

4 Local search algorithm for $k$ -MedO/ $k$ -MeaO