This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Outliers Detection Is Not So Hard: Approximation Algorithms for Robust Clustering Problems Using Local Search Techniques

Yishui Wang School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, P.R. China. Email: wangys@ustb.edu.cn.    Rolf H. Mo¨{\ddot{\rm o}}hring Institute for Applied Optimization, Department of Computer Science and Technology, Hefei University, P.R. China, and The Combinatorial Optimization and Graph Algorithms (COGA) group, Institute for Mathematics, Technical University of Berlin, Germany. Email: Rolf.Moehring@tu-berlin.de.    Chenchen Wu Corresponding author. College of Science, Tianjin University of Technology, Tianjin 300384, P.R. China. Email: wu_chenchen_tjut@163.com.    Dachuan Xu Department of Operations Research and Information Engineering, Beijing University of Technology, Beijing 100124, P.R. China. Email: xudc@bjut.edu.cn.    Dongmei Zhang School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, P.R. China. Email: zhangdongmei@sdjzu.edu.cn.

In this paper, we consider two types of robust models of the kk-median/kk-means problems: the outlier-version (kk-MedO/kk-MeaO) and the penalty-version (kk-MedP/kk-MeaP), in which we can mark some points as outliers and discard them. In kk-MedO/kk-MeaO, the number of outliers is bounded by a given integer. In kk-MedP/kk-MeaP, we do not bound the number of outliers, but each outlier will incur a penalty cost. We develop a new technique to analyze the approximation ratio of local search algorithms for these two problems by introducing an adapted cluster that can capture useful information about outliers in the local and the global optimal solution. For kk-MeaP, we improve the best known approximation ratio based on local search from 25+ε25+\varepsilon to 9+ε9+\varepsilon. For kk-MedP, we obtain the best known approximation ratio. For kk-MedO/kk-MeaO, there exists only two bi-criteria approximation algorithms based on local search. One violates the outlier constraint (the constraint on the number of outliers), while the other violates the cardinality constraint (the constraint on the number of clusters). We consider the former algorithm and improve its approximation ratios from 17+ε17+\varepsilon to 3+ε3+\varepsilon for kk-MedO, and from 274+ε274+\varepsilon to 9+ε9+\varepsilon for kk-MeaO.

1 Introduction

Using large data sets to make better decisions is becoming more important and routinely applied in Operations Research, Management Science, Biology, Computer Science, and Machine Learning (see e.g. Bernstein et al. 2019, Borgwardt and Happach 2019, Hochbaum and Liu 2018, Lu and Wedig 2013). Clustering large data is a fundamental problem in data analytics. Among many clustering types, center-based clustering is the most popular and widely used one. Center-based clustering problems include kk-median, kk-means, kk-center, facility location problems, and so on (see e.g. Ahmadian et al. 2017, Byrka et al. 2014, Lloyd 1982, Li 2013, Li et al. 2013, Ni et al. 2020). The kk-median and kk-means problems are the most basic and classic clustering problems. The goal of kk-median/means clustering is to find kk cluster centers such that the total (squared) distance from each input datum to the closest cluster center is minimized. Usually, one considers kk-median problems in arbitrary metrics while kk-means problems in the Euclidean space D\mathbb{R}^{D}.

Both problems are NP-hard to approximate beyond with the lower bounds 1+2/e1.7361+2/e\approx 1.736 (Jain et al. 2003) and 1.071.07 (Cohen-Addad and Karthik 2019) for kk-median and kk-means, respectively. There are many papers on designing efficient approximation algorithms. The best known approximations are 2.675+ε2.675+\varepsilon (Byrka et al. 2014) and 6.357+ε6.357+\varepsilon (Ahmadian et al. 2017) for kk-median and kk-means, respectively. If we restrict to a fixed-dimensional Euclidean space, the kk-median and kk-means problems have a PTAS (see Arora et al. 1998, Cohen-Addad et al. 2019, Friggstad et al. 2019b).

However, real-world data sets often contain outliers which may totally spoil kk-median/means clustering results. To overcome this problem, robust clustering techniques have been developed to avoid being affected by outliers. In general, there are two types of robust formulations: kk-median/means with outliers (kk-MedO/kk-MeaO) and kk-median/means with penalties (kk-MedP/kk-MeaP). We formally define these problems as follows.

Definition 1.1 (kk-Median Problem with Outliers/Penalties).

In the kk-median problem with outliers (kk-MedO), we are given a client set 𝒳\mathcal{X} of nn points, a facility set \mathcal{F} of mm points, a metric space (𝒳,d)(\mathcal{X}\cup\mathcal{F},d), and two positive integers k<mk<m and z<nz<n. The aim is to find a subset SS\subseteq\mathcal{F} of cardinality at most kk, and an outlier set P𝒳P\subseteq\mathcal{X} of cardinality at most zz such that the objective function x𝒳PminsSd(x,s)\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s) is minimized. In the kk-median problem with penalties (kk-MedP), we have the same input except that the cardinality restrictions on the penalty set P𝒳P\subseteq\mathcal{X} is replaced by a nonnegative penalty pxp_{x} for each x𝒳x\in\mathcal{X}, and the objective function is to minimize x𝒳PminsSd(x,s)+xPpx\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s)+\sum_{x\in P}p_{x}.

Definition 1.2 (kk-Means Problem with Outliers/Penalties).

In the kk-means problem with outliers (kk-MeaO), we are given a data set 𝒳\mathcal{X} in d\mathbb{R}^{d} of nn points and two positive integers kk and z<nz<n. Let d(u,v):=uv2d(u,v):=\|u-v\|_{2} be the Euclidean distance of the points uu and vv. The aim is to find a cluster center set SdS\subseteq\mathbb{R}^{d} of cardinality at most kk, and an outlier set P𝒳P\subseteq\mathcal{X} of cardinality at most zz such that the objective function x𝒳PminsSd(x,s)2\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s)^{2} is minimized. In the kk-means problem with penalties (kk-MeaP), we have the same input except that the cardinality restrictions on penalty set P𝒳P\subseteq\mathcal{X} is instead of a nonnegative penalty pxp_{x} for each x𝒳x\in\mathcal{X}, and the objective function is to minimize x𝒳PminsSd(x,s)2+xPpx\sum_{x\in\mathcal{X}\setminus P}\min_{s\in S}d(x,s)^{2}+\sum_{x\in P}p_{x}.

From the perspective of clustering, we can view a facility in the kk-median problem as a center, and view a client as a point. To avoid confusion, we use “center” and “point” for all problems we consider in this paper.

Basic versions of these problems have been widely studied, and many approximation algorithms based on many different techniques, including LP-rounding (see e.g. Charikar et al. 2002, Charikar and Li 2012, Li 2013), primal-dual (see e.g. Ahmadian et al. 2017, Jain and Vazirani 2001), dual-fitting (see e.g. Jain et al. 2003, Mahdian et al. 2006), local search (see e.g. Arya et al. 2004, Kanungo et al. 2004, Korupolu et al. 2000), Lagrangian relaxation (see e.g. Jain et al. 2003, Jain and Vazirani 2001), bi-point rounding (see e.g. Jain et al. 2003, Jain and Vazirani 2001), and pseudo-approximation (see e.g. Byrka et al. 2014, Li and Svensson 2016), have been developed and applied.

We now discuss the state-of-art approximation results for robust version of kk-median/kk-means. Chen (2008) has presented the first constant but very large approximation algorithm for kk-MedO via successive local search. Krishnaswamy et al. (2018) have obtained an iterative LP rounding framework yielding (7.081+ε)(7.081+\varepsilon)- and (53.002+ε)(53.002+\varepsilon)-approximation algorithms for kk-MedO and kk-MeaO, respectively. To the best of our knowledge, these are only two constant factor approximation results for kk-MedO and kk-MeaO.

The first constant 44-approximation for kk-MedP has been given by Charikar et al. (2001) using Lagrangian relaxation framework of Jain and Vazirani (2001). The best (3+ε)(3+\varepsilon)-approximation for kk-MedP has been obtained by Hajiaghayi et al. (2012), who called the problem the red-blue median problem. Three years later, Wang et al. (2015) have independently obtained the same factor approximation for kk-MedP and have further generalized it to a (3.732+ε)(3.732+\varepsilon)-approximation for the kk-facility location problem with linear penalties, which is a common generalization of facility location (in which there are facility opening costs and no cardinality constraint) and kk-MedP. Both of them use local search techniques. Zhang (2007) has obtained the approximation ratio 3.732+ε3.732+\varepsilon for the kk-facility location problem (kk-FLP). The currently best ratio of 3.253.25 for kk-FLP is due to Charikar and Li (2012). For the kk-median problem with uniform penalties, Wu et al. (2018) have adapted the pseudo-approximation technique of Li and Svensson (2016) and obtained a (2.732+ε)(2.732+\varepsilon)-approximation.

Zhang et al. (2019) have presented the first constant (25+ε)(25+\varepsilon)- approximation algorithm for kk-MeaP using local search. Feng et al. (2019) have improved this to a (19.849+ε)(19.849+\varepsilon)-approximation by combing Lagrangian relaxation with bipoint rounding. A summary of the up-to-date approximation results for kk-MedO/kk-MeaO and kk-MedP/kk-MeaP along with their ordinary versions is given in Table 1.

Table 1: Comparion of (robust) clustering problems.
Techniques and reference kk-median kk-MedO kk-MedP kk-means kk-MeaO kk-MeaP
LP rounding (Charikar et al. 2002) 6236\frac{2}{3}
Lagrangian relaxation (Jain and Vazirani 2001) 66 108108
Lagrangian relaxation (Charikar et al. 2001) 44
Lagrangian relaxation (Jain et al. 2003) 44
Local search (Arya et al. 2004) 3+ε3+\varepsilon
Local search (Kanungo et al. 2004) 9+ε9+\varepsilon
Successive local search (Chen 2008) constant
Dependent LP rounding (Charikar and Li 2012) 3.253.25
Local search (Hajiaghayi et al. 2012) 3+ε3+\varepsilon
Pseudo-approximation (Li and Svensson 2016) 2.732+ε2.732+\varepsilon
Pseudo-approximation (Byrka et al. 2014) 2.675+ε2.675+\varepsilon
Iterative LP rounding (Krishnaswamy et al. 2018) 7.081+ε7.081+\varepsilon 53.002+ε53.002+\varepsilon
Primal-dual (Ahmadian et al. 2017) 6.357+ε6.357+\varepsilon
Local search (Zhang et al. 2019) 25+ε25+\varepsilon
Bipoint rounding (Feng et al. 2019) 19.849+ε19.849+\varepsilon

The available literature suggests two observations concerning the approximation factor: i) kk-MedP/kk-MeaP seems more easy to approximate than kk-MedO/kk-MeaO. ii) The existence of outliers make the approximation of the corresponding robust clustering problems much harder than the ordinary clustering problems.

The best known approximation ratios for kk-MedO and kk-MeaO have been obtained by LP-rounding, but these algorithms are not strongly polynomial-time since they involve solving linear programs. Concerning time complexity, local search is better than LP-rounding, and this technique has been well applied to kk-median/kk-means and their penalty versions. Furthermore, the standard local search algorithm is also used for kk-median/kk-means with some special metrics such as the minor-free metric (Cohen-Addad et al. 2019) and the doubling metric (Friggstad et al. 2019b). These two papers show that the standard local search scheme yields a PTAS for the considered problems when the dimension is fixed. Their results hold in particular for the Euclidean metric, since both the minor-free metric and the doubling metric are extensions of the Euclidean metric.

Unfortunately, the standard local search algorithm for kk-MedO/kk-MeaO can not produce a feasible solution with a bounded approximation ratio (Friggstad et al. 2019a). So some research directions focus on bi-criteria approximation algorithms based on local search for these two problems. These algorithms have a bounded approximation ratio but violate either the kk-constraint or the outlier constraint by a bounded factor. Gupta et al. (2017) have developed a method for addressing outliers in a local search algorithm, yielding a bi-criteria (274+ε,O(kεlognδ))(274+\varepsilon,O(\frac{k}{\varepsilon}\log n\delta))-approximation algorithm (δ\delta as defined in Section 1.2) that violates the outlier constraint. Friggstad et al. (2019a) have provided (3+ε,1+ε)(3+\varepsilon,1+\varepsilon)- and (25+ε,1+ε)(25+\varepsilon,1+\varepsilon)-local search bi-criteria approximation algorithms for kk-MedO and kk-MeaO respectively.

We will consider the standard local search algorithm for kk-MedP/kk-MeaP, and the outlier-based local search algorithm by Gupta et al. (2017) for kk-MedO/kk-MeaO. Using our new technique, we will improve the approximation ratios for kk-MeaP, kk-MeaO and kk-MedO. For kk-MedP, we obtain the same approximation ratio which is the best one possible.

We list the related results about local search algorithms for kk-MedO/kk-MeaO and kk-MedP/kk-MeaP in Table 2.

Table 2: Local search algorithms for (robust) clustering problems. The # centers blowup means the factor by which the cardinality constraint is violated. The # outliers blowup means the factor by which the outlier constraint is violated.
Reference Problem Ratio # centers blowup # outliers blowup
Arya et al. (2004) kk-median 3+ε3+\varepsilon none none
Kanungo et al. (2004) kk-means 9+ε9+\varepsilon none none
Chen (2008) kk-MedO constant none none
Hajiaghayi et al. (2012) kk-MedP 3+ε3+\varepsilon none none
Zhang et al. (2019) kk-MeaP 25+ε25+\varepsilon none none
Cohen-Addad et al. (2019) kk-median/kk-means in minor-free metrics with fixed dimension PTAS none none
Friggstad et al. (2019b) kk-median/kk-means with fixed doubling dimension PTAS none none
Friggstad et al. (2019a) kk-MedO kk-MeaO 3+ε3+\varepsilon 25+ε25+\varepsilon 1+ε1+\varepsilon 1+ε1+\varepsilon none none
Gupta et al. (2017) kk-MedO kk-MeaO 17+ε17+\varepsilon 274+ε274+\varepsilon none none O(klog(nδ)/ε)O(k\log(n\delta)/\varepsilon) O(klog(nδ)/ε)O(k\log(n\delta)/\varepsilon)
Our results kk-MedP kk-MeaP kk-MedO kk-MedO kk-MeaO kk-MeaO 3+ε3+\varepsilon 9+ε9+\varepsilon 5+ε5+\varepsilon 3+ε3+\varepsilon 25+ε25+\varepsilon 9+ε9+\varepsilon none none none none none none none none O(klog(nδ)/ε)O(k\log(n\delta)/\varepsilon) O(k2log(nδ)/ε)O(k^{2}\log(n\delta)/\varepsilon) O(klog(nδ)/ε)O(k\log(n\delta)/\varepsilon) O(k2log(nδ)/ε)O(k^{2}\log(n\delta)/\varepsilon)

1.1 Our techniques

We concentrate on kk-MedP and kk-MeaP to illustrate our techniques. The associated outlier versions are then easy generalizations.

In the standard local search algorithm, one starts from an arbitrary feasible solution. Operations such as add center, delete center, or swap centers, define the neighborhood of the currently feasible solution. One then searches for a local optimal solution in the whole neighborhood and takes it as the new current solution. This is iterated until the improvement becomes sufficiently small.

Similar to the previous analyses of local search algorithms for kk-median and kk-means, we want to find some valid inequalities by constructing swap operations in order to establish some “connections” between local and global optimal solutions. Integrating all these inequalities or connections carefully, we can bound cost of the local optimal solution by the global optimal cost.

In the analysis of kk-median (see Arya et al. 2004), these connections are given individually for each point (that is, each point yields an inequality that gives a bound of its cost after the constructed swap operation). We call this type of analysis an “individual form”.

Another analysis type is the “cluster form”, in which the connections between the local and global optimal solutions are revealed for some clusters containing several points. The cluster form analysis was first used for kk-means in Kanungo et al. (2004). In the work of Kanungo et al. (2004), the authors use the Centroid Lemma (introduced in Section 2.2) to obtain equality for each cluster in the optimal solution, and then deduce the approximation ratio by these equalities and the triangle inequality. They found that the cluster form analysis is tighter than the individual form. However, the same analysis does not apply to kk-MeaP due to the existence of outliers. Indeed, the clusters derived with equalities from the Centroid Lemma should contain no outliers in both the local and global optimal solutions, since they do not incur a cost for outliers.

To this end, we careful recognize and define an adapted cluster as a cluster that excludes outliers. In order to use the Centroid Lemma, we identify a new centroid for the adapted cluster and use the triangle inequality for the squared distances to identify the associated centroid of the global solution in the analysis. These new centroids can be found by a carefully defined mapping function.

Our cluster form analysis also applies to kk-MedP, although there is no result like the Centroid Lemma for this problem. In fact, we only need to denote the optimal center of the adapted cluster for kk-MedP (corresponding to the centroid in the analysis for kk-MeaP), and use its optimality to derive the inequality for the adapted cluster. During the entire process of the analysis, we do not compute the optimal center, so we do not need a result like the Centroid Lemma.

Our cluster form analysis establishes a bridge between local and global solutions for both robust and ordinary clusterings, and we obtain a clear and unified understanding of them. Furthermore, we believe that our technique can be generalized to other robust clustering problems such as the robust facility location and kk-center problems.

1.2 Our contributions

We use the standard local search algorithm for kk-MedP and kk-MeaP. Via a subtle cluster form analysis, we obtain the following result.

Theorem 1.3.

The standard local search algorithm yields (3+ε)(3+\varepsilon)- and (9+ε)(9+\varepsilon)-approximations for kk-MedP and kk-MeaP respectively.

Our analysis is different to that of Hajiaghayi et al. (2012) who have also obtained a local search (3+ε)(3+\varepsilon)-approximation for kk-MedP, and improve the previous local search (25+ε)(25+\varepsilon)-approximation (Zhang et al. 2019) and the primal-dual (19.849+ε)(19.849+\varepsilon)-approximation (Feng et al. 2019) for kk-MeaP. Moreover, our result indicates that the penalty-version of the clustering problems have the same approximation ratios as the ordinary version, when we adopt the local search technique followed with our cluster form analysis.

For kk-MedO and kk-MeaO, we use the outlier-based local search algorithm (based on Gupta et al. 2017).

The algorithm has a parameter for controlling the descending step-length of the cost in each iteration. This parameter is fixed in Gupta et al. (2017), while it is an input in our algorithm because both the approximation ratio and the number of outliers blowup are associated with the value of this parameter. This helps us to reveal a tradeoff between the approximation ratio and the outlier blowup. When selecting appropriate values for this parameter, we can obtain constant approximation ratios. In the following theorems, δ\delta denotes the maximal distance between two points in the data set.

Theorem 1.4.

The outlier-based local search algorithm yields bicriteria (5+ε,O((5+\varepsilon,O( kεlog(nδ)))\frac{k}{\varepsilon}\log(n\delta)))- and (3+ε,O(k2εlog(nδ)))(3+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))-approximations for kk-MedO, and bicriteria (25+ε,O(kεlog(nδ)))(25+\varepsilon,O(\frac{k}{\varepsilon}\log(n\delta)))- and (9+ε,O(k2εlog(nδ)))(9+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))-approximations for kk-MeaO, where O(kεlog(nδ))O(\frac{k}{\varepsilon}\log(n\delta)) and O(k2εlog(nδ))O(\frac{k^{2}}{\varepsilon}\log(n\delta)) are the factors by which the outlier constraint is violated.

With the same outlier blowup, our ratios obtained with single-swap significantly improve the previous ratios 17+ε17+\varepsilon and 274+ε274+\varepsilon for the kk-MedO and kk-MeaO, respectively. The multi-swap version improves this even more, but with a larger outlier blowup.

These results strengthens our comprehension of robust clustering problems from a local search aspect. Furthermore, our cluster form analysis has a high potential to be applied in the robust version for FLP and kk-FLP, since the structures of these two problems are similar to kk-MedP, and the analyses for the connection cost and facility opening cost are seperated in the previous papers that study local search algorithms for FLP and kk-FLP (see Arya et al. 2004, Zhang 2007).

1.3 Outline of the paper

Section 2 presents the unified models and notations for kk-MedP/kk-MeaP and kk-MedO/kk-MeaO, and some useful technical lemmas. Section 3 then presents our standard local search algorithms for kk-MedP/kk-MeaP and our corresponding theoretical results. In Section 4, we develop our outlier-based local search algorithms for kk-MedO/kk-MeaO and present our corresponding theoretical results. The conclusions are given in Section 5. All technical proofs are given in the appendices.

2 Preliminaries

2.1 The models

We use the following notation for the problems studied in this paper (in addition to the notation introduced in the introduction). 𝒞{\mathcal{C}} denotes the candidate center set, and Δ(a,b)\Delta(a,b) denotes the connection cost between two points aa and bb. For kk-MedP and kk-MedO, we have 𝒞={\mathcal{C}}={\mathcal{F}} and Δ(a,b)=d(a,b)\Delta(a,b)=d(a,b); for kk-MeaP and kk-MeaO, we have 𝒞=𝒳{\mathcal{C}}={\mathcal{X}} and Δ(a,b)=d2(a,b)\Delta(a,b)=d^{2}(a,b). Then, the penalty-version can be formulated as

minS𝒞,P𝒳x𝒳PminsSΔ(s,x)+xPpx,\min_{S\subseteq{\mathcal{C}},P\subseteq{\mathcal{X}}}\sum\limits_{x\in{\mathcal{X}}\setminus P}\min_{s\in S}\Delta(s,x)+\sum\limits_{x\in P}p_{x},

and the outlier-version can be formulated as

minS𝒞,P𝒳:|P|zx𝒳PminsSΔ(s,x).\min_{S\subseteq{\mathcal{C}},P\subseteq{\mathcal{X}}:|P|\leq z}\sum\limits_{x\in{\mathcal{X}}\setminus P}\min_{s\in S}\Delta(s,x).

Considering kk-MeaP and kk-MedP, we assume that SS is a set of kk centers. It is obvious that the optimal penalized point set with respect to SS is P={x𝒳|pxminsSd(s,x)}P=\{x\in\mathcal{X}|p_{x}\leq\min_{s\in S}d(s,x)\} for kk-MedP and P={x𝒳|pxminsSd2(s,x)}P=\{x\in\mathcal{X}|p_{x}\leq\min_{s\in S}d^{2}(s,x)\} for kk-MeaP, implying that SS determines the corresponding kk clusters N(s):={x𝒳P|sx=s}N(s):=\{x\in\mathcal{X}\setminus P|s_{x}=s\} for all sSs\in S, where sxs_{x} denotes the closest center in SS to x𝒳Px\in\mathcal{X}\setminus P, i.e., sx:=argminsSd(s,x)s_{x}:=\operatorname*{argmin}_{s\in S}d(s,x). Thus, we also call SS a feasible solution for kk-MedP and kk-MeaP.

Given a center set SS and a subset R𝒳R\subseteq{\mathcal{X}}, we suppose that 𝒳R={x1,x2,,{\mathcal{X}}\setminus R=\{x_{1},x_{2},\dots, x|𝒳R|}x_{|{\mathcal{X}}\setminus R|}\} subject to d(sx1,x1)d(sx2,x2)d(sx|𝒳R|,x|𝒳R|)d(s_{x_{1}},x_{1})\geq d(s_{x_{2}},x_{2})\geq\dots\geq d(s_{x_{|{\mathcal{X}}\setminus R|}},x_{|{\mathcal{X}}\setminus R|}). Let outlier(S,R){\rm outlier}(S,R) :={x1,x2,,xz}:=\{x_{1},x_{2},\dots,x_{z}\} if |𝒳R|z|{\mathcal{X}}\setminus R|\geq z, otherwise, outlier(S,R):=𝒳R{\rm outlier}(S,R):={\mathcal{X}}\setminus R. We simplify outlier(S,){\rm outlier}(S,\emptyset) to outlier(S){\rm outlier}(S). For kk-MedO and kk-MeaO, it is obvious that the optimal outlier set with respect to SS is outlier(S){\rm outlier}(S), implying that the set SS can be seen as a feasible solution. We also use (S,P)(S,P) to denote a solution (not necessarily feasible) in which the center set is SS and the outlier set is PP for kk-MedO and kk-MeaO.

2.2 Some technical lemmas

Given a data subset D𝒳D\subseteq\mathcal{X} and a point c𝒞c\in{\mathcal{C}}, we define Δ(c,D):=xDΔ(c,x)\Delta(c,D):=\sum_{x\in D}\Delta(c,x). Let cent𝒞(D){\rm cent}_{{\mathcal{C}}}(D) be a center point in 𝒞{\mathcal{C}} that optimizes the objective of the kk-means/kk-median problem, i.e., cent𝒞(D):=argminc𝒞Δ(c,D){\rm cent}_{{\mathcal{C}}}(D):=\operatorname*{argmin}_{c\in{\mathcal{C}}}\Delta(c,D). We remark that the notation argmin\operatorname*{argmin} (argmax\operatorname*{argmax}) denotes an arbitrary element that minimizes (maximizes) the objective. From the well-known centroid lemma (Kanungo et al. 2004), we get cent𝒞(D)=cent(D){\rm cent}_{{\mathcal{C}}}(D)={\rm cent}(D) for kk-means, where cent(D){\rm cent}(D) is the centroid of DD, that is defined as follows.

Definition 2.1 (Centroid).

Given a set DdD\subseteq{\mathbb{R}}^{d}, we call the point xDx/|D|\sum_{x\in D}x/|D| denoted by cent(D){\rm cent}(D) the centroid of DD.

Lemma 2.2 (Centroid Lemma (Kanungo et al. 2004)).

For any data subset D𝒳D\subseteq\mathcal{X} and a point cdc\in\mathbb{R}^{d}, we have d2(c,D)=d2(cent(D),D)+|D|d2(cent(D),c)d^{2}(c,D)=d^{2}({\rm cent}(D),D)+|D|d^{2}({\rm cent}(D),c).

So, the candidate center points of a kk-means problem are the centroid points for all subsets of 𝒳\mathcal{X}. Note that the total amount of these candidate center points is 2|𝒳|12^{|\mathcal{X}|}-1. To cut down this exponential magnitude, Matoušek (2000) introduces the concept of approximate centroid set shown in the following definition.

Definition 2.3.

A set 𝒞d{\mathcal{C}}^{\prime}\subseteq\mathbb{R}^{d} is an ε\varepsilon-approximate centroid set for 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} if for any set D𝒳D\subseteq\mathcal{X}, we have minc𝒞d2(c,D)(1+ε)mincdd2(c,D)\min_{c\in{\mathcal{C}}^{\prime}}d^{2}(c,D)\leq(1+\varepsilon)\min_{c\in\mathbb{R}^{d}}d^{2}(c,D).

The following lemma shows the important observation that a polynomial size ϵ^\hat{\epsilon}-approximate centroid set for 𝒳\mathcal{X} can be found in polynomial time. In the remainder of this paper, we restrict that the candidate center set of kk-MeaP/kk-MeaO is the ε^\hat{\varepsilon}-approximate centroid set 𝒞{\mathcal{C}}^{\prime}, by utilizing this observation.

Lemma 2.4 (Matoušek (2000)).

Given an nn-point set 𝒳{\mathcal{X}} and a real number ε>0\varepsilon>0, an ε\varepsilon-approximate centroid set for 𝒳{\mathcal{X}}, of size O(nεdlog(1/ε))O\left(n\varepsilon^{-d}\log(1/\varepsilon)\right), can be computed in time O(nlogn+nεdlog(1/ε))O\left(n\log n+n\varepsilon^{-d}\log(1/\varepsilon)\right).

For the kk-median problem, we do not need the approximate center set, since the candidate centers are in the finite set {\mathcal{F}}.

3 Local search approximation algorithms for kk-MedP and kk-MeaP

Let ρ\rho be a fixed integer. For any feasible solution SS, ASA\subseteq S and B𝒞SB\subseteq\mathcal{C}\setminus S with |A|=|B|ρ|A|=|B|\leq\rho, we define the so-called multi-swap operation swap(A{\rm swap}(A, B)B) such that all centers in AA are dropped from SS and all centers in BB are added to SS.

We further denote the connection cost of the point x𝒳x\in{\mathcal{X}} by costc(x){\rm cost}_{c}(x), i.e., costc(x):=Δ(sx,x){\rm cost}_{c}(x):=\Delta(s_{x},x), and denote by costc{\rm cost}_{c}, costp{\rm cost}_{p}, and cost(S){\rm cost}(S) the following expressions costc:=x𝒳Pcostc(x){\rm cost}_{c}:=\sum_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x); costp:=xPpx{\rm cost}_{p}:=\sum_{x\in P}p_{x}; cost(S):=costc+costp{\rm cost}(S):={\rm cost}_{c}+{\rm cost}_{p}, where PP is the optimal penalized point set with respect to SS.

Now we are ready to present our multi-swap local search algorithm.

  Algorithm 1 The multi-swap local search algorithm: LS-Multi-Swap(𝒳,C,k,{pj}j𝒳,ρ{\mathcal{X}},C,k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho)

 

0:  data set 𝒳{\mathcal{X}}, candidate center set CC, penalty cost pjp_{j} for all j𝒳j\in{\mathcal{X}}, positive integers kk and ρk\rho\leq k.
0:  center set SCS\subseteq C.
1:  Arbitrarily choose a kk-center subset SS from CC.
2:  Compute (A,B):=argminAS,BCS,|A|=|B|ρcost(SAB).(A,B):=\arg\min_{A\subseteq S,B\subseteq C\setminus S,|A|=|B|\leq\rho}{\rm cost}(S\setminus A\cup B).
3:  while cost(SAB)<cost(S){\rm cost}(S\setminus A\cup B)<{\rm cost}(S) do
4:     Set S:=SABS:=S\setminus A\cup B.
5:     Compute (A,B):=argminAS,BCS,|A|=|B|ρcost(SAB).(A,B):=\arg\min_{A\subseteq S,B\subseteq C\setminus S,|A|=|B|\leq\rho}{\rm cost}(S\setminus A\cup B).
6:  end while
7:  return  SS

 

For kk-MedP, we run LS-Multi-Swap(𝒳,,k,{pj}j𝒳,ρ{\mathcal{X}},{\mathcal{F}},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho); for kk-MeaP, we first call the algorithm of Makarychev et al. (2016) to construct an ε^{\hat{\varepsilon}}-approximate centroid set 𝒞𝒳\mathcal{{\mathcal{C}}}^{\prime}\subseteq{\mathcal{X}}, then run LS-Multi-Swap(𝒳,𝒞,k,{pj}j𝒳,ρ{\mathcal{X}},{\mathcal{C}}^{\prime},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho). The values of ρ\rho and ε^\hat{\varepsilon} will be determined in our analysis of the algorithm.

3.1 The analysis

Let SS^{*} be a global optimal solution with the penalized set P={x𝒳|pxminsSΔ(x,s)}P^{*}=\{x\in\mathcal{X}|p_{x}\leq\min_{s^{*}\in S}\Delta(x,s^{*})\}. Similar to the feasible solution SS, we introduce the corresponding notations sxs^{*}_{x}, N(s)N^{*}(s^{*}), costc(x){\rm cost}^{*}_{c}(x), costc{\rm cost}_{c}^{*}, costp{\rm cost}_{p}^{*} and cost(S){\rm cost}(S^{*}).

We use the standard analysis for a local search algorithm, in which some swap operations between SS and SS^{*} are constructed, and then each point is reassigned to a center in the new solution. In the cluster form analysis, we try to bound the new cost for a set of points, rather than bounding the cost of each point individually and independently. To this end, we introduce the adapted cluster as follows.

Nq(s):=N(s)P,sS.N^{*}_{q}(s^{*}):=N^{*}(s^{*})\setminus P,\qquad\forall s^{*}\in S^{*}.

With the adapted cluster, we set S~:={cent𝒞(Nq(s))|sS}\tilde{S}^{*}:=\{{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}))|s^{*}\in S^{*}\}. We introduce a mapping ϕ:S~S\phi:\tilde{S}^{*}\rightarrow S and map each point cS~c\in\tilde{S}^{*} to ϕ(c):=argminsSd(c,s)\phi(c):=\arg\min_{s\in S}d(c,s). We say that the center ϕ(cent𝒞(Nq(s))\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})) captures ss^{*}. Considering one of all constructed swap operations, we will reassign some points to a center determined by the mapping ϕ\phi (for instance, reassign the point xx to ϕ(cent(Nq(sx)))\phi({\rm cent}(N^{*}_{q}(s^{*}_{x}))). The details will be stated later).

Combining all swap operations, the sum of the costs of these points appears in the right hand side of the inequality which is derived from the local optimality of SS. For kk-MeaP, we can bound this sum by the connection costs of SS and SS^{*}, see Lemma 3.1. Note that all these points are not outliers in both SS and SS^{*}. This is the reason why we need to use the adapted cluster rather than the cluster N(s)N^{*}(s^{*}) which was used in the analysis for kk-means (Gupta et al. 2017).

In the proof of Lemma 3.1, we divide the set 𝒳(PP){\mathcal{X}}\setminus(P\cup P^{*}) into some adapted clusters with respect to all sSs^{*}\in S^{*}, and apply the Centroid Lemma to each adapted cluster. Afterwards we bound the square of distances between a centroid cc of the adapted cluster and its mapped point ϕ(c)\phi(c). This explains why the domain of the mapping ϕ\phi is the set of centroids of adapted clusters.

Lemma 3.1.

Let SS and SS^{*} be a local optimal solution and a global optimal solution of kk-MeaP, respectively. Then,

x𝒳(PP)d2(ϕ(cent(Nq(sx))),x)\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}d^{2}(\phi({\rm cent}(N^{*}_{q}(s^{*}_{x}))),x) \displaystyle\leq x𝒳(PP)(2costc(x)+costc(x))+\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}\left(2{\rm cost}^{*}_{c}(x)+{\rm cost}_{c}(x)\right)+ (1)
2x𝒳(PP)costc(x)x𝒳(PP)costc(x).\displaystyle 2\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}^{*}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}.
Proof.

With the Cauchy-Schwarz inequality, we obtain

sSxNq(s)d(x,cent𝒞(Nq(s)))d(x,sx)\displaystyle\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d(x,{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})))\cdot d(x,s_{x}) (2)
\displaystyle\leq sSxNq(s)d2(x,cent𝒞(Nq(s)))sSxNq(s)d2(x,sx).\displaystyle\sqrt{\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})))}\cdot\sqrt{\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,s_{x})}.

Lemma 2.2 and the definition of ϕ()\phi(\cdot) then yield

x𝒳(PP)d2(ϕ(cent𝒞(Nq(sx))),x)\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x) (3)
=\displaystyle= sSxNq(s)d2(ϕ(cent𝒞(Nq(s))),x)\displaystyle\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}))),x)
=\displaystyle= sS[d2(cent𝒞(Nq(s)),Nq(s))+|Nq(s)|d2(cent𝒞(Nq(s)),ϕ(cent𝒞(Nq(s))))]\displaystyle\sum\limits_{s^{*}\in S^{*}}\left[d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),N^{*}_{q}(s^{*}))+|N^{*}_{q}(s^{*})|\cdot d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}))))\right]
=\displaystyle= sSd2(cent𝒞(Nq(s)),Nq(s))+sSxNq(s)d2(cent𝒞(Nq(s)),ϕ(cent𝒞(Nq(s))))\displaystyle\sum\limits_{s^{*}\in S^{*}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),N^{*}_{q}(s^{*}))+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}))))
\displaystyle\leq sSd2(cent𝒞(Nq(s)),Nq(s))+sSxNq(s)d2(cent𝒞(Nq(s)),sx).\displaystyle\sum\limits_{s^{*}\in S^{*}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),N^{*}_{q}(s^{*}))+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),s_{x}).

Using the triangle inequality for d(,)d(\cdot,\cdot), we obtain

sSxNq(s)d2(cent𝒞(Nq(s)),sx)\displaystyle\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),s_{x}) (4)
\displaystyle\leq sSxNq(s)(d(x,cent𝒞(Nq(s)))+d(x,sx))2\displaystyle\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}\left(d(x,{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})))+d(x,s_{x})\right)^{2}
=\displaystyle= sSd2(cent𝒞(Nq(s)),Nq(s))+sSxNq(s)d2(x,sx)\displaystyle\sum\limits_{s^{*}\in S^{*}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),N^{*}_{q}(s^{*}))+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,s_{x})
+ 2sSxNq(s)d(x,cent𝒞(Nq(s)))d(x,sx).\displaystyle+\ 2\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d(x,{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})))\cdot d(x,s_{x}).

Integrating (2)-(4) and using the definition of cent𝒞(){\rm cent}_{{\mathcal{C}}}(\cdot) then gives

x𝒳(PP)d2(ϕ(cent𝒞(Nq(sx))),x)\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)
\displaystyle\leq 2sSd2(cent𝒞(Nq(s)),Nq(s))+sSxNq(s)d2(x,sx)\displaystyle 2\sum\limits_{s^{*}\in S^{*}}d^{2}({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})),N^{*}_{q}(s^{*}))+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,s_{x})
+ 2sSxNq(s)d2(x,cent𝒞(Nq(s)))sSxNq(s)d2(x,sx)\displaystyle+\ 2\sqrt{\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})))}\cdot\sqrt{\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,s_{x})}
\displaystyle\leq 2sSd2(s,Nq(s))+sSxNq(s)d2(x,sx)\displaystyle 2\sum\limits_{s^{*}\in S^{*}}d^{2}(s^{*},N^{*}_{q}(s^{*}))+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,s_{x})
+ 2sSxNq(s)d2(x,s)sSxNq(s)d2(x,sx)\displaystyle+\ 2\sqrt{\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,s^{*})}\cdot\sqrt{\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}_{q}(s^{*})}d^{2}(x,s_{x})}
=\displaystyle= x𝒳(PP)(2costc(x)+costc(x))+2x𝒳(PP)costc(x)x𝒳(PP)costc(x).\displaystyle\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}\left(2{\rm cost}^{*}_{c}(x)+{\rm cost}_{c}(x)\right)+2\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}^{*}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}.

Note that cent𝒞()=cent(){\rm cent}_{{\mathcal{C}}}(\cdot)={\rm cent}(\cdot) for kk-MeaP. So we complete the proof. ∎

Consider now ϕ(S~)\phi(\tilde{S}^{*}), i.e., the image set of S~\tilde{S^{*}} under ϕ\phi. We list all elements of ϕ(S~)\phi(\tilde{S}^{*}) as ϕ(S~)={s1,,sm}\phi(\tilde{S}^{*})=\{s_{1},...,s_{m}\} where m:=|ϕ(S~)|m:=|\phi(\tilde{S}^{*})|. For each l{1,,m}l\in\{1,...,m\}, let Sl:={sl}S_{l}:=\{s_{l}\} and Sl:={sS|ϕ(cent𝒞(Nq(S^{*}_{l}:=\{s^{*}\in S^{*}|\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}( s)))=sl}s^{*})))=s_{l}\}. Thus, SS^{*} is partitioned into S1,S2,,SmS^{*}_{1},S^{*}_{2},...,S^{*}_{m}. Noting that |S|=|S|=k|S|=|S^{*}|=k, we can enlarge each SlS_{l} such that S1,S2,,SmS_{1},S_{2},...,S_{m} is a partition of SS with |Sl|=|Sl||S_{l}|=|S^{*}_{l}| for each l{1,2,,m}l\in\{1,2,...,m\}.

We will construct a swap operation between the points in SlS_{l} and SlS^{*}_{l} for each pair (Sl,Sl)(S_{l},S^{*}_{l}). Before doing this, we note that a center sSs^{*}\in S^{*} need not belong to the candidate center set 𝒞{\mathcal{C}}^{\prime} for kk-MeaP. Thus, we introduce a center s^𝒞{\hat{s}^{*}}\in\mathcal{C} associated with each sSs^{*}\in S^{*} to ensure that the swap operation involved in ss^{*} can be implemented in Algorithm 3. For each sSs^{*}\in S^{*}, let s^:=argminc𝒞d(c,N(s)){\hat{s}^{*}}:=\arg\min_{c\in{\mathcal{C}}^{\prime}}d(c,N^{*}(s^{*})). Combined with Definition 2.3, we have (see Zhang et al. 2019)

xN(s)d2(s^,x)\displaystyle\sum\limits_{x\in N^{*}(s^{*})}d^{2}({\hat{s}^{*}},x) =\displaystyle= d2(s^,N(s))=minc𝒞d2(c,N(s))\displaystyle d^{2}({\hat{s}^{*}},N^{*}(s^{*}))=\min\limits_{c\in\mathcal{C}}d^{2}(c,N^{*}(s^{*})) (5)
\displaystyle\leq (1+ε^)mincdd2(c,N(s))=(1+ε^)d2(s,N(s))\displaystyle(1+{\hat{\varepsilon}})\min\limits_{c\in\mathbb{R}^{d}}d^{2}(c,N^{*}(s^{*}))=(1+{\hat{\varepsilon}})d^{2}(s^{*},N^{*}(s^{*}))
=\displaystyle= (1+ε^)xN(s)d2(s,x).\displaystyle(1+{\hat{\varepsilon}})\sum\limits_{x\in N^{*}(s^{*})}d^{2}(s^{*},x).

The algorithm allows at most ρ\rho points to be swapped. To satisfy this condition, we consider the following two cases to construct swap operations (cf. Figure 1 for ρ=3\rho=3).

Refer to caption
(a) |Sl|ρ|S_{l}|\leq\rho
Refer to caption
(b) |Sl|>ρ|S_{l}|>\rho
Figure 1: Two cases for constructing the swap operations between SlS_{l} and SlS_{l}^{*} for ρ=3\rho=3. The solid squares belong to ϕ(S~)\phi(\tilde{S}^{*}).
Case 1

(cf. Figure 1(a)). For each ll with |Sl|=|Sl|ρ|S_{l}|=|S^{*}_{l}|\leq\rho, we consider the pair (Sl,Sl)(S_{l},S^{*}_{l}). Let S^l:={s^|sSl}{\hat{S}^{*}_{l}}:=\{\hat{s}^{*}|s^{*}\in S^{*}_{l}\}. W.l.o.g., we assume that S^l𝒳S{\hat{S}^{*}_{l}}\subseteq{\mathcal{X}}\setminus S. For kk-MedP, we consider the swap(Sl,Sl)(S_{l},S^{*}_{l}); for kk-MeaP, we consider the swap(Sl,S^l)(S_{l},{\hat{S}^{*}_{l}}). Utilizing these swap operations, we obtain the following result.

Lemma 3.2.

If |Sl|=|Sl|ρ|S_{l}|=|S^{*}_{l}|\leq\rho, then we have

0\displaystyle 0 \displaystyle\leq sSlxN(s)P(pxcostc(x))+sSlxN(s)P(d(ϕ(cent𝒞(Nq(sx))),x)costc(x))+\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+ (6)
sSlxN(s)P(costc(x)costc(x))+sSlxN(s)P(costc(x)px).\displaystyle\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}({\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))+\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\cap P}({\rm cost}_{c}^{*}(x)-p_{x}).

for kk-MedP, and

0\displaystyle 0 \displaystyle\leq sSlxN(s)P(pxcostc(x))+sSlxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))+\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+ (7)
sSlxN(s)P((1+ε^)costc(x)costc(x))+sSlxN(s)P((1+ε^)costc(x)px).\displaystyle\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))+\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-p_{x}).

for kk-MeaP.

Case 2

(cf. Figure 1(b)). For each ll with |Sl|=|Sl|=ml>ρ|S_{l}|=|S^{*}_{l}|=m_{l}>\rho, we consider (ml1)ml(m_{l}-1)m_{l} pairs (s,s)(s,s^{*}) with sSl\{sl}s\in S_{l}\backslash\{s_{l}\} and sSls^{*}\in S^{*}_{l}. For kk-MedP, we consider the swap(s,s)(s,s^{*}); for kk-MeaP, we consider the swap(s,s^)(s,{\hat{s}^{*}}). Utilizing these swap operations, we obtain the following result.

Lemma 3.3.

For any sSl\{sl}s\in S_{l}\backslash\{s_{l}\} and sSls^{*}\in S^{*}_{l}, we have

0\displaystyle 0 \displaystyle\leq xN(s)P(pxcostc(x))+xN(s)P(d(ϕ(cent𝒞(Nq(sx))),x)costc(x))+\displaystyle\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus P^{*}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+ (8)
xN(s)P(costc(x)costc(x))+xN(s)P(costc(x)px)\displaystyle\sum\limits_{x\in N^{*}(s^{*})\setminus P}({\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{*}(s^{*})\cap P}({\rm cost}_{c}^{*}(x)-p_{x})

for kk-MedP, and

0\displaystyle 0 \displaystyle\leq xN(s)P(pxcostc(x))+xN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))+\displaystyle\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+ (9)
xN(s)P((1+ε^)costc(x)costc(x))+xN(s)P((1+ε^)costc(x)px)\displaystyle\sum\limits_{x\in N^{*}(s^{*})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{*}(s^{*})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-p_{x})

for kk-MeaP.

Lemma 3.2 shows a relationship between the sets SlS_{l} and SlS^{*}_{l}, while Lemma 3.3 shows a relationship between two points in SlS_{l} and SlS^{*}_{l} respectively. We remark that Lemma 3.3 holds for all pairs (Sl,Sl)(S_{l},S^{*}_{l}) (no matter whether |Sl|>ρ|S_{l}|>\rho). This is useful for the analysis of the algorithm for kk-MedO/kk-MeaO in Section 4.

Proof of Lemma 3.2.

We only prove it for kk-MeaP. The proof for kk-MedP is similar. After the operation swap(Sl,S^l)(S_{l},{\hat{S}^{*}_{l}}), we penalize all points in N(s)PN(s)\cap P^{*} for all sSls\in S_{l}, reassign each point xN(s)x\in N^{*}(s^{*}) to s^\hat{s}^{*} for all sSls^{*}\in S^{*}_{l}, and reassign each point xN(s)(sSlN(s)P)x\in N(s)\setminus\left(\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*}\right) to ϕ(cent𝒞(Nq(sx)))\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))) for all sSls\in S_{l} (sxSls^{*}_{x}\notin S^{*}_{l} implies ϕ(cent𝒞(Nq(sx)))Sl\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})))\notin S_{l}). Since the operation swap(Sl,S^l)(S_{l},{\hat{S}^{*}_{l}}) does not improve the local optimal solution SS, we have

0\displaystyle 0 \displaystyle\leq cost(S{Sl}{S^l})cost(S)\displaystyle{\rm cost}(S\setminus\{S_{l}\}\cup\{\hat{S}^{*}_{l}\})-{\rm cost}(S)
\displaystyle\leq sSlxN(s)P(pxcostc(x))+\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+
sSlxN(s)(sSlN(s)P)(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))+\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus\left(\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*}\right)}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+
sSlxN(s)P(d2(s^,x)costc(x))+sSlxN(s)P(d2(s^,x)px).\displaystyle\sum\limits_{s^{*}\in S^{*}_{l}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}(d^{2}({\hat{s}^{*}},x)-{\rm cost}_{c}(x))+\sum\limits_{s^{*}\in S^{*}_{l}}\sum\limits_{x\in N^{*}(s^{*})\cap P}(d^{2}({\hat{s}^{*}},x)-p_{x}).\

Combining this with xN(s)=xN(s)P+xN(s)P\sum_{x\in N^{*}(s^{*})}=\sum_{x\in N^{*}(s^{*})\setminus P}+\sum_{x\in N^{*}(s^{*})\cap P} and inequality (5) completes the proof. ∎

Proof of Lemma 3.3.

We again only prove it for kk-MeaP, and the proof for kk-MedP is again similar. Recall the definition of s^{\hat{s}^{*}}. W.l.o.g., we assume that s^S{\hat{s}^{*}}\notin S. It follows from ssls\neq s_{l} and ϕ(cent𝒞(Nq(s)))=sl\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*})))=s_{l} that ϕ(cent𝒞(Nq(sx)))\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))) s\neq s when xN(s)(N(s)P)x\in N(s)\setminus(N^{*}(s^{*})\cup P^{*}). Since the operation swap(s,s^)(s,{\hat{s}^{*}}) does not improve the current solution SS, we have

0\displaystyle 0 \displaystyle\leq cost(S{s}{s^})cost(S)\displaystyle{\rm cost}(S\setminus\{s\}\cup\{\hat{s}^{*}\})-{\rm cost}(S)
\displaystyle\leq xN(s)P(pxcostc(x))+xN(s)(N(s)P)(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus(N^{*}(s^{*})\cup P^{*})}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+xN(s)P(d2(s^,x)costc(x))+xN(s)P(d2(s^,x)px)\displaystyle+\sum\limits_{x\in N^{*}(s^{*})\setminus P}(d^{2}({\hat{s}^{*}},x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{*}(s^{*})\cap P}(d^{2}({\hat{s}^{*}},x)-p_{x})
\displaystyle\leq xN(s)P(pxcostc(x))+xN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+xN(s)P((1+ε^)costc(x)costc(x))+xN(s)P((1+ε^)costc(x)px).\displaystyle+\sum\limits_{x\in N^{*}(s^{*})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))+\sum\limits_{x\in N^{*}(s^{*})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-p_{x}).

This completes the proof. ∎

Combining Lemmas 3.2 and 3.3, we estimate the cost of SS for kk-MedP and kk-MeaP in the following two theorems respectively.

Theorem 3.4.

LS-Multi-Swap(𝒳,,k,{pj}j𝒳,ρ{\mathcal{X}},{\mathcal{F}},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho) for kk-MedP produces a local optimal solution SS satisfying costc+costp(3+2/ρ)costc+(1+1/ρ)costp{\rm cost}_{c}+{\rm cost}_{p}\leq(3+2/\rho){\rm cost}_{c}^{*}+(1+1/\rho){\rm cost}_{p}^{*}.

Theorem 3.5.

Let 𝒞{\mathcal{C}}^{\prime} be an ε^\hat{\varepsilon}-approximate centroid set for 𝒳{\mathcal{X}}. LS-Multi-Swap(𝒳,{\mathcal{X}}, 𝒞,k,{pj}j𝒳,ρ{\mathcal{C}}^{\prime},k,\{p_{j}\}_{j\in{\mathcal{X}}},\rho) for kk-MeaP produces a local optimal solution SS satisfying costc+costp(3+2/ρ+ε^)2costc+(3+2/ρ+ε^)(1+1/ρ)costp{\rm cost}_{c}+{\rm cost}_{p}\leq\left(3+2/\rho+{\hat{\varepsilon}}\right)^{2}{\rm cost}_{c}^{*}+\left(3+2/\rho+{\hat{\varepsilon}}\right)\left(1+1/\rho\right){\rm cost}_{p}^{*}.

Proof of Theorem 3.4..

Note that ml/(ml1)(ρ+1)/ρm_{l}/(m_{l}-1)\leq(\rho+1)/\rho and d(ϕ(c),x)costc(x)d(\phi(c),x)\geq{\rm cost}_{c}(x) for any cS~c\in\tilde{S}^{*} and any x𝒳x\in{\mathcal{X}}. Summing the inequality (6) with weight 11 and inequality (8) with weight 1/(ml1)1/(m_{l}-1) over all constructed swap operations, we have

0\displaystyle 0 \displaystyle\leq (1+1ρ)sSxN(s)P(pxcostc(x))\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x)) (10)
+(1+1ρ)sSxN(s)P(d(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle+\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+sSxN(s)P(costc(x)costc(x))\displaystyle+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}({\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))
+sSxN(s)P(costc(x)px).\displaystyle+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\cap P}({\rm cost}_{c}^{*}(x)-p_{x}).

The triangle inequality and the definition of ϕ()\phi(\cdot) imply that

d(ϕ(cent𝒞(Nq(sx))),x)\displaystyle d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x) (11)
\displaystyle\leq d(ϕ(cent𝒞(Nq(sx))),cent𝒞(Nq(sx)))+d(cent𝒞(Nq(sx)),x)\displaystyle d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})))+d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)
\displaystyle\leq d(sx,cent𝒞(Nq(sx)))+d(cent𝒞(Nq(sx)),x)\displaystyle d(s_{x},{\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})))+d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)
\displaystyle\leq d(sx,x)+d(cent𝒞(Nq(sx)),x)+d(cent𝒞(Nq(sx)),x)\displaystyle d(s_{x},x)+d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)+d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)
=\displaystyle= costc(x)+2d(cent𝒞(Nq(sx)),x).\displaystyle{\rm cost}_{c}(x)+2d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x).

Combining inequalities (10) and (11), we obtain

0\displaystyle 0 \displaystyle\leq (1+1ρ)sSxN(s)P(pxcostc(x))+(1+1ρ)sSxN(s)P2d(cent𝒞(Nq(sx)),x)\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x))+\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{*}}2d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x) (12)
+sSxN(s)P(costc(x)costc(x))+sSxN(s)P(costc(x)px)\displaystyle+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}({\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\cap P}({\rm cost}_{c}^{*}(x)-p_{x})
\displaystyle\leq (1+1ρ)xPpx+(1+1ρ)sSxNq(s)2d(cent𝒞(Nq(sx)),x)\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{x\in P^{*}}p_{x}+\left(1+\frac{1}{\rho}\right)\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N_{q}^{*}(s^{*})}2d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)
+x𝒳Pcostc(x)x𝒳Pcostc(x)xPpx\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{*}}{\rm cost}_{c}^{*}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)-\sum\limits_{x\in P}p_{x}
=\displaystyle= (1+1ρ)costp+(1+1ρ)sSxNq(s)2d(cent𝒞(Nq(sx)),x)\displaystyle\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}+\left(1+\frac{1}{\rho}\right)\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N_{q}^{*}(s^{*})}2d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)
+costccostccostp.\displaystyle+\ {\rm cost}_{c}^{*}-{\rm cost}_{c}-{\rm cost}_{p}.

From the definitions of Nq()N^{*}_{q}(\cdot) and cent𝒞(){\rm cent}_{{\mathcal{C}}}(\cdot), we get that

sSxNq(s)2d(cent𝒞(Nq(sx)),x)sSxNq(s)2d(sx,x)2costc.\displaystyle\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N_{q}^{*}(s^{*})}2d({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x})),x)\leq\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N_{q}^{*}(s^{*})}2d(s^{*}_{x},x)\leq 2{\rm cost}_{c}^{*}. (13)

Finally, we complete the proof by combining inequalities (12)-(13) for ρ=2/ε\rho=2/\varepsilon. ∎

Proof of Theorem 3.5..

Similar to the proof of Theorem 3.4, we obtain by summing inequality (7) with weight 11 and inequality (9) with weight 1/(ml1)1/(m_{l}-1) over all constructed swap operations that

0\displaystyle 0 \displaystyle\leq (1+1ρ)sSxN(s)P(pxcostc(x))\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\cap P^{*}}(p_{x}-{\rm cost}_{c}(x)) (14)
+(1+1ρ)sSxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle+\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+sSxN(s)P((1+ε^)costc(x)costc(x))\displaystyle+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))
+sSxN(s)P((1+ε^)costc(x)px).\displaystyle+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\cap P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-p_{x}).

Because of sSxN(s)P=x𝒳(PP)\sum_{s\in S}\sum_{x\in N(s)\setminus P^{*}}=\sum_{x\in{\mathcal{X}}\setminus(P\cup P^{*})} and Lemma 3.1, the RHS of (14) is not larger than

(3+2ρ+ε^)x𝒳Pcostc(x)x𝒳Pcostc(x)\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)\sum\limits_{x\in\mathcal{X}\setminus P^{*}}{\rm cost}_{c}^{*}(x)-\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x) (15)
+ 2(1+1ρ)x𝒳Pcostc(x)x𝒳Pcostc(x)+(1+1ρ)xPpxxPpx\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P^{*}}{\rm cost}_{c}^{*}(x)}\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)}+\left(1+\frac{1}{\rho}\right)\sum\limits_{x\in P^{*}}p_{x}-\sum\limits_{x\in P}p_{x}
=\displaystyle= (3+2ρ+ε^)costccostc+2(1+1ρ)costcCs+(1+1ρ)costpcostp\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}-{\rm cost}_{c}+2\left(1+\frac{1}{\rho}\right)\sqrt{{\rm cost}_{c}^{*}C_{s}}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}-{\rm cost}_{p}
\displaystyle\leq ((3+2ρ+ε^)costc+(1+1ρ)costp)(costc+costp)\displaystyle\left(\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}\right)-\left({\rm cost}_{c}+{\rm cost}_{p}\right)
+2(1+1ρ)3+2ρ+ε^((3+2ρ+ε^)costc+(1+1ρ)costp)(costc+costp).\displaystyle+\frac{2\left(1+\frac{1}{\rho}\right)}{\sqrt{3+\frac{2}{\rho}+{\hat{\varepsilon}}}}\sqrt{\left(\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}\right)\left({\rm cost}_{c}+{\rm cost}_{p}\right)}.

The RHS of (15) is equal to

((3+2ρ+ε^)costc+(1+1ρ)costp+αcostc+costp)\displaystyle\left(\sqrt{\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}}+\alpha\sqrt{{\rm cost}_{c}+{\rm cost}_{p}}\right)
×((3+2ρ+ε^)costc+(1+1ρ)costpβcostc+costp)\displaystyle\times\left(\sqrt{\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}}-\beta\sqrt{{\rm cost}_{c}+{\rm cost}_{p}}\right)

where

α=1+1/ρ3+2/ρ+ε^+(1+1/ρ)23+2/ρ+ε^+1ε,\displaystyle\alpha=\frac{1+1/\rho}{\sqrt{3+2/\rho+\hat{\varepsilon}}}+\sqrt{\frac{(1+1/\rho)^{2}}{3+2/\rho+\hat{\varepsilon}}+1-\varepsilon},
β=1+1/ρ3+2/ρ+ε^+(1+1/ρ)23+2/ρ+ε^+1ε.\displaystyle\beta=-\frac{1+1/\rho}{\sqrt{3+2/\rho+\hat{\varepsilon}}}+\sqrt{\frac{(1+1/\rho)^{2}}{3+2/\rho+\hat{\varepsilon}}+1-\varepsilon}.

This implies that

(3+2ρ+ε^)costc+(1+1ρ)costpβcostc+costp0,\displaystyle\sqrt{\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}}-\beta\sqrt{{\rm cost}_{c}+{\rm cost}_{p}}\geq 0,

which is equivalent to

costc+costp\displaystyle{\rm cost}_{c}+{\rm cost}_{p} \displaystyle\leq 1β2(3+2ρ+ε^)costc+1β2(1+1ρ)costp.\displaystyle\frac{1}{\beta^{2}}\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}_{c}^{*}+\frac{1}{\beta^{2}}\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}.

Observe that

1β23+2ρ+ε^.\frac{1}{\beta^{2}}\leq 3+\frac{2}{\rho}+\hat{\varepsilon}.

So, we have

costc+costp(3+2ρ+ε^)2costc+(3+2ρ+ε^)(1+1ρ)costp.\displaystyle{\rm cost}_{c}+{\rm cost}_{p}\leq\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)^{2}{\rm cost}_{c}^{*}+\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)\left(1+\frac{1}{\rho}\right){\rm cost}_{p}^{*}. (16)

Substituting ε^=1/ρ{\hat{\varepsilon}}=1/\rho into (16) completes the proof. ∎

We remark that Algorithm 3 can be adapted to a polynomial-time algorithm that only sacrifices ε\varepsilon in the approximation factor (see Arya et al. 2004). Combining this adaptation and Theorems 3.4-3.5, we obtain a (3+ε)(3+\varepsilon)-approximation algorithm for kk-MedP, and a (9+ε)(9+\varepsilon)-approximation algorithm for kk-MeaP, if ρ\rho and ε^\hat{\varepsilon} are sufficiently small.

4 Local search algorithm for kk-MedO/kk-MeaO

In this section, we focus on kk-MedO and kk-MeaO. We apply the technique for addressing outliers in a local search algorithm provided by Gupta et al. (2017) to kk-MedO and kk-MeaO, and use our new analysis to improve the approximation ratio.

4.1 The algorithm

Each iteration of the outlier-based multi-swap local search algorithm has a no-swap step and a swap step. Supposing that the current solution is (S,P)(S,P), the no-swap step implements an “add outliers” operation that adds the points in outlier(S,P){\rm outlier}(S,P) (defined in Section 2.1) to PP, if this operation can reduce the cost by a given factor. Then, the swap step searches for a better solution by the multi-swap together with the “add outliers” operations. The algorithm terminates when both the no-swap and the swap step can not reduce the cost by the given factor.

Let cost(S,P){\rm cost}(S,P) denote the the cost of the solution (S,P)(S,P). We give the formal description of the outlier-based local search algorithm in Algorithm 4.1. This algorithm has three parameters: ρ\rho is the number of points which are allowed to be swapped in a solution, qq and ε\varepsilon are used to control the descending step-length of the cost. The parameter qq is fixed as kk in the algorithm provided by Gupta et al. (2017), while it is an input in our algorithm, because the approximation ratio is associated with the value of this parameter.

The following proposition holds for this algorithm.

Proposition 4.1 (Gupta et al. 2017).

Let (S,P)(S,P) be the solution produced by LS-Multi-Swap-Outlier(𝒳,{\mathcal{X}}, C,z,k,ρ,εC,z,k,\rho,\varepsilon), and set q=kq=k if ρ=1\rho=1, otherwise, set q=k2kq=k^{2}-k. Then

  • (i)

    cost(S,Poutlier(S,P))(1ε/q)cost(S,P){\rm cost}(S,P\cup{\rm outlier}(S,P))\geq\left(1-\varepsilon/q\right){\rm cost}(S,P),

  • (ii)

    cost(SAB,Poutlier(SAB,P))(1ε/q)cost(S,P){\rm cost}(S\setminus A\cup B,P\cup{\rm outlier}(S\setminus A\cup B,P))\geq\left(1-\varepsilon/q\right){\rm cost}(S,P) for any ASA\subseteq S and BCB\subseteq C.

  Algorithm 2 The outlier-based local search algorithm: LS-Multi-Swap-Outlier(𝒳,C,z,k,ρ,q,ε{\mathcal{X}},C,z,k,\rho,q,\varepsilon)

 

0:  Data set 𝒳{\mathcal{X}}, candidate center set CC, positive integers zz, kk, qq and ρk\rho\leq k, real number ε>0\varepsilon>0.
0:  Center set SCS\subseteq C and outlier set P𝒳P\subseteq{\mathcal{X}}.
1:  Arbitrarily choose a kk-center subset SS from CC.
2:  Set P:=outlier(C)P:={\rm outlier}(C).
3:  Set α:=+\alpha:=+\infty.
4:  while cost(S,P)<α{\rm cost}(S,P)<\alpha do
5:     αcost(S,P)\alpha\leftarrow{\rm cost}(S,P)
6:     if cost(S,Poutlier(S,P))<(1εq)cost(S,P){\rm cost}(S,P\cup{\rm outlier}(S,P))<\left(1-\dfrac{\varepsilon}{q}\right){\rm cost}(S,P) then
7:        Set P:=Poutlier(S,P)P:=P\cup{\rm outlier}(S,P).
8:     end if
9:     Compute (A,B):=argminAS,BCS,|A|=|B|ρcost(SAB,Poutlier(SAB,P)).(A,B):=\arg\min_{A\subseteq S,B\subseteq C\setminus S,|A|=|B|\leq\rho}{\rm cost}(S\setminus A\cup B,P\cup{\rm outlier}(S\setminus A\cup B,P)).
10:     Set S:=SABS^{\prime}:=S\setminus A\cup B and P:=Poutlier(SAB,P)P^{\prime}:=P\cup{\rm outlier}(S\setminus A\cup B,P).
11:     if cost(S,P)<(1εq)cost(S,P){\rm cost}(S^{\prime},P^{\prime})<\left(1-\dfrac{\varepsilon}{q}\right){\rm cost}(S,P) then
12:        Set S:=SS:=S^{\prime} and P:=PP:=P^{\prime}.
13:     end if
14:  end while
15:  return  SS and PP

 

For kk-MedO, we run LS-Multi-Swap-Outlier(𝒳,,z,k,ρ,q,ε{\mathcal{X}},{\mathcal{F}},z,k,\rho,q,\varepsilon). For kk-MeaO, we run LS-Multi-Swap-Outlier(𝒳,𝒞,z,k,ρ,q,ε{\mathcal{X}},{\mathcal{C}}^{\prime},z,k,\rho,q,\varepsilon), where 𝒞{\mathcal{C}}^{\prime} is an ε^{\hat{\varepsilon}}-approximate centroid set for 𝒳{\mathcal{X}}. The values of ρ\rho, ε\varepsilon, and ε^\hat{\varepsilon} will be determined in the analysis of the algorithm.

4.2 The analysis

The time complexity of Algorithm 4.1 is shown in the following theorem.

Theorem 4.2.

The running time of LS-Multi-Swap-Outlier(𝒳,𝒞,z,k,ρ,q,ε{\mathcal{X}},{\mathcal{C}},z,k,\rho,q,\varepsilon) is O(kρnρqεlog(nδ))O\left(\frac{k^{\rho}n^{\rho}q}{\varepsilon}\log(n\delta)\right).

Proof.

The proof is similar to that in Gupta et al. (2017). For the sake of completeness, we present it also here. W.l.o.g, we can assume that the optimal value of the problem is at least 11 by scaling the distances, except for the trivial case that k=nzk=n-z. Under this assumption, the cost of any solution is at most nδ1n\delta\geq 1. The number of iterations is at most O(log1ε/q(nδ))=O(qεlog(nδ))O(-\log_{1-\varepsilon/q}(n\delta))=O(\frac{q}{\varepsilon}\log(n\delta)), since the cost is reduced to at most (1ε/q)(1-\varepsilon/q) times the old cost in each iteration. The number of solutions searched by a swap operation is at most O((kn)ρ)O((kn)^{\rho}), since |A|=|B|ρ|A|=|B|\leq\rho. This completes the proof. ∎

The algorithm may violates the outlier constraint in order to yield a bounded approximation ratio. We can also bound the number of outliers by a suitable factor, which is shown in the following result.

Theorem 4.3.

The number of outliers returned by LS-Multi-Swap-Outlier(𝒳,𝒞,{\mathcal{X}},{\mathcal{C}}, z,k,ρ,q,εz,k,\rho,q,\varepsilon) is O(zqεO\left(\frac{zq}{\varepsilon}\right. log(nδ))\left.\log(n\delta)\right).

Proof.

From the proof of Theorem 4.2, we know that LS-Multi-Swap-Outlier(𝒳,𝒞,{\mathcal{X}},{\mathcal{C}}, z,k,ρ,q,εz,k,\rho,q,\varepsilon) has at most O(qεlog(nδ))O\left(\frac{q}{\varepsilon}\log(n\delta)\right) iterations. In each iteration, the algorithm removes at most 2z2z outliers. This completes the proof. ∎

Let (S,P)(S,P) be the solution returned by Algorithm 4.1, and (S,P)(S^{*},P^{*}) be the global optimal solution. Similar to the penalty version, we use the same notations (except that the outlier version has not penalty cost) and adopt the same partition of SS and SS^{*} (S=lSl,S=lSlS=\cup_{l}S_{l},\ S^{*}=\cup_{l}S^{*}_{l}). Similar to Lemmas 3.2 and 3.3, we obtain the following two results.

Lemma 4.4.

If |Sl|=|Sl|ρ|S_{l}|=|S^{*}_{l}|\leq\rho, we have

εqcost(S,P)\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P) \displaystyle\leq sSlxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))+\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+ (17)
sSlxN(s)costc(x)sSlxN(s)Pcostc(x).\displaystyle\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})}{\rm cost}_{c}^{*}(x)-\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x).

for kk-MedO, and

εqcost(S,P)\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P) \displaystyle\leq sSlxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))+\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)+
sSlxN(s)(1+ε^)costc(x)sSlxN(s)Pcostc(x).\displaystyle\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x).

for kk-MeaO.

Proof.

We only prove it for kk-MeaO. The proof for kk-MedO is similar. We consider the swap(Sl,S^l)(S_{l},{\hat{S}}^{*}_{l}). Since the swap step of the algorithm produces at most zz additional outliers in each iteration, and |PsSlN(s)P||P|+z|P\setminus\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*}|\leq|P|+z, we can let the points in PsSlN(s)PP\setminus\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*} be the additional outliers after the constructed swap operation. For the other points, it is obvious that we can apply the reassignments in the proof of Lemma 3.2 also here. Then, Proposition 4.1 yields

εqcost(S,P)\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)
\displaystyle\leq cost(SSlSl^,Poutlier(SSlSl^,P))cost(S,P)\displaystyle{\rm cost}(S\setminus S_{l}\cup\hat{S^{*}_{l}},P\cup{\rm outlier}(S\setminus S_{l}\cup\hat{S^{*}_{l}},P))-{\rm cost}(S,P)
\displaystyle\leq sSlxN(s)Pcostc(x)\displaystyle-\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}{\rm cost}_{c}(x)
+sSlxN(s)(sSlN(s)P)(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus\left(\bigcup_{s^{*}\in S^{*}_{l}}N^{*}(s^{*})\cup P^{*}\right)}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+sSlxN(s)P(d2(s^,x)costc(x))+sSlxN(s)Pd2(s^,x)\displaystyle+\sum\limits_{s^{*}\in S^{*}_{l}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}(d^{2}({\hat{s}^{*}},x)-{\rm cost}_{c}(x))+\sum\limits_{s^{*}\in S^{*}_{l}}\sum\limits_{x\in N^{*}(s^{*})\cap P}d^{2}({\hat{s}^{*}},x)
\displaystyle\leq sSlxN(s)Pcostc(x)+sSlxN(s)P(1+ε^)costc(x)\displaystyle-\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\cap P^{*}}{\rm cost}_{c}(x)+\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\cap P}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)
+sSlxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle+\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+sSlxN(s)P((1+ε^)costc(x)costc(x))\displaystyle+\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}((1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-{\rm cost}_{c}(x))
\displaystyle\leq sSlxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle\sum\limits_{s\in S_{l}}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+sSlxN(s)(1+ε^)costc(x)sSlxN(s)Pcostc(x),\displaystyle+\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-\sum\limits_{s^{*}\in S_{l}^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x),

where the third inequality follows from (5).

Lemma 4.5.

For any point sSl{sl}s\in S_{l}\setminus\{s_{l}\} and sSls^{*}\in S^{*}_{l}, we have

εqcost(S,P)\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P) \displaystyle\leq xN(s)P(d(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle\sum\limits_{x\in N(s)\setminus P^{*}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right) (19)
+xN(s)costc(x)xN(s)Pcostc(x)\displaystyle+\sum\limits_{x\in N^{*}(s^{*})}{\rm cost}_{c}^{*}(x)-\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x)

for kk-MedO, and

εqcost(S,P)\displaystyle-\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P) \displaystyle\leq xN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right) (20)
+xN(s)(1+ε^)costc(x)xN(s)Pcostc(x)\displaystyle+\sum\limits_{x\in N^{*}(s^{*})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x)

for kk-MeaO.

Proof.

The proof is similar to those for Lemmas 3.3 and 4.4. ∎

Next we will construct some swap operations for each pair (Sl,Sl)(S_{l},S^{*}_{l}), and then apply Lemmas 4.4 and 4.5 to these swaps. Similar to the analysis for the penalty version, we consider two cases according to the size of SlS_{l}: |Sl|ρ|S_{l}|\leq\rho and |Sl|=ml>ρ|S_{l}|=m_{l}>\rho.

Note that the number of constructed swap operations will appear in the coefficient of cost(S,P){\rm cost}(S,P) after summing the inequalities in Lemmas 4.4 and 4.5. We want this number to be as small as possible, since it is proportional to the approximation ratio due to the later analysis. On the other hand, to obtain the entire cost of the solution (S,P)(S,P), we need to swap all centers in SS at least once. Thus, for the case of |Sl|ρ|S_{l}|\leq\rho, we consider the same swap operations in the analysis for the penalty version (we state it again in the following Case 1), since each center in SlS_{l} is swapped exactly once.

For the case of |Sl|=ml>ρ|S_{l}|=m_{l}>\rho, there are ml(ml1)m_{l}(m_{l}-1) single-swap operations in the analysis for the penalty version. This makes the coefficient of the cost of (S,P)(S^{*},P*) small (ml/(ml1)1m_{l}/(m_{l}-1)\rightarrow 1 when ml+m_{l}\rightarrow+\infty), but the number of swaps is large. In this section, we consider two methods to construct swap operations for this case, which are stated in Methods 1 and 2 in the following Case 2. Note that Method 2 is the same as that in Section 3.

Case 1

(cf. Figure 1(a) for ρ=3\rho=3). For each ll with |Sl|=|Sl|ρ|S_{l}|=|S^{*}_{l}|\leq\rho, let Sl={sl}S_{l}=\{s_{l}\} and Sl={sl}S^{*}_{l}=\{s^{*}_{l}\}. We construct the swap(sl,s)(s_{l},s^{*}) for kk-MedO, and swap(sl,s^l)(s_{l},{\hat{s}^{*}_{l}}) for kk-MeaO.

Case 2.

For each ll with |Sl|=|Sl|=ml>1|S_{l}|=|S^{*}_{l}|=m_{l}>1, let Sl={sl,sl,2,,sl,ml}S_{l}=\{s_{l},s_{l,2},\dots,s_{l,m_{l}}\} and Sl={sl,1,sl,2,,sl,ml}S^{*}_{l}=\{s^{*}_{l,1},s^{*}_{l,2},\dots,s^{*}_{l,m_{l}}\}.

  • Method 1

    (cf. Figure 2). Set

    ψ(s):={sl,2,ifs=sl,1;sl,2,ifs=sl,2;sl,3,ifs=sl,3;sl,ml,ifs=sl,ml.\psi(s^{*}):=\left\{\begin{array}[]{ll}s_{l,2},&{\rm if}\ s^{*}=s^{*}_{l,1};\\ s_{l,2},&{\rm if}\ s^{*}=s^{*}_{l,2};\\ s_{l,3},&{\rm if}\ s^{*}=s^{*}_{l,3};\\ ...&...\\ s_{l,m_{l}},&{\rm if}\ s^{*}=s^{*}_{l,m_{l}}.\end{array}\right.

    For each sSls^{*}\in S^{*}_{l}, we construct swap(ψ(s),s)(\psi(s^{*}),s^{*}) for kk-MedO, and swap(ψ((\psi( s),s^)s^{*}),{\hat{s}^{*}}) for kk-MeaO.

  • Method 2

    (cf. Figure 1(b)). We consider (ml1)ml(m_{l}-1)m_{l} pairs (s,s)(s,s^{*}) with sSl\{sl}s\in S_{l}\backslash\{s_{l}\} and sSls^{*}\in S^{*}_{l}. For kk-MedO, we construct swap(s,s)(s,s^{*}) for each pair; for kk-MeaO, we construct swap(s,s^)(s,{\hat{s}^{*}}) for each pair.

Refer to caption
Figure 2: Sinlgle-swap operations for the case of |Sl|>ρ|S_{l}|>\rho.

Combining these swap operations, we obtain the main results for Algorithm 4.1, which are shown in the following two theorems.

Theorem 4.6.

Let (S,P)(S,P) be the solution returned by LS-Multi-Swap-Outlier(𝒳,{\mathcal{X}}, ,z,k,ρ,q,ε{\mathcal{F}},z,k,\rho,q,\varepsilon) for kk-MedO. If (1+k)ε<q(1+k)\varepsilon<q, then we have

cost(S,P)51(1+k)ε/qcost(S,P).\displaystyle{\rm cost}(S,P)\leq\frac{5}{1-(1+k)\varepsilon/q}\cdot{\rm cost}(S^{*},P^{*}). (21)

If (1+k2k)ε<q(1+k^{2}-k)\varepsilon<q, then we have

cost(S,P)3+2/ρ1(1+k2k)ε/qcost(S,P).\displaystyle{\rm cost}(S,P)\leq\frac{3+2/\rho}{1-(1+k^{2}-k)\varepsilon/q}\cdot{\rm cost}(S^{*},P^{*}). (22)
Theorem 4.7.

Let 𝒞{\mathcal{C}}^{\prime} be an ε^\hat{\varepsilon}-approximate centroid set for the data set 𝒳{\mathcal{X}}, and (S,P)(S,P) be the solution returned by LS-Multi-Swap-Outlier(𝒳,𝒞,z,k,ρ,q,ε{\mathcal{X}},{\mathcal{C}}^{\prime},z,k,\rho,q,\varepsilon) for kk-MeaO. If (5+ε^)(1+k)ε<(9+ε^)q(5+\hat{\varepsilon})(1+k)\varepsilon<(9+\hat{\varepsilon})q, then we have

cost(S,P)5+ε^β12cost(S,P)\displaystyle{\rm cost}(S,P)\leq\dfrac{5+\hat{\varepsilon}}{\beta_{1}^{2}}\cdot{\rm cost}(S^{*},P^{*}) (23)

where

β1=25+ε^+45+ε^+1(1+k)εq.\beta_{1}=-\frac{2}{\sqrt{5+\hat{\varepsilon}}}+\sqrt{\frac{4}{5+\hat{\varepsilon}}+1-\frac{(1+k)\varepsilon}{q}}.

If (1+k2k)ε/q<(1+1/ρ)2/(3+2/ρ+ε^)+1(1+k^{2}-k)\varepsilon/q<(1+1/\rho)^{2}/(3+2/\rho+\hat{\varepsilon})+1, then we have

cost(S,P)3+2/ρ+ε^β22cost(S,P)\displaystyle{\rm cost}(S,P)\leq\dfrac{3+2/\rho+\hat{\varepsilon}}{\beta_{2}^{2}}\cdot{\rm cost}(S^{*},P^{*}) (24)

where

β2=1+1/ρ3+2/ρ+ε^+(1+1/ρ)23+2/ρ+ε^+1(1+k2k)εq.\beta_{2}=-\frac{1+1/\rho}{\sqrt{3+2/\rho+\hat{\varepsilon}}}+\sqrt{\frac{(1+1/\rho)^{2}}{3+2/\rho+\hat{\varepsilon}}+1-\frac{(1+k^{2}-k)\varepsilon}{q}}.

Each of these two theorems gives two approximation ratios for Algorithm 4.1. The first one is obtained by Method 1, while the second one is obtained by Method 2.

Proof of Theorem 4.6..

We first prove inequality (21). For Case 2, we use Method 1 to construct swap operations. Note that each point in SS is swapped at most twice, and each point in SS^{*} is swapped once exactly, implying that the number of constructed swap operations is |S|=k|S^{*}|=k. Summing inequality (19) over these kk swaps and using Proposition 4.1, we obtain

kεqcost(S,P)\displaystyle-\frac{k\varepsilon}{q}\cdot{\rm cost}(S,P) \displaystyle\leq 2sSxN(s)P(d(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle 2\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right) (25)
+sS(xN(s)costc(x)xN(s)Pcostc(x))\displaystyle+\sum\limits_{s^{*}\in S^{*}}\left(\sum\limits_{x\in N^{*}(s^{*})}{\rm cost}_{c}^{*}(x)-\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x)\right)
\displaystyle\leq 2x𝒳(PP)(d(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle 2\sum\limits_{x\in{\mathcal{X}}\setminus(P\cup P^{*})}\left(d(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+x𝒳Pcostc(x)x𝒳Pcostc(x)+PPcostc(x)\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{*}}{\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{*}\setminus P}{\rm cost}_{c}(x)
\displaystyle\leq 4cost(S,P)+cost(S,P)cost(S,P)+PPcostc(x)\displaystyle 4{\rm cost}(S^{*},P^{*})+{\rm cost}(S^{*},P^{*})-{\rm cost}(S,P)+\sum\limits_{P^{*}\setminus P}{\rm cost}_{c}(x)
=\displaystyle= 5cost(S,P)cost(S,P)+PPcostc(x),\displaystyle 5{\rm cost}(S^{*},P^{*})-{\rm cost}(S,P)+\sum\limits_{P^{*}\setminus P}{\rm cost}_{c}(x),

where the third inequality follows from (11) and (13).

Using the definition of outlier(,){\rm outlier}(\cdot,\cdot), we obtain

xPPcostc(x)\displaystyle\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x) \displaystyle\leq xoutlier(S,P)costc(x)\displaystyle\sum\limits_{x\in{\rm outlier}(S,P)}{\rm cost}_{c}(x) (26)
=\displaystyle= cost(S,P)cost(S,Poutlier(S,P))\displaystyle{\rm cost}(S,P)-{\rm cost}(S,P\cup{\rm outlier}(S,P))
\displaystyle\leq εqcost(S,P).\displaystyle\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P).

Combining inequalities (25)-(26), we have

05cost(S,P)(1(1+k)εq)cost(S,P),\displaystyle 0\leq 5{\rm cost}(S^{*},P^{*})-\left(1-\dfrac{(1+k)\varepsilon}{q}\right){\rm cost}(S,P),

which is equivalent to (21) under the condition that (1+k)ε<q(1+k)\varepsilon<q.

Next we will prove the inequality (22). For Case 2, we use Method 2 to construct swap operations. Let L1:={l||Sl|ρ}L_{1}:=\{l~{}|~{}|S_{l}|\leq\rho\} and L2:={l||Sl|>ρ}L_{2}:=\{l~{}|~{}|S_{l}|>\rho\}. Summing inequality (17) with weight 1 and inequality (19) with weight 1/(ml1)1/(m_{l}-1) over all constructed swap operations, and observing that ml/(ml1)(ρ+1)/ρm_{l}/(m_{l}-1)\leq(\rho+1)/\rho, we obtain

lL1εqcost(S,P)lL21ml1ml(ml1)εqcost(S,P)\displaystyle-\sum\limits_{l\in L_{1}}\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)-\sum\limits_{l\in L_{2}}\frac{1}{m_{l}-1}\cdot m_{l}(m_{l}-1)\cdot\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P) (27)
\displaystyle\leq (1+1ρ)sSxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+sSxN(s)costc(x)sSxN(s)Pcostc(x).\displaystyle+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})}{\rm cost}_{c}^{*}(x)-\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x).

Note that there are at most k(k1)k(k-1) constructed swap operations. It follows from 1/(ml1)11/(m_{l}-1)\leq 1 that

LHSof(27)\displaystyle{\rm LHS~{}of~{}(\ref{ieq1-thm-outlier-kmedo})} \displaystyle\geq (|L1|+lL2ml(ml1))εqcost(S,P)\displaystyle-\left(|L_{1}|+\sum\limits_{l\in L_{2}}m_{l}(m_{l}-1)\right)\cdot\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P) (28)
\displaystyle\geq (k2k)εqcost(S,P).\displaystyle\dfrac{(k^{2}-k)\varepsilon}{q}\cdot{\rm cost}(S,P).

Inequality (11) then yields the following upper bound for the RHS of (27).

RHSof(27)\displaystyle{\rm RHS~{}of~{}(\ref{ieq1-thm-outlier-kmedo})} \displaystyle\leq (3+2ρ)x𝒳Pcostc(x)x𝒳Pcostc(x)+xPPcostc(x)\displaystyle\left(3+\frac{2}{\rho}\right)\sum\limits_{x\in\mathcal{X}\setminus P^{*}}{\rm cost}_{c}^{*}(x)-\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x) (29)
=\displaystyle= (3+2ρ)costccostc+xPPcostc(x).\displaystyle\left(3+\frac{2}{\rho}\right){\rm cost}_{c}^{*}-{\rm cost}_{c}+\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x).

Combining inequalities (26)-(29), we have

0(3+2ρ)cost(S,P)(1(1+k2k)εq)cost(S,P),\displaystyle 0\leq\left(3+\frac{2}{\rho}\right){\rm cost}(S^{*},P^{*})-\left(1-\dfrac{(1+k^{2}-k)\varepsilon}{q}\right){\rm cost}(S,P),

which is equivalent to (22) under the condition (1+k2k)ε<q(1+k^{2}-k)\varepsilon<q.

Proof of Theorem 4.7..

We first use Method 1 for Case 2 to prove (23). Similar to the proof for kk-MedO, we have

kεqcost(S,P)\displaystyle-\dfrac{k\varepsilon}{q}\cdot{\rm cost}(S,P) \displaystyle\leq 2x𝒳(PP)(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle 2\sum\limits_{x\in{\mathcal{X}}\setminus(P\cup P^{*})}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right) (30)
+x𝒳P(1+ε^)costc(x)x𝒳Pcostc(x)+PPcostc(x)\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{*}}(1+\hat{\varepsilon}){\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{*}\setminus P}{\rm cost}_{c}(x)
\displaystyle\leq 4x𝒳(PP)costc(x)\displaystyle 4\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}^{*}_{c}(x)
+4x𝒳(PP)costc(x)x𝒳(PP)costc(x)\displaystyle+4\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}^{*}_{c}(x)}\cdot\sqrt{\sum\limits_{x\in\mathcal{X}\setminus(P\cup P^{*})}{\rm cost}_{c}(x)}
+x𝒳P(1+ε^)costc(x)x𝒳Pcostc(x)+PPcostc(x)\displaystyle+\sum\limits_{x\in{\mathcal{X}}\setminus P^{*}}(1+\hat{\varepsilon}){\rm cost}_{c}(x)-\sum\limits_{x\in{\mathcal{X}}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{P^{*}\setminus P}{\rm cost}_{c}(x)
\displaystyle\leq 4cost(S,P)cost(S,P)\displaystyle 4\sqrt{{\rm cost}(S^{*},P^{*})}\cdot\sqrt{{\rm cost}(S,P)}
+(5+ε^)cost(S,P)cost(S,P)+εqcost(S,P),\displaystyle+(5+\hat{\varepsilon}){\rm cost}(S^{*},P^{*})-{\rm cost}(S,P)+\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P),

where the second inequality follows from Lemma 3.1 (this lemma still holds for the outlier version of kk-means), and the third inequality follows from (26).

When (5+ε^)(1+k)ε<(9+ε^)q(5+\hat{\varepsilon})(1+k)\varepsilon<(9+\hat{\varepsilon})q, it follows by factorization that inequality (30) is equivalent to

0\displaystyle 0 \displaystyle\leq ((5+ε^)cost(S,P)+αcost(S,P))\displaystyle\left(\sqrt{\left(5+\hat{\varepsilon}\right){\rm cost}(S^{*},P^{*})}+\alpha\sqrt{{\rm cost}(S,P)}\right) (31)
×((5+ε^)cost(S,P)β1cost(S,P)),\displaystyle\times\left(\sqrt{\left(5+\hat{\varepsilon}\right){\rm cost}(S^{*},P^{*})}-\beta_{1}\sqrt{{\rm cost}(S,P)}\right),

where

α\displaystyle\alpha =\displaystyle= 25+ε^+45+ε^+1(1+k)εq,\displaystyle\frac{2}{\sqrt{5+\hat{\varepsilon}}}+\sqrt{\frac{4}{5+\hat{\varepsilon}}+1-\frac{(1+k)\varepsilon}{q}},
β1\displaystyle\beta_{1} =\displaystyle= 25+ε^+45+ε^+1(1+k)εq.\displaystyle-\frac{2}{\sqrt{5+\hat{\varepsilon}}}+\sqrt{\frac{4}{5+\hat{\varepsilon}}+1-\frac{(1+k)\varepsilon}{q}}.

Since the first term of the RHS of (31) is non-negative, we obtain

(5+ε^)cost(S,P)β1cost(S,P)0,\sqrt{\left(5+\hat{\varepsilon}\right){\rm cost}(S^{*},P^{*})}-\beta_{1}\sqrt{{\rm cost}(S,P)}\geq 0,

which gives (23).

Next we prove inequality (24). For Case 2, we use Method 2 to construct swap operations. Similar to the proof of Theorem 4.6, summing inequality (4.4) with weight 1 and inequality (20) with weight 1/(ml1)1/(m_{l}-1) over all constructed swap operations implies that

lL1εqcost(S,P)lL21ml1ml(ml1)εqcost(S,P)\displaystyle-\sum\limits_{l\in L_{1}}\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P)-\sum\limits_{l\in L_{2}}\frac{1}{m_{l}-1}\cdot m_{l}(m_{l}-1)\cdot\dfrac{\varepsilon}{q}\cdot{\rm cost}(S,P) (32)
\displaystyle\leq (1+1ρ)sSxN(s)P(d2(ϕ(cent𝒞(Nq(sx))),x)costc(x))\displaystyle\left(1+\frac{1}{\rho}\right)\sum\limits_{s\in S}\sum\limits_{x\in N(s)\setminus P^{*}}\left(d^{2}(\phi({\rm cent}_{{\mathcal{C}}}(N^{*}_{q}(s^{*}_{x}))),x)-{\rm cost}_{c}(x)\right)
+sSxN(s)(1+ε^)costc(x)sSxN(s)Pcostc(x).\displaystyle+\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})}(1+{\hat{\varepsilon}}){\rm cost}_{c}^{*}(x)-\sum\limits_{s^{*}\in S^{*}}\sum\limits_{x\in N^{*}(s^{*})\setminus P}{\rm cost}_{c}(x).

Because of Lemma 3.1, the RHS of (32) is bounded from above by

RHSof(32)\displaystyle{\rm RHS~{}of~{}(\ref{ieq1-thm-outlier-kmeao})} \displaystyle\leq (3+2ρ+ε^)x𝒳Pcostc(x)x𝒳Pcostc(x)+xPPcostc(x)\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right)\sum\limits_{x\in\mathcal{X}\setminus P^{*}}{\rm cost}_{c}^{*}(x)-\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)+\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x) (33)
+ 2(1+1ρ)x𝒳Pcostc(x)x𝒳Pcostc(x)\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P^{*}}{\rm cost}_{c}^{*}(x)}\sqrt{\sum\limits_{x\in\mathcal{X}\setminus P}{\rm cost}_{c}(x)}
=\displaystyle= (3+2ρ+ε^)cost(S,P)cost(S,P)+xPPcostc(x)\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}(S^{*},P^{*})-{\rm cost}(S,P)+\sum\limits_{x\in P^{*}\setminus P}{\rm cost}_{c}(x)
+ 2(1+1ρ)cost(S,P)cost(S,P).\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{{\rm cost}(S^{*},P^{*}){\rm cost}(S,P)}.

Combining inequalities (26), (28), (32) and (33), we have

0\displaystyle 0 \displaystyle\leq (3+2ρ+ε^)cost(S,P)(1(1+k2k)εq)cost(S,P)\displaystyle\left(3+\frac{2}{\rho}+{\hat{\varepsilon}}\right){\rm cost}(S^{*},P^{*})-\left(1-\frac{(1+k^{2}-k)\varepsilon}{q}\right){\rm cost}(S,P)
+ 2(1+1ρ)cost(S,P)cost(S,P).\displaystyle+\ 2\left(1+\frac{1}{\rho}\right)\sqrt{{\rm cost}(S^{*},P^{*}){\rm cost}(S,P)}.

Using the factorization for this inequality and the condition in this theorem, we obtain the desired result.

Consequently, we have the following corollaries that specify the tradeoff between the approximation ratio and the outlier blowup.

Corollary 4.8.

There exists a bi-criteria (5+ε,O(kεlog(nδ)))(5+\varepsilon,O(\frac{k}{\varepsilon}\log(n\delta)))-, and a bi-criteria (3+ε,O(k2εlog(nδ)))(3+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))-approximation algorithm for kk-MedO.

Proof.

If qk+1q\geq k+1, then

51(1+k)ε/q51ε5+O(ε).\frac{5}{1-(1+k)\varepsilon/q}\leq\frac{5}{1-\varepsilon}\sim 5+O(\varepsilon).

If qk2k+1q\geq k^{2}-k+1 and ρ2/O(ε)\rho\geq 2/O(\varepsilon), then

3+2/ρ1(1+kkk)ε/q3+O(ε)1ε3+O(ε).\frac{3+2/\rho}{1-(1+k^{k}-k)\varepsilon/q}\leq\frac{3+O(\varepsilon)}{1-\varepsilon}\sim 3+O(\varepsilon).

Combining the above results, Theorems 4.3 and 4.6 complete the proof. ∎

Corollary 4.9.

There exists a bi-criteria (25+ε,O(kεlog(nδ)))(25+\varepsilon,O(\frac{k}{\varepsilon}\log(n\delta)))-, and a bi-criteria (9+ε,O(k2εlog(nδ)))(9+\varepsilon,O(\frac{k^{2}}{\varepsilon}\log(n\delta)))-approximation algorithm for kk-MeaO.

Proof.

Recall the definitions of β1\beta_{1} and β2\beta_{2} in Theorem 4.7. We then have

5+ε^β1225+O(ε+ε^)\dfrac{5+\hat{\varepsilon}}{\beta_{1}^{2}}\sim 25+O(\varepsilon+\hat{\varepsilon})

when q=k+1q=k+1, and

3+2/ρ+ε^β229+O(ε+ε^)\dfrac{3+2/\rho+\hat{\varepsilon}}{\beta_{2}^{2}}\sim 9+O(\varepsilon+\hat{\varepsilon})

when q=k2k+1q=k^{2}-k+1 and ρ\rho is sufficiently large. Then, combining these results, Theorems 4.3 and 4.7 completes the proof. ∎

5 Conclusions

The previous analyses of local search algorithms for the robust kk-median/kk-means, use only the individual form, in which the constructed connections between the local and global optimal solutions are individual for each point. This has the disadvantage that the joint information about outliers remains hidden. In this paper, we develop a cluster form analysis and define the adapted cluster that captures the outlier information. We find that this new technique works better than the previous analysis methods of local search algorithms, since it improves the approximation ratios of local search algorithms for kk-MeaP, kk-MeaO and kk-MedO, and obtain the same ratio which is the best for kk-MedP.

We believe that our new technique will also work for the robust FLP, since the structure of FLP is similar to kk-median/kk-means. Also, our technique seems to be promising for the robust kk-center problem, even for any algorithm for robust clustering problems that is based on local search.

6 Acknowledgments

The first author is supported by the NSFC under Grant No. 12001039. The second author is supported by the Science Foundation of the Anhui Education Department under Grant No. KJ2019A0834. The third author is supported by the NSFC under Grant No. 11971349. The fourth and fifth authors are supported by the NSFC under Grant No. 11871081. The fourth author is also supported by Beijing Natural Science Foundation Project under Grant No. Z200002.

References

  • Ahmadian et al. (2017) Ahmadian S, Norouzi-Fard A, Svensson O, Ward J (2017) Better guarantees for k-means and euclidean k-median by primal-dual algorithms. Proceedings of the 58th Annual Symposium on Foundations of Computer Science, 61–72 (IEEE).
  • Arora et al. (1998) Arora S, Raghavan P, Rao S (1998) Approximation schemes for euclidean k-medians and related problems. Proceedings of the 30th Annual ACM Symposium on Theory of Computing, 106–113.
  • Arya et al. (2004) Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for k-median and facility location problems. SIAM Journal on Computing 33(3):544–562.
  • Bernstein et al. (2019) Bernstein F, Modaresi S, Sauré D (2019) A dynamic clustering approach to data-driven assortment personalization. Management Science 65(5):2095–2115.
  • Borgwardt and Happach (2019) Borgwardt S, Happach F (2019) Good clusterings have large volume. Operations Research 67(1):215–231.
  • Byrka et al. (2014) Byrka J, Pensyl T, Rybicki B, Srinivasan A, Trinh K (2017) An improved approximation for k-median, and positive correlation in budgeted optimization. ACM Transactions on Algorithms 13(2):23:1–23:31.
  • Charikar et al. (2002) Charikar M, Guha S, Tardos É, Shmoys DB (2002) A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciencess 65(1):129–149.
  • Charikar et al. (2001) Charikar M, Khuller S, Mount DM, Narasimhan G (2001) Algorithms for facility location problems with outliers. Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms, 642–651 (Society for Industrial and Applied Mathematics).
  • Charikar and Li (2012) Charikar M, Li S (2012) A dependent lp-rounding approach for the k-median problem. International Colloquium on Automata, Languages, and Programming, 194–205 (Springer).
  • Chen (2008) Chen K (2008) A constant factor approximation algorithm for k-median clustering with outliers. Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, 826–835.
  • Cohen-Addad and Karthik (2019) Cohen-Addad V, Karthik C (2019) Inapproximability of clustering in lp metrics. Proceedings of the 60th Annual Symposium on Foundations of Computer Science, 519–539 (IEEE).
  • Cohen-Addad et al. (2019) Cohen-Addad V, Klein PN, Mathieu C (2019) Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. SIAM Journal on Computing 48(2):644–667.
  • Feng et al. (2019) Feng Q, Zhang Z, Shi F, Wang J (2019) An improved approximation algorithm for the k-means problem with penalties. Proceedings of the 2nd Annual International Workshop on Frontiers in Algorithmics, 170–181 (Springer).
  • Friggstad et al. (2019a) Friggstad Z, Khodamoradi K, Rezapour M, Salavatipour MR (2019a) Approximation schemes for clustering with outliers. ACM Transactions on Algorithms 15(2):1–26.
  • Friggstad et al. (2019b) Friggstad Z, Rezapour M, Salavatipour MR (2019b) Local search yields a ptas for k-means in doubling metrics. SIAM Journal on Computing 48(2):452–480.
  • Gupta et al. (2017) Gupta S, Kumar R, Lu K, Moseley B, Vassilvitskii S (2017) Local search methods for k-means with outliers. Proceedings of the 43rd International Conference on Very Large Data Bases 10(7):757–768.
  • Hajiaghayi et al. (2012) Hajiaghayi M, Khandekar R, Kortsarz G (2012) Local search algorithms for the red-blue median problem. Algorithmica 63(4):795–814.
  • Hochbaum and Liu (2018) Hochbaum DS, Liu S (2018) Adjacency-clustering and its application for yield prediction in integrated circuit manufacturing. Operations Research 66(6):1571–1585.
  • Jain et al. (2003) Jain K, Mahdian M, Markakis E, Saberi A, Vazirani VV (2003) Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM 50(6):795–824.
  • Jain and Vazirani (2001) Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM 48(2):274–296.
  • Kanungo et al. (2004) Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2004) A local search approximation algorithm for k-means clustering. Computational Geometry 28(2-3):89–112.
  • Korupolu et al. (2000) Korupolu MR, Plaxton CG, Rajaraman R (2000) Analysis of a local search heuristic for facility location problems. Journal of Algorithms 37(1):146–188.
  • Krishnaswamy et al. (2018) Krishnaswamy R, Li S, Sandeep S (2018) Constant approximation for k-median and k-means with outliers via iterative rounding. Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 646–659.
  • Li (2013) Li S (2013) A 1.488 approximation algorithm for the uncapacitated facility location problem. Information and Computation 222:45–58.
  • Li and Svensson (2016) Li S, Svensson O (2016) Approximating k-median via pseudo-approximation. SIAM Journal on Computing 45(2):530–547.
  • Li et al. (2013) Li Y, Shu J, Wang X, Xiu N, Xu D, Zhang J (2013) Approximation algorithms for integrated distribution network design problems. INFORMS Journal on Computing 25(3):572–584.
  • Lloyd (1982) Lloyd S (1982) Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2):129–137.
  • Lu and Wedig (2013) Lu SF, Wedig GJ (2013) Clustering, agency costs and operating efficiency: Evidence from nursing home chains. Management Science 59(3):677–694.
  • Mahdian et al. (2006) Mahdian M, Ye Y, Zhang J (2006) Approximation algorithms for metric facility location problems. SIAM Journal on Computing 36(2):411–432.
  • Makarychev et al. (2016) Makarychev K, Makarychev Y, Sviridenko M, Ward J (2016) A bi-criteria approximation algorithm for k-means. Proceedings of the 19th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX), and the 20th International Workshop on Randomization and Computation (RANDOM), 14:1–14:20 (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik).
  • Matoušek (2000) Matoušek J (2000) On approximate geometric k-clustering. Discrete & Computational Geometry 24(1):61–84.
  • Ni et al. (2020) Ni W, Shu J, Song M, Xu D, Zhang K (2020) A branch-and-price algorithm for facility location with general facility cost functions. INFORMS Journal on Computing URL http://dx.doi.org/10.1287/ijoc.2019.0921.
  • Wang et al. (2015) Wang Y, Xu D, Du D, Wu C (2015) Local search algorithms for k-median and k-facility location problems with linear penalties. Proceedings of the 9th Annual International Conference on Combinatorial Optimization and Applications, 60–71 (Springer).
  • Wu et al. (2018) Wu C, Du D, Xu D (2018) An approximation algorithm for the k-median problem with uniform penalties via pseudo-solution. Theoretical Computer Science 749:80–92.
  • Zhang et al. (2019) Zhang D, Hao C, Wu C, Xu D, Zhang Z (2019) Local search approximation algorithms for the k-means problem with penalties. Journal of Combinatorial Optimization 37(2):439–453.
  • Zhang (2007) Zhang P (2007) A new approximation algorithm for the k-facility location problem. Theoretical Computer Science 384(1):126–135.