This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

WaveCluster with Differential Privacy

Ling Chen#1\leavevmode\nobreak\ {}^{\#1}, Ting Yu#1,2\leavevmode\nobreak\ {}^{\#1,2}, Rada Chirkova#1\leavevmode\nobreak\ {}^{\#1}
#1 Department of Computer Science
   North Carolina State University    Raleigh    USA
#2 Qatar Computing Research Institute
   Doha    Qatar
1 lchen10@ncsu.edu
   1,2 tyu@{ncsu.edu    qf.org.qa}    1 rychirko@ncsu.edu
Abstract

WaveCluster is an important family of grid-based clustering algorithms that are capable of finding clusters of arbitrary shapes. In this paper, we investigate techniques to perform WaveCluster while ensuring differential privacy. Our goal is to develop a general technique for achieving differential privacy on WaveCluster that accommodates different wavelet transforms. We show that straightforward techniques based on synthetic data generation and introduction of random noise when quantizing the data, though generally preserving the distribution of data, often introduce too much noise to preserve useful clusters. We then propose two optimized techniques, PrivTHR and PrivTHREM, which can significantly reduce data distortion during two key steps of WaveCluster: the quantization step and the significant grid identification step. We conduct extensive experiments based on four datasets that are particularly interesting in the context of clustering, and show that PrivTHR and PrivTHREM achieve high utility when privacy budgets are properly allocated.

1 Introduction

Clustering is an important class of data analysis that has been extensively applied in a variety of fields, such as identifying different groups of customers in marketing and grouping homologous gene sequences in biology research [21]. Clustering results allow data analysts to gain valuable insights into data distribution when it is challenging to make hypotheses on raw data. Among various clustering techniques, a grid-based clustering algorithm called WaveCluster [35, 36] is famous for detecting clusters of arbitrary shapes. WaveCluster relies on wavelet transforms, a family of convolutions with appropriate kernel functions, to convert data into a transformed space, where the natural clusters in the data become more distinguishable.

In many data-analysis scenarios, when the data being analyzed contains personal information and the result of the analysis needs to be shared with the public or untrusted third parties, sensitive private information may be leaked, e.g., whether certain personal information is stored in a database or has contributed to the analysis. Consider the databases A and B in Figure 1. These two databases have two attributes, Monthly Income and Monthly Living Expenses, and the records differ only in one record, u. Without u’s participation in database A, WaveCluster identifies two separate clusters, marked by blue and red, respectively. With u’s participation, WaveCluster identifies only one cluster marked by color blue from database B. Therefore, merely from the number of clusters returned (rather than which data points belong to which cluster), an adversary may infer a user’s participation. Due to such potential leak of private information, data holders may be reluctant to share the original data or data-analysis results with each other or with the public.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Example of personal privacy breach in cluster analysis.

In this paper, we develop techniques to perform WaveCluster with differential privacy [12, 14]. Differential privacy provides a provable strong privacy guarantee that the output of a computation is insensitive to any particular individual. In other words, based on the output, an adversary has limited ability to make inference about whether an individual is present or absent in the dataset. Differential privacy is often achieved by the perturbation of randomized algorithms, and the privacy level is controlled by a parameter ϵ\epsilon called “privacy budget”. Intuitively, the privacy protection via differential privacy grows stronger as ϵ\epsilon grows smaller.

WaveCluster provides a framework that allows any kind of wavelet transform to be plugged in for data transformation, such as the Haar transform [4] and Biorthogonal transform [28]. There are various wavelet transforms that are suitable for different types of applications, such as image compression and signal processing [5]. Plugged in different wavelet transforms, WaveCluster can leverage different properties of the data, such as frequency and location, for finding the dense regions as clusters. Thus, in this paper, we aim to develop a general technique for achieving differential privacy on WaveCluster that accommodates different wavelet transforms.

Refer to caption
(a) Original
Refer to caption
(b) Baseline
Figure 2: Inaccurate clustering result produced by Baseline. (a) shows the WaveCluster results on the original data and (b) shows the WaveCluster results of Baseline, which leverages the adaptive-grid [33] approach to generate the synthetic data. Points in different clusters are shown in different colors, and the points marked by red are considered as noises that do not form a cluster.

We first consider a general technique, Baseline, that adapts existing differentially private data-publishing techniques to WaveCluster through synthetic data generation. Specifically, we could generate synthetic data based on any data model of the original data that is published through differential privacy, and then apply WaveCluster using any wavelet transform over the synthetic data. Baseline seems particularly promising as many effective differentially private data-publishing techniques have been proposed in the literature, all of which strive to preserve some important properties of the original data. Therefore, hopefully the “shape” of the original data is also preserved in the synthetic data, and consequently could be discovered by WaveCluster. Unfortunately, as we will show later in the paper, this synthetic data-generation technique often cannot produce accurate results. Differentially private data-publishing techniques such as spatial decompositions [10], adaptive-grid [33], and Privelet [39], output noisy descriptions of the data distribution and often contain negative counts for sparse partitions due to random noise. These negative counts do not affect the accuracy of large range queries (which is often one of the main utility measures in private data publishing) since zero-mean noise distribution smoothes the effect of negative counts. However, negative counts cannot be smoothed away in the synthesized dataset, which are typically set to zero counts. Figure 2 shows an example of inaccurate clustering results produced by Baseline using adaptive-grid [33], As we can see, the synthetic data generated in Baseline significantly distorts the data distribution, causing two clusters to be merged as one and reducing the accuracy of the WaveCluster results.

Motivated by the above challenge, we propose three techniques that enforce differential privacy on the key steps of WaveCluster, rather than relying on synthetic data generation. WaveCluster accepts as input a set of data points in a multi-dimensional space, and consists of the following main steps. First, in the quantization step WaveCluster quantizes the multi-dimensional space by dividing the space into grids, and computes the count of the data points in each grid. These counts of grids form a count matrix MM. Second, in the wavelet transform step WaveCluster applies a wavelet transform on the count matrix MM to obtain the approximation of the multi-dimensional space. Third, in the significant grid identification step WaveCluster identifies significant grids based on the pre-defined density threshold. Fourth, in the cluster identification step WaveCluster outputs as clusters the connected components from these significant grids [23]. To enforce differential privacy on WaveCluster, we first propose a technique, PrivQT, that introduces Laplacian noise to the quantization step. However, such straightforward privacy enforcement cannot produce usable private WaveCluster results, since the noise introduced in this step significantly distorts the density threshold for identifying significant grids. To address this issue, we further propose two techniques, PrivTHR and PrivTHREM, which enforce differential privacy on both the quantization step and the significant grid identification step. These two techniques differ in how to determine the noisy density threshold. We show that by allocating appropriate budgets in these two steps, both techniques can achieve differential privacy with significantly improved utility.

Traditionally, the effectiveness of WaveCluster is evaluated through visual inspection by human experts (i.e., visually determining whether the discovered clusters match those reflected in the user’s mind) [35, 36]. Unfortunately, visual inspection is inappropriate to assess the utility of differentially private WaveCluster. Visual inspection is not quantitative, and thus it is hard to systematically compare the impact of different techniques through visual inspection. Generally, researchers use quantitative measures to assess the utility of differentially private results, such as relative or absolute errors for range queries and prediction accuracy for classification. But there is no existing utility measures for density-based clustering algorithms with differential privacy.

To mitigate this problem, in this paper we propose two types of utility measures. The first is to measure the dissimilarity between true and private WaveCluster results by measuring the differences of significant grids and clusters, which correspond to the outputs of the two key steps (the significant grid identification and the cluster identification) in WaveCluster. To more intuitively understand the usefulness of discovered clusters, our second utility measure considers one concrete application of cluster analysis, i.e., to build a classifier based on discovered clusters, and then use that classifier to predict future data. Therefore the prediction accuracy of the classifier from one aspect reflects the actual utility of private WaveCluster.

To evaluate the proposed techniques, our experiments use four datasets containing different data shapes that are particularly interesting in the context of clustering [1, 9]. Our results show that PrivTHR and PrivTHREM achieve high utility for both types of utility measures, and are superior to Baseline and PrivQT.

2 Related Work

The syntactic approaches for privacy preserving clustering [18] is to output kk-anonymous clusters. Friedman et al. [17] presented an algorithm to output kk-anonymous clusters by using minimum spanning tree. Karakasidis et al. [24] created kk-anonymous clusters by merging clusters so that each cluster contains at least kk key values of the records. Fung et al. [19] proposed an approach that converts the anonymity problem for cluster analysis to the counterpart problem for classification analysis. Aggarwal et al. [3] proposed a perturbation method called rr-gather clustering, which releases the cluster centers, together with their sizes, radiuses, and a set of associated sensitive values. However, these approaches only satisfy syntactic privacy notions such as k-anonymity, and cannot provide formal guarantees of privacy as differential privacy.

In this work, our goal is to perform WaveCluster under differential privacy. The focus of initial work on differential privacy [12, 14, 25, 13, 15] concerned the theoretical proof of its feasibility on various data analysis tasks, e.g., histogram and logistic regression.

More recent work has focused on practical applications of differential privacy for privacy-preserving data publishing. An approach proposed by Barak et al. [7] encoded marginals with Fourier coefficients and then added noise to the released coefficients. Hay et al. [22] exploited consistency constraints to reduce noise for histogram counts. Xiao et al. [39] proposed Privelet, which uses wavelet transforms to reduce noise for histogram counts. Cormode et al. [10] indexed data by kd-trees and quad-trees, developing effective budget allocation strategies for building the noisy trees and obtaining noisy counts for the tree nodes. Qardaji et al. [33] proposed uniform-grid and adaptive-grid methods to derive appropriate partition granularity in differentially private synopsis publishing. Xu et al. [40] proposed the NoiseFirst and StructureFirst techniques for constructing optimal noisy histograms, using dynamic programming and Exponential mechanism. These data publishing techniques are specifically crafted for answering range queries. Unfortunately, synthesizing the dataset and applying WaveCluster on top of it often render WaveCluster results useless, since these differentially private data publishing techniques do not capture the essence of WaveCluster and introduce too much unnecessary noise for WaveCluster.

Another important line of prior work focuses on integrating differential privacy into other practical data analysis tasks, such as regression analysis, model fitting, classification and etc. Chaudhuri et al. [8] proposed a differentially private regularized logistic regression algorithm that balances privacy with learnability. Zhang et al. [42] proposed a differentially private approach for logistic and linear regressions that involve perturbing the objective function of the regression model, rather than simply introducing noise into the results. Friedman et al. [16] incorporated differential privacy into several types of decision trees and subsequently demonstrated the tradeoff among privacy, accuracy and sample size. Using decision trees as an example application, Mohammed et al. [31] investigated a generalization-based algorithm for achieving differential privacy for classification problems.

Differentially private cluster analysis has also be studied in prior work. Zhang et al. [41] proposed differentially private model fitting based on genetic algorithms, with applications to k-means clustering. McSherry [29] introduced the PINQ framework, which has been applied to achieve differential privacy for k-means clustering using an iterative algorithm [38]. Nissim et al.[32] proposed the sample-aggregate framework that calibrates the noise magnitude according to the smooth sensitivity of a function. They showed that their framework can be applied to k-means clustering under the assumption that the dataset is well-separated. These research efforts primarily focus on centroid-based clustering, such as k-means, that is most suited for separating convex clusters and presents insufficient spatial information to detect clusters with complex shapes, e.g. concave shapes. In contrast to these research efforts, we propose techniques that enforce differential privacy on WaveCluster, which is not restricted to well-separated datasets, and can detect clusters with arbitrary shapes.

3 Preliminaries

In this section, we first present the background of differential privacy. Then we describe the WaveCluster algorithm followed by our problem statement.

3.1 Differential Privacy

Differential privacy [12] is a recent privacy model, which guarantees that an adversary cannot infer an individual’s presence in a dataset from the randomized output, despite having knowledge of all remaining individuals in the dataset.

Definition 1

(ϵ\epsilon-differential privacy): Given any pair of neighboring databases DD and DD^{\prime} that differ only in one individual record, a randomized algorithm AA is ϵ\epsilon-differentially private iff for any SRange(A)S\subseteq Range(A):

Pr[A(D)S]Pr[A(D)S]eϵPr[A(D)\in S]\leq Pr[A(D^{\prime})\in S]*e^{\epsilon}

The parameter ϵ\epsilon indicates the level of privacy. Smaller ϵ\epsilon provides stronger privacy. When ϵ\epsilon is very small, eϵe^{\epsilon} \approx 1+ ϵ\epsilon. Since the value of ϵ\epsilon directly affects the level of privacy, we refer to it as the privacy budget. Appropriate allocation of the privacy budget for a computational process is important for reaching a favorable trade-off between privacy and utility. The most common strategy to achieve ϵ\epsilon-differential privacy is to add noise to the output of a function. The magnitude of introduced noise is calibrated by the privacy budget ϵ\epsilon and the sensitivity of the query function. The sensitivity of a query function is defined as the maximum difference between the outputs of the query function on any pair of neighboring databases.:

Δf=maxD,Df(D)f(D)1\Delta f=\max_{D,D^{\prime}}\parallel f(D)-f(D^{\prime})\parallel_{1}

There are two common approaches for achieving ϵ\epsilon-differential privacy: Laplace mechanism [14] and Exponential mechanism [30].

Laplace Mechanism: The output of a query function ff is perturbed by adding noise from the Laplace distribution with probability density function f(x|b)=12bexp(|x|b)f(x|b)=\frac{1}{2b}\exp(-\frac{|x|}{b}), b=Δfϵb=\frac{\Delta f}{\epsilon}. The following randomized mechanism AlA_{l} satisfies ϵ\epsilon-differential privacy:

𝒜l(D)=f(D)+Lap(Δfϵ)\mathcal{A}_{l}(D)=f(D)+Lap(\frac{\Delta f}{\epsilon})

Exponential Mechanism: This mechanism returns an output that is close to the optimum, with respect to a quality function. A quality function q(D,r)q(D,r) assigns a score to all possible outputs rRr\in R, where RR is the output range of ff, and better outputs receive higher scores. A randomized mechanism AeA_{e} that outputs rRr\in R with probability

Pr[𝒜e(D)=r]exp(ϵq(D,r)2S(q))Pr[\mathcal{A}_{e}(D)=r]\propto exp(\frac{\epsilon q(D,r)}{2S(q)})

satisfies ϵ\epsilon-differential privacy, where S(q)S(q) is the sensitivity of the quality function.

Differential privacy has two properties: sequential composition and parallel composition. Sequential composition is that given nn independent randomized mechanisms A1,A2,,AnA_{1},A_{2},\ldots,A_{n} where AiA_{i} (1in1\leq i\leq n) satisfies ϵi\epsilon_{i}-differential privacy, a sequence of AiA_{i} over the dataset DD satisfies ϵ\epsilon-differential privacy, where ϵ=1n(ϵi)\epsilon=\sum_{1}^{n}(\epsilon_{i}). Parallel composition is that given nn independent randomized mechanisms A1,A2,,AnA_{1},A_{2},\ldots,A_{n} where AiA_{i} (1in1\leq i\leq n) satisfies ϵ\epsilon-differential privacy, a sequence of AiA_{i} over a set of disjoint data sets DiD_{i} satisfies ϵ\epsilon-differential privacy.

Refer to caption
Figure 3: Illustration of WaveCluster.

3.2 WaveCluster

WaveCluster is an algorithm developed by Sheikholeslami et al. [35, 36] for the purpose of clustering spatial data. It works by using a wavelet transform to detect the boundaries between clusters. A wavelet transform allows the algorithm to distinguish between areas of high contrast (high frequency components) and areas of low contrast (low frequency components). The motivation behind this distinction is that within a cluster there should be low contrast and between clusters there should be an area of high contrast (the border). WaveCluster has the following steps as shown in Figure 3:

Quantization: Quantize the feature space into grids of a specified size, creating a count matrix MM.

Wavelet Transform: Apply a wavelet transform to the count matrix MM, such as Haar transform [4] and Biorthogonal transform [28], and decompose MM to the average subband that gives the approximation of the count matrix and the detail subband that has the information about the boundaries of clusters. We refer to the average subband as the wavelet-transformed-value matrix (WW).

Significant Grid Identification: Identify the significant grids from the average subband WW. WaveCluster constructs a sorted list LL of the positive wavelet transformed values obtained from WW and compute the ppth percentile of the values in LL. The values that are below the ppth percentile of LL are non-significant values. Their corresponding grids are considered as non-significant grids and the data points in the non-significant grids are considered as noise.

Cluster Identification: Identify clusters from the significant grids using connected component labeling algorithm [23] (two grids are connected if they are adjacent), map the clusters back to the original multi-dimensional space, and label the data points based on which cluster the data points reside in.

In WaveCluster, users need to specify four parameters:

num_grid (g1,g2,,gng_{1},g_{2},\ldots,g_{n}): the number of grids that the nn-
dimensional space is partitioned into along each dimension. For the brevity of description, we simply use gg to refer to the partitions of the nn-dimensional space (g1,g2,,gng_{1},g_{2},\ldots,g_{n}). This parameter controls the scaling of quantization. Inappropriate scaling can cause problems of over-quantization and under-quantization, affecting the accuracy of clustering [36].

density threshold (pp): a percentage value pp that specifies pp% of the values in LL are non-significant values. For ease of presentation, we use k=(1p)|L|k=(1-p)|L| to represent the top kk values in LL and their corresponding grids are considered as significant grids.

level: a wavelet decomposition level, which indicates how many times a wavelet transform is applied. The larger the level is, the more approximate the result is. In our techniques, we set level to 1 since a smaller level value provides more accurate results [36].

wavelet: a wavelet transform to be applied. Haar transform [4] is one of the simplest wavelet transforms and widely used, which is computed by iterating difference and averaging between odd and even samples of a signal (or a sequence of data points). Other commonly used wavelet transforms include Biorthogonal transform [28], Daubechies transform [11], and so on.

Motivating Scenario. Consider a scenario with two participants: the data owner (e.g. hospitals) and the querier (e.g. data miner). The data owner holds raw data and has the legal obligation to protect individuals’ privacy while the querier is eager to obtain cluster analysis results for further exploration. The goal of our work is to enable the data owner to release cluster analysis results using WaveCluster while not compromising the privacy of any individual who contributes to the raw data. The data owner has a good knowledge of the raw data and it is not difficult for her to pick the appropriate parameters (e.g. num_grid, density threshold, and wavelet) for non-private WaveCluster. For example, the data owner may draw from her past experience on similar data to determine the appropriate parameters for the current dataset. The parameters picked for the non-private setting are directly used for the private setting, and thus the data owner does not need to infer another set of parameters for the private setting.

Problem Statement. Given a raw data set DD, appropriate WaveCluster parameters for DD and a privacy budget ϵ\epsilon, our goal is to investigate an effective approach AA such that AA (1) satisfies ϵ\epsilon-differential privacy, and (2) achieves high utility of the private WaveCluster results with regard to the utility metrics UU.

4 Approaches

In this section, we present four techniques for achieving differential privacy on WaveCluster. We first describe the Baseline technique that achieves differential privacy through synthetic data generation. We then describe three techniques that enforces differential privacy on the key steps of WaveCluster.

4.1 Baseline Approach (Baseline)

A straightforward technique to achieve differential privacy on WaveCluster is as follows: (1) adapt an existing ϵ\epsilon-differential privacy preserving data publishing method to get the noisy description of the data distribution in some fashion, such as a set of contingency tables or a spatial decomposition tree [40, 39, 10, 33]; (2) generate a synthetic dataset according to the noisy description; (3) apply WaveCluster on the synthetic dataset. We refer to this technique as Baseline, and its pseudocode is shown in Algorithm 1.

Algorithm 1 Baseline
1:Dataset DD, num_grid gg, density threshold pp, wavelet transform ww, differential privacy budget ϵ\epsilon
2:A set of differentially private clusters
3:procedure BaselineBaseline(D,g,p,w,ϵD,g,p,w,\epsilon)
4:  DD^{\prime} = DiffPrivPublishing(D,ϵD,\epsilon)
5:  MM^{\prime} = Quantization(D,gD^{\prime},g)
6:  WW^{\prime} = WaveletTransform(MM^{\prime},ww)
7:  LL^{\prime} = ConvertToPosSortedArray(WW^{\prime})
8:  kk^{\prime} = (1p)|L|(1-p)|L^{\prime}|
9:  dd^{\prime} = Top(kk^{\prime},LL^{\prime})
10:  return ConnCompLabel(W,dW^{\prime},d^{\prime})
11:end procedure

Baseline first leverages a ϵ\epsilon-differential privacy preserving data publishing method to obtain a noisy dataset DD^{\prime} (Line 2) and partitions DD^{\prime} based on the number of grids gg to obtain the noisy count matrix MM^{\prime} (Line 3). Baseline then applies a wavelet transform on MM^{\prime} to obtain WW^{\prime} (Line 4). WW^{\prime} is then turned into a list LL^{\prime} that keeps only positive values and the values in LL^{\prime} is sorted in ascending order (Line 5). With LL^{\prime}, kk^{\prime} is computed based on the specified density threshold pp and the size of LL^{\prime} (Line 6). Finally, Baseline obtains dd^{\prime} as the top kk^{\prime}th value in LL^{\prime} (Line 7), where any value in LL^{\prime} greater than dd^{\prime} is considered as a significant value, and applies the connected component labeling algorithm to identify clusters of significant grids (Line 8).

Discussion. Baseline achieves differential privacy on WaveCluster through the achievement of differential privacy on data publishing. However, it does not produce accurate WaveCluster results in most cases. The adapted ϵ\epsilon-differential privacy preserving data publishing method is designed for answering range queries. The noisy descriptions of the data distribution generated by the method may contain negative counts for certain partitions since the noise distribution is Laplacian with zero mean. These negative counts do not affect the range query accuracy too much since zero-mean noise distribution smooths the effect of noise. For example, a partition p1p_{1} has the true count of 2 and the noisy count of -2, whose noise is canceled by another partition p2p_{2} having the true count of 10 and the noisy count of 14 when both p1p_{1} and p2p_{2} are included in a range query. In particular, when the range query spreads large range of a dataset, a single partition with a noisy negative count does not affect its accuracy too much. However, when the method is used for generating a synthetic dataset, the noisy negative counts are reset as zero counts, causing the data distribution to change radically on the whole and further leading to the severe deviation in differentially private WaveCluster results.

4.2 Private Quantization (PrivQT)

To address the challenge faced by Baseline, we propose techniques that enforce differential privacy on the key steps of WaveCluster. Our first approach, called Private Quantization (PrivQT), introduces independent Laplacian noise in the quantization step to achieve differential privacy. In the quantization step, the data is divided into grids and the count matrix MM is computed. To ensure differential privacy in this step, we rely on the Laplace mechanism that introduces independent Laplacian noise to MM. Clearly, if we change one individual in the input data, such as adding, removing or modifying an individual, there is at most one change in one entry of MM. According to the parallel composition property of differential privacy, the noise amount introduced to each grid is Lap(1ϵ)Lap(\frac{1}{\epsilon}), given a privacy budget ϵ\epsilon. Since the following steps of WaveCluster are carried on using the differentially private count matrix MM^{\prime}, the clusters derived from these steps are also differentially private. Algorithm 2 shows the pseudocode of PrivQT. Except from the first step that introduces independent Laplacian noise to MM (Line 2), the other steps (Lines 3-7) are the same as Baseline.

Algorithm 2 PrivQT
1:Dataset DD, num_grid gg, density threshold pp, wavelet transform ww, differential privacy budget ϵ\epsilon
2:A set of differentially private clusters
3:procedure PrivQTPrivQT(D,g,p,w,ϵD,g,p,w,\epsilon)
4:  MM^{\prime} = PrivQuantization(D,g,ϵD,g,\epsilon)
5:  WW^{\prime} = WaveletTransform(MM^{\prime},ww)
6:  LL^{\prime} = ConvertToPosSortedArray(WW^{\prime})
7:  kk^{\prime} = (1p)|L|(1-p)|L^{\prime}|
8:  dd^{\prime} = Top(kk^{\prime},LL^{\prime})
9:  return ConnCompLabel(W,dW^{\prime},d^{\prime})
10:end procedure

Selecting the appropriate grid size (reflected by the parameter num_grid gg) in the quantization step strongly affects the accuracy of WaveCluster results [36], and also the differentially private
WaveCluster results. A small grid size (small gg) causes more data points to fall into each grid and thus the count of data points for each grid becomes larger, which makes the count matrix MM resistant to Laplacian noise. However, the small grid size is not helpful for WaveCluster to detect clusters with accurate shapes and renders the results less useful. On the other hand, although posing a larger grid size on the data captures the density distribution of the data more clearly, it makes each grid’s count too small and thus become sensitive to Laplacian noise, which dramatically affects the identification of significant grids and further the shapes of clusters. Our empirical results show that only when an appropriate grid size is given, differentially private WaveCluster results maintains high utility.

Discussion. Although PrivQT achieves differential privacy on the WaveCluster results, the noisy count matrix MM^{\prime} and its resulting noisy LL^{\prime} are significantly distorted and consequently the clustering results. The reason is as follows. Given a specified percentage value pp, PrivQT computes kk^{\prime} from the positive values in WW^{\prime}, where WW^{\prime} is derived from MM^{\prime}, which is perturbed by Laplacian noise. Laplacian distribution is symmetric and has zero-mean. According to its randomness, approximately half of the zero-count grids become noisy positive-count grids due to positive noise while the remaining ones are turned into noisy negative-count grids due to negative noise. These noisy positive-count grids may cause their corresponding wavelet transformed values in WW^{\prime} to become positive (depending on the targeted wavelet transform), which will inappropriately participate in the computation of kk^{\prime} and further distorts kk^{\prime}. Due to the dominating errors introduced by approximately half of zero-count grids becoming noisy positive-count grids, our empirical results show that the utility of private WaveCluster results by PrivQT improves marginally even for a large privacy budget.

4.3 Private Quantization with Refined Noisy Density Threshold (PrivTHR)

The limitation of PrivQT lies in the severe distortion of kk^{\prime} by Laplacian noise introduced into count matrix MM^{\prime}. To mitigate the distortion, we propose a technique, PrivTHR, which prunes a portion of noisy positive values in WW^{\prime} to refine the computation of kk^{\prime}. Algorithm 3 shows the pseudocode of PrivTHR.

PrivTHR first introduces random noise to the count matrix MM, similar to PrivQT, and obtains a noisy count matrix MM^{\prime} (Line 2). PrivTHR then applies a wavelet transform on MM^{\prime} to obtain WW^{\prime} (Line 3). WW^{\prime} is then turned into a list LL^{\prime} that keeps only positive values and the values in LL^{\prime} is sorted in ascending order (Line 4). Thus, only the positive values in WW^{\prime} will be used for computing kk^{\prime} based on the specified density threshold pp. To reduce the distortion of kk^{\prime}, starting from the smallest noisy positive values in LL^{\prime}, PrivTHR discards the first |Z|2\frac{|Z|^{\prime}}{2} values (Line 6), where ZZ represents the non-positive (negative or zero) values in the WW and |Z||Z|^{\prime} is a noisy estimate of |Z||Z| (Line 5). The reason why PrivTHR removes |Z|2\frac{|Z|^{\prime}}{2} values from LL^{\prime} is based on the utility analysis (in Section 5.2) that approximately |Z|2\frac{|Z|}{2} non-positive values in WW are turned into positive values due to the randomness of Laplacian noise. Since |Z||Z| partially describes the data distribution and releasing |Z||Z| without protection may leak private information, PrivTHR also introduces Laplacian noise to |Z||Z|, ensuring the whole process correctly enforces differentially privacy (Lines 11-17). The noise introduced to |Z||Z| depends on the wavelet transform used to compute WW. For example, if we use Haar transform for nn-dimensional data, a value in WW is computed by applying average for two neighboring elements along each dimension. Since any single change in the input only causes one entry of the count matrix MM to change by 1, the change of MM causes at maximum one value in WW to change, and thus causes |Z||Z| to change by 1 at maximum, i.e., the sensitivity of |Z||Z| is 1111For other wavelet transforms that use circular convolutions, such as Biorthogonal transform, the sensitivity of nn depends on the count of positive values and the count of negative values in the matrix computed by the coefficient vector [28].. Finally, PrivTHR obtains dd^{\prime} as the top kk^{\prime}th value in L′′L^{\prime\prime} (Line 8), where any value in L′′L^{\prime\prime} greater than dd^{\prime} is considered as a significant value, and applies the connected component labeling algorithm to identify clusters of significant grids (Line 9).

Algorithm 3 PrivTHR
1:Dataset DD, num_grid gg, density threshold pp, wavelet transform ww, differential privacy budget ϵ\epsilon, allocation percentage α\alpha
2:A set of differentially private clusters
3:procedure PrivTHRPrivTHR(D,g,p,w,ϵ,αD,g,p,w,\epsilon,\alpha)
4:  MM^{\prime} = PrivQuantization(D,g,αϵD,g,\alpha\epsilon)
5:  WW^{\prime} = WaveletTransform(MM^{\prime},ww)
6:  LL^{\prime} = ConvertToPosSortedArray(WW^{\prime})
7:  |Z||Z|^{\prime} = NoisyCountOfNonPosValues(D,g,w,(1α)ϵD,g,w,(1-\alpha)\epsilon)
8:  L′′L^{\prime\prime} = RemoveFrom(LL^{\prime},0,|Z|2\frac{|Z|^{\prime}}{2})
9:  kk^{\prime} = (1p)|L′′|(1-p)|L^{\prime\prime}|
10:  dd^{\prime} = Top(kk^{\prime},L′′L^{\prime\prime})
11:  return ConnCompLabel(W,dW^{\prime},d^{\prime})
12:end procedure
13:procedure NoisyCountOfNonPosValues(D,g,w,ϵD,g,w,\epsilon)
14:  MM = Quantization(D,gD,g)
15:  WW = WaveletTransform(MM,ww)
16:  |Z||Z| = CountOfNonPos(WW)
17:  |Z||Z|^{\prime} = |Z||Z| + Lap(Sensitivity(n)ϵ)Lap(\frac{Sensitivity(n)}{\epsilon})
18:  return |Z||Z|^{\prime}
19:end procedure

Budget Allocation. PrivTHR first introduces Laplacian noise in the quantization step using a privacy budget ϵ1=αϵ\epsilon_{1}=\alpha\epsilon, where 0<α<10<\alpha<1. In the significant grid identification step, PrivTHR further introduces Laplacian noise to |Z||Z| using the remaining privacy budget (1α)ϵ(1-\alpha)\epsilon. Based on utility analysis in Section 5.2.2, ϵ1\epsilon_{1} requires a smaller amount of budget than ϵ2\epsilon_{2}. Our empirical results in Section 7 further show in detail the impact of α\alpha on clustering accuracy.

4.4 Private Quantization with Noisy Threshold using Exponential Mechanism (PrivTHREM)

Besides pruning noisy positive values in WW^{\prime}, we propose an alternative technique that employs Exponential mechanism for deriving kk^{\prime} from the sorted list of LL. Algorithm 4 shows the pseudocode of PrivTHREM.

PrivTHREM first introduces Laplacian noise to the count matrix MM, which is similar to PrivQT and PrivTHR. After that, we obtain a noisy count matrix MM^{\prime} (Line 2) and the corresponding WW^{\prime} (Line 3). Different from the previous two techniques that compute kk^{\prime} from WW^{\prime}, PrivTHREM derives kk^{\prime} from WW using Exponential mechanism (Lines 7-15). In this case, although the sorted list derived from WW^{\prime} is severely distorted in PrivTHREM, the derivation of kk^{\prime} from WW is not affected by the distorted WW^{\prime} at all. Given reasonable privacy budget, kk^{\prime} derived from WW using Exponential mechanism is reasonably accurate, compared to the case when kk^{\prime} is derived from WW^{\prime}.

The quality function fed into the Exponential mechanism is [10]:

q(L,X)=|rank(x)k|,q(L,X)=-|rank(x)-k|,

where LL represents the sorted positive values in WW with MinMin and MaxMax values (Line 10), and XX represents the possible output space, i.e., all the possible values in the range of (0,Max](0,Max]. Given a WW with mm positive values and their relationships are x1x2xmx_{1}\geq x_{2}\geq\ldots\geq x_{m}, these mm values divide the range (0,Max](0,Max] into mm partitions: (0,xm],(xm1,xm2],(0,x_{m}],(x_{m-1},x_{m-2}], ,(x2,x1]\ldots,(x_{2},x_{1}], and the ranks for these partitions are mm, m1m-1, \ldots, 2, 1. For any x(xi1,xi]x\in(x_{i-1},x_{i}], its rank is rank(xi)rank(x_{i}). For example, if x(x2,x1]x\in(x_{2},x_{1}], rank(x)=rank(x1)=1rank(x)=rank(x_{1})=1. Similar to PrivTHR, when using Haar transform, any single change in the input causes only one value in WW to change. Thus, at maximum one value will be added into or removed from LL, causing the outcome of q(L,X)q(L,X) to be changed by 1, i.e., the sensitivity of q(L,X)q(L,X) is 1222Similar to PrivTHR, for other wavelet transforms that use circular convolutions, the sensitivity of q(L,X)q(L,X) depends on the count of positive values and the count of negative values in the matrix computed by the coefficient vector [28]..

Plugging in the above quality function into Exponential mechanism, we obtain the following algorithm: for any value x(0,Max]x\in(0,Max], the Exponential mechanism (EM) returns xx with probability
Pr[EM(L)=x]exp(ϵ|rank(x)k|2)Pr[EM(L)=x]\propto exp(-\frac{\epsilon|rank(x)-k|}{2}) (Line 12). Since all the values in a partition have the same probability to be chosen, a random value from the partition Pti=(xi1,xi]Pt_{i}=(x_{i-1},x_{i}] will be chosen with the probability proportional to |Pti|exp(ϵ2|ik|)|Pt_{i}|*exp(-\frac{\epsilon}{2}|i-k|). In other words, once kk^{\prime} is chosen, PrivTHREM further computes a uniform random value dd^{\prime} from PtiPt_{i} (Line 13), and any value in LL^{\prime} greater than dd^{\prime} is considered as a significant value.

Budget Allocation. Similar to PrivTHR, the privacy budget is split between two steps: introduction of Laplacian noise in quantization and obtaining kk^{\prime} using Exponential mechanism. Previous empirical experiments [10] on splitting budgets between obtaining noisy median and noisy counts suggest that, 30% vs. 70% budget allocation strategy performs best. Specifically, 70% of budget is allocated for obtaining noisy count matrix MM^{\prime} (Line 2) and the remaining budget is allocated for computing kk^{\prime} (Line 4).

Algorithm 4 PrivTHREM
1:Dataset DD, num_grid gg, density threshold pp, wavelet transform ww, differential privacy budget ϵ\epsilon, allocation percentage α\alpha
2:A set of differentially private clusters
3:procedure PrivTHREMPrivTHR_{EM}(D,g,p,w,ϵ,αD,g,p,w,\epsilon,\alpha)
4:  MM^{\prime} = PrivQuantization(D,g,αϵD,g,\alpha\epsilon)
5:  WW^{\prime} = WaveletTransform(MM^{\prime},ww)
6:  dd^{\prime} = NoisyDensityThreshold(D,g,p,w,(1α)ϵD,g,p,w,(1-\alpha)\epsilon)
7:  return ConnCompLabel(W,dW^{\prime},d^{\prime})
8:end procedure
9:procedure NoisyDensityThreshold(D,g,p,w,ϵD,g,p,w,\epsilon)
10:  MM = Quantization(D,gD,g)
11:  WW = WaveletTransform(MM,ww)
12:  LL = ConvertToPosSortedArray(WW)
13:  kk = (1p)|L|(1-p)|L|
14:  kk^{\prime} = ExponentialMechanism(LL,kk,ϵ\epsilon)
15:  dd^{\prime} = UniformRandom(LL, kk^{\prime} )
16:  return dd^{\prime}
17:end procedure

5 Privacy and Utility Analysis

In this section, we present the theoretical analysis of proposed techniques PrivQT, PrivTHR and PrivTHREM.

5.1 Privacy Analysis

In this part we establish the privacy guarantee of PrivQT, PrivTHR and PrivTHREM.

Theorem 1

PrivQT is ϵ\epsilon-differentially private.

Proof 5.2.

PrivQT introduces independent Laplacian noise Lap(1ϵ)Lap(\frac{1}{\epsilon}) to grid counts, which are computed on disjoint datasets. According to the parallel composition property of differential privacy described in Section 3.1, the privacy cost depends only on the worst guarantee of all computations over disjoint datasets. Therefore, PrivQT is ϵ\epsilon-differentially private.

Theorem 5.3.

PrivTHR is ϵ\epsilon-differentially private.

Proof 5.4.

PrivTHR splits privacy budget into two parts. First, for private quantization, adding Laplacian noise Lap(1αϵ)Lap(\frac{1}{\alpha\epsilon}) achieves strict αϵ\alpha\epsilon-differential privacy. The proof is same as PrivQT. Second, PrivTHR introduces Laplacian noise Lap(1(1α)ϵ)Lap(\frac{1}{(1-\alpha)\epsilon}) to the true count of non-positive values in WW, which achieves (1α)ϵ(1-\alpha)\epsilon-differential privacy. Using the composition property of differential privacy, PrivTHR achieves ϵ\epsilon-differentially private since ϵ=αϵ+(1α)ϵ\epsilon=\alpha\epsilon+(1-\alpha)\epsilon.

Theorem 5.5.

PrivTHREM is ϵ\epsilon-differentially private.

Proof 5.6.

Similar to PrivTHR, PrivTHREM has two steps of randomization: private quantization and obtaining noisy density threshold dd^{\prime}. Private quantization achieves αϵ\alpha\epsilon-differential privacy according to Laplace mechanism and parallel composition property. Sampling noisy density threshold dd^{\prime} by Exponential mechanism consumes budget of (1α)ϵ(1-\alpha)\epsilon, which achieves (1α)ϵ(1-\alpha)\epsilon-differential privacy. According to the composition property of differential privacy, PrivTHREM is ϵ\epsilon-differentially private.

5.2 Utility Analysis

In this section, we present utility guarantees of our algorithms (PrivQT, PrivTHR and PrivTHREM) with theoretical analysis. In WaveCluster, the step of significant grid identification determines the clustering results. In the private results of WaveCluster, PrivQT, PrivTHR and PrivTHREM return a list of noisy significant grids. To quantify the utility of PrivQT, PrivTHR and PrivTHREM, we consider finding significant grids whose wavelet transformed values surpass a threshold to be similar to finding the top-kk frequent itemsets whose frequencies surpass a threshold. In significant grid identification, LL is the list of positive wavelet transformed values from WW sorted in ascending order, ZZ represents the set of zero values from WW, and kk indicates the threshold position in LL and all the top-kk values in LL correspond to significant grids, where k=(1p)|L|k=(1-p)|L|. One parameter to specify kk is the density threshold pp, which remains the same either with or without noise introduction. However, |L||L|, another parameter to determine kk, will be changed to |L||L^{\prime}| under differential privacy, where LL^{\prime} is the list of positive wavelet transformed values from WW^{\prime} sorted in ascending order. LL is different from LL^{\prime} since noise introduction might result in a portion of zero values in ZZ becoming positive and a small portion of positive values in LL becoming non-positive.

5.2.1 Utility Analysis for PrivQT.

We first provide the analysis of difference between kk and kk^{\prime} in PrivQT. In PrivQT, the difference between kk^{\prime} and kk depends on two factors: (1) a set of zero values in ZZ becoming noisy positive, Zp={WZ|WZ=WZ+Noise,WZZ,WZ>0}Z^{\prime}_{p}=\{W^{\prime}_{Z}|W^{\prime}_{Z}=W_{Z}+Noise,W_{Z}\in Z,W^{\prime}_{Z}>0\}, where WZW^{\prime}_{Z} is the noisy value of zero value in ZZ, and (2) a set of positive values in LL becoming noisy non-positive, Ln={WL|WL=WL+Noise,WLL,WL0}L^{\prime}_{n}=\{W^{\prime}_{L}|W^{\prime}_{L}=W_{L}+Noise,W_{L}\in L,W^{\prime}_{L}\leq 0\}, where WLW^{\prime}_{L} is the noisy value of positive value in LL. That is, k=(1p)(|L|+|Zp||Ln|)k^{\prime}=(1-p)(|L|+|Z^{\prime}_{p}|-|L^{\prime}_{n}|).

Analysis of |Zp||Z^{\prime}_{p}|. In PrivQT, since we are adding Lap(1ϵ)Lap(\frac{1}{\epsilon}) noise to each grid count and the Haar transform computes the average from four adjacent grids, the noise added into a wavelet transformed value is the sum of four i.i.d. samples from the Laplace distribution. The sum of hh i.i.d. Laplace distributions with mean 0 is the difference of two i.i.d. Gamma distributions [26], referred to as distribution TT. Distribution TT is a polynomial in |x||x| divided by e|x|e^{|x|}, which is a symmetric function and thus the probability for distribution TT to produce positive values is 12\frac{1}{2}. Thus, the events of values in ZZ adding positive noise from distribution TT conform to the Binominal Distribution with parameters |Z||Z| and 12\frac{1}{2} and its expected value is |Z|2\frac{|Z|}{2}.

Analysis of |Ln||L^{\prime}_{n}|. For LnL^{\prime}_{n}, each value is added the noise conforming to the symmetric distribution TT. The probability density function of |Ln||L^{\prime}_{n}| is f|Ln|(x)=(|L|x)i=1xfT(yWi)i=x+1|L|fT(y>Wi),WiLf_{|L^{\prime}_{n}|}(x)={|L|\choose x}\prod_{i=1}^{x}f_{T}(y\leq-W_{i})\prod_{i=x+1}^{|L|}\\ f_{T}(y>-W_{i}),W_{i}\in L, and its expected value E[|Ln|]E[|L^{\prime}_{n}|] is n=1|L|fT(yWi)\sum_{n=1}^{|L|}\\ f_{T}(y\leq-W_{i}). E[|Ln|]E[|L^{\prime}_{n}|] is large when WiW_{i} is small and there is limited privacy budget. Consider an extreme case that might not be suitable for clustering. All the positive values in LL are the minimum value 0.5 due to the sum of adjacent four grid counts being the minimum value 1, resulting in a high E[|Ln|]E[|L^{\prime}_{n}|]. Clustering, especially WaveCluster algorithm, is useful when the dataset has dense areas (clusters) and empty areas (gap between clusters). Such extreme case is not suitable for clustering since its data distribution is close to uniform distribution. Those datasets that are interesting for clustering always have highly dense cluster centers and cluster borders with low density. Only those values corresponding to border grids are possible to become noisy non-positive and the size of border grids is relatively small. Therefore, E[|Ln|]E[|L^{\prime}_{n}|] is a small constant. We refer to the value of |Ln||L^{\prime}_{n}| as θ\theta in the following analysis.

Analysis of kkk^{\prime}-k. In PrivQT, E[kk]=(1p)(|Z|2θ)E[k^{\prime}-k]=(1-p)(\frac{|Z|}{2}-\theta). There are two extreme cases when kk0k^{\prime}-k\approx 0. For one extreme, |Z|=|L||Z|=|L| and all the positive values in LL is the minimum value 0.5. When ϵ=1\epsilon=1, θ0.43|L||Z|2\theta\approx 0.43|L|\approx\frac{|Z|}{2}, which makes kk0k^{\prime}-k\approx 0. For another extreme, |Z|=0|Z|=0, all the positive values in LL is large, e.g. 15\geq 15. When ϵ=1\epsilon=1, θ0\theta\approx 0 and kk0k^{\prime}-k\approx 0. For those datasets that are interesting in the context of clustering, |Z||Z| is pretty large compared to the whole space since ZZ is used to separate different clusters. What is more, dense areas within clusters are typically larger than the space of cluster borders with low density, i.e. θ\theta is far smaller than |Z|2\frac{|Z|}{2}. In PrivQT, |Z|2\frac{|Z|}{2} dominates the difference between kkk^{\prime}-k, which increases false positive rate. In PrivTHR and PrivTHREM, we use different strategies to minimize the difference between kkk^{\prime}-k.

Theorem 5.7.

In PrivQT with Haar transform, given 0<ω<10<\omega<1, let η1=|Z|2|Z|ln(1ω)2\eta_{1}=\frac{|Z|}{2}-\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}, η2=|Z|2+|Z|ln(1ω)2\eta_{2}=\frac{|Z|}{2}+\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}, and γ=8ϵln(4(|L|+|Z|)ω)\gamma=\frac{8}{\epsilon}\ln{(\frac{4(|L|+|Z|)}{\omega})}, then with probability at least (1ω)2(1-\omega)^{2}, (1) all values in LL greater than Wkmin+γW_{k_{min}^{\prime}}+\gamma are output, where kmin=k+(1p)(η1θ)k_{min}^{\prime}=k+(1-p)(\eta_{1}-\theta), and (2) no values in LL less than WkmaxγW_{k_{max}^{\prime}}-\gamma are output, where kmax=k+(1p)(η2θ)k_{max}^{\prime}=k+(1-p)(\eta_{2}-\theta).

Proof 5.8.

In PrivQT, k=(1p)(|L|+|Zp||Ln|)k^{\prime}=(1-p)(|L|+|Z^{\prime}_{p}|-|L^{\prime}_{n}|). Since |Zp||Z^{\prime}_{p}| follows Binominal distribution with parameters |Z||Z| and 12\frac{1}{2} and |Ln||L^{\prime}_{n}| is noted as a small value θ\theta, kk^{\prime} follows the Binomial distribution and decides the number of values in LL that become output. Given ω\omega, we can derive kk^{\prime}’s lower bound kmink^{\prime}_{min}, and show that values greater than Wkmin+γW_{k_{min}^{\prime}}+\gamma are output, i.e., subclaim (1). Let 1ω=Pr(kkmin)=Pr(|Zp|η1)1-\omega=Pr(k^{\prime}\geq k_{min}^{\prime})=Pr(|Z^{\prime}_{p}|\geq\eta_{1}). As Pr(|Zp|η1)Pr(|Z^{\prime}_{p}|\geq\eta_{1}) = 1 - Pr(|Zp|η1)Pr(|Z^{\prime}_{p}|\leq\eta_{1}) and Pr(|Zp|η1)e(2(|Z|2η1)2|Z|)Pr(|Z^{\prime}_{p}|\leq\eta_{1})\leq e^{(-2\frac{(\frac{|Z|}{2}-\eta_{1})^{2}}{|Z|})} [6], we have η1=|Z|2|Z|ln(1ω)2\eta_{1}=\frac{|Z|}{2}-\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}. For constant ω\omega, η1=O(|Z|2|Z|2)\eta_{1}=O(\frac{|Z|}{2}-\sqrt{\frac{|Z|}{2}}) will suffice.

Similar as kmink^{\prime}_{min}, we can also derive the bound γ\gamma of the noise added to each value in LZL\cup Z based on ω\omega. For Haar wavelet transform, each value in LZL\cup Z is added the noise that is the sum of 4 Laplacian random variables divided by 2 (i.e., 4Lap(1ϵ)2\frac{4Lap(\frac{1}{\epsilon})}{2}). For values in LZL\cup Z, let all 4(|L|+|Z|)4(|L|+|Z|) Laplacian random variables generate noise within [γ4,γ4][-\frac{\gamma}{4},\frac{\gamma}{4}]. The probability that no Laplacian random variable’ value is outside [γ4,γ4][-\frac{\gamma}{4},\frac{\gamma}{4}] is 1Pr(A)1-Pr(A), where AA is that at least one Laplacian random variable’s value is outside [γ4,γ4][-\frac{\gamma}{4},\frac{\gamma}{4}]. By union bound, Pr(A)i=14(|L|+|Z|)Pr(Bi)Pr(A)\leq\sum_{i=1}^{4(|L|+|Z|)}Pr(B_{i}), where BiB_{i} is that iith Laplacian random variable’s noise is outside [γ4,γ4][-\frac{\gamma}{4},\frac{\gamma}{4}] and Pr(Bi)=eϵγ8Pr(B_{i})=e^{-\frac{\epsilon\gamma}{8}}. Thus, we can derive that with at least the probability 14(|L|+|Z|)eϵγ81-4(|L|+|Z|)e^{-\frac{\epsilon\gamma}{8}}, no Laplacian random variable’ value is outside [γ4,γ4][-\frac{\gamma}{4},\frac{\gamma}{4}], and each value in LZL\cup Z has their noise amount within [γ2,γ2][-\frac{\gamma}{2},\frac{\gamma}{2}]. Let ω=4(|L|+|Z|)eϵγ8\omega=4(|L|+|Z|)e^{-\frac{\epsilon\gamma}{8}}, then ϵγ8=ln(ω4(|L|+|Z|))-\frac{\epsilon\gamma}{8}=\ln{(\frac{\omega}{4(|L|+|Z|)})} and we have γ=8ϵln(4(|L|+|Z|)ω)\gamma=\frac{8}{\epsilon}\ln{(\frac{4(|L|+|Z|)}{\omega})}. For constant ω\omega, γ=O(ln(4(|L|+|Z|))ϵ)\gamma=O(\frac{ln{(4(|L|+|Z|))}}{\epsilon}).

Subclaim (1) can be derived based on (a) with probability at least 1ω1-\omega, kkmink^{\prime}\geq k_{min}^{\prime} and (b) with probability at least 1ω1-\omega, the noise of each value in LL being within [γ2,γ2][-\frac{\gamma}{2},\frac{\gamma}{2}]. Detailed proof is omitted here. Subclaim (1) requires both conditions (a) and (b) to hold, and thus the probability is at least (1ω)2(1-\omega)^{2}.

We can derive the upper bound kmaxk_{max}^{\prime} of kk^{\prime} given ω\omega. Let 1ω=Pr(kkmax)=Pr(|Zp|η2)1-\omega=Pr(k^{\prime}\leq k_{max}^{\prime})=Pr(|Z^{\prime}_{p}|\leq\eta_{2}). Recall that |Zp||Z^{\prime}_{p}| follows Binomial distribution (|Z|,12)(|Z|,\frac{1}{2}), and Binomial distribution (|Z|,12)(|Z|,\frac{1}{2}) is symmetric with respect to |Z|2\frac{|Z|}{2}. Thus, the probability of sampling a value from the range [0,η2][0,\eta_{2}] is the same as sampling a value from the range [η1,|Z|][\eta_{1},|Z|], and we have η2=|Z|η1=|Z|2+|Z|ln(1ω)2\eta_{2}=|Z|-\eta_{1}=\frac{|Z|}{2}+\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}. For constant ω\omega, η2=O(|Z|2+|Z|2)\eta_{2}=O(\frac{|Z|}{2}+\sqrt{\frac{|Z|}{2}}) will suffice.

Subclaim (2) can be proved based on (c) with probability at least 1ω1-\omega, kkmaxk^{\prime}\leq k_{max}^{\prime} and (b) with probability at least 1ω1-\omega, the noise of each value in LL being within [γ2,γ2][-\frac{\gamma}{2},\frac{\gamma}{2}]. As subclaim (2) requires both conditions (c) and (b) to hold, the probability is at least (1ω)2(1-\omega)^{2}.

For other wavelet transforms that use circular convolutions, such as Biorthogonal transform, the derivation for the bounds of kk^{\prime} with η1\eta_{1} and η2\eta_{2} remains the same since |Zp||Z^{\prime}_{p}| following Binomial distribution is independent of any wavelet transform being adapted. Thus, our framework is extensible to other wavelet transforms, and the bound of noise magnitude γ\gamma depends on the amount of adjacent grid counts involved in computing a wavelet transformed value.

5.2.2 Utility Analysis for PrivTHR.

Theorem 5.9.

In PrivTHR with Haar transform, given 0<ω<10<\omega<1, let η1=|Z|2|Z|ln(1ω)2\eta_{1}=\frac{|Z|}{2}-\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}, η2=|Z|2+|Z|ln(1ω)2\eta_{2}=\frac{|Z|}{2}+\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}, γ=8ϵ1ln(4(|L|+|Z|)ω)\gamma=\frac{8}{\epsilon_{1}}\ln{(\frac{4(|L|+|Z|)}{\omega})} and β=2ϵ2ln(1ω)\beta=\frac{2}{\epsilon_{2}}\ln{(\frac{1}{\omega})}, then with probability at least (1ω)3(1-\omega)^{3}, (1) all values in LL greater than Wkmin+γW_{k_{min}^{\prime}}+\gamma are output, where kmin=k+(1p)(η1θ|Z|2β)k^{\prime}_{min}=k+(1-p)(\eta_{1}-\theta-\frac{|Z|}{2}-\beta), and (2) no values in LL less than WkmaxγW_{k_{max}^{\prime}}-\gamma are output, where kmax=k+(1p)(η2θ|Z|2+β)k^{\prime}_{max}=k+(1-p)(\eta_{2}-\theta-\frac{|Z|}{2}+\beta).

Proof 5.10.

In PrivTHR, we allocate ϵ1\epsilon_{1} for private quantization and ϵ2\epsilon_{2} for protecting |Z|2\frac{|Z|}{2}, which makes k=(1p)(|L|+|Zp|θ|Z|2+Lap(1ϵ2))k^{\prime}=(1-p)(|L|+|Z^{\prime}_{p}|-\theta-\frac{|Z|}{2}+Lap(\frac{1}{\epsilon_{2}})). With the probability at least 1eϵ2β21-e^{-\frac{\epsilon_{2}\beta}{2}}, Lap(1ϵ2)Lap(\frac{1}{\epsilon_{2}}) has the noise amount within β\beta. Let 1ω=1eϵ2β21-\omega=1-e^{-\frac{\epsilon_{2}\beta}{2}}, then we get β=2ϵ2ln(1ω)\beta=\frac{2}{\epsilon_{2}}\ln{(\frac{1}{\omega})}. For constant ω\omega, β=O(1ϵ2)\beta=O(\frac{1}{\epsilon_{2}}) will suffice. The proofs of η1\eta_{1}, η2\eta_{2}, γ\gamma and subclaims (1) and (2) are the same as Theorem 4, and γ=O(ln(4(|L|+|Z|))ϵ1)\gamma=O(\frac{ln{(4(|L|+|Z|))}}{\epsilon_{1}}) for constant ω\omega.

Difference between PrivTHR and PrivQT: By Theorem 4, in PrivQT kmin=k+(1p)(η1θ)k_{min}^{\prime}=k+(1-p)(\eta_{1}-\theta) and kmax=k+(1p)(η2θ)k_{max}^{\prime}=k+(1-p)(\eta_{2}-\theta). By Theorem 5, in PrivTHR kmin=k+(1p)(η1θ|Z|2β)k_{min}^{\prime}=k+(1-p)(\eta_{1}-\theta-\frac{|Z|}{2}-\beta) and kmax=k+(1p)(η2θ|Z|2+β)k_{max}^{\prime}=k+(1-p)(\eta_{2}-\theta-\frac{|Z|}{2}+\beta). η1=O(|Z|2|Z|2)\eta_{1}=O(\frac{|Z|}{2}-\sqrt{\frac{|Z|}{2}}), η2=O(|Z|2+|Z|2)\eta_{2}=O(\frac{|Z|}{2}+\sqrt{\frac{|Z|}{2}}). As we can see, by removing |Z|2±β\frac{|Z|}{2}\pm\beta positive values from LL^{\prime}, PrivTHR provides better utility guarantee than PrivQT since the difference between kk^{\prime} and kk becomes O(|Z|2)±(β+θ)O(\sqrt{\frac{|Z|}{2}})\pm(\beta+\theta), where θ\theta is a small constant and β\beta is small when sufficient budget ϵ2\epsilon_{2} is provided.

5.2.3 Utility Analysis for PrivTHREM.

Theorem 5.11.

In PrivTHREM with Haar transform, given 0<ω<10<\omega<1, let η1=|L|k1+2ϵ2ln(|WmaxWk||Wk|ω)\eta_{1}=|L|-k-1+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{max}-W_{k}|}{|W_{k}|\omega}), η2=k2+2ϵ2ln(|Wk||WmaxWk|ω)\eta_{2}=k-2+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|\omega}), and γ=8ϵ1ln(4(|L|+|Z|)ω)\gamma=\frac{8}{\epsilon_{1}}\ln{(\frac{4(|L|+|Z|)}{\omega})}, then with probability at least (1ω)2(1-\omega)^{2}, (1) all values in LL greater than Wkmin+γW_{k_{min}^{\prime}}+\gamma are output, where kmin=kη1k_{min}^{\prime}=k-\eta_{1}, and (2) no values in LL less than WkmaxγW_{k_{max}^{\prime}}-\gamma are output, where kmax=k+η2k_{max}^{\prime}=k+\eta_{2}.

Proof 5.12.

In PrivTHREM, we allocate ϵ2\epsilon_{2} for deriving kk^{\prime} from kk by employing Exponential mechanism, a general method proposed in [30]. The probability of selecting a rank ii is |Pti|exp(ϵ22|irank(Wk)|)|Pt_{i}|*exp(-\frac{\epsilon_{2}}{2}|i-rank(W_{k})|), where PtiPt_{i} is the range (Wi1,Wi](W_{i-1},W_{i}] decided by the i1i-1th and iith wavelet transformed values.

Let 1ω1-\omega be the probability of sampling a kk^{\prime} where kkη1k-k^{\prime}\leq\eta_{1}, then

ω<|WmaxWk|eϵ22(η1+1)|Wk|eϵ22(|L|k)\Leftrightarrow\omega<\frac{|W_{max}-W_{k}|*e^{-\frac{\epsilon_{2}}{2}(\eta_{1}+1)}}{|W_{k}|*e^{-\frac{\epsilon_{2}}{2}(|L|-k)}}
η1<|L|k1+2ϵ2ln(|WmaxWk||Wk|ω)\Leftrightarrow\eta_{1}<|L|-k-1+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{max}-W_{k}|}{|W_{k}|\omega})

For constant ω\omega, η1=O(|L|k+1ϵ2ln(|WmaxWk||Wk|)\eta_{1}=O(|L|-k+\frac{1}{\epsilon_{2}}\ln(\frac{|W_{max}-W_{k}|}{|W_{k}|}) will suffice.

Let 1ω1-\omega be the probability of sampling a kk^{\prime} where kkη2k^{\prime}-k\leq\eta_{2}, then

ω<|Wk|eϵ22(η2+1)|WmaxWk|eϵ22(k1)\Leftrightarrow\omega<\frac{|W_{k}|*e^{-\frac{\epsilon_{2}}{2}(\eta_{2}+1)}}{|W_{max}-W_{k}|*e^{-\frac{\epsilon_{2}}{2}(k-1)}}
η2<k2+2ϵ2ln(|Wk||WmaxWk|ω)\Leftrightarrow\eta_{2}<k-2+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|\omega})

For constant ω\omega, η2=O(k+1ϵ2ln(|Wk||WmaxWk|)\eta_{2}=O(k+\frac{1}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|}) will suffice. The proof of γ\gamma and subclaims (1) and (2) are the same as Theorem 4.

Analysis of PrivTHR and PrivTHREM. By Theorem 5 and Theorem 6, the accuracy for sampling kk^{\prime} in PrivTHR is dominated by 1ϵ2\frac{1}{\epsilon_{2}} while in PrivTHREM the accuracy is dominated by 1ϵ2ln(|Wk||WmaxWk|)\frac{1}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|}). Depending on the data distribution, PrivTHREM may present better or worse utility guarantee than PrivTHR:
ln(|Wk||WmaxWk|)\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|}) is positive when when |Wk||WmaxWk|>1\frac{|W_{k}|}{|W_{max}-W_{k}|}>1, and the accuracy for sampling kk^{\prime} in PrivTHREM becomes more sensitive to ϵ2\epsilon_{2} than PrivTHR; ln(|Wk||WmaxWk|)\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|}) becomes negative when |Wk||WmaxWk|\frac{|W_{k}|}{|W_{max}-W_{k}|} is less than 1, and the bounds of utility guarantee for PrivTHREM becomes better than PrivTHR.

Section 7 demonstrates that by reducing the difference between kk^{\prime} and kk, PrivTHR and PrivTHREM achieve more accurate results than PrivQT, which conforms to the above analysis.

Refer to caption
(a) DS1, g = 64, p = 58
Refer to caption
(b) DS2, g = 40, p = 10
Refer to caption
(c) DS3, g = 36, p = 23
Refer to caption
(d) Gowalla, g = 80, p = 31
Figure 4: Illustration of datasets and their WaveCluster results.

6 Quantitative Measures

To quantitatively assess the utility of differentially private WaveCluster, we propose two types of measures for measuring the dissimilarity between true and differentially private WaveCluster results. The first type, DSGCDSG_{C}, measures the dissimilarity of the significant grids and the clusters between true and private results. The second type focuses on observing the usefulness of differentially private WaveCluster results for further data analysis. The reason is that a slight difference in the significant grids or clusters may cause a significant difference when using the WaveCluster results. In this paper, we choose a typical application of further data analysis: building a classifier from the clustering results to predict unlabeled data [20]. The classifier built from true WaveCluster results is called the true classifier clftclf_{t} while the classifier built from differentially private WaveCluster results is called the private classifier clfpclf_{p}. To measure the dissimilarity between clftclf_{t} and clfpclf_{p}, we propose two metrics: OCMOCM and 2CE2CE.

6.1 Dissimilarity based on Significant Grids and Clusters

DSGCDSG_{C} considers the dissimilarities of significant grids and clusters. Assume that there are tt clusters of true significant grids and ss clusters of differentially private significant grids. tt might not be equal to ss, and the cluster labels in tt true clusters and ss private clusters are completely arbitrary. To accommodate these differences, we adopt the Hungarian method [27], a combinatorial optimization algorithm, to solve the matching problem between tt true clusters and ss private clusters while minimizing the matching difference.

When cluster CiC_{i} matches to cluster CjC_{j}, we define that the distance dd between cluster CiC_{i} and cluster CjC_{j} is max{|Ci\Cj|,|Cj\Ci|}max\{|C_{i}\backslash C_{j}|,|C_{j}\backslash C_{i}|\}. Consider a cluster Ci={g1,g3,g5}C_{i}=\{g_{1},g_{3},g_{5}\} and a cluster CjC_{j} = {g1,g5,g7,g9}\{g_{1},g_{5},g_{7},g_{9}\}. The distance dd between clusters CiC_{i} and CjC_{j} is max{|{g3}|,|{g7,g9}|}max\{|\{g_{3}\}|,|\{g_{7},g_{9}\}|\} =2=2. Given tt true clusters, ss private clusters, and tst\geq s , a matching Mt,sM_{t,s} of tt true clusters and ss private clusters is a set of cluster pairs, where each private cluster is matched with a true cluster. We then define the cost of a matching (McostM_{cost}) as the sum of all the distances between each cluster pair in the matching Mt,sM_{t,s} plus the count of significant grids in the non-matched clusters:

Mcost=1ixt,1jysmax{|Cix\Cjy|,|Cjy\Cix|}+1zt|Cz|M_{cost}=\sum_{1\leq i_{x}\leq t,1\leq j_{y}\leq s}max\{|C_{i_{x}}\backslash C_{j_{y}}|,|C_{j_{y}}\backslash C_{i_{x}}|\}+\sum_{1\leq z\leq t}|C_{z}|

Here, ixi_{x} and jyj_{y} indicate the subscripts of clusters in a matched pair. |Cz||C_{z}| represents the count of significant grids in the non-matched true clusters. Among all the possible matchings of clusters, we use the Hungarian method to find the optimal matching with the minimum McostM_{cost}, and computed DSGCDSG_{C} as:

DSGC=Mcost|T|DSG_{C}=\frac{M_{cost}}{|T|}

Here TT denotes the set of significant grids in the true WaveCluster results.

6.2 Dissimilarity based on Classifier Prediction

OCMOCM and 2CE2CE measure the dissimilarity between clftclf_{t} and clfpclf_{p}. We name this way of evaluation as “clustering-first-then-classification”: given a set of unlabeled data points, we use a portion of the data points (e.g., 90%) to compute WaveCluster results, where each cluster is a set of significant grids. Using the significant grids with cluster labels as training data, we build classifiers clftclf_{t} and clfpclf_{p}, and use them to predict the classes for the remaining data points (e.g., 10%).

Dissimilarity of Classifiers based on Optimal Class Matching (OCMOCM). OCMOCM measures the dissimilarity between the two sets of classes predicted by clftclf_{t} and clfpclf_{p} for the same test samples. We use LtL_{t} to denote the set of classes predicted by clftclf_{t} and LpL_{p} to denote the set of classes predicted by clfpclf_{p}. Since LtL_{t} and LpL_{p} are completely arbitrary, we exploit the Hungarian method to find the optimal matching between LtL_{t} and LpL_{p}.

Assume that a class Lt,iL_{t,i} predicted by clftclf_{t} is matched to a class Lp,jL_{p,j} predicted by clfpclf_{p}, forming a class pair. We compute the count of common test samples in the class Lt,iL_{t,i} and the class Lp,jL_{p,j}, and sum the common test samples in each class pair to compute CTCT:

CT=1ic1,1jc2|Lt,iLp,j|CT=\sum_{1\leq i\leq c_{1},1\leq j\leq c_{2}}|L_{t,i}\cap L_{p,j}|

Here c1c_{1} is the count of classes in LtL_{t} and c2c_{2} is the count of classes in LpL_{p}, and we assume c1c2c_{1}\geq c_{2}. Since there are many possible mappings from the classes in LtL_{t} to the classes in LpL_{p}, we use the Hungarian method to find the optimal mapping that maximizes CTCT. Based on CTCT and the total count of the test samples TTTT, we derive the dissimilarity OCMOCM:

OCM=1CTTTOCM=1-\frac{CT}{TT}

When the dissimilarity is smaller, the differentially private WaveCluster results are more similar to the true WaveCluster results and maintain high utility for classification use.

Dissimilarity of Classifiers based on 2-Combination Enumeration (2CE2CE). 2CE2CE measures the dissimilarity between clftclf_{t} and clfpclf_{p} based on relationships of every pair of test samples, i.e., whether two samples are in the same class. Essentially, given a pair of test samples AA and BB, we say AA and BB are classified consistently either (1) clft(A)=clft(B)clf_{t}(A)=clf_{t}(B) and clfp(A)=clfp(B)clf_{p}(A)=clf_{p}(B) or (2) clft(A)clft(B)clf_{t}(A)\neq clf_{t}(B) and clfp(A)clfp(B)clf_{p}(A)\neq clf_{p}(B). 2CE2CE is the ratio of the count of test sample pairs that are not classified consistently over the total number of test sample pairs, which is the set of 2-combination of the test samples. 2CE2CE uses pairs of test samples to eliminate the need of finding the optimal matching between the classes predicted by clftclf_{t} and clfpclf_{p}.

Refer to caption
(a) k’ and k on DS1
Refer to caption
(b) k’ and k on DS2
Refer to caption
(c) k’ and k on DS3
Refer to caption
(d) k’ and k on Gowalla
Refer to caption
(e)
Figure 5: Comparing private kk^{\prime} of 4 techniques with true kk on DS1DS1, DS2DS2, DS3DS3 and GowallaGowalla with increasing ϵ\epsilon.

7 Experiments

We evaluate the proposed techniques using three datasets that are widely used in previous clustering algorithms [1], and one large scale dataset derived from the check-in information in Gowalla333https://snap.stanford.edu/data/loc-gowalla.html. geo-social networking website [9], which was used to evaluate grid-based clustering algorithms in [37].

7.1 Experiment Setup

In our experiments, we compare the performances of the four techniques, Baseline, PrivQT, PrivTHR, and PrivTHREM, on the four datasets using two types of measures proposed in Section 6 and provide analysis on the results. We use Haar transform as the wavelet transform and set the wavelet decomposition level to 1 for the four techniques. Baseline uses the adaptive-grid method [33] for synthetic data generation. The classification algorithm used for measuring OCMOCM and 2CE2CE is C4.5 decision tree algorithm [34]. We conduct experiments with privacy budgets ranging from 0.1 to 2.0; for each budget and each metric, we apply the techniques on each dataset for 10 times and compute their average performances. All experiments were conducted on a machine with Intel 2.67GHz CPU and 8GB RAM.

Datasets. The four clustering datasets contain different data shapes that are specially interesting for clustering. Figures 4 shows the WaveCluster results on four datasets under certain parameter settings of grid size gg and density threshold pp. Any two adjacent clusters are marked with different colors. The points in red color are identified as noise, which fall into the non-significant grids.

DS1DS1 is a dataset containing 15 Gaussian clusters with different degrees of cluster overlapping. It contains 30000 data points. These 15 clusters are all in convex shapes. The center area of each cluster has higher density and is resistant to noise. However, the overlapped area of two adjacent clusters has lower density and is prone to be affected by noise, which might turn the corresponding non-significant grids into significant grids and further connect two separate clusters. DS2DS2 is a dataset with 3 spiral clusters. It contains 31200 data points. The head of each spiral is quite close to one another. Some noisy significant grids are very likely to bridge the gap between adjacent spirals and merge them into one cluster. DS3DS3 is a data dataset with 5 various shapes of clusters, including concave shapes. It contains 31520 data points. There are two clusters that both contain two sub components and a narrow line-shape area that bridges those two sub components. The narrow bridging area has low density and might be turned into non-significant grids, causing a cluster to split into two clusters. GowallaGowalla is the check-in dataset resembling the world map, which records time and location information of users’ check-ins. We use only the location information for evaluation. There are about 6.4M records in total. The large size of the dataset makes it infeasible to run experiments with C4.5 and Baseline due to memory constraints. Thus, similar to [33], we sampled 1M records from the dataset for evaluation.

We next present the results on comparing kk^{\prime} and kk, and then present the results of the two types of measures.

Refer to caption
(a) DSGC-DS1
Refer to caption
(b) DSGC-DS2
Refer to caption
(c) DSGC-DS3
Refer to caption
(d) DSGC-Gowalla
Refer to caption
(e)
Figure 6: Comparing DSGCDSG_{C} of 4 techniques on DS1DS1, DS2DS2, DS3DS3 and GowallaGowalla with increasing ϵ\epsilon.

7.2 Comparing Private kk^{\prime} With True kk

We first measure the differences between the true kk and private kk^{\prime}s on each dataset with ϵ\epsilon ranging from 0.1 to 2.0, and the results are shown in Figure 5. The results show that for all datasets, when ϵ0.5\epsilon\geq 0.5, the relative errors of kk^{\prime}, i.e., |kk|k\frac{|k^{\prime}-k|}{k}, in PrivQT and PrivTHREM are less than 4.7% on average, while the relative errors of kk^{\prime} in Baseline and PrivQT range from 32.2% to 150.5%. For example, in DS2DS2, the true kk is 144. When ϵ\epsilon is 1, the average private kk^{\prime} is 141.0 (2.1%2.1\%) for PrivTHR and 142.8 (0.8%0.8\%) for PrivTHREM, while Baseline and PrivQT obtain 284.0 (97.2%97.2\%) and 249.2 (73.1%73.1\%) for the average kk^{\prime} respectively. Note that |Z||Z| is 241 in DS2DS2, and the difference between the average kk^{\prime} and kk is 105.2 for PrivQT, which is quite close to the theoretical bound (1p)|Z|2=108.45(1-p)\frac{|Z|}{2}=108.45 derived from our utility analysis in Section 5.2.1. When ϵ\epsilon is 0.1, the kk^{\prime} in PrivTHREM deviates from kk more significantly than the kk^{\prime} in PrivTHR, indicating that PrivTHREM is more sensitive to ϵ\epsilon than PrivTHR as discussed in Section 5.2.3. For example, in DS2DS2, the average kk^{\prime} in PrivTHREM is 82.8 (42.5%42.5\%) while the average kk^{\prime} in PrivTHR is 131.2 (8.9%8.9\%).

7.3 Results of DSGCDSG_{C}

Figure 6 shows the results of DSGCDSG_{C} for the four techniques when the privacy budget ranges from 0.1 to 2.0. X-axis shows the privacy budgets, and Y-axis denotes the values of DSGCDSG_{C}. As shown in the results, both PrivTHR and PrivTHREM achieve smaller DSGCDSG_{C} values than Baseline and PrivQT on all four datasets for all budgets. The reason is that though the noisy significant grids generated by Baseline and PrivQT may be similar to the true significant grids, these noisy significant grids result in very different shapes of clusters and thus result in a large value of DSGCDSG_{C}, while PrivTHR and PrivTHREM preserves more accurate cluster shapes. For example, in DS3DS3, the narrow line-shape areas and the gap between two adjacent clusters are sensitive to noise. If some noisy significant grids appear in these areas, two clusters may be merged into one; if some significant grids disappear due to noise, one cluster might be split into two clusters. Such changes cause DSGCDSG_{C} to increase significantly.

Unlike the other techniques, PrivQT benefits little from the increased privacy budgets. For PrivQT, the difference between kk^{\prime} and kk in PrivQT is dominated by |Z|2\frac{|Z|}{2}. Increasing privacy budgets can only reduce noise magnitude and cannot smooth such difference.

Refer to caption
(a) F-Measure-DS1
Refer to caption
(b) F-Measure-DS2
Refer to caption
(c) F-Measure-DS3
Refer to caption
(d) F-Measure-Gowalla
Refer to caption
(e)
Figure 7: Comparing F-Measure of 4 techniques on DS1DS1, DS2DS2, DS3DS3 and GowallaGowalla with increasing ϵ\epsilon.

Comparison to F-Measure Results. Clustering analysis usually uses F-measure as a representative external validations to measure the similarity between the ground truth (known class labels) and the clustering results [2]. In our experiments, we consider the true WaveCluster results as the ground truth, and the results of F-measure are shown in Figure 7. The results show that PrivQT and Baseline achieve high F-measure scores (more than 0.8) for almost all budgets in DS1DS1, even though the private results produced by PrivQT and Baseline are quite different from the true results. For example, when ϵ=0.1\epsilon=0.1, the private results of PrivQT and Baseline have more than 30 clusters while the true results have only 15 clusters. On the contrary, Figure 6 (a) shows that DSGCDSG_{C} is able to clearly differentiate the performances of the four techniques. The reason is that unlike DSGCDSG_{C} that allows only one-to-one mapping between true and private clusters, F-measure allows one-to-many or many-to-one mapping between true and private clusters. If the size of true clusters is larger than that of private clusters, F-measure allows many to one mapping, and vice versa. Thus, DSGCDSG_{C} presents more strict evaluation than F-measure in computing similarity/dissimilarity.

Refer to caption
(a) OCM-DS1
Refer to caption
(b) OCM-DS2
Refer to caption
(c) OCM-DS3
Refer to caption
(d) OCM-Gowalla
Refer to caption
(e) 2CE-DS1
Refer to caption
(f) 2CE-DS2
Refer to caption
(g) 2CE-DS3
Refer to caption
(h) 2CE-Gowalla
Refer to caption
(i)
Figure 8: Comparing OCMOCM and 2CE2CE of 4 techniques on DS1DS1, DS2DS2, DS3DS3 and GowallaGowalla with increasing ϵ\epsilon.

7.4 Results of OCMOCM and 2CE2CE

Results of OCMOCM. Figure 8 shows the results of OCMOCM for the four techniques. X-axis denotes the privacy budgets while Y-axis denotes the values of OCMOCM. As shown in the results, PrivTHR and PrivTHREM achieve smaller OCMOCM values than Baseline and PrivQT for all datasets when ϵ\epsilon ranges from 0.5 to 2.0. When ϵ\epsilon is greater than 0.5, the OCMOCM values of PrivTHR and PrivTHREM are less than 0.15 on DS1DS1, DS3DS3, and GowallaGowalla, indicating the private classifier clfpclf_{p} maintains highly similar prediction results as the true classifier clftclf_{t}. On DS2DS2 that contains 3 spirals, PrivTHREM still maintains a very low OCMOCM value (<< 0.1) when ϵ\epsilon is greater than 0.5 while PrivTHR has a slightly worse OCMOCM value (ranging from 0.1 to 0.2). Such results show that PrivTHREM is more resilient to noise for concave-shaped data than PrivTHR.

Results of 2CE2CE. Figure 8 shows the results of 2CE2CE for the four techniques. X-axis denotes the privacy budgets while Y-axis denotes the values of 2CE2CE. As shown in the results, PrivTHR and PrivTHREM achieve smaller 2CE2CE values than Baseline and PrivQT for all datasets when ϵ\epsilon ranges from 0.5 to 2.0.

In general, all four techniques exhibit similar trends of 2CE2CE as their trends in OCMOCM. On DS1DS1, all four techniques have very low 2CE2CE values (<< 0.1) though their corresponding OCMOCM values are much higher (ranging from 0.05 to 0.5). The reason is that 2CE2CE captures the relationships between data points while OCMOCM focuses on the mappings of classes. If there are kk test samples out of NN total samples having different prediction results in the true and private results, 2CE2CE expresses the differences as C(k,2)+k(Nk)C(k,2)+k(N-k) over the total combinations of test samples C(N,2)C(N,2), while OCMOCM expresses the differences as kk over NN. On DS1DS1, the kk test samples are predicted to be in the same cluster in the private results and C(k,2)C(k,2) becomes close to 0. In this case, only k(Nk)k(N-k) matters in the computation of 2CE2CE. Given that C(N,2)C(N,2) is much larger than NN and k(Nk)k(N-k) when NN of DS1DS1 is about 30,000, 2CE2CE has a smaller value than OCMOCM for measuring the differences, and thus is less sensitive to the noise on DS1DS1.

Budget Allocation for PrivTHR. Based on the utility analysis Section 5.2.2, ϵ1\epsilon_{1} for private quantization affects the accuracy of γ\gamma, and ϵ2\epsilon_{2} for obtaining |Z||Z|^{\prime} affects the accuracy of β\beta. As the constant factor of γ\gamma, 8ϵ1ln(4(|L|+|Z|)ω)\frac{8}{\epsilon_{1}}\ln{(\frac{4(|L|+|Z|)}{\omega})}, is larger than the constant factor of β\beta, 2ϵ2ln(1ω)\frac{2}{\epsilon_{2}}\ln{(\frac{1}{\omega})}, more budget should be allocated for ϵ1\epsilon_{1} to achieve better utility. We evaluate the values of DSGCDSG_{C} of PrivTHR on DS1DS1 under different budget allocation strategies, ranging from 1% for ϵ1\epsilon_{1} to 99% for ϵ1\epsilon_{1}. Based on the results, the budget allocation strategy with 90% for ϵ1\epsilon_{1} and 10% for ϵ2\epsilon_{2} performs the best. The results of other measures on DS1DS1 show the similar results, and the results of all the two types of measures on other datasets also show the similar results. Detailed results are omitted.

8 Conclusion

In this paper we have addressed the problem of cluster analysis with differential privacy. We take a well-known effective and efficient clusteing algorithm called WaveCluster, and propose several ways to introduce randomness in the computation of WaveCluster. We also devise several new quantitative measures for examining the dissimilarity between the non-private and differentially private results and the usefulness of differentially private results in classification. In the future, we will investigate under differential privacy other categories of clustering algorithms, such as hierarchical clustering. Another important problem is to explore the applicability of differentially private clustering in those cases where the users do not have good knowledge about the dataset, and the parameters of the algorithms should be inferred in a differentially private way.

Acknowledgments. This work is supported in part by the National Science Foundation under the awards CNS-1314229.

References

  • [1] Clustering datasets. http://cs.joensuu.fi/sipu/datasets/.
  • [2] E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, and A. Zimek. Evaluation of clusterings - metrics and visual support. In ICDE, 2012.
  • [3] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving anonymity via clustering. In PODS, 2006.
  • [4] A. N. Akansu and R. A. Haddad. Multiresolution Signal Decomposition: Transforms, Subbands, and Wavelets. Academic Press, Inc., 1992.
  • [5] A. N. Akansu, W. A. Serdijn, and I. W. Selesnick. Emerging applications of wavelets: A review. Phys. Commun., 3(1), 2010.
  • [6] N. Alon and J. H. Spencer. The Probabilistic Method. Wiley, 1992.
  • [7] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. 2007.
  • [8] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, 2008.
  • [9] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: User movement in location-based social networks. In KDD, 2011.
  • [10] G. Cormode, C. Procopiuc, D. Srivastava, E. Shen, and T. Yu. Differentially private spatial decompositions. In ICDE, 2012.
  • [11] I. Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, 1992.
  • [12] C. Dwork. Differential privacy: A survey of results. In TAMC, 2008.
  • [13] C. Dwork and J. Lei. Differential privacy and robust statistics. In STOC, 2009.
  • [14] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
  • [15] D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In STOC, 2009.
  • [16] A. Friedman and A. Schuster. Data mining with differential privacy. In KDD, 2010.
  • [17] A. Friedman, R. Wolff, and A. Schuster. Providing k-anonymity in data mining. The VLDB Journal, 17(4), July 2008.
  • [18] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv., 42(4), 2010.
  • [19] B. C. M. Fung, K. Wang, L. Wang, and P. C. K. Hung. Privacy-preserving data publishing for cluster analysis. Data Knowl. Eng., 68(6), 2009.
  • [20] P. Green, F. J. Carmone, and S. M. Smith. Multidimensional scaling, section five: Dimension reducing methods and cluster analysis. 1989. Addison Wesley.
  • [21] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 2011.
  • [22] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private histograms through consistency. PVLDB, 3(1-2), 2010.
  • [23] B. K. P. Horn. Robot Vision. The MIT Press, 1988.
  • [24] A. Karakasidis and V. S. Verykios. Reference table based k-anonymous private blocking. In SAC, 2012.
  • [25] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In FOCS, 2008.
  • [26] S. Kotz, T. Kozubowski, and K. Podgórski. The Laplace distribution and generalizations : a revisit with applications to communications, economics, engineering, and finance. Birkhäuser, 2001.
  • [27] H. W. Kuhn. Variants of the hungarian method for assignment problems. Naval Research Logistics Quarterly, 3, 1956.
  • [28] S. G. Mallat. A Wavelet Tour of Signal Processing. Academic Press. Academic Press, Inc., 1999.
  • [29] F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9), 2010.
  • [30] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.
  • [31] N. Mohammed, R. Chen, B. C. Fung, and P. S. Yu. Differentially private data release for data mining. In KDD, 2011.
  • [32] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, 2007.
  • [33] W. H. Qardaji, W. Yang, and N. Li. Differentially private grids for geospatial data. In ICDE, 2013.
  • [34] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
  • [35] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In VLDB, 1998.
  • [36] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A wavelet-based clustering approach for spatial data in very large databases. VLDB J., 8(3-4), 2000.
  • [37] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung. Density-based place clustering in geo-social networks. In SIGMOD, 2014.
  • [38] M. Winslett, Y. Yang, and Z. Zhang. Demonstration of damson: Differential privacy for analysis of large data. ICPADS, IEEE Computer Society, 2012.
  • [39] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. TKDE, 23(8), 2011.
  • [40] J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett. Differentially private histogram publication. VLDB J., 22(6), 2013.
  • [41] J. Zhang, X. Xiao, Y. Yang, Z. Zhang, and M. Winslett. Privgene: Differentially private model fitting using genetic algorithms. In SIGMOD, 2013.
  • [42] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett. Functional mechanism: Regression analysis under differential privacy. PVLDB, 5(11), 2012.