WaveCluster with Differential Privacy

Ling Chen

\leavevmode\nobreak\ {}^{\#1}

, Ting Yu

\leavevmode\nobreak\ {}^{\#1,2}

, Rada Chirkova

\leavevmode\nobreak\ {}^{\#1}

^#1 Department of Computer Science North Carolina State University Raleigh USA
^#2 Qatar Computing Research Institute Doha Qatar
¹ lchen10@ncsu.edu ^1,2 tyu@{ncsu.edu qf.org.qa} ¹ rychirko@ncsu.edu

Abstract

WaveCluster is an important family of grid-based clustering algorithms that are capable of finding clusters of arbitrary shapes. In this paper, we investigate techniques to perform WaveCluster while ensuring differential privacy. Our goal is to develop a general technique for achieving differential privacy on WaveCluster that accommodates different wavelet transforms. We show that straightforward techniques based on synthetic data generation and introduction of random noise when quantizing the data, though generally preserving the distribution of data, often introduce too much noise to preserve useful clusters. We then propose two optimized techniques, PrivTHR and PrivTHR_EM, which can significantly reduce data distortion during two key steps of WaveCluster: the quantization step and the significant grid identification step. We conduct extensive experiments based on four datasets that are particularly interesting in the context of clustering, and show that PrivTHR and PrivTHR_EM achieve high utility when privacy budgets are properly allocated.

1 Introduction

Clustering is an important class of data analysis that has been extensively applied in a variety of fields, such as identifying different groups of customers in marketing and grouping homologous gene sequences in biology research [21]. Clustering results allow data analysts to gain valuable insights into data distribution when it is challenging to make hypotheses on raw data. Among various clustering techniques, a grid-based clustering algorithm called WaveCluster [35, 36] is famous for detecting clusters of arbitrary shapes. WaveCluster relies on wavelet transforms, a family of convolutions with appropriate kernel functions, to convert data into a transformed space, where the natural clusters in the data become more distinguishable.

In many data-analysis scenarios, when the data being analyzed contains personal information and the result of the analysis needs to be shared with the public or untrusted third parties, sensitive private information may be leaked, e.g., whether certain personal information is stored in a database or has contributed to the analysis. Consider the databases A and B in Figure 1. These two databases have two attributes, Monthly Income and Monthly Living Expenses, and the records differ only in one record, u. Without u’s participation in database A, WaveCluster identifies two separate clusters, marked by blue and red, respectively. With u’s participation, WaveCluster identifies only one cluster marked by color blue from database B. Therefore, merely from the number of clusters returned (rather than which data points belong to which cluster), an adversary may infer a user’s participation. Due to such potential leak of private information, data holders may be reluctant to share the original data or data-analysis results with each other or with the public.

In this paper, we develop techniques to perform WaveCluster with differential privacy [12, 14]. Differential privacy provides a provable strong privacy guarantee that the output of a computation is insensitive to any particular individual. In other words, based on the output, an adversary has limited ability to make inference about whether an individual is present or absent in the dataset. Differential privacy is often achieved by the perturbation of randomized algorithms, and the privacy level is controlled by a parameter $\epsilon$ called “privacy budget”. Intuitively, the privacy protection via differential privacy grows stronger as $\epsilon$ grows smaller.

WaveCluster provides a framework that allows any kind of wavelet transform to be plugged in for data transformation, such as the Haar transform [4] and Biorthogonal transform [28]. There are various wavelet transforms that are suitable for different types of applications, such as image compression and signal processing [5]. Plugged in different wavelet transforms, WaveCluster can leverage different properties of the data, such as frequency and location, for finding the dense regions as clusters. Thus, in this paper, we aim to develop a general technique for achieving differential privacy on WaveCluster that accommodates different wavelet transforms.

We first consider a general technique, Baseline, that adapts existing differentially private data-publishing techniques to WaveCluster through synthetic data generation. Specifically, we could generate synthetic data based on any data model of the original data that is published through differential privacy, and then apply WaveCluster using any wavelet transform over the synthetic data. Baseline seems particularly promising as many effective differentially private data-publishing techniques have been proposed in the literature, all of which strive to preserve some important properties of the original data. Therefore, hopefully the “shape” of the original data is also preserved in the synthetic data, and consequently could be discovered by WaveCluster. Unfortunately, as we will show later in the paper, this synthetic data-generation technique often cannot produce accurate results. Differentially private data-publishing techniques such as spatial decompositions [10], adaptive-grid [33], and Privelet [39], output noisy descriptions of the data distribution and often contain negative counts for sparse partitions due to random noise. These negative counts do not affect the accuracy of large range queries (which is often one of the main utility measures in private data publishing) since zero-mean noise distribution smoothes the effect of negative counts. However, negative counts cannot be smoothed away in the synthesized dataset, which are typically set to zero counts. Figure 2 shows an example of inaccurate clustering results produced by Baseline using adaptive-grid [33], As we can see, the synthetic data generated in Baseline significantly distorts the data distribution, causing two clusters to be merged as one and reducing the accuracy of the WaveCluster results.

Motivated by the above challenge, we propose three techniques that enforce differential privacy on the key steps of WaveCluster, rather than relying on synthetic data generation. WaveCluster accepts as input a set of data points in a multi-dimensional space, and consists of the following main steps. First, in the quantization step WaveCluster quantizes the multi-dimensional space by dividing the space into grids, and computes the count of the data points in each grid. These counts of grids form a count matrix $M$ . Second, in the wavelet transform step WaveCluster applies a wavelet transform on the count matrix $M$ to obtain the approximation of the multi-dimensional space. Third, in the significant grid identification step WaveCluster identifies significant grids based on the pre-defined density threshold. Fourth, in the cluster identification step WaveCluster outputs as clusters the connected components from these significant grids [23]. To enforce differential privacy on WaveCluster, we first propose a technique, PrivQT, that introduces Laplacian noise to the quantization step. However, such straightforward privacy enforcement cannot produce usable private WaveCluster results, since the noise introduced in this step significantly distorts the density threshold for identifying significant grids. To address this issue, we further propose two techniques, PrivTHR and PrivTHR_EM, which enforce differential privacy on both the quantization step and the significant grid identification step. These two techniques differ in how to determine the noisy density threshold. We show that by allocating appropriate budgets in these two steps, both techniques can achieve differential privacy with significantly improved utility.

Traditionally, the effectiveness of WaveCluster is evaluated through visual inspection by human experts (i.e., visually determining whether the discovered clusters match those reflected in the user’s mind) [35, 36]. Unfortunately, visual inspection is inappropriate to assess the utility of differentially private WaveCluster. Visual inspection is not quantitative, and thus it is hard to systematically compare the impact of different techniques through visual inspection. Generally, researchers use quantitative measures to assess the utility of differentially private results, such as relative or absolute errors for range queries and prediction accuracy for classification. But there is no existing utility measures for density-based clustering algorithms with differential privacy.

To mitigate this problem, in this paper we propose two types of utility measures. The first is to measure the dissimilarity between true and private WaveCluster results by measuring the differences of significant grids and clusters, which correspond to the outputs of the two key steps (the significant grid identification and the cluster identification) in WaveCluster. To more intuitively understand the usefulness of discovered clusters, our second utility measure considers one concrete application of cluster analysis, i.e., to build a classifier based on discovered clusters, and then use that classifier to predict future data. Therefore the prediction accuracy of the classifier from one aspect reflects the actual utility of private WaveCluster.

To evaluate the proposed techniques, our experiments use four datasets containing different data shapes that are particularly interesting in the context of clustering [1, 9]. Our results show that PrivTHR and PrivTHR_EM achieve high utility for both types of utility measures, and are superior to Baseline and PrivQT.

2 Related Work

The syntactic approaches for privacy preserving clustering [18] is to output $k$ -anonymous clusters. Friedman et al. [17] presented an algorithm to output $k$ -anonymous clusters by using minimum spanning tree. Karakasidis et al. [24] created $k$ -anonymous clusters by merging clusters so that each cluster contains at least $k$ key values of the records. Fung et al. [19] proposed an approach that converts the anonymity problem for cluster analysis to the counterpart problem for classification analysis. Aggarwal et al. [3] proposed a perturbation method called $r$ -gather clustering, which releases the cluster centers, together with their sizes, radiuses, and a set of associated sensitive values. However, these approaches only satisfy syntactic privacy notions such as k-anonymity, and cannot provide formal guarantees of privacy as differential privacy.

In this work, our goal is to perform WaveCluster under differential privacy. The focus of initial work on differential privacy [12, 14, 25, 13, 15] concerned the theoretical proof of its feasibility on various data analysis tasks, e.g., histogram and logistic regression.

More recent work has focused on practical applications of differential privacy for privacy-preserving data publishing. An approach proposed by Barak et al. [7] encoded marginals with Fourier coefficients and then added noise to the released coefficients. Hay et al. [22] exploited consistency constraints to reduce noise for histogram counts. Xiao et al. [39] proposed Privelet, which uses wavelet transforms to reduce noise for histogram counts. Cormode et al. [10] indexed data by kd-trees and quad-trees, developing effective budget allocation strategies for building the noisy trees and obtaining noisy counts for the tree nodes. Qardaji et al. [33] proposed uniform-grid and adaptive-grid methods to derive appropriate partition granularity in differentially private synopsis publishing. Xu et al. [40] proposed the NoiseFirst and StructureFirst techniques for constructing optimal noisy histograms, using dynamic programming and Exponential mechanism. These data publishing techniques are specifically crafted for answering range queries. Unfortunately, synthesizing the dataset and applying WaveCluster on top of it often render WaveCluster results useless, since these differentially private data publishing techniques do not capture the essence of WaveCluster and introduce too much unnecessary noise for WaveCluster.

Another important line of prior work focuses on integrating differential privacy into other practical data analysis tasks, such as regression analysis, model fitting, classification and etc. Chaudhuri et al. [8] proposed a differentially private regularized logistic regression algorithm that balances privacy with learnability. Zhang et al. [42] proposed a differentially private approach for logistic and linear regressions that involve perturbing the objective function of the regression model, rather than simply introducing noise into the results. Friedman et al. [16] incorporated differential privacy into several types of decision trees and subsequently demonstrated the tradeoff among privacy, accuracy and sample size. Using decision trees as an example application, Mohammed et al. [31] investigated a generalization-based algorithm for achieving differential privacy for classification problems.

Differentially private cluster analysis has also be studied in prior work. Zhang et al. [41] proposed differentially private model fitting based on genetic algorithms, with applications to k-means clustering. McSherry [29] introduced the PINQ framework, which has been applied to achieve differential privacy for k-means clustering using an iterative algorithm [38]. Nissim et al.[32] proposed the sample-aggregate framework that calibrates the noise magnitude according to the smooth sensitivity of a function. They showed that their framework can be applied to k-means clustering under the assumption that the dataset is well-separated. These research efforts primarily focus on centroid-based clustering, such as k-means, that is most suited for separating convex clusters and presents insufficient spatial information to detect clusters with complex shapes, e.g. concave shapes. In contrast to these research efforts, we propose techniques that enforce differential privacy on WaveCluster, which is not restricted to well-separated datasets, and can detect clusters with arbitrary shapes.

3 Preliminaries

In this section, we first present the background of differential privacy. Then we describe the WaveCluster algorithm followed by our problem statement.

3.1 Differential Privacy

Differential privacy [12] is a recent privacy model, which guarantees that an adversary cannot infer an individual’s presence in a dataset from the randomized output, despite having knowledge of all remaining individuals in the dataset.

Definition 1

( $\epsilon$ -differential privacy): Given any pair of neighboring databases $D$ and $D^{\prime}$ that differ only in one individual record, a randomized algorithm $A$ is $\epsilon$ -differentially private iff for any $S\subseteq Range(A)$ :

Pr[A(D)\in S]\leq Pr[A(D^{\prime})\in S]*e^{\epsilon}

The parameter $\epsilon$ indicates the level of privacy. Smaller $\epsilon$ provides stronger privacy. When $\epsilon$ is very small, $e^{\epsilon}$ $\approx$ 1+ $\epsilon$ . Since the value of $\epsilon$ directly affects the level of privacy, we refer to it as the privacy budget. Appropriate allocation of the privacy budget for a computational process is important for reaching a favorable trade-off between privacy and utility. The most common strategy to achieve $\epsilon$ -differential privacy is to add noise to the output of a function. The magnitude of introduced noise is calibrated by the privacy budget $\epsilon$ and the sensitivity of the query function. The sensitivity of a query function is defined as the maximum difference between the outputs of the query function on any pair of neighboring databases.:

\Delta f=\max_{D,D^{\prime}}\parallel f(D)-f(D^{\prime})\parallel_{1}

There are two common approaches for achieving $\epsilon$ -differential privacy: Laplace mechanism [14] and Exponential mechanism [30].

Laplace Mechanism: The output of a query function $f$ is perturbed by adding noise from the Laplace distribution with probability density function $f(x|b)=\frac{1}{2b}\exp(-\frac{|x|}{b})$ , $b=\frac{\Delta f}{\epsilon}$ . The following randomized mechanism $A_{l}$ satisfies $\epsilon$ -differential privacy:

\mathcal{A}_{l}(D)=f(D)+Lap(\frac{\Delta f}{\epsilon})

Exponential Mechanism: This mechanism returns an output that is close to the optimum, with respect to a quality function. A quality function $q(D,r)$ assigns a score to all possible outputs $r\in R$ , where $R$ is the output range of $f$ , and better outputs receive higher scores. A randomized mechanism $A_{e}$ that outputs $r\in R$ with probability

Pr[\mathcal{A}_{e}(D)=r]\propto exp(\frac{\epsilon q(D,r)}{2S(q)})

satisfies $\epsilon$ -differential privacy, where $S(q)$ is the sensitivity of the quality function.

Differential privacy has two properties: sequential composition and parallel composition. Sequential composition is that given $n$ independent randomized mechanisms $A_{1},A_{2},\ldots,A_{n}$ where $A_{i}$ ( $1\leq i\leq n$ ) satisfies $\epsilon_{i}$ -differential privacy, a sequence of $A_{i}$ over the dataset $D$ satisfies $\epsilon$ -differential privacy, where $\epsilon=\sum_{1}^{n}(\epsilon_{i})$ . Parallel composition is that given $n$ independent randomized mechanisms $A_{1},A_{2},\ldots,A_{n}$ where $A_{i}$ ( $1\leq i\leq n$ ) satisfies $\epsilon$ -differential privacy, a sequence of $A_{i}$ over a set of disjoint data sets $D_{i}$ satisfies $\epsilon$ -differential privacy.

3.2 WaveCluster

WaveCluster is an algorithm developed by Sheikholeslami et al. [35, 36] for the purpose of clustering spatial data. It works by using a wavelet transform to detect the boundaries between clusters. A wavelet transform allows the algorithm to distinguish between areas of high contrast (high frequency components) and areas of low contrast (low frequency components). The motivation behind this distinction is that within a cluster there should be low contrast and between clusters there should be an area of high contrast (the border). WaveCluster has the following steps as shown in Figure 3:

Quantization: Quantize the feature space into grids of a specified size, creating a count matrix $M$ .

Wavelet Transform: Apply a wavelet transform to the count matrix $M$ , such as Haar transform [4] and Biorthogonal transform [28], and decompose $M$ to the average subband that gives the approximation of the count matrix and the detail subband that has the information about the boundaries of clusters. We refer to the average subband as the wavelet-transformed-value matrix ( $W$ ).

Significant Grid Identification: Identify the significant grids from the average subband $W$ . WaveCluster constructs a sorted list $L$ of the positive wavelet transformed values obtained from $W$ and compute the $p$ th percentile of the values in $L$ . The values that are below the $p$ th percentile of $L$ are non-significant values. Their corresponding grids are considered as non-significant grids and the data points in the non-significant grids are considered as noise.

Cluster Identification: Identify clusters from the significant grids using connected component labeling algorithm [23] (two grids are connected if they are adjacent), map the clusters back to the original multi-dimensional space, and label the data points based on which cluster the data points reside in.

In WaveCluster, users need to specify four parameters:

num_grid ( $g_{1},g_{2},\ldots,g_{n}$ ): the number of grids that the $n$ -
dimensional space is partitioned into along each dimension. For the brevity of description, we simply use $g$ to refer to the partitions of the $n$ -dimensional space ( $g_{1},g_{2},\ldots,g_{n}$ ). This parameter controls the scaling of quantization. Inappropriate scaling can cause problems of over-quantization and under-quantization, affecting the accuracy of clustering [36].

density threshold ( $p$ ): a percentage value $p$ that specifies $p$ % of the values in $L$ are non-significant values. For ease of presentation, we use $k=(1-p)|L|$ to represent the top $k$ values in $L$ and their corresponding grids are considered as significant grids.

level: a wavelet decomposition level, which indicates how many times a wavelet transform is applied. The larger the level is, the more approximate the result is. In our techniques, we set level to 1 since a smaller level value provides more accurate results [36].

wavelet: a wavelet transform to be applied. Haar transform [4] is one of the simplest wavelet transforms and widely used, which is computed by iterating difference and averaging between odd and even samples of a signal (or a sequence of data points). Other commonly used wavelet transforms include Biorthogonal transform [28], Daubechies transform [11], and so on.

Motivating Scenario. Consider a scenario with two participants: the data owner (e.g. hospitals) and the querier (e.g. data miner). The data owner holds raw data and has the legal obligation to protect individuals’ privacy while the querier is eager to obtain cluster analysis results for further exploration. The goal of our work is to enable the data owner to release cluster analysis results using WaveCluster while not compromising the privacy of any individual who contributes to the raw data. The data owner has a good knowledge of the raw data and it is not difficult for her to pick the appropriate parameters (e.g. num_grid, density threshold, and wavelet) for non-private WaveCluster. For example, the data owner may draw from her past experience on similar data to determine the appropriate parameters for the current dataset. The parameters picked for the non-private setting are directly used for the private setting, and thus the data owner does not need to infer another set of parameters for the private setting.

Problem Statement. Given a raw data set $D$ , appropriate WaveCluster parameters for $D$ and a privacy budget $\epsilon$ , our goal is to investigate an effective approach $A$ such that $A$ (1) satisfies $\epsilon$ -differential privacy, and (2) achieves high utility of the private WaveCluster results with regard to the utility metrics $U$ .

4 Approaches

In this section, we present four techniques for achieving differential privacy on WaveCluster. We first describe the Baseline technique that achieves differential privacy through synthetic data generation. We then describe three techniques that enforces differential privacy on the key steps of WaveCluster.

4.1 Baseline Approach (Baseline)

A straightforward technique to achieve differential privacy on WaveCluster is as follows: (1) adapt an existing $\epsilon$ -differential privacy preserving data publishing method to get the noisy description of the data distribution in some fashion, such as a set of contingency tables or a spatial decomposition tree [40, 39, 10, 33]; (2) generate a synthetic dataset according to the noisy description; (3) apply WaveCluster on the synthetic dataset. We refer to this technique as Baseline, and its pseudocode is shown in Algorithm 1.

Algorithm 1 Baseline

1:Dataset

D

, num_grid

g

, density threshold

p

, wavelet transform

w

, differential privacy budget

\epsilon

2:A set of differentially private clusters

3:procedure

Baseline

(

D,g,p,w,\epsilon

)

D^{\prime}

= DiffPrivPublishing(

D,\epsilon

)

M^{\prime}

= Quantization(

D^{\prime},g

)

W^{\prime}

= WaveletTransform(

M^{\prime}

w

)

L^{\prime}

= ConvertToPosSortedArray(

W^{\prime}

)

k^{\prime}

(1-p)|L^{\prime}|

d^{\prime}

= Top(

k^{\prime}

L^{\prime}

)

10: return ConnCompLabel(

W^{\prime},d^{\prime}

)

11:end procedure

Baseline first leverages a $\epsilon$ -differential privacy preserving data publishing method to obtain a noisy dataset $D^{\prime}$ (Line 2) and partitions $D^{\prime}$ based on the number of grids $g$ to obtain the noisy count matrix $M^{\prime}$ (Line 3). Baseline then applies a wavelet transform on $M^{\prime}$ to obtain $W^{\prime}$ (Line 4). $W^{\prime}$ is then turned into a list $L^{\prime}$ that keeps only positive values and the values in $L^{\prime}$ is sorted in ascending order (Line 5). With $L^{\prime}$ , $k^{\prime}$ is computed based on the specified density threshold $p$ and the size of $L^{\prime}$ (Line 6). Finally, Baseline obtains $d^{\prime}$ as the top $k^{\prime}$ th value in $L^{\prime}$ (Line 7), where any value in $L^{\prime}$ greater than $d^{\prime}$ is considered as a significant value, and applies the connected component labeling algorithm to identify clusters of significant grids (Line 8).

Discussion. Baseline achieves differential privacy on WaveCluster through the achievement of differential privacy on data publishing. However, it does not produce accurate WaveCluster results in most cases. The adapted $\epsilon$ -differential privacy preserving data publishing method is designed for answering range queries. The noisy descriptions of the data distribution generated by the method may contain negative counts for certain partitions since the noise distribution is Laplacian with zero mean. These negative counts do not affect the range query accuracy too much since zero-mean noise distribution smooths the effect of noise. For example, a partition $p_{1}$ has the true count of 2 and the noisy count of -2, whose noise is canceled by another partition $p_{2}$ having the true count of 10 and the noisy count of 14 when both $p_{1}$ and $p_{2}$ are included in a range query. In particular, when the range query spreads large range of a dataset, a single partition with a noisy negative count does not affect its accuracy too much. However, when the method is used for generating a synthetic dataset, the noisy negative counts are reset as zero counts, causing the data distribution to change radically on the whole and further leading to the severe deviation in differentially private WaveCluster results.

4.2 Private Quantization (PrivQT)

To address the challenge faced by Baseline, we propose techniques that enforce differential privacy on the key steps of WaveCluster. Our first approach, called Private Quantization (PrivQT), introduces independent Laplacian noise in the quantization step to achieve differential privacy. In the quantization step, the data is divided into grids and the count matrix $M$ is computed. To ensure differential privacy in this step, we rely on the Laplace mechanism that introduces independent Laplacian noise to $M$ . Clearly, if we change one individual in the input data, such as adding, removing or modifying an individual, there is at most one change in one entry of $M$ . According to the parallel composition property of differential privacy, the noise amount introduced to each grid is $Lap(\frac{1}{\epsilon})$ , given a privacy budget $\epsilon$ . Since the following steps of WaveCluster are carried on using the differentially private count matrix $M^{\prime}$ , the clusters derived from these steps are also differentially private. Algorithm 2 shows the pseudocode of PrivQT. Except from the first step that introduces independent Laplacian noise to $M$ (Line 2), the other steps (Lines 3-7) are the same as Baseline.

Algorithm 2 PrivQT

1:Dataset

D

, num_grid

g

, density threshold

p

, wavelet transform

w

, differential privacy budget

\epsilon

2:A set of differentially private clusters

3:procedure

PrivQT

(

D,g,p,w,\epsilon

)

M^{\prime}

= PrivQuantization(

D,g,\epsilon

)

W^{\prime}

= WaveletTransform(

M^{\prime}

w

)

L^{\prime}

= ConvertToPosSortedArray(

W^{\prime}

)

k^{\prime}

(1-p)|L^{\prime}|

d^{\prime}

= Top(

k^{\prime}

L^{\prime}

)

9: return ConnCompLabel(

W^{\prime},d^{\prime}

)

10:end procedure

Selecting the appropriate grid size (reflected by the parameter num_grid $g$ ) in the quantization step strongly affects the accuracy of WaveCluster results [36], and also the differentially private
WaveCluster results. A small grid size (small $g$ ) causes more data points to fall into each grid and thus the count of data points for each grid becomes larger, which makes the count matrix $M$ resistant to Laplacian noise. However, the small grid size is not helpful for WaveCluster to detect clusters with accurate shapes and renders the results less useful. On the other hand, although posing a larger grid size on the data captures the density distribution of the data more clearly, it makes each grid’s count too small and thus become sensitive to Laplacian noise, which dramatically affects the identification of significant grids and further the shapes of clusters. Our empirical results show that only when an appropriate grid size is given, differentially private WaveCluster results maintains high utility.

Discussion. Although PrivQT achieves differential privacy on the WaveCluster results, the noisy count matrix $M^{\prime}$ and its resulting noisy $L^{\prime}$ are significantly distorted and consequently the clustering results. The reason is as follows. Given a specified percentage value $p$ , PrivQT computes $k^{\prime}$ from the positive values in $W^{\prime}$ , where $W^{\prime}$ is derived from $M^{\prime}$ , which is perturbed by Laplacian noise. Laplacian distribution is symmetric and has zero-mean. According to its randomness, approximately half of the zero-count grids become noisy positive-count grids due to positive noise while the remaining ones are turned into noisy negative-count grids due to negative noise. These noisy positive-count grids may cause their corresponding wavelet transformed values in $W^{\prime}$ to become positive (depending on the targeted wavelet transform), which will inappropriately participate in the computation of $k^{\prime}$ and further distorts $k^{\prime}$ . Due to the dominating errors introduced by approximately half of zero-count grids becoming noisy positive-count grids, our empirical results show that the utility of private WaveCluster results by PrivQT improves marginally even for a large privacy budget.

4.3 Private Quantization with Refined Noisy Density Threshold (PrivTHR)

The limitation of PrivQT lies in the severe distortion of $k^{\prime}$ by Laplacian noise introduced into count matrix $M^{\prime}$ . To mitigate the distortion, we propose a technique, PrivTHR, which prunes a portion of noisy positive values in $W^{\prime}$ to refine the computation of $k^{\prime}$ . Algorithm 3 shows the pseudocode of PrivTHR.

PrivTHR first introduces random noise to the count matrix $M$ , similar to PrivQT, and obtains a noisy count matrix $M^{\prime}$ (Line 2). PrivTHR then applies a wavelet transform on $M^{\prime}$ to obtain $W^{\prime}$ (Line 3). $W^{\prime}$ is then turned into a list $L^{\prime}$ that keeps only positive values and the values in $L^{\prime}$ is sorted in ascending order (Line 4). Thus, only the positive values in $W^{\prime}$ will be used for computing $k^{\prime}$ based on the specified density threshold $p$ . To reduce the distortion of $k^{\prime}$ , starting from the smallest noisy positive values in $L^{\prime}$ , PrivTHR discards the first $\frac{|Z|^{\prime}}{2}$ values (Line 6), where $Z$ represents the non-positive (negative or zero) values in the $W$ and $|Z|^{\prime}$ is a noisy estimate of $|Z|$ (Line 5). The reason why PrivTHR removes $\frac{|Z|^{\prime}}{2}$ values from $L^{\prime}$ is based on the utility analysis (in Section 5.2) that approximately $\frac{|Z|}{2}$ non-positive values in $W$ are turned into positive values due to the randomness of Laplacian noise. Since $|Z|$ partially describes the data distribution and releasing $|Z|$ without protection may leak private information, PrivTHR also introduces Laplacian noise to $|Z|$ , ensuring the whole process correctly enforces differentially privacy (Lines 11-17). The noise introduced to $|Z|$ depends on the wavelet transform used to compute $W$ . For example, if we use Haar transform for $n$ -dimensional data, a value in $W$ is computed by applying average for two neighboring elements along each dimension. Since any single change in the input only causes one entry of the count matrix $M$ to change by 1, the change of $M$ causes at maximum one value in $W$ to change, and thus causes $|Z|$ to change by 1 at maximum, i.e., the sensitivity of $|Z|$ is 1¹¹1For other wavelet transforms that use circular convolutions, such as Biorthogonal transform, the sensitivity of $n$ depends on the count of positive values and the count of negative values in the matrix computed by the coefficient vector [28].. Finally, PrivTHR obtains $d^{\prime}$ as the top $k^{\prime}$ th value in $L^{\prime\prime}$ (Line 8), where any value in $L^{\prime\prime}$ greater than $d^{\prime}$ is considered as a significant value, and applies the connected component labeling algorithm to identify clusters of significant grids (Line 9).

Algorithm 3 PrivTHR

1:Dataset

D

, num_grid

g

, density threshold

p

, wavelet transform

w

, differential privacy budget

\epsilon

, allocation percentage

\alpha

2:A set of differentially private clusters

3:procedure

PrivTHR

(

D,g,p,w,\epsilon,\alpha

)

M^{\prime}

= PrivQuantization(

D,g,\alpha\epsilon

)

W^{\prime}

= WaveletTransform(

M^{\prime}

w

)

L^{\prime}

= ConvertToPosSortedArray(

W^{\prime}

)

|Z|^{\prime}

= NoisyCountOfNonPosValues(

D,g,w,(1-\alpha)\epsilon

)

L^{\prime\prime}

= RemoveFrom(

L^{\prime}

,0,

\frac{|Z|^{\prime}}{2}

)

k^{\prime}

(1-p)|L^{\prime\prime}|

10:

d^{\prime}

= Top(

k^{\prime}

L^{\prime\prime}

)

11: return ConnCompLabel(

W^{\prime},d^{\prime}

)

12:end procedure

13:procedure NoisyCountOfNonPosValues(

D,g,w,\epsilon

)

14:

M

= Quantization(

D,g

)

15:

W

= WaveletTransform(

M

w

)

16:

|Z|

= CountOfNonPos(

W

)

17:

|Z|^{\prime}

|Z|

Lap(\frac{Sensitivity(n)}{\epsilon})

18: return

|Z|^{\prime}

19:end procedure

Budget Allocation. PrivTHR first introduces Laplacian noise in the quantization step using a privacy budget $\epsilon_{1}=\alpha\epsilon$ , where $0<\alpha<1$ . In the significant grid identification step, PrivTHR further introduces Laplacian noise to $|Z|$ using the remaining privacy budget $(1-\alpha)\epsilon$ . Based on utility analysis in Section 5.2.2, $\epsilon_{1}$ requires a smaller amount of budget than $\epsilon_{2}$ . Our empirical results in Section 7 further show in detail the impact of $\alpha$ on clustering accuracy.

4.4 Private Quantization with Noisy Threshold using Exponential Mechanism (PrivTHR_EM)

Besides pruning noisy positive values in $W^{\prime}$ , we propose an alternative technique that employs Exponential mechanism for deriving $k^{\prime}$ from the sorted list of $L$ . Algorithm 4 shows the pseudocode of PrivTHR_EM.

PrivTHR_EM first introduces Laplacian noise to the count matrix $M$ , which is similar to PrivQT and PrivTHR. After that, we obtain a noisy count matrix $M^{\prime}$ (Line 2) and the corresponding $W^{\prime}$ (Line 3). Different from the previous two techniques that compute $k^{\prime}$ from $W^{\prime}$ , PrivTHR_EM derives $k^{\prime}$ from $W$ using Exponential mechanism (Lines 7-15). In this case, although the sorted list derived from $W^{\prime}$ is severely distorted in PrivTHR_EM, the derivation of $k^{\prime}$ from $W$ is not affected by the distorted $W^{\prime}$ at all. Given reasonable privacy budget, $k^{\prime}$ derived from $W$ using Exponential mechanism is reasonably accurate, compared to the case when $k^{\prime}$ is derived from $W^{\prime}$ .

The quality function fed into the Exponential mechanism is [10]:

q(L,X)=-|rank(x)-k|,

where $L$ represents the sorted positive values in $W$ with $Min$ and $Max$ values (Line 10), and $X$ represents the possible output space, i.e., all the possible values in the range of $(0,Max]$ . Given a $W$ with $m$ positive values and their relationships are $x_{1}\geq x_{2}\geq\ldots\geq x_{m}$ , these $m$ values divide the range $(0,Max]$ into $m$ partitions: $(0,x_{m}],(x_{m-1},x_{m-2}],$ $\ldots,(x_{2},x_{1}]$ , and the ranks for these partitions are $m$ , $m-1$ , $\ldots$ , 2, 1. For any $x\in(x_{i-1},x_{i}]$ , its rank is $rank(x_{i})$ . For example, if $x\in(x_{2},x_{1}]$ , $rank(x)=rank(x_{1})=1$ . Similar to PrivTHR, when using Haar transform, any single change in the input causes only one value in $W$ to change. Thus, at maximum one value will be added into or removed from $L$ , causing the outcome of $q(L,X)$ to be changed by 1, i.e., the sensitivity of $q(L,X)$ is 1²²2Similar to PrivTHR, for other wavelet transforms that use circular convolutions, the sensitivity of $q(L,X)$ depends on the count of positive values and the count of negative values in the matrix computed by the coefficient vector [28]..

Plugging in the above quality function into Exponential mechanism, we obtain the following algorithm: for any value $x\in(0,Max]$ , the Exponential mechanism (EM) returns $x$ with probability
$Pr[EM(L)=x]\propto exp(-\frac{\epsilon|rank(x)-k|}{2})$ (Line 12). Since all the values in a partition have the same probability to be chosen, a random value from the partition $Pt_{i}=(x_{i-1},x_{i}]$ will be chosen with the probability proportional to $|Pt_{i}|*exp(-\frac{\epsilon}{2}|i-k|)$ . In other words, once $k^{\prime}$ is chosen, PrivTHR_EM further computes a uniform random value $d^{\prime}$ from $Pt_{i}$ (Line 13), and any value in $L^{\prime}$ greater than $d^{\prime}$ is considered as a significant value.

Budget Allocation. Similar to PrivTHR, the privacy budget is split between two steps: introduction of Laplacian noise in quantization and obtaining $k^{\prime}$ using Exponential mechanism. Previous empirical experiments [10] on splitting budgets between obtaining noisy median and noisy counts suggest that, 30% vs. 70% budget allocation strategy performs best. Specifically, 70% of budget is allocated for obtaining noisy count matrix $M^{\prime}$ (Line 2) and the remaining budget is allocated for computing $k^{\prime}$ (Line 4).

Algorithm 4 PrivTHR_EM

1:Dataset

D

, num_grid

g

, density threshold

p

, wavelet transform

w

, differential privacy budget

\epsilon

, allocation percentage

\alpha

2:A set of differentially private clusters

3:procedure

PrivTHR_{EM}

(

D,g,p,w,\epsilon,\alpha

)

M^{\prime}

= PrivQuantization(

D,g,\alpha\epsilon

)

W^{\prime}

= WaveletTransform(

M^{\prime}

w

)

d^{\prime}

= NoisyDensityThreshold(

D,g,p,w,(1-\alpha)\epsilon

)

7: return ConnCompLabel(

W^{\prime},d^{\prime}

)

8:end procedure

9:procedure NoisyDensityThreshold(

D,g,p,w,\epsilon

)

10:

M

= Quantization(

D,g

)

11:

W

= WaveletTransform(

M

w

)

12:

L

= ConvertToPosSortedArray(

W

)

13:

k

(1-p)|L|

14:

k^{\prime}

= ExponentialMechanism(

L

k

\epsilon

)

15:

d^{\prime}

= UniformRandom(

L

k^{\prime}

)

16: return

d^{\prime}

17:end procedure

5 Privacy and Utility Analysis

In this section, we present the theoretical analysis of proposed techniques PrivQT, PrivTHR and PrivTHR_EM.

5.1 Privacy Analysis

In this part we establish the privacy guarantee of PrivQT, PrivTHR and PrivTHR_EM.

Theorem 1

PrivQT is $\epsilon$ -differentially private.

Proof 5.2.

PrivQT introduces independent Laplacian noise $Lap(\frac{1}{\epsilon})$ to grid counts, which are computed on disjoint datasets. According to the parallel composition property of differential privacy described in Section 3.1, the privacy cost depends only on the worst guarantee of all computations over disjoint datasets. Therefore, PrivQT is $\epsilon$ -differentially private.

Theorem 5.3.

PrivTHR is $\epsilon$ -differentially private.

Proof 5.4.

PrivTHR splits privacy budget into two parts. First, for private quantization, adding Laplacian noise $Lap(\frac{1}{\alpha\epsilon})$ achieves strict $\alpha\epsilon$ -differential privacy. The proof is same as PrivQT. Second, PrivTHR introduces Laplacian noise $Lap(\frac{1}{(1-\alpha)\epsilon})$ to the true count of non-positive values in $W$ , which achieves $(1-\alpha)\epsilon$ -differential privacy. Using the composition property of differential privacy, PrivTHR achieves $\epsilon$ -differentially private since $\epsilon=\alpha\epsilon+(1-\alpha)\epsilon$ .

Theorem 5.5.

PrivTHR_EM is $\epsilon$ -differentially private.

Proof 5.6.

Similar to PrivTHR, PrivTHR_EM has two steps of randomization: private quantization and obtaining noisy density threshold $d^{\prime}$ . Private quantization achieves $\alpha\epsilon$ -differential privacy according to Laplace mechanism and parallel composition property. Sampling noisy density threshold $d^{\prime}$ by Exponential mechanism consumes budget of $(1-\alpha)\epsilon$ , which achieves $(1-\alpha)\epsilon$ -differential privacy. According to the composition property of differential privacy, PrivTHR_EM is $\epsilon$ -differentially private.

5.2 Utility Analysis

In this section, we present utility guarantees of our algorithms (PrivQT, PrivTHR and PrivTHR_EM) with theoretical analysis. In WaveCluster, the step of significant grid identification determines the clustering results. In the private results of WaveCluster, PrivQT, PrivTHR and PrivTHR_EM return a list of noisy significant grids. To quantify the utility of PrivQT, PrivTHR and PrivTHR_EM, we consider finding significant grids whose wavelet transformed values surpass a threshold to be similar to finding the top- $k$ frequent itemsets whose frequencies surpass a threshold. In significant grid identification, $L$ is the list of positive wavelet transformed values from $W$ sorted in ascending order, $Z$ represents the set of zero values from $W$ , and $k$ indicates the threshold position in $L$ and all the top- $k$ values in $L$ correspond to significant grids, where $k=(1-p)|L|$ . One parameter to specify $k$ is the density threshold $p$ , which remains the same either with or without noise introduction. However, $|L|$ , another parameter to determine $k$ , will be changed to $|L^{\prime}|$ under differential privacy, where $L^{\prime}$ is the list of positive wavelet transformed values from $W^{\prime}$ sorted in ascending order. $L$ is different from $L^{\prime}$ since noise introduction might result in a portion of zero values in $Z$ becoming positive and a small portion of positive values in $L$ becoming non-positive.

5.2.1 Utility Analysis for PrivQT.

We first provide the analysis of difference between $k$ and $k^{\prime}$ in PrivQT. In PrivQT, the difference between $k^{\prime}$ and $k$ depends on two factors: (1) a set of zero values in $Z$ becoming noisy positive, $Z^{\prime}_{p}=\{W^{\prime}_{Z}|W^{\prime}_{Z}=W_{Z}+Noise,W_{Z}\in Z,W^{\prime}_{Z}>0\}$ , where $W^{\prime}_{Z}$ is the noisy value of zero value in $Z$ , and (2) a set of positive values in $L$ becoming noisy non-positive, $L^{\prime}_{n}=\{W^{\prime}_{L}|W^{\prime}_{L}=W_{L}+Noise,W_{L}\in L,W^{\prime}_{L}\leq 0\}$ , where $W^{\prime}_{L}$ is the noisy value of positive value in $L$ . That is, $k^{\prime}=(1-p)(|L|+|Z^{\prime}_{p}|-|L^{\prime}_{n}|)$ .

Analysis of $|Z^{\prime}_{p}|$ . In PrivQT, since we are adding $Lap(\frac{1}{\epsilon})$ noise to each grid count and the Haar transform computes the average from four adjacent grids, the noise added into a wavelet transformed value is the sum of four i.i.d. samples from the Laplace distribution. The sum of $h$ i.i.d. Laplace distributions with mean 0 is the difference of two i.i.d. Gamma distributions [26], referred to as distribution $T$ . Distribution $T$ is a polynomial in $|x|$ divided by $e^{|x|}$ , which is a symmetric function and thus the probability for distribution $T$ to produce positive values is $\frac{1}{2}$ . Thus, the events of values in $Z$ adding positive noise from distribution $T$ conform to the Binominal Distribution with parameters $|Z|$ and $\frac{1}{2}$ and its expected value is $\frac{|Z|}{2}$ .

Analysis of $|L^{\prime}_{n}|$ . For $L^{\prime}_{n}$ , each value is added the noise conforming to the symmetric distribution $T$ . The probability density function of $|L^{\prime}_{n}|$ is $f_{|L^{\prime}_{n}|}(x)={|L|\choose x}\prod_{i=1}^{x}f_{T}(y\leq-W_{i})\prod_{i=x+1}^{|L|}\\ f_{T}(y>-W_{i}),W_{i}\in L$ , and its expected value $E[|L^{\prime}_{n}|]$ is $\sum_{n=1}^{|L|}\\ f_{T}(y\leq-W_{i})$ . $E[|L^{\prime}_{n}|]$ is large when $W_{i}$ is small and there is limited privacy budget. Consider an extreme case that might not be suitable for clustering. All the positive values in $L$ are the minimum value 0.5 due to the sum of adjacent four grid counts being the minimum value 1, resulting in a high $E[|L^{\prime}_{n}|]$ . Clustering, especially WaveCluster algorithm, is useful when the dataset has dense areas (clusters) and empty areas (gap between clusters). Such extreme case is not suitable for clustering since its data distribution is close to uniform distribution. Those datasets that are interesting for clustering always have highly dense cluster centers and cluster borders with low density. Only those values corresponding to border grids are possible to become noisy non-positive and the size of border grids is relatively small. Therefore, $E[|L^{\prime}_{n}|]$ is a small constant. We refer to the value of $|L^{\prime}_{n}|$ as $\theta$ in the following analysis.

Analysis of $k^{\prime}-k$ . In PrivQT, $E[k^{\prime}-k]=(1-p)(\frac{|Z|}{2}-\theta)$ . There are two extreme cases when $k^{\prime}-k\approx 0$ . For one extreme, $|Z|=|L|$ and all the positive values in $L$ is the minimum value 0.5. When $\epsilon=1$ , $\theta\approx 0.43|L|\approx\frac{|Z|}{2}$ , which makes $k^{\prime}-k\approx 0$ . For another extreme, $|Z|=0$ , all the positive values in $L$ is large, e.g. $\geq 15$ . When $\epsilon=1$ , $\theta\approx 0$ and $k^{\prime}-k\approx 0$ . For those datasets that are interesting in the context of clustering, $|Z|$ is pretty large compared to the whole space since $Z$ is used to separate different clusters. What is more, dense areas within clusters are typically larger than the space of cluster borders with low density, i.e. $\theta$ is far smaller than $\frac{|Z|}{2}$ . In PrivQT, $\frac{|Z|}{2}$ dominates the difference between $k^{\prime}-k$ , which increases false positive rate. In PrivTHR and PrivTHR_EM, we use different strategies to minimize the difference between $k^{\prime}-k$ .

Theorem 5.7.

In PrivQT with Haar transform, given $0<\omega<1$ , let $\eta_{1}=\frac{|Z|}{2}-\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}$ , $\eta_{2}=\frac{|Z|}{2}+\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}$ , and $\gamma=\frac{8}{\epsilon}\ln{(\frac{4(|L|+|Z|)}{\omega})}$ , then with probability at least $(1-\omega)^{2}$ , (1) all values in $L$ greater than $W_{k_{min}^{\prime}}+\gamma$ are output, where $k_{min}^{\prime}=k+(1-p)(\eta_{1}-\theta)$ , and (2) no values in $L$ less than $W_{k_{max}^{\prime}}-\gamma$ are output, where $k_{max}^{\prime}=k+(1-p)(\eta_{2}-\theta)$ .

Proof 5.8.

In PrivQT, $k^{\prime}=(1-p)(|L|+|Z^{\prime}_{p}|-|L^{\prime}_{n}|)$ . Since $|Z^{\prime}_{p}|$ follows Binominal distribution with parameters $|Z|$ and $\frac{1}{2}$ and $|L^{\prime}_{n}|$ is noted as a small value $\theta$ , $k^{\prime}$ follows the Binomial distribution and decides the number of values in $L$ that become output. Given $\omega$ , we can derive $k^{\prime}$ ’s lower bound $k^{\prime}_{min}$ , and show that values greater than $W_{k_{min}^{\prime}}+\gamma$ are output, i.e., subclaim (1). Let $1-\omega=Pr(k^{\prime}\geq k_{min}^{\prime})=Pr(|Z^{\prime}_{p}|\geq\eta_{1})$ . As $Pr(|Z^{\prime}_{p}|\geq\eta_{1})$ = 1 - $Pr(|Z^{\prime}_{p}|\leq\eta_{1})$ and $Pr(|Z^{\prime}_{p}|\leq\eta_{1})\leq e^{(-2\frac{(\frac{|Z|}{2}-\eta_{1})^{2}}{|Z|})}$ [6], we have $\eta_{1}=\frac{|Z|}{2}-\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}$ . For constant $\omega$ , $\eta_{1}=O(\frac{|Z|}{2}-\sqrt{\frac{|Z|}{2}})$ will suffice.

Similar as $k^{\prime}_{min}$ , we can also derive the bound $\gamma$ of the noise added to each value in $L\cup Z$ based on $\omega$ . For Haar wavelet transform, each value in $L\cup Z$ is added the noise that is the sum of 4 Laplacian random variables divided by 2 (i.e., $\frac{4Lap(\frac{1}{\epsilon})}{2}$ ). For values in $L\cup Z$ , let all $4(|L|+|Z|)$ Laplacian random variables generate noise within $[-\frac{\gamma}{4},\frac{\gamma}{4}]$ . The probability that no Laplacian random variable’ value is outside $[-\frac{\gamma}{4},\frac{\gamma}{4}]$ is $1-Pr(A)$ , where $A$ is that at least one Laplacian random variable’s value is outside $[-\frac{\gamma}{4},\frac{\gamma}{4}]$ . By union bound, $Pr(A)\leq\sum_{i=1}^{4(|L|+|Z|)}Pr(B_{i})$ , where $B_{i}$ is that $i$ th Laplacian random variable’s noise is outside $[-\frac{\gamma}{4},\frac{\gamma}{4}]$ and $Pr(B_{i})=e^{-\frac{\epsilon\gamma}{8}}$ . Thus, we can derive that with at least the probability $1-4(|L|+|Z|)e^{-\frac{\epsilon\gamma}{8}}$ , no Laplacian random variable’ value is outside $[-\frac{\gamma}{4},\frac{\gamma}{4}]$ , and each value in $L\cup Z$ has their noise amount within $[-\frac{\gamma}{2},\frac{\gamma}{2}]$ . Let $\omega=4(|L|+|Z|)e^{-\frac{\epsilon\gamma}{8}}$ , then $-\frac{\epsilon\gamma}{8}=\ln{(\frac{\omega}{4(|L|+|Z|)})}$ and we have $\gamma=\frac{8}{\epsilon}\ln{(\frac{4(|L|+|Z|)}{\omega})}$ . For constant $\omega$ , $\gamma=O(\frac{ln{(4(|L|+|Z|))}}{\epsilon})$ .

Subclaim (1) can be derived based on (a) with probability at least $1-\omega$ , $k^{\prime}\geq k_{min}^{\prime}$ and (b) with probability at least $1-\omega$ , the noise of each value in $L$ being within $[-\frac{\gamma}{2},\frac{\gamma}{2}]$ . Detailed proof is omitted here. Subclaim (1) requires both conditions (a) and (b) to hold, and thus the probability is at least $(1-\omega)^{2}$ .

We can derive the upper bound $k_{max}^{\prime}$ of $k^{\prime}$ given $\omega$ . Let $1-\omega=Pr(k^{\prime}\leq k_{max}^{\prime})=Pr(|Z^{\prime}_{p}|\leq\eta_{2})$ . Recall that $|Z^{\prime}_{p}|$ follows Binomial distribution $(|Z|,\frac{1}{2})$ , and Binomial distribution $(|Z|,\frac{1}{2})$ is symmetric with respect to $\frac{|Z|}{2}$ . Thus, the probability of sampling a value from the range $[0,\eta_{2}]$ is the same as sampling a value from the range $[\eta_{1},|Z|]$ , and we have $\eta_{2}=|Z|-\eta_{1}=\frac{|Z|}{2}+\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}$ . For constant $\omega$ , $\eta_{2}=O(\frac{|Z|}{2}+\sqrt{\frac{|Z|}{2}})$ will suffice.

Subclaim (2) can be proved based on (c) with probability at least $1-\omega$ , $k^{\prime}\leq k_{max}^{\prime}$ and (b) with probability at least $1-\omega$ , the noise of each value in $L$ being within $[-\frac{\gamma}{2},\frac{\gamma}{2}]$ . As subclaim (2) requires both conditions (c) and (b) to hold, the probability is at least $(1-\omega)^{2}$ .

For other wavelet transforms that use circular convolutions, such as Biorthogonal transform, the derivation for the bounds of $k^{\prime}$ with $\eta_{1}$ and $\eta_{2}$ remains the same since $|Z^{\prime}_{p}|$ following Binomial distribution is independent of any wavelet transform being adapted. Thus, our framework is extensible to other wavelet transforms, and the bound of noise magnitude $\gamma$ depends on the amount of adjacent grid counts involved in computing a wavelet transformed value.

5.2.2 Utility Analysis for PrivTHR.

Theorem 5.9.

In PrivTHR with Haar transform, given $0<\omega<1$ , let $\eta_{1}=\frac{|Z|}{2}-\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}$ , $\eta_{2}=\frac{|Z|}{2}+\sqrt{\frac{|Z|\ln{(\frac{1}{\omega})}}{2}}$ , $\gamma=\frac{8}{\epsilon_{1}}\ln{(\frac{4(|L|+|Z|)}{\omega})}$ and $\beta=\frac{2}{\epsilon_{2}}\ln{(\frac{1}{\omega})}$ , then with probability at least $(1-\omega)^{3}$ , (1) all values in $L$ greater than $W_{k_{min}^{\prime}}+\gamma$ are output, where $k^{\prime}_{min}=k+(1-p)(\eta_{1}-\theta-\frac{|Z|}{2}-\beta)$ , and (2) no values in $L$ less than $W_{k_{max}^{\prime}}-\gamma$ are output, where $k^{\prime}_{max}=k+(1-p)(\eta_{2}-\theta-\frac{|Z|}{2}+\beta)$ .

Proof 5.10.

In PrivTHR, we allocate $\epsilon_{1}$ for private quantization and $\epsilon_{2}$ for protecting $\frac{|Z|}{2}$ , which makes $k^{\prime}=(1-p)(|L|+|Z^{\prime}_{p}|-\theta-\frac{|Z|}{2}+Lap(\frac{1}{\epsilon_{2}}))$ . With the probability at least $1-e^{-\frac{\epsilon_{2}\beta}{2}}$ , $Lap(\frac{1}{\epsilon_{2}})$ has the noise amount within $\beta$ . Let $1-\omega=1-e^{-\frac{\epsilon_{2}\beta}{2}}$ , then we get $\beta=\frac{2}{\epsilon_{2}}\ln{(\frac{1}{\omega})}$ . For constant $\omega$ , $\beta=O(\frac{1}{\epsilon_{2}})$ will suffice. The proofs of $\eta_{1}$ , $\eta_{2}$ , $\gamma$ and subclaims (1) and (2) are the same as Theorem 4, and $\gamma=O(\frac{ln{(4(|L|+|Z|))}}{\epsilon_{1}})$ for constant $\omega$ .

Difference between PrivTHR and PrivQT: By Theorem 4, in PrivQT $k_{min}^{\prime}=k+(1-p)(\eta_{1}-\theta)$ and $k_{max}^{\prime}=k+(1-p)(\eta_{2}-\theta)$ . By Theorem 5, in PrivTHR $k_{min}^{\prime}=k+(1-p)(\eta_{1}-\theta-\frac{|Z|}{2}-\beta)$ and $k_{max}^{\prime}=k+(1-p)(\eta_{2}-\theta-\frac{|Z|}{2}+\beta)$ . $\eta_{1}=O(\frac{|Z|}{2}-\sqrt{\frac{|Z|}{2}})$ , $\eta_{2}=O(\frac{|Z|}{2}+\sqrt{\frac{|Z|}{2}})$ . As we can see, by removing $\frac{|Z|}{2}\pm\beta$ positive values from $L^{\prime}$ , PrivTHR provides better utility guarantee than PrivQT since the difference between $k^{\prime}$ and $k$ becomes $O(\sqrt{\frac{|Z|}{2}})\pm(\beta+\theta)$ , where $\theta$ is a small constant and $\beta$ is small when sufficient budget $\epsilon_{2}$ is provided.

5.2.3 Utility Analysis for PrivTHR_EM.

Theorem 5.11.

In PrivTHR_EM with Haar transform, given $0<\omega<1$ , let $\eta_{1}=|L|-k-1+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{max}-W_{k}|}{|W_{k}|\omega})$ , $\eta_{2}=k-2+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|\omega})$ , and $\gamma=\frac{8}{\epsilon_{1}}\ln{(\frac{4(|L|+|Z|)}{\omega})}$ , then with probability at least $(1-\omega)^{2}$ , (1) all values in $L$ greater than $W_{k_{min}^{\prime}}+\gamma$ are output, where $k_{min}^{\prime}=k-\eta_{1}$ , and (2) no values in $L$ less than $W_{k_{max}^{\prime}}-\gamma$ are output, where $k_{max}^{\prime}=k+\eta_{2}$ .

Proof 5.12.

In PrivTHR_EM, we allocate $\epsilon_{2}$ for deriving $k^{\prime}$ from $k$ by employing Exponential mechanism, a general method proposed in [30]. The probability of selecting a rank $i$ is $|Pt_{i}|*exp(-\frac{\epsilon_{2}}{2}|i-rank(W_{k})|)$ , where $Pt_{i}$ is the range $(W_{i-1},W_{i}]$ decided by the $i-1$ th and $i$ th wavelet transformed values.

Let $1-\omega$ be the probability of sampling a $k^{\prime}$ where $k-k^{\prime}\leq\eta_{1}$ , then

\Leftrightarrow\omega<\frac{|W_{max}-W_{k}|*e^{-\frac{\epsilon_{2}}{2}(\eta_{1}+1)}}{|W_{k}|*e^{-\frac{\epsilon_{2}}{2}(|L|-k)}}

\Leftrightarrow\eta_{1}<|L|-k-1+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{max}-W_{k}|}{|W_{k}|\omega})

For constant $\omega$ , $\eta_{1}=O(|L|-k+\frac{1}{\epsilon_{2}}\ln(\frac{|W_{max}-W_{k}|}{|W_{k}|})$ will suffice.

Let $1-\omega$ be the probability of sampling a $k^{\prime}$ where $k^{\prime}-k\leq\eta_{2}$ , then

\Leftrightarrow\omega<\frac{|W_{k}|*e^{-\frac{\epsilon_{2}}{2}(\eta_{2}+1)}}{|W_{max}-W_{k}|*e^{-\frac{\epsilon_{2}}{2}(k-1)}}

\Leftrightarrow\eta_{2}<k-2+\frac{2}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|\omega})

For constant $\omega$ , $\eta_{2}=O(k+\frac{1}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|})$ will suffice. The proof of $\gamma$ and subclaims (1) and (2) are the same as Theorem 4.

Analysis of PrivTHR and PrivTHR_EM. By Theorem 5 and Theorem 6, the accuracy for sampling $k^{\prime}$ in PrivTHR is dominated by $\frac{1}{\epsilon_{2}}$ while in PrivTHR_EM the accuracy is dominated by $\frac{1}{\epsilon_{2}}\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|})$ . Depending on the data distribution, PrivTHR_EM may present better or worse utility guarantee than PrivTHR:
$\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|})$ is positive when when $\frac{|W_{k}|}{|W_{max}-W_{k}|}>1$ , and the accuracy for sampling $k^{\prime}$ in PrivTHR_EM becomes more sensitive to $\epsilon_{2}$ than PrivTHR; $\ln(\frac{|W_{k}|}{|W_{max}-W_{k}|})$ becomes negative when $\frac{|W_{k}|}{|W_{max}-W_{k}|}$ is less than 1, and the bounds of utility guarantee for PrivTHR_EM becomes better than PrivTHR.

Section 7 demonstrates that by reducing the difference between $k^{\prime}$ and $k$ , PrivTHR and PrivTHR_EM achieve more accurate results than PrivQT, which conforms to the above analysis.

6 Quantitative Measures

To quantitatively assess the utility of differentially private WaveCluster, we propose two types of measures for measuring the dissimilarity between true and differentially private WaveCluster results. The first type, $DSG_{C}$ , measures the dissimilarity of the significant grids and the clusters between true and private results. The second type focuses on observing the usefulness of differentially private WaveCluster results for further data analysis. The reason is that a slight difference in the significant grids or clusters may cause a significant difference when using the WaveCluster results. In this paper, we choose a typical application of further data analysis: building a classifier from the clustering results to predict unlabeled data [20]. The classifier built from true WaveCluster results is called the true classifier $clf_{t}$ while the classifier built from differentially private WaveCluster results is called the private classifier $clf_{p}$ . To measure the dissimilarity between $clf_{t}$ and $clf_{p}$ , we propose two metrics: $OCM$ and $2CE$ .

6.1 Dissimilarity based on Significant Grids and Clusters

$DSG_{C}$ considers the dissimilarities of significant grids and clusters. Assume that there are $t$ clusters of true significant grids and $s$ clusters of differentially private significant grids. $t$ might not be equal to $s$ , and the cluster labels in $t$ true clusters and $s$ private clusters are completely arbitrary. To accommodate these differences, we adopt the Hungarian method [27], a combinatorial optimization algorithm, to solve the matching problem between $t$ true clusters and $s$ private clusters while minimizing the matching difference.

When cluster $C_{i}$ matches to cluster $C_{j}$ , we define that the distance $d$ between cluster $C_{i}$ and cluster $C_{j}$ is $max\{|C_{i}\backslash C_{j}|,|C_{j}\backslash C_{i}|\}$ . Consider a cluster $C_{i}=\{g_{1},g_{3},g_{5}\}$ and a cluster $C_{j}$ = $\{g_{1},g_{5},g_{7},g_{9}\}$ . The distance $d$ between clusters $C_{i}$ and $C_{j}$ is $max\{|\{g_{3}\}|,|\{g_{7},g_{9}\}|\}$ $=2$ . Given $t$ true clusters, $s$ private clusters, and $t\geq s$ , a matching $M_{t,s}$ of $t$ true clusters and $s$ private clusters is a set of cluster pairs, where each private cluster is matched with a true cluster. We then define the cost of a matching ( $M_{cost}$ ) as the sum of all the distances between each cluster pair in the matching $M_{t,s}$ plus the count of significant grids in the non-matched clusters:

M_{cost}=\sum_{1\leq i_{x}\leq t,1\leq j_{y}\leq s}max\{|C_{i_{x}}\backslash C_{j_{y}}|,|C_{j_{y}}\backslash C_{i_{x}}|\}+\sum_{1\leq z\leq t}|C_{z}|

Here, $i_{x}$ and $j_{y}$ indicate the subscripts of clusters in a matched pair. $|C_{z}|$ represents the count of significant grids in the non-matched true clusters. Among all the possible matchings of clusters, we use the Hungarian method to find the optimal matching with the minimum $M_{cost}$ , and computed $DSG_{C}$ as:

DSG_{C}=\frac{M_{cost}}{|T|}

Here $T$ denotes the set of significant grids in the true WaveCluster results.

6.2 Dissimilarity based on Classifier Prediction

$OCM$ and $2CE$ measure the dissimilarity between $clf_{t}$ and $clf_{p}$ . We name this way of evaluation as “clustering-first-then-classification”: given a set of unlabeled data points, we use a portion of the data points (e.g., 90%) to compute WaveCluster results, where each cluster is a set of significant grids. Using the significant grids with cluster labels as training data, we build classifiers $clf_{t}$ and $clf_{p}$ , and use them to predict the classes for the remaining data points (e.g., 10%).

Dissimilarity of Classifiers based on Optimal Class Matching ( $OCM$ ). $OCM$ measures the dissimilarity between the two sets of classes predicted by $clf_{t}$ and $clf_{p}$ for the same test samples. We use $L_{t}$ to denote the set of classes predicted by $clf_{t}$ and $L_{p}$ to denote the set of classes predicted by $clf_{p}$ . Since $L_{t}$ and $L_{p}$ are completely arbitrary, we exploit the Hungarian method to find the optimal matching between $L_{t}$ and $L_{p}$ .

Assume that a class $L_{t,i}$ predicted by $clf_{t}$ is matched to a class $L_{p,j}$ predicted by $clf_{p}$ , forming a class pair. We compute the count of common test samples in the class $L_{t,i}$ and the class $L_{p,j}$ , and sum the common test samples in each class pair to compute $CT$ :

CT=\sum_{1\leq i\leq c_{1},1\leq j\leq c_{2}}|L_{t,i}\cap L_{p,j}|

Here $c_{1}$ is the count of classes in $L_{t}$ and $c_{2}$ is the count of classes in $L_{p}$ , and we assume $c_{1}\geq c_{2}$ . Since there are many possible mappings from the classes in $L_{t}$ to the classes in $L_{p}$ , we use the Hungarian method to find the optimal mapping that maximizes $CT$ . Based on $CT$ and the total count of the test samples $TT$ , we derive the dissimilarity $OCM$ :

OCM=1-\frac{CT}{TT}

When the dissimilarity is smaller, the differentially private WaveCluster results are more similar to the true WaveCluster results and maintain high utility for classification use.

Dissimilarity of Classifiers based on 2-Combination Enumeration ( $2CE$ ). $2CE$ measures the dissimilarity between $clf_{t}$ and $clf_{p}$ based on relationships of every pair of test samples, i.e., whether two samples are in the same class. Essentially, given a pair of test samples $A$ and $B$ , we say $A$ and $B$ are classified consistently either (1) $clf_{t}(A)=clf_{t}(B)$ and $clf_{p}(A)=clf_{p}(B)$ or (2) $clf_{t}(A)\neq clf_{t}(B)$ and $clf_{p}(A)\neq clf_{p}(B)$ . $2CE$ is the ratio of the count of test sample pairs that are not classified consistently over the total number of test sample pairs, which is the set of 2-combination of the test samples. $2CE$ uses pairs of test samples to eliminate the need of finding the optimal matching between the classes predicted by $clf_{t}$ and $clf_{p}$ .

7 Experiments

We evaluate the proposed techniques using three datasets that are widely used in previous clustering algorithms [1], and one large scale dataset derived from the check-in information in Gowalla³³3https://snap.stanford.edu/data/loc-gowalla.html. geo-social networking website [9], which was used to evaluate grid-based clustering algorithms in [37].

7.1 Experiment Setup

In our experiments, we compare the performances of the four techniques, Baseline, PrivQT, PrivTHR, and PrivTHR_EM, on the four datasets using two types of measures proposed in Section 6 and provide analysis on the results. We use Haar transform as the wavelet transform and set the wavelet decomposition level to 1 for the four techniques. Baseline uses the adaptive-grid method [33] for synthetic data generation. The classification algorithm used for measuring $OCM$ and $2CE$ is C4.5 decision tree algorithm [34]. We conduct experiments with privacy budgets ranging from 0.1 to 2.0; for each budget and each metric, we apply the techniques on each dataset for 10 times and compute their average performances. All experiments were conducted on a machine with Intel 2.67GHz CPU and 8GB RAM.

Datasets. The four clustering datasets contain different data shapes that are specially interesting for clustering. Figures 4 shows the WaveCluster results on four datasets under certain parameter settings of grid size $g$ and density threshold $p$ . Any two adjacent clusters are marked with different colors. The points in red color are identified as noise, which fall into the non-significant grids.

$DS1$ is a dataset containing 15 Gaussian clusters with different degrees of cluster overlapping. It contains 30000 data points. These 15 clusters are all in convex shapes. The center area of each cluster has higher density and is resistant to noise. However, the overlapped area of two adjacent clusters has lower density and is prone to be affected by noise, which might turn the corresponding non-significant grids into significant grids and further connect two separate clusters. $DS2$ is a dataset with 3 spiral clusters. It contains 31200 data points. The head of each spiral is quite close to one another. Some noisy significant grids are very likely to bridge the gap between adjacent spirals and merge them into one cluster. $DS3$ is a data dataset with 5 various shapes of clusters, including concave shapes. It contains 31520 data points. There are two clusters that both contain two sub components and a narrow line-shape area that bridges those two sub components. The narrow bridging area has low density and might be turned into non-significant grids, causing a cluster to split into two clusters. $Gowalla$ is the check-in dataset resembling the world map, which records time and location information of users’ check-ins. We use only the location information for evaluation. There are about 6.4M records in total. The large size of the dataset makes it infeasible to run experiments with C4.5 and Baseline due to memory constraints. Thus, similar to [33], we sampled 1M records from the dataset for evaluation.

We next present the results on comparing $k^{\prime}$ and $k$ , and then present the results of the two types of measures.

7.2 Comparing Private $k^{\prime}$ With True $k$

We first measure the differences between the true $k$ and private $k^{\prime}$ s on each dataset with $\epsilon$ ranging from 0.1 to 2.0, and the results are shown in Figure 5. The results show that for all datasets, when $\epsilon\geq 0.5$ , the relative errors of $k^{\prime}$ , i.e., $\frac{|k^{\prime}-k|}{k}$ , in PrivQT and PrivTHR_EM are less than 4.7% on average, while the relative errors of $k^{\prime}$ in Baseline and PrivQT range from 32.2% to 150.5%. For example, in $DS2$ , the true $k$ is 144. When $\epsilon$ is 1, the average private $k^{\prime}$ is 141.0 ( $2.1\%$ ) for PrivTHR and 142.8 ( $0.8\%$ ) for PrivTHR_EM, while Baseline and PrivQT obtain 284.0 ( $97.2\%$ ) and 249.2 ( $73.1\%$ ) for the average $k^{\prime}$ respectively. Note that $|Z|$ is 241 in $DS2$ , and the difference between the average $k^{\prime}$ and $k$ is 105.2 for PrivQT, which is quite close to the theoretical bound $(1-p)\frac{|Z|}{2}=108.45$ derived from our utility analysis in Section 5.2.1. When $\epsilon$ is 0.1, the $k^{\prime}$ in PrivTHR_EM deviates from $k$ more significantly than the $k^{\prime}$ in PrivTHR, indicating that PrivTHR_EM is more sensitive to $\epsilon$ than PrivTHR as discussed in Section 5.2.3. For example, in $DS2$ , the average $k^{\prime}$ in PrivTHR_EM is 82.8 ( $42.5\%$ ) while the average $k^{\prime}$ in PrivTHR is 131.2 ( $8.9\%$ ).

7.3 Results of $DSG_{C}$

Figure 6 shows the results of $DSG_{C}$ for the four techniques when the privacy budget ranges from 0.1 to 2.0. X-axis shows the privacy budgets, and Y-axis denotes the values of $DSG_{C}$ . As shown in the results, both PrivTHR and PrivTHR_EM achieve smaller $DSG_{C}$ values than Baseline and PrivQT on all four datasets for all budgets. The reason is that though the noisy significant grids generated by Baseline and PrivQT may be similar to the true significant grids, these noisy significant grids result in very different shapes of clusters and thus result in a large value of $DSG_{C}$ , while PrivTHR and PrivTHR_EM preserves more accurate cluster shapes. For example, in $DS3$ , the narrow line-shape areas and the gap between two adjacent clusters are sensitive to noise. If some noisy significant grids appear in these areas, two clusters may be merged into one; if some significant grids disappear due to noise, one cluster might be split into two clusters. Such changes cause $DSG_{C}$ to increase significantly.

Unlike the other techniques, PrivQT benefits little from the increased privacy budgets. For PrivQT, the difference between $k^{\prime}$ and $k$ in PrivQT is dominated by $\frac{|Z|}{2}$ . Increasing privacy budgets can only reduce noise magnitude and cannot smooth such difference.

Comparison to F-Measure Results. Clustering analysis usually uses F-measure as a representative external validations to measure the similarity between the ground truth (known class labels) and the clustering results [2]. In our experiments, we consider the true WaveCluster results as the ground truth, and the results of F-measure are shown in Figure 7. The results show that PrivQT and Baseline achieve high F-measure scores (more than 0.8) for almost all budgets in $DS1$ , even though the private results produced by PrivQT and Baseline are quite different from the true results. For example, when $\epsilon=0.1$ , the private results of PrivQT and Baseline have more than 30 clusters while the true results have only 15 clusters. On the contrary, Figure 6 (a) shows that $DSG_{C}$ is able to clearly differentiate the performances of the four techniques. The reason is that unlike $DSG_{C}$ that allows only one-to-one mapping between true and private clusters, F-measure allows one-to-many or many-to-one mapping between true and private clusters. If the size of true clusters is larger than that of private clusters, F-measure allows many to one mapping, and vice versa. Thus, $DSG_{C}$ presents more strict evaluation than F-measure in computing similarity/dissimilarity.

7.4 Results of $OCM$ and $2CE$

Results of $OCM$ . Figure 8 shows the results of $OCM$ for the four techniques. X-axis denotes the privacy budgets while Y-axis denotes the values of $OCM$ . As shown in the results, PrivTHR and PrivTHR_EM achieve smaller $OCM$ values than Baseline and PrivQT for all datasets when $\epsilon$ ranges from 0.5 to 2.0. When $\epsilon$ is greater than 0.5, the $OCM$ values of PrivTHR and PrivTHR_EM are less than 0.15 on $DS1$ , $DS3$ , and $Gowalla$ , indicating the private classifier $clf_{p}$ maintains highly similar prediction results as the true classifier $clf_{t}$ . On $DS2$ that contains 3 spirals, PrivTHR_EM still maintains a very low $OCM$ value ( $<$ 0.1) when $\epsilon$ is greater than 0.5 while PrivTHR has a slightly worse $OCM$ value (ranging from 0.1 to 0.2). Such results show that PrivTHR_EM is more resilient to noise for concave-shaped data than PrivTHR.

Results of $2CE$ . Figure 8 shows the results of $2CE$ for the four techniques. X-axis denotes the privacy budgets while Y-axis denotes the values of $2CE$ . As shown in the results, PrivTHR and PrivTHR_EM achieve smaller $2CE$ values than Baseline and PrivQT for all datasets when $\epsilon$ ranges from 0.5 to 2.0.

In general, all four techniques exhibit similar trends of $2CE$ as their trends in $OCM$ . On $DS1$ , all four techniques have very low $2CE$ values ( $<$ 0.1) though their corresponding $OCM$ values are much higher (ranging from 0.05 to 0.5). The reason is that $2CE$ captures the relationships between data points while $OCM$ focuses on the mappings of classes. If there are $k$ test samples out of $N$ total samples having different prediction results in the true and private results, $2CE$ expresses the differences as $C(k,2)+k(N-k)$ over the total combinations of test samples $C(N,2)$ , while $OCM$ expresses the differences as $k$ over $N$ . On $DS1$ , the $k$ test samples are predicted to be in the same cluster in the private results and $C(k,2)$ becomes close to 0. In this case, only $k(N-k)$ matters in the computation of $2CE$ . Given that $C(N,2)$ is much larger than $N$ and $k(N-k)$ when $N$ of $DS1$ is about 30,000, $2CE$ has a smaller value than $OCM$ for measuring the differences, and thus is less sensitive to the noise on $DS1$ .

Budget Allocation for PrivTHR. Based on the utility analysis Section 5.2.2, $\epsilon_{1}$ for private quantization affects the accuracy of $\gamma$ , and $\epsilon_{2}$ for obtaining $|Z|^{\prime}$ affects the accuracy of $\beta$ . As the constant factor of $\gamma$ , $\frac{8}{\epsilon_{1}}\ln{(\frac{4(|L|+|Z|)}{\omega})}$ , is larger than the constant factor of $\beta$ , $\frac{2}{\epsilon_{2}}\ln{(\frac{1}{\omega})}$ , more budget should be allocated for $\epsilon_{1}$ to achieve better utility. We evaluate the values of $DSG_{C}$ of PrivTHR on $DS1$ under different budget allocation strategies, ranging from 1% for $\epsilon_{1}$ to 99% for $\epsilon_{1}$ . Based on the results, the budget allocation strategy with 90% for $\epsilon_{1}$ and 10% for $\epsilon_{2}$ performs the best. The results of other measures on $DS1$ show the similar results, and the results of all the two types of measures on other datasets also show the similar results. Detailed results are omitted.

8 Conclusion

In this paper we have addressed the problem of cluster analysis with differential privacy. We take a well-known effective and efficient clusteing algorithm called WaveCluster, and propose several ways to introduce randomness in the computation of WaveCluster. We also devise several new quantitative measures for examining the dissimilarity between the non-private and differentially private results and the usefulness of differentially private results in classification. In the future, we will investigate under differential privacy other categories of clustering algorithms, such as hierarchical clustering. Another important problem is to explore the applicability of differentially private clustering in those cases where the users do not have good knowledge about the dataset, and the parameters of the algorithms should be inferred in a differentially private way.

Acknowledgments. This work is supported in part by the National Science Foundation under the awards CNS-1314229.

References

[1] Clustering datasets. http://cs.joensuu.fi/sipu/datasets/.
[2] E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, and A. Zimek. Evaluation of clusterings - metrics and visual support. In ICDE, 2012.
[3] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving anonymity via clustering. In PODS, 2006.
[4] A. N. Akansu and R. A. Haddad. Multiresolution Signal Decomposition: Transforms, Subbands, and Wavelets. Academic Press, Inc., 1992.
[5] A. N. Akansu, W. A. Serdijn, and I. W. Selesnick. Emerging applications of wavelets: A review. Phys. Commun., 3(1), 2010.
[6] N. Alon and J. H. Spencer. The Probabilistic Method. Wiley, 1992.
[7] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. 2007.
[8] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, 2008.
[9] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: User movement in location-based social networks. In KDD, 2011.
[10] G. Cormode, C. Procopiuc, D. Srivastava, E. Shen, and T. Yu. Differentially private spatial decompositions. In ICDE, 2012.
[11] I. Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, 1992.
[12] C. Dwork. Differential privacy: A survey of results. In TAMC, 2008.
[13] C. Dwork and J. Lei. Differential privacy and robust statistics. In STOC, 2009.
[14] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
[15] D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In STOC, 2009.
[16] A. Friedman and A. Schuster. Data mining with differential privacy. In KDD, 2010.
[17] A. Friedman, R. Wolff, and A. Schuster. Providing k-anonymity in data mining. The VLDB Journal, 17(4), July 2008.
[18] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv., 42(4), 2010.
[19] B. C. M. Fung, K. Wang, L. Wang, and P. C. K. Hung. Privacy-preserving data publishing for cluster analysis. Data Knowl. Eng., 68(6), 2009.
[20] P. Green, F. J. Carmone, and S. M. Smith. Multidimensional scaling, section five: Dimension reducing methods and cluster analysis. 1989. Addison Wesley.
[21] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 2011.
[22] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private histograms through consistency. PVLDB, 3(1-2), 2010.
[23] B. K. P. Horn. Robot Vision. The MIT Press, 1988.
[24] A. Karakasidis and V. S. Verykios. Reference table based k-anonymous private blocking. In SAC, 2012.
[25] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In FOCS, 2008.
[26] S. Kotz, T. Kozubowski, and K. Podgórski. The Laplace distribution and generalizations : a revisit with applications to communications, economics, engineering, and finance. Birkhäuser, 2001.
[27] H. W. Kuhn. Variants of the hungarian method for assignment problems. Naval Research Logistics Quarterly, 3, 1956.
[28] S. G. Mallat. A Wavelet Tour of Signal Processing. Academic Press. Academic Press, Inc., 1999.
[29] F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9), 2010.
[30] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.
[31] N. Mohammed, R. Chen, B. C. Fung, and P. S. Yu. Differentially private data release for data mining. In KDD, 2011.
[32] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, 2007.
[33] W. H. Qardaji, W. Yang, and N. Li. Differentially private grids for geospatial data. In ICDE, 2013.
[34] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
[35] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In VLDB, 1998.
[36] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A wavelet-based clustering approach for spatial data in very large databases. VLDB J., 8(3-4), 2000.
[37] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung. Density-based place clustering in geo-social networks. In SIGMOD, 2014.
[38] M. Winslett, Y. Yang, and Z. Zhang. Demonstration of damson: Differential privacy for analysis of large data. ICPADS, IEEE Computer Society, 2012.
[39] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. TKDE, 23(8), 2011.
[40] J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett. Differentially private histogram publication. VLDB J., 22(6), 2013.
[41] J. Zhang, X. Xiao, Y. Yang, Z. Zhang, and M. Winslett. Privgene: Differentially private model fitting using genetic algorithms. In SIGMOD, 2013.
[42] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett. Functional mechanism: Regression analysis under differential privacy. PVLDB, 5(11), 2012.

WaveCluster with Differential Privacy

Abstract

1 Introduction

2 Related Work

3 Preliminaries

3.1 Differential Privacy

Definition 1

3.2 WaveCluster

4 Approaches

4.1 Baseline Approach (Baseline)

4.2 Private Quantization (PrivQT)

4.3 Private Quantization with Refined Noisy Density Threshold (PrivTHR)

4.4 Private Quantization with Noisy Threshold using Exponential Mechanism (PrivTHREM)

5 Privacy and Utility Analysis

5.1 Privacy Analysis

Theorem 1

Proof 5.2.

Theorem 5.3.

Proof 5.4.

Theorem 5.5.

Proof 5.6.

5.2 Utility Analysis

5.2.1 Utility Analysis for PrivQT.

Theorem 5.7.

Proof 5.8.

5.2.2 Utility Analysis for PrivTHR.

Theorem 5.9.

Proof 5.10.

5.2.3 Utility Analysis for PrivTHREM.

Theorem 5.11.

Proof 5.12.

6 Quantitative Measures

6.1 Dissimilarity based on Significant Grids and Clusters

6.2 Dissimilarity based on Classifier Prediction

7 Experiments

7.1 Experiment Setup

7.2 Comparing Private k′k^{\prime} With True kk

7.3 Results of D​S​GCDSG_{C}

7.4 Results of O​C​MOCM and 2​C​E2CE

8 Conclusion

References

4.4 Private Quantization with Noisy Threshold using Exponential Mechanism (PrivTHR_EM)

5.2.3 Utility Analysis for PrivTHR_EM.

7.2 Comparing Private $k^{\prime}$ With True $k$

7.3 Results of $DSG_{C}$

7.4 Results of $OCM$ and $2CE$