This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient and Model-Agnostic Parameter Estimation Under Privacy-Preserving Post-randomization Data

Qinglong Tian University of Waterloo Jiwei Zhao University of Wisconsin-Madison
Abstract

Protecting individual privacy is crucial when releasing sensitive data for public use. While data de-identification helps, it is not enough. This paper addresses parameter estimation in scenarios where data are perturbed using the Post-Randomization Method (PRAM) to enhance privacy. Existing methods for parameter estimation under PRAM data suffer from limitations like being parameter-specific, model-dependent, and lacking efficiency guarantees. We propose a novel, efficient method that overcomes these limitations. Our method is applicable to general parameters defined through estimating equations and makes no assumptions about the underlying data model. We further prove that the proposed estimator achieves the semiparametric efficiency bound, making it optimal in terms of asymptotic variance.

1 Introduction

1.1 Background

To make data accessible to the public, agencies like the Census Bureau, medical institutes, and law enforcement often release statistical databases. However, when statistical databases are released publicly, attackers can use them to identify individuals or uncover sensitive information by combining them with other available databases. For example, Sweeney, (2001) was able to reidentify some cancer patients in an anonymous medical database by matching some information (e.g., gender or zip code) in some named external databases (e.g., voter registration database). Thus, data de-identification alone is not enough to protect individuals’ private information. In addition to data de-identification, where information like name is removed, statistical disclosure control (SDC) methods aim to prevent the sensitive information from being re-identified in inference attacks. One can find a general introduction to SDC methodology in Hundepool et al., (2012) and Willenborg and De Waal, (2012).

Data perturbation is a type of commonly used SDC methods. It involves intentionally modifying data before publication to safeguard individual privacy. The core concept is straightforward: sensitive information (e.g., gender, zip code, and date of birth) needs to be altered in a way that prevents linking it to external datasets; thus making inference attacks difficult to implement. Common approaches include adding simple additive Gaussian noise or utilizing the Laplacian noise mechanism within the differential privacy framework. We refer to Mivule, (2012) and Okkalioglu et al., (2015) for more details.

This paper considers the problem where we need to perturb categorical variables. The post-randomization method (PRAM), which is firstly proposed by Gouweleeuw et al., (1998), is a natural way to adding noise to categorical variables: Suppose there is a sensitive binary variable Z{0,1}Z\in\left\{0,1\right\} that we wish to perturb, the PRAM method performs the perturbation using a known transition matrix, denoted by 𝐏2×2{\bf P}\in\mathbb{R}^{2\times 2}. In this example, suppose the transition matrix is defined as

𝐏=[Pr(Z=0|Z=0)Pr(Z=0|Z=1)Pr(Z=1|Z=0)Pr(Z=1|Z=1)]=[0.80.10.20.9],{\bf P}=\begin{bmatrix}\Pr(Z^{\ast}=0|Z=0)&\Pr(Z^{\ast}=0|Z=1)\\ \Pr(Z^{\ast}=1|Z=0)&\Pr(Z^{\ast}=1|Z=1)\end{bmatrix}=\begin{bmatrix}0.8&0.1\\ 0.2&0.9\end{bmatrix},

where ZZ is the original unperturbed sensitive variable, and ZZ^{\ast} is the perturbed one. We can see that the PRAM method performs data perturbation probabilistically: There is a probability (Pr(Z=0|Z=0)=0.8\Pr(Z^{\ast}=0|Z=0)=0.8 or Pr(Z=1|Z=1)=0.9\Pr(Z^{\ast}=1|Z=1)=0.9) that the variable can keep its original value after perturbations; otherwise, its value will be flipped to the opposite category. Similarly, we can extend the PRAM method to variables with k3k\geq 3 categories with a k×kk\times k transition matrix.

The transition matrix 𝐏{\bf P} is crucial in balancing data privacy with data utility. While all columns in 𝐏{\bf P} must sum to 1, with each entry between 0 and 1, the value along the diagonal (i.e., Pr(Z=i|Z=i)\Pr(Z^{\ast}=i|Z=i) for i=1,,ki=1,\dots,k) is typically set to be greater than 0.5. Such a requirement prioritizes preserving the original data up to a certain extent, as overly emphasizing privacy would significantly reduce the usefulness of the published data. For example, in an extreme case, all entries in 𝐏{\bf P} are set to 0.5 in the binary example, the resulting perturbed variable would contain no usable information.

1.2 Existing Work and Motivations

PRAM’s probabilistic nature inherently protects against inference attacks. However, for users who do not have access to the original data, how to use the perturbed data for faithful statistical analysis poses significant challenges. Consider a simple example: Suppose we have a logistic regression task with the “PRAM-ed” response variable YY^{\ast} (from the original variable YY) and the original covariate XX, and we are interested in estimating a parameter β\beta defined as the solution to an estimating equation E{U(X,Y;β)}=0E\left\{U(X,Y;\beta)\right\}=0 for some known function U()U(\cdot) (e.g., β\beta can be the coefficient of XX in logistic regression by choosing U()U(\cdot) accordingly, see Section 2.1 for more details). The estimation of β\beta is generally biased if we treat the perturbed YY^{\ast} as if it is the original response YY without making any adjustment. This is because β\beta^{\ast} (which is the solution to the equation E{U(X,Y;β)}=0E\left\{U(X,Y^{\ast};\beta^{\ast})\right\}=0) is generally different from β\beta. So, even if the estimation method is unbiased given the original data, it becomes biased with PRAM data because we estimate a totally different parameter ββ\beta^{\ast}\neq\beta.

There is some existing work on parameter estimation with PRAM-ed data. In their seminal works, Gouweleeuw et al., (1998) proposed an unbiased moment estimator for frequency counts, whereas van den Hout and van der Heijden, (2002) studied a general framework to estimate odds ratios. Provided that a suitable parametric model is available, the EM algorithm (Dempster et al.,, 1977) appeared to be a popular choice to adjust for PRAM data. van den Hout and Kooiman, (2006) estimated the parameters in a linear regression model when covariates are subject to randomized response. Woo and Slavković, (2012) developed and implemented EM-type algorithms to obtain maximum likelihood estimates in logistic regression models, and Woo and Slavković, (2015) further extended the framework to generalized linear models. In both Woo and Slavković, (2012) and Woo and Slavković, (2015), the variables subject to PRAM could be either response, covariate, or both. However, existing methods have the following major limitations:

  1. 1.

    Parameter-specificity: These methods are often designed for specific parameters and may not apply to more general ones.

  2. 2.

    Parametric model dependence: Many methods rely on specific assumptions about the underlying data model, such as assuming a logistic regression relationship between p(y|x)p(y|x). This dependence can make them vulnerable to model misspecification.

  3. 3.

    Limited Optimality: Existing methods may not offer a guaranteed optimal solution for all situations. Specific data characteristics and analysis goals can heavily influence their performance.

1.3 Related Work

The PRAM problem is closely related to the label noise problem in the machine learning literature (e.g., Lawrence and Schölkopf, 2001; Scott, 2015; Li et al., 2021; Liu et al., 2023; Guo et al., 2024), as well as the misclassification problem in the statistical literature (e.g., Carroll et al., 2006; Buonaccorsi, 2010; Yi, 2017, 2021). The label noise problem is usually considered in a supervised learning setting, and the goal is to train a classifier using labeled data. However, we can only observe contaminated label YY^{\ast} instead of clean label YY. Such a problem is common in real-world applications. In a medical context, obtaining an accurate gold standard for diagnosis can be challenging due to factors such as cost, time limitations, and ethical considerations. This often necessitates the use of less reliable, “imperfect” diagnostic procedures, leading to potential misdiagnoses. Consequently, the labels assigned (healthy or diseased) based on these imperfect methods can introduce noise into the data.

In the context of label noise or misclassification, a common assumption known as class-dependent noise (Lawrence and Schölkopf, 2001) states that the probability of a noisy label YY^{\ast} given the true label YY and the features 𝐗{\bf X} is equal to the probability of the noisy label YY^{\ast} given only the true label YY: Pr(Y|Y,𝐗)=Pr(Y|Y)\Pr(Y^{\ast}|Y,{\bf X})=\Pr(Y^{\ast}|Y). Under the class-dependent noise assumption, the label noise and PRAM problems share similar settings. However, a crucial distinction lies in the intentionality of the noise. In label noise scenarios, misclassification occurs unintentionally, arising from various factors such as human error, measurement limitations, or imperfect diagnostic procedures. Conversely, the noise is deliberately introduced through the known transition matrix 𝐏{\bf P} (representing Pr(Y|Y)\Pr(Y^{\ast}|Y)) in the PRAM setting. Unlike the label noise setting, where the transition matrix is typically unknown, the PRAM setting assumes the transition matrix 𝐏{\bf P} is readily available for the general public because 𝐏{\bf P} itself is not sensitive. This access to the transition matrix allows for different approaches and analyses compared to the unintentional noise encountered in typical label noise problems.

1.4 Overview

We propose a novel method for parameter estimation with PRAM data, and we allow the PRAM-ed variables to be either the response or covariate. We address the aforementioned limitations of existing methods accordingly.

  1. 1.

    General Parameters: we consider a parameter 𝜷\bm{{\bm{\beta}}} defined through an estimating equation E{U(𝐗,Y;𝜷)}=𝟎E\left\{U({\bf X},Y;\bm{{\bm{\beta}}})\right\}={\bf 0}, which is general and covers many commonly used parameters.

  2. 2.

    Model-Agnostic Method: we do not impose any parametric assumptions on the data model, and our proposed method is free of the problem of model misspecifications.

  3. 3.

    Estimation Efficiency: We answer the optimality question by proving that our proposed method achieves the semiparametric efficiency bound (Bickel et al., 1993; Tsiatis, 2006). In other words, the proposed estimator has the smallest possible asymptotic variance among all the regular asymptotic linear estimators.

The rest of the paper is organized as follows. Section 2 first formally introduces the problem setup and then explains why existing methods rely on parametric assumptions. Section 3 utilizes the findings in Section 2 and proposes an efficient and model-agnostic estimator. Section 4 conducts comprehensive numerical studies to evaluate and compare the proposed method and existing methods. Section 5 concludes the paper with discussions on potential future research.

2 Understanding Model Dependence in Existing Methods

2.1 Problem Setting

We start this section by formally introducing the notation used throughout the paper. We consider a random vector (Y,Y,𝐗)p(y,y,𝐱)(Y,Y^{\ast},{\bf X})\sim p(y,y^{\ast},{\bf x}), where

  • YY: Represents the original sensitive categorical variable Y{0,,k1}Y\in\left\{0,\dots,k-1\right\}.

  • YY^{\ast}: Represents the PRAM-ed variable, which is the sensitive variable after applying the privacy-preserving transformation.

  • 𝐗{\bf X}: Represents a vector of covariates associated with the variable of interest.

Furthermore, we assume that the perturbed variable YY^{\ast} is independent of 𝐗{\bf X} conditional on YY (e.g., p(y|y,𝐱)=p(y|y)p(y^{\ast}|y,{\bf x})=p(y^{\ast}|y)) due to the PRAM mechanism. However, the original yy is unobservable; users can only access data from p(y,𝐱)p(y^{\ast},{\bf x}) and the transition matrix 𝐏{\bf P}, whose (i,j)(i,j)th entry is the probability of transforming Y=j1Y=j-1 to Y=i1Y^{\ast}=i-1: Pr(Y=i1|Y=j1)\Pr(Y^{\ast}=i-1|Y=j-1). Though we present the method by transforming the response variable YY, it is crucial to note that the following discussions and the proposed method also apply seamlessly to transforming sensitive covariates. We transform YY here solely for illustrative purposes.

We are interested in estimating a parameter 𝜷d{\bm{\beta}}\in\mathbb{R}^{d}, which is defined as the solution to the estimating equation E{𝐔(Y,𝐗;𝜷)}=𝟎E\left\{{\bf U}(Y,{\bf X};{\bm{\beta}})\right\}={\bf 0} for a known function 𝐔()d{\bf U}(\cdot)\in\mathbb{R}^{d}. This general form of parameter definition encompasses various quantities of interest. For example, by letting U(y,x;β)=yβU(y,x;\beta)=y-\beta, the parameter of interest becomes the mean of the response β=E(y)\beta=E(y). For binary y{0,1}y\in\left\{0,1\right\} if we let

𝐔(y,x;𝜷)=[yexpit(β0+β1x){yexpit(β0+β1x)}x],{\bf U}(y,x;{\bm{\beta}})=\begin{bmatrix}y-\mathrm{expit}(\beta_{0}+\beta_{1}x)\\ \left\{y-\mathrm{expit}(\beta_{0}+\beta_{1}x)\right\}x\end{bmatrix},

where expit(v)=1/{1+exp(v)}\mathrm{expit}(v)=1/\left\{1+\exp(-v)\right\}. Then 𝜷T=(β0,β1){\bm{\beta}}^{\rm T}=(\beta_{0},\beta_{1}) corresponds to the intercept and coefficient of xx in a logistic regression model. For continuous yy (suppose xx is a categorical sensitive variable), if we let

𝐔(y,x;𝜷)=[yβ0β1x(yβ0β1x)x],{\bf U}(y,x;{\bm{\beta}})=\begin{bmatrix}y-\beta_{0}-\beta_{1}x\\ (y-\beta_{0}-\beta_{1}x)x\end{bmatrix},

then 𝜷T=(β0,β1){\bm{\beta}}^{\rm T}=(\beta_{0},\beta_{1}) corresponds to the intercept and coefficient of xx in a simple linear regression model. Importantly, we do not make any model assumptions on p(y,x)p(y,x) when defining these parameters. The definition of the parameter is independent of the model, making it well-defined even if the model (e.g., p(y|x)p(y|x)) is misspecified or unknown. Lastly, the goal is to estimate and perform statistical inference on the parameter 𝜷{\bm{\beta}} given PRAM data {(𝐱i,yi),i=1,,n}\{({\bf x}_{i},y^{\ast}_{i}),i=1,\dots,n\}.

2.2 Why Existing Methods Are Model-Dependent?

This section examines the key limitation of existing methods (e.g., van den Hout and Kooiman, 2006, Woo and Slavković, 2012, Woo and Slavković, 2015): their reliance on parametric assumptions. By analyzing this limitation, we aim to pave the way for introducing our proposed model-agnostic method, which offers greater flexibility and robustness. The rest of this section provides a high-level examination of existing model-dependent methods.

Due to the absence of the original label YY, we cannot use the estimating equation E{𝐔(Y,𝐗;𝜷)}=𝟎E\left\{{\bf U}(Y,{\bf X};{\bm{\beta}})\right\}={\bf 0} directly to solve for 𝜷{\bm{\beta}}. However, we can rewrite the estimating equation as

E{𝐔(Y,𝐗;𝜷)}=E[E{𝐔(Y,𝐗;𝜷)|Y,𝐗}]=0,E\left\{{\bf U}(Y,{\bf X};{\bm{\beta}})\right\}=E\left[E\left\{{\bf U}(Y,{\bf X};{\bm{\beta}})|Y^{\ast},{\bf X}\right\}\right]=0, (1)

where the conditional expectation 𝐕(y,𝐱)E{𝐔(Y,𝐗;𝜷)|Y=y,𝐗=𝐱}{\bf V}(y^{\ast},{\bf x})\equiv E\left\{{\bf U}(Y,{\bf X};{\bm{\beta}})|Y^{\ast}=y^{\ast},{\bf X}={\bf x}\right\} in (1) is a function observable variables yy^{\ast} and 𝐱{\bf x}. Therefore, the conditional expectation 𝐕(y,𝐱){\bf V}(y^{\ast},{\bf x}) is crucial for estimating 𝜷{\bm{\beta}} because once we can derive (or at least estimate) 𝐕(y,𝐱){\bf V}(y^{\ast},{\bf x}), we can estimate 𝜷{\bm{\beta}} by solving 𝜷{\bm{\beta}} from the empirical version of (1), which is given by

1ni=1n𝐕(yi,𝐱i;𝜷)=𝟎.\frac{1}{n}\sum_{i=1}^{n}{\bf V}(y_{i}^{\ast},{\bf x}_{i};{\bm{\beta}})={\bf 0}. (2)

By the definition of conditional expectation, we have

𝐕(y,𝐱;𝜷)=E{𝐔(Y,𝐗;𝜷)|Y=y,𝐗=𝐱}=𝐔(y,𝐱;𝜷)p(y|y,𝐱)𝑑y,{\bf V}(y^{\ast},{\bf x};{\bm{\beta}})=E\left\{{\bf U}(Y,{\bf X};{\bm{\beta}})|Y^{\ast}=y^{\ast},{\bf X}={\bf x}\right\}=\int{\bf U}(y,{\bf x};{\bm{\beta}})p(y|y^{\ast},{\bf x})dy,

where we can see that 𝐕(y,𝐱;𝜷){\bf V}(y^{\ast},{\bf x};{\bm{\beta}}) is determined by p(y|y,𝐱)p(y|y^{\ast},{\bf x}). Using the Bayes rule, we can be further write p(y|y,𝐱)p(y|y^{\ast},{\bf x}) as

p(y|y,𝐱)=p(y,y,𝐱)p(y,𝐱)=p(y|y,𝐱)p(y,𝐱)p(y,𝐱)=p(y|y)p(y|𝐱)p(y|𝐱).p(y|y^{\ast},{\bf x})=\frac{p(y,y^{\ast},{\bf x})}{p(y^{\ast},{\bf x})}=\frac{p(y^{\ast}|y,{\bf x})p(y,{\bf x})}{p(y^{\ast},{\bf x})}=\frac{p(y^{\ast}|y)p(y|{\bf x})}{p(y^{\ast}|{\bf x})}. (3)

Equation (3) shows that p(y|y,𝐱)p(y|y^{\ast},{\bf x}), or equivalently, 𝐕(y,𝐱;𝜷){\bf V}(y^{\ast},{\bf x};{\bm{\beta}}), is determined by p(y|y)p(y^{\ast}|y), p(y|𝐱)p(y|{\bf x}), and p(y|𝐱)p(y^{\ast}|{\bf x}). The first one, p(y|y)p(y^{\ast}|y), is known, so we only need to focus on the latter two: p(y|𝐱)p(y|{\bf x}) and p(y|𝐱)p(y^{\ast}|{\bf x}). These two conditional distributions are closely connected, as we can show that

p(y|𝐱)=p(y|y)p(y|𝐱)𝑑y.p(y^{\ast}|{\bf x})=\int p(y^{\ast}|y)p(y|{\bf x})dy. (4)

Namely, if we specify p(y|𝐱)p(y|{\bf x}), then p(y|𝐱)p(y^{\ast}|{\bf x}) is determined by (4). On the contrary, under mild conditions, one can also solve p(y|𝐱)p(y|{\bf x}) from the integral equation (4) when given p(y|𝐱)p(y^{\ast}|{\bf x}). To summarize, once we put a parametric assumption on p(y|𝐱)p(y|{\bf x}) or p(y|𝐱)p(y^{\ast}|{\bf x}), we can estimate p(y|y,𝐱)p(y|y^{\ast},{\bf x}) as well as 𝐕(y,𝐱;𝜷){\bf V}(y^{\ast},{\bf x};{\bm{\beta}}). Then by replacing 𝐕(y,𝐱;𝜷){\bf V}(y^{\ast},{\bf x};{\bm{\beta}}) with the estimated 𝐕^(y,𝐱;𝜷)\widehat{{\bf V}}(y^{\ast},{\bf x};{\bm{\beta}}) in (2), we can estimate the parameter 𝜷{\bm{\beta}} by solving the equation. This explain why existing methods hinge on imposing model assumption on p(y|𝐱)p(y|{\bf x}) or p(y|𝐱)p(y^{\ast}|{\bf x}).

So far, our analysis highlights the fundamental limitation of existing methods: their dependence on parametric assumptions for estimating 𝜷{\bm{\beta}}. The question is: how to bypass the parametric assumption? We provide our answer in the next section.

3 Towards Efficient Estimation: A Model-Agnostic Approach

3.1 All Roads Lead to Rome (But Some Are Better)

We consider a simple example where the sensitive variable is binary Y{0,1}Y\in\left\{0,1\right\}. We try to answer the following question: Given the distribution of the perturbed variable p(y)p(y^{\ast}) and the transition matrix 𝐏{\bf P}, how to recover the distribution of the latent p(y)p(y)?

From a probabilistic point of view, the marginal distributions of YY and YY^{\ast} are connected through a reversion matrix 𝐐1{\bf Q}_{1} as follows.

[Pr(Y=0)Pr(Y=1)]=𝐐1[Pr(Y=0)Pr(Y=1)],\begin{bmatrix}\Pr(Y=0)\\ \Pr(Y=1)\end{bmatrix}={\bf Q}_{1}\begin{bmatrix}\Pr(Y^{\ast}=0)\\ \Pr(Y^{\ast}=1)\end{bmatrix}, (5)

where

𝐐1[Pr(Y=0|Y=0)Pr(Y=0|Y=1)Pr(Y=1|Y=0)Pr(Y=1|Y=1)].{\bf Q}_{1}\equiv\begin{bmatrix}\Pr(Y=0|Y^{\ast}=0)&\Pr(Y=0|Y^{\ast}=1)\\ \Pr(Y=1|Y^{\ast}=0)&\Pr(Y=1|Y^{\ast}=1)\end{bmatrix}.

The problem with this approach is that the reversion matrix 𝐐1{\bf Q}_{1} is not readily available. But we can compute the reversion matrix 𝐐1{\bf Q}_{1} using the Bayes rule:

𝐐1=[p11π0p11π0+p12π1p21π0p21π0+p22π1p12π1p11π0+p12π1p22π1p21π0+p22π1],{\bf Q}_{1}=\begin{bmatrix}\dfrac{p_{11}\pi_{0}}{p_{11}\pi_{0}+p_{12}\pi_{1}}&\dfrac{p_{21}\pi_{0}}{p_{21}\pi_{0}+p_{22}\pi_{1}}\\ \dfrac{p_{12}\pi_{1}}{p_{11}\pi_{0}+p_{12}\pi_{1}}&\dfrac{p_{22}\pi_{1}}{p_{21}\pi_{0}+p_{22}\pi_{1}}\end{bmatrix}, (6)

where pijp_{ij} is the (i,j)(i,j)th entry of the transition matrix 𝐏{\bf P} and πiPr(Y=i)\pi_{i}\equiv\Pr(Y^{\ast}=i) for i=0,1i=0,1.

Another way to reach the same goal is simply inverting the transition matrix. By the definition of the transition matrix, we have

[Pr(Y=0)Pr(Y=1)]=𝐏[Pr(Y=0)Pr(Y=1)].\begin{bmatrix}\Pr(Y^{\ast}=0)\\ \Pr(Y^{\ast}=1)\end{bmatrix}={\bf P}\begin{bmatrix}\Pr(Y=0)\\ \Pr(Y=1)\end{bmatrix}.

Therefore, we can recover the marginal distribution of YY in the following way, as long as 𝐏{\bf P} is nonsingular.

[Pr(Y=0)Pr(Y=1)]=𝐐2[Pr(Y=0)Pr(Y=1)],\begin{bmatrix}\Pr(Y=0)\\ \Pr(Y=1)\end{bmatrix}={\bf Q}_{2}\begin{bmatrix}\Pr(Y^{\ast}=0)\\ \Pr(Y^{\ast}=1)\end{bmatrix}, (7)

where the reversion matrix is given by 𝐐2=𝐏1{\bf Q}_{2}={\bf P}^{-1}.

While both reversion matrices 𝐐1{\bf Q}_{1} and 𝐐2{\bf Q}_{2} can achieve the same goal, comparing 𝐐1{\bf Q}_{1} and 𝐐2{\bf Q}_{2} reveals their key difference: the first reversion matrix 𝐐1{\bf Q}_{1} depends on the data model (i.e., π0\pi_{0} and π1\pi_{1}) while the second reversion matrix 𝐐2{\bf Q}_{2} does not. After all, 𝐐1{\bf Q}_{1} contains π0=Pr(Y=0)\pi_{0}=\Pr(Y^{\ast}=0) and π1=Pr(Y=1)\pi_{1}=\Pr(Y^{\ast}=1) while 𝐐2{\bf Q}_{2} is purely an inverse of a known matrix and non-probabilistic. Calling Pr(Y=0)\Pr(Y^{\ast}=0) a model may sound strange, but that is because we consider a simple example.

To further illustrate the distinction between these approaches and lay the groundwork for our proposed method, we consider a general scenario by introducing a covariate vector 𝐗{\bf X}. The task now becomes recovering the joint distribution p(y,𝐱)p(y,{\bf x}) using the noisy p(y,𝐱)p(y^{\ast},{\bf x}) and the transition matrix P. Similarly to the previous case in (5), we can derive p(y,x) as follows:

[p(y=0,𝐱)p(y=1,𝐱)]=𝐐1(y,𝐱)[p(y=0,𝐱)p(y=1,𝐱)],\begin{bmatrix}p(y=0,{\bf x})\\ p(y=1,{\bf x})\end{bmatrix}={\bf Q}_{1}(y^{\ast},{\bf x})\begin{bmatrix}p(y^{\ast}=0,{\bf x})\\ p(y^{\ast}=1,{\bf x})\end{bmatrix},

where the reversion matrix

𝐐1(y,𝐱)[p(y=0|y=0,𝐱)p(y=0|y=1,𝐱)p(y=1|y=0,𝐱)p(y=1|y=1,𝐱)]{\bf Q}_{1}(y^{\ast},{\bf x})\equiv\begin{bmatrix}p(y=0|y^{\ast}=0,{\bf x})&p(y=0|y^{\ast}=1,{\bf x})\\ p(y=1|y^{\ast}=0,{\bf x})&p(y=1|y^{\ast}=1,{\bf x})\end{bmatrix}

depends on the data model p(y|y,𝐱)p(y|y^{\ast},{\bf x}), which is similar to the issue encountered in Section 2.2. This dependence on the data model underscores the reason why existing methods are model-dependent: they need to estimate p(y|y,𝐱)p(y|y^{\ast},{\bf x}).

In contrast, the second approach only requires the known matrix 𝐐2=𝐏1{\bf Q}_{2}={\bf P}^{-1}, which is the inverse of the transition matrix. It can be verified that

[p(y=0,𝐱)p(y=1,𝐱)]=𝐐2[p(y=0,𝐱)p(y=1,𝐱)].\begin{bmatrix}p(y=0,{\bf x})\\ p(y=1,{\bf x})\end{bmatrix}={\bf Q}_{2}\begin{bmatrix}p(y^{\ast}=0,{\bf x})\\ p(y^{\ast}=1,{\bf x})\end{bmatrix}. (8)

Our discussion above highlights the key advantage of using the reversion matrix 𝐐2=𝐏1{\bf Q}_{2}={\bf P}^{-1}: It enables us to bypass the need for parametric assumptions. Therefore, to answer the question posed at the end of Section 2.2 (“how to bypass the parametric assumption?”), our solution is simple: use the inverse of the transition matrix 𝐏{\bf P}.

3.2 A Model-Agnostic Estimator

To enhance understanding, in this section, we present our method using a simple example: the parameter of interest β\beta is a scalar and the sensitive variable Y{0,1}Y\in\{0,1\} is binary. We will explain the core concept using intuitive language in this section before delving into the formal details in Section 3.3.

The parameter of interest β\beta is defined as the solution to the estimation equation E{u(Y,𝐗;β)}=0E\left\{{\mathrm{u}}(Y,{\bf X};\beta)\right\}=0. Suppose we have a sample {(𝐱i,yi)}\left\{({\bf x}_{i},y^{\ast}_{i})\right\} for i=1,,ni=1,\dots,n, then we can rewrite the expectation as

E{u(Y,𝐗;β)}=u(𝐱,y;β)p(𝐱,y)dyd𝐱=j=01u(𝐱,y=j;β)p(𝐱,y=j)dyd𝐱=i=1nj=01u(𝐱i,y=j;β)p(𝐱i,y=j)=i=1n[u(𝐱i,y=0;β)u(𝐱i,y=1;β)]T[p(𝐱i,y=0)p(𝐱i,y=1)]\begin{split}E&\left\{{\mathrm{u}}(Y,{\bf X};\beta)\right\}=\int{\mathrm{u}}({\bf x},y;\beta)p({\bf x},y)dyd{\bf x}=\int\sum_{j=0}^{1}{\mathrm{u}}({\bf x},y=j;\beta)p({\bf x},y=j)dyd{\bf x}\\ =&\sum_{i=1}^{n}\sum_{j=0}^{1}{\mathrm{u}}({\bf x}_{i},y=j;\beta)p({\bf x}_{i},y=j)=\sum_{i=1}^{n}\begin{bmatrix}{\mathrm{u}}({\bf x}_{i},y=0;\beta)\\ {\mathrm{u}}({\bf x}_{i},y=1;\beta)\end{bmatrix}^{\rm T}\begin{bmatrix}p({\bf x}_{i},y=0)\\ p({\bf x}_{i},y=1)\end{bmatrix}\end{split} (9)

One can understand the third equation in (9) by treating 𝐱{\bf x} as if it is a discrete variable with support 𝐱{𝐱1,,𝐱n}{\bf x}\in\{{\bf x}_{1},\dots,{\bf x}_{n}\}. By the relationship in (8), we can further write (9) as

E{u(Y,𝐗;β)}=i=1n[u(𝐱i,y=0;β)u(𝐱i,y=1;β)]T𝐏1[p(𝐱=𝐱i,y=0)p(𝐱=𝐱i,y=1)].E\left\{{\mathrm{u}}(Y,{\bf X};\beta)\right\}=\sum_{i=1}^{n}\begin{bmatrix}{\mathrm{u}}({\bf x}_{i},y=0;\beta)\\ {\mathrm{u}}({\bf x}_{i},y=1;\beta)\end{bmatrix}^{\rm T}{\bf P}^{-1}\begin{bmatrix}p({\bf x}={\bf x}_{i},y^{\ast}=0)\\ p({\bf x}={\bf x}_{i},y^{\ast}=1)\end{bmatrix}. (10)

Equation (10) implies that E{u(Y,𝐗;β)}E\left\{{\mathrm{u}}(Y,{\bf X};\beta)\right\}, which is originally an expectation with respect to (𝐗,Y)({\bf X},Y), can be transformed into an expectation with respect to (𝐗,Y)({\bf X},Y^{\ast}) with the help of 𝐏1{\bf P}^{-1}. Therefore, we can estimate the expectation in (10) using the Monte Carlo method with the sample {(𝐱i,yi)}\left\{({\bf x}_{i},y^{\ast}_{i})\right\} for i=1,,ni=1,\dots,n, and estimate the parameter β\beta by solving the following equation:

E^{u(Y,𝐗;β)}=1ni=1n[u(𝐱i,y=0;β)u(𝐱i,y=1;β)]T𝐏1[𝟙yi=01𝟙yi=0]=0,\widehat{E}\left\{{\mathrm{u}}(Y,{\bf X};\beta)\right\}=\frac{1}{n}\sum_{i=1}^{n}\begin{bmatrix}{\mathrm{u}}({\bf x}_{i},y=0;\beta)\\ {\mathrm{u}}({\bf x}_{i},y=1;\beta)\end{bmatrix}^{\rm T}{\bf P}^{-1}\begin{bmatrix}\mathbbm{1}_{y^{\ast}_{i}=0}\\ 1-\mathbbm{1}_{y^{\ast}_{i}=0}\end{bmatrix}=0, (11)

where the indicator function satisfies 𝟙yi=0=1\mathbbm{1}_{y^{\ast}_{i}=0}=1 if yi=0y_{i}^{\ast}=0; otherwise, 𝟙yi=0=0\mathbbm{1}_{y^{\ast}_{i}=0}=0. The empirical estimating equation (11) does not contain any models, thus making our proposed estimator model-agnostic.

3.3 Theoretical Results

Influence Function

Firstly, we extend our discussion in Section 3.2 to a general setting where we have a vector of parameters 𝜷d{\bm{\beta}}\in\mathbb{R}^{d} and a multi-class variable Y{0,,K1}Y\in\left\{0,\dots,K-1\right\}. The parameter of interest is defined by the solution of E{𝐔(𝐗,Y;𝜷)}=𝟎dE\left\{{\bf U}({\bf X},Y;{\bm{\beta}})\right\}={\bf 0}\in\mathbb{R}^{d} for some known function 𝐔()d{\bf U}(\cdot)\in\mathbb{R}^{d}. Using similar arguments as in Section 3.2, for i=1,,Ki=1,\dots,K, the influence function of our proposed estimator 𝜷^\widehat{\bm{\beta}} is given by

ϕ(y=i1,𝐱;𝜷)\displaystyle\bm{\phi}(y^{*}=i-1,{\bf x};{\bm{\beta}}) =\displaystyle= k=1K𝐔(y=k1,𝐱;𝜷)qk,ielement of 𝐏1\displaystyle\sum_{k=1}^{K}{\bf U}(y=k-1,{\bf x};{\bm{\beta}})\underbrace{q_{k,i}}_{\mbox{element of }{\bf P}^{-1}}
=\displaystyle= 𝕌(𝐱;𝜷)𝐏1𝐞i,\displaystyle\mathbb{U}({\bf x};{\bm{\beta}}){\bf P}^{-1}{\bf e}_{i},

where qkiq_{ki} is the (k,i)(k,i)th entry of 𝐏1{\bf P}^{-1}, 𝕌(𝐱;𝜷)\mathbb{U}({\bf x};{\bm{\beta}}) is a d×Kd\times K matrix with its kkth column being 𝐔(y=k1,𝐱;𝜷){\bf U}(y=k-1,{\bf x};{\bm{\beta}}) such that

𝕌(𝐱;𝜷)=[𝐔T(y=0,𝐱;𝜷)𝐔T(y=K1,𝐱;𝜷)]T,\mathbb{U}({\bf x};{\bm{\beta}})=\begin{bmatrix}{\bf U}^{\rm T}(y=0,{\bf x};{\bm{\beta}})\\ \vdots\\ {\bf U}^{\rm T}(y=K-1,{\bf x};{\bm{\beta}})\end{bmatrix}^{\rm T},

and 𝐞iK{\bf e}_{i}\in\mathbb{R}^{K} is a basis vector whose iith entry is 1 while other entries are all 0 for i=1,,Ki=1,\dots,K. Based on the influence function in (3.3), our proposed estimator 𝜷^\widehat{\bm{\beta}} given a sample {(𝐱i,yi)}\left\{({\bf x}_{i},y^{\ast}_{i})\right\}, i=1,,ni=1,\dots,n, is the solution to

1ni=1nk=1K𝐔(y=k1,𝐱i;𝜷)qk,yi+1element of 𝐏1=𝟎.\displaystyle\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}{\bf U}(y=k-1,{\bf x}_{i};{\bm{\beta}})\underbrace{q_{k,y_{i}^{*}+1}}_{\mbox{element of }{\bf P}^{-1}}={\bf 0}. (13)

Semiparametric Efficiency

Using the theories of the M-estimator (e.g., van der Vaart, 2000), the proposed estimator 𝜷^\widehat{\bm{\beta}} in (13) is readily consistent and has an asymptotic normal distribution. Moreover, we will prove in Theorem 1 that there are no better alternative estimators regarding estimation efficiency because our proposed 𝜷^\widehat{\bm{\beta}} has the smallest possible asymptotic variance (i.e., achieves the semiparametric efficiency bound).

Notably, the proposed estimator 𝜷^\widehat{\bm{\beta}} achieves the semiparametric efficiency bound without relying on any additional assumptions. This is because it operates independent of any models, both parametric and nonparametric. This is a significant advantage, as achieving semiparametric efficiency in other methods often requires some challenging conditions. For example, literature on double machine learning usually requires that the nonparametric nuisance functions (functions whose specific form is unknown but belongs to a certain class) need to be estimated precisely, typically requiring a convergence rate of op(n1/4)o_{p}(n^{-1/4}). Additionally, for most semiparametric estimator, the parametric part of the model needs to be specified correctly. We establish the efficiency results in the following theorem. The proofs are given in the supplementary materials.

Theorem 1.

The efficient influence function for estimating 𝛃{\bm{\beta}} is given by

𝛀1ϕ(y,𝐱;𝜷),{\bm{\Omega}}^{-1}\bm{\phi}(y^{\ast},{\bf x};{\bm{\beta}}),

where 𝛀d×d{\bm{\Omega}}\in\mathbb{R}^{d\times d} is a constant matrix and is defined as E{𝐔(𝐗,Y;𝛃)/𝛃T}E\left\{\partial{\bf U}({\bf X},Y;{\bm{\beta}})/\partial{\bm{\beta}}^{\rm T}\right\} evaluated at the true value 𝛃=𝛃0{\bm{\beta}}={\bm{\beta}}_{0} and ϕ()\bm{\phi}(\cdot) is given in (3.3).

Proposition 1.

The proposed efficient 𝛃^\widehat{{\bm{\beta}}} has the asymptotic representation

n(𝜷^𝜷0)=1ni=1n𝛀1ϕ(y,𝐱;𝜷0)+op(1)𝑑Norm(𝟎,𝛀1E(ϕϕT)(𝛀1)T).\sqrt{n}(\widehat{\bm{\beta}}-{\bm{\beta}}_{0})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{\bm{\Omega}}^{-1}\bm{\phi}(y^{\ast},{\bf x};{\bm{\beta}}_{0})+o_{p}(1)\xrightarrow{d}\mathrm{Norm}\left({\bf 0},{\bm{\Omega}}^{-1}E(\bm{\phi}\bm{\phi}^{\rm T})({\bm{\Omega}}^{-1})^{\rm T}\right).

The proposed estimator 𝛃^\widehat{{\bm{\beta}}} achieves the semiparametric efficiency bound (i.e., 𝛀1E(ϕϕT)(𝛀1)T{\bm{\Omega}}^{-1}E(\bm{\phi}\bm{\phi}^{\rm T})({\bm{\Omega}}^{-1})^{\rm T}), and is semiparametrically efficient.

Remark 1 (Efficiency Loss of 𝜷^\widehat{{\bm{\beta}}} Compared to Oracle Estimator 𝜷^o\widehat{{\bm{\beta}}}_{o}).

We define the oracle estimator 𝛃^o\widehat{\bm{\beta}}_{o} as the solution of

1ni=1n𝐔(yi,𝐱i;𝜷)=𝟎.\frac{1}{n}\sum^{n}_{i=1}{\bf U}(y_{i},{\bf x}_{i};{\bm{\beta}})={\bf 0}.

We say 𝛃^o\widehat{\bm{\beta}}_{o} is an oracle estimator because it needs the unobservable original label yiy_{i}. The asymptotic representation of the oracle estimator is given as

n(𝜷^o𝜷0)=1ni=1n𝛀1𝐔(yi,𝐱i;𝜷0)+op(1)𝑑Norm(𝟎,𝛀1E(𝐔𝐔T)(𝛀1)T).\begin{split}&\sqrt{n}(\widehat{{\bm{\beta}}}_{o}-{\bm{\beta}}_{0})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{\bm{\Omega}}^{-1}{\bf U}(y_{i},{\bf x}_{i};{\bm{\beta}}_{0})+o_{p}(1)\\ &\xrightarrow{d}\mathrm{Norm}\left({\bf 0},{\bm{\Omega}}^{-1}E({\bf U}{\bf U}^{\rm T})({\bm{\Omega}}^{-1})^{\rm T}\right).\end{split}

Noticing that E{ϕ(y,𝐱;𝛃)|Y=y,𝐗=𝐱}=𝐔(𝐱,y;𝛃)E\left\{\bm{\phi}(y^{\ast},{\bf x};{\bm{\beta}})|Y=y,{\bf X}={\bf x}\right\}={\bf U}({\bf x},y;{\bm{\beta}}), one can write

𝛀1E(ϕϕT)(𝛀1)T=𝛀1E(𝐔𝐔T)(𝛀1)T+𝛀1E{(ϕ𝐔)(ϕ𝐔)T}(𝛀1)T.\begin{split}{\bm{\Omega}}^{-1}E(\bm{\phi}\bm{\phi}^{\rm T})({\bm{\Omega}}^{-1})^{\rm T}=&{\bm{\Omega}}^{-1}E({\bf U}{\bf U}^{\rm T})({\bm{\Omega}}^{-1})^{\rm T}\\ &+{\bm{\Omega}}^{-1}E\left\{(\bm{\phi}-{\bf U})(\bm{\phi}-{\bf U})^{\rm T}\right\}({\bm{\Omega}}^{-1})^{\rm T}.\end{split}

Therefore, 𝛀1E{(ϕ𝐔)(ϕ𝐔)T}(𝛀1)T{\bm{\Omega}}^{-1}E\left\{(\bm{\phi}-{\bf U})(\bm{\phi}-{\bf U})^{\rm T}\right\}({\bm{\Omega}}^{-1})^{\rm T} is the efficiency loss of 𝛃^\widehat{{\bm{\beta}}} compared to the oracle estimator 𝛃^o\widehat{{\bm{\beta}}}_{o}. It is also the price to pay for preserving privacy. Considering a special case where 𝐔(){\bf U}(\cdot) does not contain YY, then from (3.3), we have ϕ=𝐔\bm{\phi}={\bf U}, and there is no efficiency loss because YY is not involved.

Statistical Inference

Finally, to preserve the model-agnostic feature of 𝜷^\widehat{\bm{\beta}} when conducting statistical inference, we propose to use the resampling method by Jin et al., (2001) to estimate the asymptotic variance of 𝜷^\widehat{\bm{\beta}}. The details of this method have been carefully studied in Jin et al., (2001), so we only briefly describe the procedure here. Note that 𝜷^\widehat{\bm{\beta}} is the solution of n1i=1nϕ(yi,𝐱i;𝜷)=𝟎n^{-1}\sum_{i=1}^{n}\bm{\phi}(y^{\ast}_{i},{\bf x}_{i};{\bm{\beta}})={\bf 0}, where the randomness of 𝜷^\widehat{\bm{\beta}} comes from different realizations of the data sample {(yi,𝐱i)}\left\{\left(y_{i}^{\ast},{\bf x}_{i}\right)\right\} for i=1,,ni=1,\dots,n. Now, conditional on one observed sample {(yi,𝐱i)}\left\{\left(y_{i}^{\ast},{\bf x}_{i}\right)\right\} for i=1,,ni=1,\dots,n, we denote the observed estimate of 𝜷{\bm{\beta}} as 𝜷^\widehat{\bm{\beta}} (which is fixed now). Let Li,i=1,,nL_{i},i=1,\ldots,n, be nn independent and identically distributed copies of a nonnegative, completely known random variable LL with mean one and variance one (e.g., Lexp(1)L\sim\exp(1)). Then, we denote 𝜷~\widetilde{\bm{\beta}} as the solution of n1i=1nLiϕ(yi,𝐱i;𝜷)=𝟎,n^{-1}\sum_{i=1}^{n}L_{i}\bm{\phi}(y^{\ast}_{i},{\bf x}_{i};{\bm{\beta}})={\bf 0}, where only LiL_{i}’s are random in the above estimating equation. It can be shown that n(𝜷~𝜷^)\sqrt{n}(\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}) (when conditional on the sample) shares the same asymptotic distribution as n(𝜷^𝜷0)\sqrt{n}(\widehat{\bm{\beta}}-{\bm{\beta}}_{0}). In practice, conditional on (yi,𝐱i),i=1,,n(y_{i}^{*},{\bf x}_{i}),i=1,\ldots,n, and 𝜷^\widehat{\bm{\beta}}, the distribution of 𝜷~\widetilde{\bm{\beta}} can be estimated by generating a large number, say, MM, of random samples Li,i=1,,nL_{i},i=1,\ldots,n. Denote the solution in the mmth replication as 𝜷~m\widetilde{\bm{\beta}}_{m}, then the asymptotic variance of 𝜷^\widehat{\bm{\beta}} can be estimated by the sample covariance matrix constructed from 𝜷~m,m=1,,M\widetilde{\bm{\beta}}_{m},m=1,\ldots,M.

4 Numerical Studies

4.1 Simulation Studies

We conduct comprehensive simulation studies to investigate the proposed method’s finite sample performance and its comparison to some existing methods. In Simulation A1, we consider a binary Y{0,1}Y\in\left\{0,1\right\} and use the logistic regression model Pr(y=1|x)=expit(β~0+β~1x)\Pr(y=1|x)=\mathrm{expit}(\widetilde{\beta}_{0}+\widetilde{\beta}_{1}x) with β~0=1\widetilde{\beta}_{0}=-1 and β~1=1.5\widetilde{\beta}_{1}=1.5. The marginal distribution of XX is chosen as Norm(0.5,1)\mathrm{Norm}(0.5,1). The response variable YY is subject to PRAM. We choose three different transition matrices in PRAM as Pr(Y=0|Y=0)=Pr(Y=1|Y=1){0.75,0.85,0.95}\Pr(Y^{\ast}=0|Y=0)=\Pr(Y^{\ast}=1|Y=1)\in\left\{0.75,0.85,0.95\right\} and we vary the sample size n{1000,1200,1400,1600,1800,2000}n\in\left\{1000,1200,1400,1600,1800,2000\right\}. The parameters of interest 𝜷=(β1,β2)T{\bm{\beta}}=(\beta_{1},\beta_{2})^{\rm T} are defined through the solution of the following estimating equation:

E[Yexpit(β1+β2X){Yexpit(β1+β2X)}X]=𝟎.E\begin{bmatrix}Y-\mathrm{expit}(\beta_{1}+\beta_{2}X)\\ \left\{Y-\mathrm{expit}(\beta_{1}+\beta_{2}X)\right\}X\end{bmatrix}={\bf 0}.

Such a choice of 𝜷{\bm{\beta}} is equivalent to estimating the coefficients from a logistic regression model so that, immediately, we have β1=β~1\beta_{1}=\widetilde{\beta}_{1} and β2=β~2\beta_{2}=\widetilde{\beta}_{2}; however, the estimating equation itself does not imply any model assumptions. Using the same way for generating (y,𝐱)(y,{\bf x}) as in Simulation A1, we conduct Simulation A2, where we fix the sample size at n=1000n=1000. However, we do not require p00Pr(Y=0|Y=0)=Pr(Y=1|Y=1)p11p_{00}\equiv\Pr(Y^{\ast}=0|Y=0)=\Pr(Y^{\ast}=1|Y=1)\equiv p_{11} for the transition matrix; we vary p00p_{00} and p11p_{11} between 0.750.75 and 0.950.95 with increment of 0.010.01.

In Simulation B, we consider the situation that the sensitive variable is a covariate, rather than the response variable. The data-generating mechanism is given by Y|X=xNorm(β~1+β~2x,1)Y|X=x\sim\mathrm{Norm}(\widetilde{\beta}_{1}+\widetilde{\beta}_{2}x,1) and Pr(X=1)=0.5\Pr(X=1)=0.5 for binary X{0,1}X\in\left\{0,1\right\}. We set β~1=1\widetilde{\beta}_{1}=-1 and β~2=1\widetilde{\beta}_{2}=1. The parameters of interest 𝜷=(β1,β2){\bm{\beta}}=(\beta_{1},\beta_{2}) are the coefficients of a simple linear regression model

E[Yβ1β2X(Yβ1β2X)X]=𝟎.E\begin{bmatrix}Y-\beta_{1}-\beta_{2}X\\ (Y-\beta_{1}-\beta_{2}X)X\end{bmatrix}={\bf 0}.

Again, the definition of parameters does not depend on parametric model assumptions, but we align the parameter definition with the data-generating mechanism so that we can obtain the true parameter values β1=β~1\beta_{1}=\widetilde{\beta}_{1} and β2=β~2\beta_{2}=\widetilde{\beta}_{2}. Similar to Simulations A1 and A2, in Simulation B1, we let the sample size vary n{1000,1200,1400,1600,1800,2000}n\in\left\{1000,1200,1400,1600,1800,2000\right\} and use p11Pr(X=1|X=1)=Pr(X=0|X=0)p00{0.75,0.85,0.95}p_{11}\equiv\Pr(X^{\ast}=1|X=1)=\Pr(X^{\ast}=0|X=0)\equiv p_{00}\in\left\{0.75,0.85,0.95\right\} as the transition probabilities; while in Simulation B2, we fix nn at n=1000n=1000 and allow p11p_{11} and p00p_{00} to vary freely between 0.750.75 and 0.950.95 with increment of 0.010.01. We run 20002000 Monte Carlo replicates for the following methods in all the aforementioned simulations.

  1. 1.

    The proposed estimator 𝜷^\widehat{\bm{\beta}} in (13).

  2. 2.

    The oracle estimator 𝜷^o\widehat{\bm{\beta}}_{o}: using the un-perturbed original sensitive variable.

  3. 3.

    The naive estimator 𝜷^b\widehat{\bm{\beta}}_{b}: treat the perturbed variable as the original one; this estimator is generally biased.

  4. 4.

    Model-dependent estimator 𝜷^m1\widehat{\bm{\beta}}_{m1} (Model 1): We model (3) using a parametric model.

  5. 5.

    Model-dependent estimator 𝜷^m2\widehat{\bm{\beta}}_{m2} (Model 2): We model (3) using a parametric model, but the model used is relatively worse than the one used in 𝜷^m1\widehat{\bm{\beta}}_{m1}.

The purpose of introducing 𝜷^m1\widehat{\bm{\beta}}_{m1} and 𝜷^m2\widehat{\bm{\beta}}_{m2} is to investigate the effects of the degree of model misspecification on the estimations. In Simulations A (resp., B), we model p(y|y,x)p(y^{\ast}|y,x) (resp., p(x|x,y)p(x^{\ast}|x,y)) using a logistic regression for 𝜷^m1\widehat{\bm{\beta}}_{m1} while we force the intercept of the logistic regression to be 0 in 𝜷^m2\widehat{\bm{\beta}}_{m2}. The motivations for such a setting are:

  • Neither the model in 𝜷^m1\widehat{\bm{\beta}}_{m1} or 𝜷^m2\widehat{\bm{\beta}}_{m2} is correct; but the parametric model used in 𝜷^m1\widehat{\bm{\beta}}_{m1} provides a better fit than that used in 𝜷^m2\widehat{\bm{\beta}}_{m2}. In other words, the degree of model misspecification is worse in 𝜷^m2\widehat{\bm{\beta}}_{m2}

  • By comparing 𝜷^m1\widehat{\bm{\beta}}_{m1} and 𝜷^\widehat{\bm{\beta}} as well as 𝜷^m2\widehat{\bm{\beta}}_{m2} and 𝜷^\widehat{\bm{\beta}}, we can investigate the benefits brought by our model-agnostic method.

  • By comparing 𝜷^m1\widehat{\bm{\beta}}_{m1} and 𝜷^m2\widehat{\bm{\beta}}_{m2}, we can investigate the severity of model misspecification on the estimation. It is worth noting that 𝜷^m2\widehat{\bm{\beta}}_{m2} does not have any practical meanings, nor do we want to establish the superiority of our proposed method by comparing it with 𝜷^m2\widehat{\bm{\beta}}_{m2}.

  • We do not include the estimator where p(y|y,x)p(y^{\ast}|y,x) (or p(x|x,y)p(x^{\ast}|x,y)) is correctly specified. This is because using the correct model needs oracle information, and we consider the semiparametric setting where the model p(y|y,x)p(y^{\ast}|y,x) is unknown (which mimics the real-world application scenario).

Firstly, Figure 1(a) summarizes the empirical bias results in Simulation A1. In general, the oracle and proposed estimators 𝜷^o\widehat{\bm{\beta}}_{o} and 𝜷^\widehat{\bm{\beta}} almost always have no bias, whereas 𝜷^m1\widehat{\bm{\beta}}_{m1} has a small bias; however, estimators 𝜷^b\widehat{\bm{\beta}}_{b} and 𝜷^m2\widehat{\bm{\beta}}_{m2} have huge biases. These noticeable biases reduce when p11=p00p_{11}=p_{00} increase from 0.750.75 to 0.950.95 but do not diminish when the sample size nn increases. The empirical bias results in Simulation B1 are very similar, so they are omitted here (contained in the supplementary materials).

We then compare the mean squared error (MSE) across these estimators and report the results in Simulation A1 in Figure 1(b). The results in Simulation B1 are similar and omitted here (can be found in the supplementary materials). Similarly, the naive and Model 2 estimators 𝜷^b\widehat{\bm{\beta}}_{b} and 𝜷^m2\widehat{\bm{\beta}}_{m2} perform badly. Among the three estimators 𝜷^o\widehat{\bm{\beta}}_{o}, 𝜷^\widehat{\bm{\beta}}, and 𝜷^m1\widehat{\bm{\beta}}_{m1}, the oracle estimator 𝜷^o\widehat{\bm{\beta}}_{o} has the smallest MSE while 𝜷^m1\widehat{\bm{\beta}}_{m1} has the largest, with our proposed estimator 𝜷^\widehat{\bm{\beta}} in between, which matches our theoretical results. The efficiency loss of 𝜷^\widehat{\bm{\beta}} over 𝜷^o\widehat{\bm{\beta}}_{o} has been demonstrated in Remark 1, which is the price to pay for protecting private information.

Refer to caption
Figure 1: Simulation A1: empirical bias (upper) and MSE (lower) of five estimators.

Next, we compare the estimation efficiency of 𝜷^\widehat{\bm{\beta}} and 𝜷^m1\widehat{\bm{\beta}}_{m1} by assessing the relative efficiency (RE) of 𝜷^\widehat{\bm{\beta}} to 𝜷^m1\widehat{{\bm{\beta}}}_{m1}, which is defined as

RE(𝜷^,𝜷^m1)=MSE(𝜷^)/MSE(𝜷^m1).\mathrm{RE}(\widehat{\bm{\beta}},\widehat{{\bm{\beta}}}_{m1})={\mathrm{MSE}(\widehat{\bm{\beta}})}/{\mathrm{MSE}(\widehat{\bm{\beta}}_{m1})}.

If RE(𝜷^,𝜷^m1)<1\mathrm{RE}(\widehat{\bm{\beta}},\widehat{{\bm{\beta}}}_{m1})<1, then 𝜷^\widehat{\bm{\beta}} is preferable than 𝜷^m1\widehat{\bm{\beta}}_{m1}, and the smaller RE(𝜷^p,𝜷^m1)\mathrm{RE}(\widehat{\bm{\beta}}_{p},\widehat{{\bm{\beta}}}_{m1}) is, the more efficient 𝜷^\widehat{\bm{\beta}} is. Our results in Figure 2 show that the proposed estimator 𝜷^\widehat{\bm{\beta}} is always more efficient than the estimator 𝜷^m1\widehat{\bm{\beta}}_{m1} in every situation we consider. The RE in simulation A1 is about 50% to 80% (see Figure 2(a)) while the RE in simulation B1 is about 90% to 97.5% (see Figure 3(a)). When we vary p00p_{00} and p11p_{11} in a wider interval [0.75,0.95][0.75,0.95], the RE in simulation A2 may range from 30% to 80% (see Figure 2(b)) and the RE in simulation B2 can be as small as 80% (see Figure 3(b)). Interestingly, in simulations A2 and B2, the area where 𝜷^p\widehat{\bm{\beta}}_{p} is more efficient is in either the top-left or bottom-right corners.

Lastly, we thoroughly report our proposed estimator 𝜷^\widehat{\bm{\beta}}’s estimation and inference results. Table 1 is for simulation A1, and a similar table for simulation B1 is in the supplementary materials. In Table 1, we summarize the empirical bias (Bias) (sample bias across 2000 replicates), the empirical standard deviation (SD) (sample standard deviation across 2000 replicates), the estimated standard error (SE) (average across 2000 estimated standard deviations, with each computed using 500 perturbed samples), and the coverage probability (CP) of the equal-sided 95%95\% confidence interval. Apparently, the estimated standard error matches the empirical standard deviation closely, and the coverage probability is close to the nominal level 95%95\%.

Refer to caption
Figure 2: Simulation A1 (upper) and Simulation A2 (lower): Relative efficiency of 𝜷^\widehat{\bm{\beta}} to 𝜷^m1\widehat{\bm{\beta}}_{m1} (relative efficiency smaller than 1 indicates 𝜷^\widehat{\bm{\beta}} is more efficient).
Refer to caption
Figure 3: Simulation B1 (upper) and Simulation B2 (lower): Relative efficiency of 𝜷^p\widehat{\bm{\beta}}_{p} to 𝜷^m1\widehat{\bm{\beta}}_{m1} (relative efficiency smaller than 1 indicates 𝜷^p\widehat{\bm{\beta}}_{p} is more efficient).
p11=p00p_{11}\!=\!p_{00} nn Bias SD SE CP
β1\beta_{1} β2\beta_{2} β1\beta_{1} β2\beta_{2} β1\beta_{1} β2\beta_{2} β1\beta_{1} β2\beta_{2}
0.750.75 1000 -0.030 0.038 0.270 0.325 0.283 0.351 0.968 0.960
1200 -0.027 0.049 0.252 0.302 0.258 0.320 0.962 0.960
1400 -0.013 0.023 0.224 0.270 0.233 0.287 0.959 0.960
1600 -0.015 0.024 0.208 0.252 0.217 0.268 0.960 0.957
1800 -0.022 0.032 0.199 0.239 0.205 0.252 0.958 0.960
2000 -0.010 0.020 0.184 0.228 0.192 0.235 0.967 0.959
0.850.85 1000 -0.003 0.019 0.173 0.208 0.177 0.213 0.957 0.958
1200 -0.005 0.013 0.159 0.188 0.161 0.193 0.960 0.962
1400 -0.007 0.011 0.148 0.174 0.149 0.178 0.957 0.955
1600 -0.008 0.013 0.134 0.160 0.139 0.166 0.963 0.967
1800 -0.006 0.011 0.126 0.150 0.130 0.156 0.960 0.965
2000 -0.005 0.011 0.121 0.146 0.123 0.147 0.954 0.954
0.950.95 1000 -0.008 0.012 0.119 0.136 0.119 0.135 0.953 0.953
1200 -0.006 0.013 0.107 0.122 0.108 0.123 0.958 0.955
1400 -0.003 0.002 0.100 0.112 0.100 0.113 0.951 0.955
1600 -0.003 0.003 0.093 0.106 0.094 0.106 0.947 0.946
1800 -0.001 0.003 0.088 0.101 0.088 0.099 0.949 0.952
2000 -0.001 0.006 0.084 0.094 0.083 0.095 0.954 0.949
Table 1: Simulation A1: estimation results and inference results of the proposed estimator 𝜷^p\widehat{\bm{\beta}}_{p}.

4.2 Real Data Application

To evaluate our method’s performance in a practical setting, we applied it to a real dataset from the Korean Labor & Income Panel Study (dataset provided in the supplementary materials). This dataset includes information on n=2505n=2505 regular wage earners for the year 2005. We focused on modeling the average monthly income (denoted by Y as a continuous variable) as a function of three demographic covariates: age (X1X_{1}, also continuous), education level (X2X_{2} as a binary variable, where 1 indicates education beyond high school and 0 otherwise), and gender (X3X_{3} as a binary variable, where 1 indicates female and 0 indicates male). Both income and age were standardized before analysis.

We focus on modeling the average monthly income (YY) as a linear function of the three demographic covariates: age (X1X_{1}), education level (X2X_{2}), and gender (X3X_{3}).

The parameters of interest are denoted by the vector 𝜷=(β0,β1,β2,β3)T{\bm{\beta}}=(\beta_{0},\beta_{1},\beta_{2},\beta_{3})^{\rm T} in our data analysis. These parameters are defined as the solution to the following estimating equation.

E{(Yβ0β1X1β2X2β3X3)[1X1X2X3]T}=𝟎E\left\{(Y-\beta_{0}-\beta_{1}X_{1}-\beta_{2}X_{2}-\beta_{3}X_{3})\begin{bmatrix}1&X_{1}&X_{2}&X_{3}\end{bmatrix}^{\rm T}\right\}={\bf 0}

In other words, the parameters are the minimizers of the mean least square error

E{(Yβ0β1X1β2X2β3X3)2}.E\left\{(Y-\beta_{0}-\beta_{1}X_{1}-\beta_{2}X_{2}-\beta_{3}X_{3})^{2}\right\}.

However, the true value of 𝜷{\bm{\beta}} is unknown as we only have a limited sample of data. To assess estimators’ performance, we compare them to an oracle estimator (denoted by 𝜷^o\widehat{\bm{\beta}}_{o}) obtained using the ordinary least squares method, assuming access to all the original variables.

We perform the PRAM procedure on the education level variable X2X_{2} to generate a perturbed version X2X_{2}^{\ast}. The transition probabilities controlling the amount of noise introduced during perturbation were set to p=Pr(X2=0|X2=0)=Pr(X2=1|X2=1){0.75,0.85,0.95}p=\Pr(X_{2}^{\ast}=0|X_{2}=0)=\Pr(X_{2}^{\ast}=1|X_{2}=1)\in\left\{0.75,0.85,0.95\right\}. For each transition probability value, we compare the performance of four estimators

  1. 1.

    Proposed Estimator 𝜷^\widehat{\bm{\beta}}.

  2. 2.

    Naive Estimator 𝜷^b\widehat{\bm{\beta}}_{b}: This estimator treats the perturbed variable X2X_{2}^{\ast} as the original X2X_{2}, ignoring the impact of PRAM.

  3. 3.

    Model-Dependent Estimator 𝜷^m1\widehat{\bm{\beta}}_{m1} (Model 1): This method uses logistic regression to model p(x2|y,x1,x3)p(x_{2}^{\ast}|y,x_{1},x_{3}).

  4. 4.

    Model-Dependent Estimator 𝜷^m2\widehat{\bm{\beta}}_{m2} (Model 2): This method uses probit regression for the same purpose as Model 1.

We estimate the standard errors for each method using the resampling approach described in Section 3.3. To evaluate estimator performance, we calculate two metrics:

  • Bias: The difference between the estimated value and the oracle estimator.

  • Root Mean Square Error (rMSE):This combines the bias and standard error to provide a more comprehensive measure of estimation accuracy: rMSE=Bias2+SE2\text{rMSE}=\sqrt{\text{Bias}^{2}+\text{SE}^{2}}.

A lower bias and/or rMSE indicate a better estimator.

The results are summarized in Table 2. Our proposed estimator consistently achieves the smallest bias and rMSE across all perturbation levels, demonstrating its superiority when compared to the other methods. This implies that the proposed estimator provides estimates closest to the true values, even when data is perturbed for privacy preservation. The performance of the other three estimators varies. While model-dependent methods generally outperform the naive estimator, they can also perform worse in some cases (particularly with low perturbation levels).

Parameter Method Measure p=0.75p=0.75 p=0.85p=0.85 p=0.95p=0.95
Intercept Oracle Estimate (SE) 0.410 (0.043)
Proposed Estimate (SE) 0.370 (0.062) 0.411 (0.051) 0.388 (0.041)
“Bias” -0.040 0.001 -0.022
“rMSE” 0.074 0.051 0.047
Model1 Estimate (SE) 0.317 (0.034) 0.332 (0.034) 0.281 (0.031)
“Bias” -0.093 -0.078 -0.129
“rMSE” 0.099 0.085 0.133
Model2 Estimate (SE) 0.235 (0.038) 0.267 (0.041) 0.293 (0.030)
“Bias” -0.175 -0.143 -0.117
“rMSE” 0.179 0.149 0.121
Naive Estimate (SE) 0.651 (0.042) 0.583 (0.043) 0.454 (0.043)
“Bias” 0.242 0.173 0.044
“rMSE” 0.246 0.178 0.062
Age Oracle Estimate (SE) 0.396 (0.038)
Proposed Estimate (SE) 0.423 (0.049) 0.404 (0.045) 0.411 (0.039)
“Bias” 0.027 0.008 0.015
“rMSE” 0.056 0.046 0.042
Model1 Estimate (SE) 0.544 (0.034) 0.524 (0.036) 0.538 (0.032)
“Bias” 0.148 0.128 0.142
“rMSE” 0.152 0.133 0.146
Model2 Estimate (SE) 0.609 (0.038) 0.584 (0.041) 0.519 (0.032)
“Bias” 0.213 0.188 0.123
“rMSE” 0.216 0.192 0.127
Naive Estimate (SE) 0.242 (0.038) 0.287 (0.038) 0.366 (0.038)
“Bias” -0.154 -0.109 -0.030
“rMSE” 0.159 0.115 0.048
Edu (PRAM-ed) Oracle Estimate (SE) 0.342 (0.019)
Proposed Estimate (SE) 0.371 (0.042) 0.335 (0.032) 0.363 (0.021)
“Bias” 0.029 -0.007 0.021
“rMSE” 0.051 0.032 0.030
Model1 Estimate (SE) 0.258 (0.010) 0.274 (0.014) 0.339 (0.013)
“Bias” -0.084 -0.068 -0.003
“rMSE” 0.085 0.069 0.013
Model2 Estimate (SE) 0.277 (0.011) 0.296 (0.013) 0.349 (0.014)
“Bias” -0.065 -0.046 0.007
“rMSE” 0.066 0.048 0.016
Naive Estimate (SE) 0.170 (0.019) 0.220 (0.020) 0.320 (0.019)
“Bias” -0.172 -0.122 -0.020
“rMSE” 0.173 0.124 0.028
Gender Oracle Estimate (SE) -0.290 (0.019)
Proposed Estimate (SE) -0.282 (0.022) -0.296 (0.018) -0.288 (0.017)
“Bias” 0.008 -0.006 0.002
“rMSE” 0.023 0.019 0.017
Model1 Estimate (SE) -0.275 (0.015) -0.293 (0.015) -0.281 (0.014)
“Bias” 0.015 -0.003 0.009
“rMSE” 0.021 0.015 0.017
Model2 Estimate (SE) -0.261 (0.014) -0.289 (0.016) -0.285 (0.015)
“Bias” 0.029 0.001 0.005
“rMSE” 0.032 0.016 0.016
Naive Estimate (SE) -0.316 (0.020) -0.312 (0.020) -0.296 (0.020)
“Bias” -0.026 -0.022 -0.006
“rMSE” 0.033 0.030 0.021
Table 2: Real data application: the parameter estimates and standard errors (in parentheses) of all five estimators, as well as “Bias” and “rMSE” (both compared to the oracle estimator 𝜷^o\widehat{\bm{\beta}}_{o}) of all other four estimators: 𝜷^\widehat{\bm{\beta}} (Proposed), 𝜷^m1\widehat{\bm{\beta}}_{m1} (Model1), 𝜷^m2\widehat{\bm{\beta}}_{m2} (Model2), and 𝜷^b\widehat{\bm{\beta}}_{b} (Naive).

5 Conluding Remarks

This paper proposes a novel method for efficient and model-agnostic parameter estimation with data perturbed using the PRAM method for privacy preservation. Our estimator offers significant advantages over existing methods by overcoming their limitations of parameter-specificity and model dependence. Notably, we prove that the proposed estimator achieves the semiparametric efficiency bound. In simpler terms, this implies that our method offers the best possible accuracy among all estimators when the true data distribution is unknown, which is almost always true in real-world applications. Looking towards future research, several interesting questions emerge. First, extending the framework to handle continuous sensitive variables presents a natural challenge. While categorical variables allow for a simple matrix inversion during reversion, continuous variables might require solving integral equations. Second, the method can be further generalized to address cases where multiple variables have undergone privacy-preserving transformations. Combining multiple covariates into a single variable offers a straightforward solution for certain scenarios (e.g., combining binary variables X1X_{1} and X2X2 into a new variable with four levels). However, the problem becomes significantly more complex when dealing with a mix of categorical and continuous variables.

References

  • Bickel et al., (1993) Bickel, P. J., Klaassen, J., Ritov, Y., and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press.
  • Buonaccorsi, (2010) Buonaccorsi, J. P. (2010). Measurement Error: Models, Methods, and Applications. Chapman and Hall/CRC.
  • Carroll et al., (2006) Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Chapman and Hall/CRC.
  • Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39:1–22.
  • Gouweleeuw et al., (1998) Gouweleeuw, J. M., Kooiman, P., and De Wolf, P. (1998). Post randomisation for statistical disclosure control: Theory and implementation. Journal of Official Statistics, 14:463–478.
  • Guo et al., (2024) Guo, H., Wang, B., and Yi, G. (2024). Label correction of crowdsourced noisy annotations with an instance-dependent noise transition model. In Advances in Neural Information Processing Systems.
  • Hundepool et al., (2012) Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., and De Wolf, P.-P. (2012). Statistical Disclosure Control, volume 2. Wiley.
  • Jin et al., (2001) Jin, Z., Ying, Z., and Wei, L. (2001). A simple resampling method by perturbing the minimand. Biometrika, 88:381–390.
  • Lawrence and Schölkopf, (2001) Lawrence, N. and Schölkopf, B. (2001). Estimating a kernel Fisher discriminant in the presence of label noise. In International Conference on Machine Learning.
  • Li et al., (2021) Li, X., Liu, T., Han, B., Niu, G., and Sugiyama, M. (2021). Provably end-to-end label-noise learning without anchor points. In International Conference on Machine Learning.
  • Liu et al., (2023) Liu, Y., Cheng, H., and Zhang, K. (2023). Identifiability of label noise transition matrix. In International Conference on Machine Learning.
  • Mivule, (2012) Mivule, K. (2012). Utilizing noise addition for data privacy, an overview. In International Conference on Information and Knowledge Engineering.
  • Okkalioglu et al., (2015) Okkalioglu, B. D., Okkalioglu, M., Koc, M., and Polat, H. (2015). A survey: deriving private information from perturbed data. Artificial Intelligence Review, 44:547–569.
  • Scott, (2015) Scott, C. (2015). A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In International Conference on Artificial Intelligence and Statistics.
  • Sweeney, (2001) Sweeney, L. (2001). Computational Disclosure Control: A Primer on Data Privacy Protection. PhD thesis, Massachusetts Institute of Technology.
  • Tsiatis, (2006) Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer.
  • van den Hout and Kooiman, (2006) van den Hout, A. and Kooiman, P. (2006). Estimating the linear regression model with categorical covariates subject to randomized response. Computational Statistics & Data Analysis, 50:3311–3323.
  • van den Hout and van der Heijden, (2002) van den Hout, A. and van der Heijden, P. G. (2002). Randomized response, statistical disclosure control and misclassification: a review. International Statistical Review, 70:269–288.
  • van der Vaart, (2000) van der Vaart, A. W. (2000). Asymptotic Statistics. Cambridge University Press.
  • Willenborg and De Waal, (2012) Willenborg, L. and De Waal, T. (2012). Elements of Statistical Disclosure Control. Springer.
  • Woo and Slavković, (2012) Woo, Y. M. J. and Slavković, A. B. (2012). Logistic regression with variables subject to post randomization method. In International Conference on Privacy in Statistical Databases.
  • Woo and Slavković, (2015) Woo, Y. M. J. and Slavković, A. B. (2015). Generalised linear models with variables subject to post randomization method. Statistica Applicata-Italian Journal of Applied Statistics, 24:29–56.
  • Yi, (2017) Yi, G. Y. (2017). Statistical Analysis with Measurement Error or Misclassification: Strategy, Method and Application. Springer.
  • Yi, (2021) Yi, G. Y. (2021). Likelihood methods with measurement error and misclassification. In Handbook of measurement error models, pages 99–126. Chapman and Hall/CRC.