This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

StableDR: Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random

Haoxuan Li1  Chunyuan Zheng2  Peng Wu3
1Peking University
2University of California, San Diego
3Beijing Technology and Business University
hxli@stu.pku.edu.cn,  czheng@ucsd.edu,  pengwu@btbu.edu.cn
Corresponding author.
Abstract

In recommender systems, users always choose the favorite items to rate, which leads to data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) methods have been widely studied and demonstrate superior performance. However, in this paper, we show that DR methods are unstable and have unbounded bias, variance, and generalization bounds to extremely small propensities. Moreover, the fact that DR relies more on extrapolation will lead to suboptimal performance. To address the above limitations while retaining double robustness, we propose a stabilized doubly robust (StableDR) learning approach with a weaker reliance on extrapolation. Theoretical analysis shows that StableDR has bounded bias, variance, and generalization error bound simultaneously under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for StableDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approaches significantly outperform the existing methods.

1 Introduction

Modern recommender systems (RSs) are rapidly evolving with the adoption of sophisticated deep learning models (Zhang et al., 2019). However, it is well documented that directly using advanced deep models usually achieves sub-optimal performance due to the existence of various biases in RS (Chen et al., 2020; Wu et al., 2022b), and the biases would be amplified over time (Mansoury et al., 2020; Wen et al., 2022). A large number of debiasing methods have emerged and gradually become a trend. For many practical tasks in RS, such as rating prediction (Schnabel et al., 2016; Wang et al., 2020a; 2019), post-view click-through rate prediction (Guo et al., 2021), post-click conversion rate prediction (Zhang et al., 2020; Dai et al., 2022), and uplift modeling (Saito et al., 2019; Sato et al., 2019; 2020), a critical challenge is to combat the selection bias and confounding bias that leading to significantly difference between the trained sample and the targeted population (Hernán & Robins, 2020). Various methods were designed to address this problem and among them, doubly robust (DR) methods (Wang et al., 2019; Zhang et al., 2020; Chen et al., 2021; Dai et al., 2022; Ding et al., 2022) play the dominant role due to their better performance and theoretical properties.

The success of DR is attributed to its double robustness and joint-learning technique. However, the DR methods still have many limitations. Theoretical analysis shows that inverse probability scoring (IPS) and DR methods may have infinite bias, variance, and generalization error bounds, in the presence of extremely small propensity scores (Schnabel et al., 2016; Wang et al., 2019; Guo et al., 2021; Li et al., 2023b). In addition, due to the fact that users are more inclined to evaluate the preferred items, the problem of data missing not at random (MNAR) often occurs in RS. This would cause selection bias and results in inaccuracy for methods that more rely on extrapolation, such as error imputation based (EIB) (Marlin et al., 2007; Steck, 2013) and DR methods.

To overcome the above limitations while maintaining double robustness, we propose a stabilized doubly robust (SDR) estimator with a weaker reliance on extrapolation, which reduces the negative impact of extrapolation and MNAR effect on the imputation model. Through theoretical analysis, we demonstrate that the SDR has bounded bias and generalization error bound for arbitrarily small propensities, which further indicates that the SDR can achieve more stable predictions.

Furthermore, we propose a novel cycle learning approach for SDR. Figure 1 shows the differences between the proposed cycle learning of SDR and the existing unbiased learning approaches. Two-phase learning (Marlin et al., 2007; Steck, 2013; Schnabel et al., 2016) first obtains an imputation/propensity model to estimate the ideal loss and then updates the prediction model by minimizing the estimated loss. DR-JL (Wang et al., 2019), MRDR-DL (Guo et al., 2021), and AutoDebias (Chen et al., 2021) alternatively update the model used to estimate the ideal loss and the prediction model. The proposed learning method cyclically uses different losses to update the three models with the aim of achieving more stable and accurate prediction results. We have conducted extensive experiments on two real-world datasets, and the results show that the proposed approach significantly improves debiasing and convergence performance compared to the existing methods.

Refer to caption
Figure 1: During the training of updating a prediction model, two-phase learning (Marlin et al., 2007; Steck, 2013; Schnabel et al., 2016) uses a fixed imputation/propensity model (Left), whereas DR-JL (Wang et al., 2019), MRDR-DL (Guo et al., 2021), and AutoDebias (Chen et al., 2021) uses alternative learning between the imputation/propensity and the prediction model (Middle). The proposed learning approach updates the three models cyclically with stabilization (Right).

2 Preliminaries

2.1 Problem Setting

In RS, due to the fact that users are more inclined to evaluate the preferred items, the collected ratings are always missing not at random (MNAR). We formulate the data MNAR problem using the widely adopted potential outcome framework (Neyman, 1990; Imbens & Rubin, 2015). Let 𝒰={1,2,,U}\mathcal{U}=\{1,2,...,U\}, ={1,2,,I}\mathcal{I}=\{1,2,...,I\} and 𝒟=𝒰×\mathcal{D}=\mathcal{U}\times\mathcal{I} be the index sets of users, items, all user-item pairs. For each (u,i)𝒟(u,i)\in\mathcal{D}, we have a treatment ou,i{0,1}o_{u,i}\in\{0,1\}, a feature vector xu,ix_{u,i}, and an observed rating ru,ir_{u,i}, where ou,i=1o_{u,i}=1 if user uu rated the item ii in the logging data, ou,i=0o_{u,i}=0 if the rating is missing. Let ru,i(1)r_{u,i}(1) is defined as the be the rating that would be observed if item ii had been rated by user uu, which is observable only for 𝒪={(u,i)(u,i)𝒟,ou,i=1}\mathcal{O}=\{(u,i)\mid(u,i)\in\mathcal{D},o_{u,i}=1\}. Many tasks in RS can be formulated by predicting the potential outcome ru,i(1)r_{u,i}(1) using feature xu,ix_{u,i} for each (u,i)(u,i).

Let r^u,i(1)=f(xu,i;ϕ)\hat{r}_{u,i}(1)=f(x_{u,i};\phi) be a prediction model with parameters ϕ\phi. If all the potential outcomes {ru,i(1):(u,i)𝒟}\{r_{u,i}(1):(u,i)\in\mathcal{D}\} were observed, the ideal loss function for solving parameters ϕ\phi is given as

ideal(ϕ)=|𝒟|1(u,i)𝒟eu,i,\mathcal{L}_{ideal}(\phi)=|\mathcal{D}|^{-1}\sum_{(u,i)\in\mathcal{D}}e_{u,i},

where eu,ie_{u,i} is the prediction error, such as the squared loss eu,i=(r^u,i(1)ru,i(1))2e_{u,i}=(\hat{r}_{u,i}(1)-r_{u,i}(1))^{2}. ideal(ϕ)\mathcal{L}_{ideal}(\phi) can be regarded as a benchmark of unbiased loss function, even though it is infeasible due to the missingness of {ru,i(1):ou,i=0}\{r_{u,i}(1):o_{u,i}=0\}. As such, a variety of methods are developed through approximating ideal(ϕ)\mathcal{L}_{ideal}(\phi) to address the selection bias, in which the propensity-based estimators show the relatively superior performance (Schnabel et al., 2016; Wang et al., 2019), and the IPS and DR estimators are

IPS=|𝒟|1(u,i)𝒟ou,ieu,ip^u,iandDR=|𝒟|1(u,i)𝒟[e^u,i+ou,i(eu,ie^u,i)p^u,i],\displaystyle\mathcal{E}_{IPS}=|\mathcal{D}|^{-1}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}e_{u,i}}{\hat{p}_{u,i}}\quad\text{and}\quad\mathcal{E}_{DR}=|\mathcal{D}|^{-1}\sum_{(u,i)\in\mathcal{D}}\Big{[}\hat{e}_{u,i}+\frac{o_{u,i}(e_{u,i}-\hat{e}_{u,i})}{\hat{p}_{u,i}}\Big{]},

where p^u,i\hat{p}_{u,i} is an estimate of propensity score pu,i:=(ou,i=1|xu,i)p_{u,i}:=\mathbb{P}(o_{u,i}=1|x_{u,i}), e^u,i\hat{e}_{u,i} is an estimate of eu,ie_{u,i}.

2.2 Related Work

Debiased learning in recommendation. The data collected in RS suffers from various biases (Chen et al., 2020; Wu et al., 2022b), which are entangled with the true preferences of users and pose a great challenge to unbiased learning. There is increasing interest in coping with different biases in recent years (Zhang et al., 2021; Ai et al., 2018; Liu et al., 2016; 2021). Schnabel et al. (2016) proposed using inverse propensity score (IPS) and self-normalized IPS (SNIPS) methods to address the selection bias on data missing not at random, Saito (2019) and Saito et al. (2020) extended them to implicit feedback data. Marlin et al. (2007) and Steck (2013) derived an error imputation-based (EIB) unbiased learning method. These three approaches adopt two-phase learning (Wang et al., 2021), which first learns a propensity/imputation model and then applies it to construct an unbiased estimator of the ideal loss to train the recommendation model. A doubly robust joint learning (DR-JL) method (Wang et al., 2019) was proposed by combining the IPS and EIB approaches. Subsequently, strands of enhanced joint learning methods were developed, including MRDR (Guo et al., 2021), Multi-task DR (Zhang et al., 2020), DR-MSE (Dai et al., 2022), BRD-DR (Ding et al., 2022), TDR Li et al. (2023b), uniform data-aware methods (Bonner & Vasile, 2018; Liu et al., 2020; Chen et al., 2021; Wang et al., 2021; Li et al., 2023c) that aimed to seek better recommendation strategies by leveraging a small uniform dataset, and multiple robust method (Li et al., 2023a) that specifies multiple propensity and imputation models and achieves unbiased learning if any of the propensity models, imputation models, or even a linear combination of these models can accurately estimate the true propensities or prediction errors. Chen et al. (2020) reviewed various biases in RS and discussed the recent progress on debiasing tasks. Wu et al. (2022b) established the connections between the biases in causal inference and the biases, thereby presenting the formal causal definitions for RS.

Stabilized causal effect estimation. The proposed method builds on the stabilized average causal effect estimation approaches in causal inference. Molenberghs et al. (2015) summarized the limitations of doubly robust methods, including unstable to small propensities (Kang & Schafer, 2007; Wu et al., 2022a), unboundedness (van der Laan & Rose, 2011), and large variance (Tan, 2007). These issues inspired a series of stabilized causal effect estimation methods in statistics (Kang & Schafer, 2007; Bang & Robins, 2005; van der Laan & Rose, 2011; Molenberghs et al., 2015). Unlike previous works that focused only on achieving learning with unbiasedness in RS, this paper provides a new perspective to develop doubly robust estimators with much more stable statistical properties.

3 Stabilized Doubly Robust Estimator

In this section, we elaborate the limitations of DR methods and propose a stabilized DR (SDR) estimator with a weaker reliance on extrapolation. Theoretical analysis shows that SDR has bounded bias and generalization error bound for arbitrarily small propensities, while IPS and DR don’t.

3.1 Motivation

Even though DR estimator has double robustness property, its performance could be significantly improved if the following three stabilization aspects can be enhanced.

More stable to small propensities. As shown in Schnabel et al. (2016), Wang et al. (2019) and Guo et al. (2021), if there exist some extremely small estimated propensity scores, the IPS/DR estimator and its bias, variance, and tail bound are unbounded, deteriorating the prediction accuracy. What’s more, such problems are widespread in practice, given the fact that there are many long-tailed users and items in RS, resulting in the presence of extreme propensities.

More stable through weakening extrapolation. DR relies more on extrapolation because the imputation model in DR is learned from the exposed events 𝒪\mathcal{O} and extrapolated to the unexposed events. If the distributional disparity of eu,ie_{u,i} on ou,i=0o_{u,i}=0 and ou,i=1o_{u,i}=1 is large, the imputed errors are likely to be inaccurate on the unexposed events and incur bias of DR. Therefore, it is beneficial to reduce bias if we can develop an enhanced DR method with weaker reliance on extrapolation.

More stable training process of updating a prediction model. In general, alternating training between models results in better performance. From Figure 1, Wang et al. (2019) proposes joint learning for DR, alternatively updating the error imputation and prediction models with given estimated propensities. Double learning (Guo et al., 2021) further incorporates parameter sharing between the imputation and prediction models. Bi-level optimization (Wang et al., 2021; Chen et al., 2021) can be viewed as alternately updating the prediction model and the other parameters used to estimate the loss. To the best of our knowledge, this is the first paper that proposes a algorithm to update the three models (i.e., error imputation model, propensity model, and prediction model) separately using different optimizers, which may resulting in more stable and accurate rating predictions.

3.2 Stabilized Doubly Robust Estimator

We propose a stabilized doubly robust (SDR) estimator that has a weaker dependence on extrapolation and is robust to small propensities. The SDR estimator consists of the following three steps.

Step 1 (Initialize imputed errors). Pre-train imputation model e^u,i\hat{e}_{u,i}, let ^|𝒟|1(u,i)𝒟e^u,i\hat{\mathcal{E}}\triangleq{|\mathcal{D}|}^{-1}\sum_{(u,i)\in\mathcal{D}}\hat{e}_{u,i}.

Step 2 (Learn constrained propensities). Learn a propensity model p^u,i\hat{p}_{u,i} satisfying

1|𝒟|(u,i)𝒟ou,ip^u,i(e^u,i^)=0.\displaystyle\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}{\frac{o_{u,i}}{\hat{p}_{u,i}}}{\left(\hat{e}_{u,i}-\hat{\mathcal{E}}\right)}=0. (1)

Step 3 (SDR estimator). The SDR estimator is given as

SDR=(u,i)𝒟ou,ieu,ip^u,i/(u,i)𝒟ou,ip^u,i(u,i)𝒟wu,ieu,i,\displaystyle\mathcal{E}_{SDR}=\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}e_{u,i}}{\hat{p}_{u,i}}\Big{/}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}\triangleq\sum_{(u,i)\in\mathcal{D}}w_{u,i}e_{u,i},

where wu,i=ou,ip^u,i/(u,i)𝒟ou,ip^u,iw_{u,i}=\frac{o_{u,i}}{\hat{p}_{u,i}}\big{/}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}. It can be seen that SDR estimator has the same form as SNIPS estimator, but the propensities are learned differently. In SDR, the estimation of propensity model relies on the imputed errors, whereas not in SNIPS.

Each step in the construction of SDR estimator plays a different role. Specifically, the Step 2 is designed to enable double robustness property as shown in Theorem 1 (see Appendix A.1 for proofs).

Theorem 1 (Double Robustness).

SDR\mathcal{E}_{SDR} is an asymptotically unbiased111Asymptotically unbiased means unbiasedness as the sample size goes to infinity. estimator of ideal\mathcal{L}_{ideal}, when either the learned propensities p^u,i\hat{p}_{u,i} or the imputed errors e^u,i\hat{e}_{u,i} are accurate for all user-item pairs.

We provide an intuitive way to illustrate the rationale of SDR. On the one hand, if the propensities can be accurately estimated (i.e., p^u,i=pu,i\hat{p}_{u,i}=p_{u,i}) by using a common model (e.g., logistic regression) without imposing constraint (1). Then the expectation of the left hand side of constraint (1) becomes

𝔼𝒪[1|𝒟|(u,i)𝒟ou,ip^u,i(e^u,i^)]=1|𝒟|(u,i)𝒟(e^u,i^)0,\mathbb{E}_{\mathcal{O}}\Big{[}\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}\left(\hat{e}_{u,i}-\mathcal{\hat{E}}\right)\Big{]}=\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\left(\hat{e}_{u,i}-\mathcal{\hat{E}}\right)\equiv 0,

which indicates the constraint (1) always holds as the sample size goes to infinity by the strong law of large numbers222This is the reason why we adopt the notation of ”asymptotically unbiased”., irrespective of the accuracy of the imputed errors e^u,i\hat{e}_{u,i}. This implies that the constraint (1) imposes almost no restriction on the estimation of propensities. In this case, the SDR estimator is almost equivalent to the original SNIPS estimator. On the other hand, if the propensities cannot be accurately estimated by using a common model, but the imputed errors are accurate (i.e., e^u,i=eu,i\hat{e}_{u,i}=e_{u,i}). In this case, ^\mathcal{\hat{E}} is an unbiased estimator. Specifically, SDR\mathcal{E}_{SDR} satisfies

1|𝒟|(u,i)𝒟ou,i(eu,iSDR)p^u,i=0.\displaystyle\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}(e_{u,i}-\mathcal{E}_{SDR})}{\hat{p}_{u,i}}=0. (2)

Combining the constraint (1) and equation (2) gives

1|𝒟|(u,i)𝒟[ou,i(eu,ie^u,i)p^u,i+ou,i(^SDR)p^u,i]=0,\displaystyle\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\left[\frac{o_{u,i}(e_{u,i}-\hat{e}_{u,i})}{\hat{p}_{u,i}}+\frac{o_{u,i}(\hat{\mathcal{E}}-\mathcal{E}_{SDR})}{\hat{p}_{u,i}}\right]=0, (3)

where the first term equals to 0 if e^u,i=eu,i\hat{e}_{u,i}=e_{u,i}, it implies that SDR=^\mathcal{E}_{SDR}=\hat{\mathcal{E}}, then the unbiasedness of SDR\mathcal{E}_{SDR} follows immediately from the unbiasedness of ^\hat{\mathcal{E}}.

In addition, Step 3 is designed for two main reasons to achieve stability. First, SDR\mathcal{E}_{SDR} is more robust to extrapolation compared with DR. This is because the propensities are learned from the entire data and thus have less requirement on extrapolation. Second, SDR\mathcal{E}_{SDR} is more stable to small propensities, since the self-normalization imposes the weight wu,iw_{u,i} to fall on the interval [0,1].

In summary, forcing the propensities to satisfy the constraint (1) makes the SDR estimator not only doubly robust, but also captures the advantages of both SNIPS and DR estimators. The design of SDR enables the constrained propensities to adaptively find the direction of debiasing if either the learned propensities without imposing constraint (1) or the imputed errors are accurate.

3.3 Theoretical Analysis of Stableness

Through theoretical analysis, we note that previous debiasing estimators such as IPS (Schnabel et al., 2016) and DR-based methods (Wang et al., 2019; Guo et al., 2021) tend to have infinite biases, variances, tail bound, and corresponding generalization error bounds, in the presence of extremely small estimated propensities. Remarkably, the proposed SDR estimator doesn’t suffer from such problems and is stable to arbitrarily small propensities, as shown in the following Theorems (see Appendixes A.2, A.3 and A.4 for proofs).

Theorem 2 (Bias of SDR).

Given imputed errors e^u,i\hat{e}_{u,i} and learned propensities p^u,i\hat{p}_{u,i} satisfying the stabilization constraint (1), with p^u,i>0\hat{p}_{u,i}>0 for all user-item pairs, the bias of SDR\mathcal{E}_{SDR} is

Bias(SDR)=|1|𝒟|(u,i)𝒟(δu,i(u,i)𝒟δu,ipu,i/p^u,i(u,i)𝒟pu,i/p^u,i)|+O(|𝒟|1),{\operatorname{Bias}}(\mathcal{E}_{SDR})=\Biggl{|}\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\left(\delta_{u,i}-\frac{\sum_{(u,i)\in\mathcal{D}}{\delta_{u,i}p_{u,i}}/\hat{p}_{u,i}}{\sum_{(u,i)\in\mathcal{D}}{p_{u,i}}/\hat{p}_{u,i}}\right)\Biggr{|}+O(|\mathcal{D}|^{-1}),

where δu,i=eu,ie^u,i\delta_{u,i}=e_{u,i}-\hat{e}_{u,i} is the error deviation.

Theorem 2 shows the bias of the SDR estimator consisting of a dominant term given by the difference between δu,i\delta_{u,i} and its weighted average, and a negligible term of order O(|𝒟|1)O(|\mathcal{D}|^{-1}). The fact that the δu,i\delta_{u,i} and its convex combinations are bounded, shows that the bias is bounded for arbitrarily small p^u,i\hat{p}_{u,i}. Compared to the Bias(IPS)=|𝒟|1|u,i𝒟(p^u,ipu,i)eu,i/p^u,i|\operatorname{Bias}\left(\mathcal{E}_{IPS}\right)={|\mathcal{D}|}^{-1}|\sum_{u,i\in\mathcal{D}}{(\hat{p}_{u,i}-p_{u,i})e_{u,i}}/{\hat{p}_{u,i}}| and Bias(DR)=|𝒟|1|u,i𝒟(p^u,ipu,i)δu,i/p^u,i|\operatorname{Bias}\left(\mathcal{E}_{DR}\right)={|\mathcal{D}|}^{-1}|\sum_{u,i\in\mathcal{D}}{(\hat{p}_{u,i}-p_{u,i})\delta_{u,i}}/{\hat{p}_{u,i}}|, it indicates that IPS and DR will have extremely large bias when there exists an extremely small p^u,i\hat{p}_{u,i}.

Theorem 3 (Variance of SDR).

Under the conditions of Theorem 2, the variance of SDR\mathcal{E}_{SDR} is

Var(SDR)=(u,i)𝒟pu,i(1pu,i)hu,i2/p^u,i2((u,i)𝒟pu,i/p^u,i)2+O(|𝒟|2),\operatorname{Var}\left(\mathcal{E}_{SDR}\right)=\frac{\sum_{(u,i)\in\mathcal{D}}p_{u,i}(1-p_{u,i})h^{2}_{u,i}/\hat{p}^{2}_{u,i}}{\left(\sum_{(u,i)\in\mathcal{D}}p_{u,i}/\hat{p}_{u,i}\right)^{2}}+O(|\mathcal{D}|^{-2}),

where hu,i=(eu,ie^u,i)(u,i)𝒟{pu,i(eu,ie^u,i)/p^u,i}/(u,i)𝒟{pu,i/p^u,i}h_{u,i}=(e_{u,i}-\hat{e}_{u,i})-\sum_{(u,i)\in\mathcal{D}}\{p_{u,i}(e_{u,i}-\hat{e}_{u,i})/\hat{p}_{u,i}\}/\sum_{(u,i)\in\mathcal{D}}\{p_{u,i}/\hat{p}_{u,i}\} is a bounded difference between eu,ie^u,ie_{u,i}-\hat{e}_{u,i} and its weighted average.

Theorem 3 shows the variance of the SDR estimator consisting of a dominant term and a negligible term of order O(|𝒟|2)O(|\mathcal{D}|^{-2}). The boundedness of the variance for arbitrarily small p^u,i\hat{p}_{u,i} is given directly from the fact that SDR has a bounded range given by the self-normalized form. Compared to the Var(IPS)=|𝒟|2u,i𝒟pu,i(1pu,i)eu,i2/p^u,i2\operatorname{Var}\left(\mathcal{E}_{IPS}\right)={|\mathcal{D}|}^{-2}\sum_{u,i\in\mathcal{D}}{{p}_{u,i}(1-p_{u,i})e^{2}_{u,i}}/{\hat{p}^{2}_{u,i}} and Var(DR)=|𝒟|2u,i𝒟pu,i(1pu,i)(eu,ie^u,i)2/p^u,i2\operatorname{Var}\left(\mathcal{E}_{DR}\right)={|\mathcal{D}|}^{-2}\sum_{u,i\in\mathcal{D}}{{p}_{u,i}(1-p_{u,i})(e_{u,i}-\hat{e}_{u,i})^{2}}/{\hat{p}^{2}_{u,i}}, it indicates that IPS and DR will have extremely large variance (tend to infinity) when there exist an extremely small p^u,i\hat{p}_{u,i} (tends to 0).

Theorem 4 (Tail Bound of SDR).

Under the conditions of Theorem 2, for any prediction model, with probability 1η1-\eta, the deviation of SDR\mathcal{E}_{SDR} from its expectation has the following tail bound

|SDR𝔼𝒪(SDR)|12log(4η)(u,i)𝒟(δmaxδu,i)2+(δu,iδmin)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2\left|\mathcal{E}_{SDR}-\mathbb{E}_{{\mathcal{O}}}(\mathcal{E}_{SDR})\right|\leq\sqrt{\frac{1}{2}{\log\left(\frac{4}{\eta}\right)}\sum_{(u,i)\in\mathcal{D}}\frac{(\delta_{\operatorname{max}}-\delta_{u,i})^{2}+(\delta_{u,i}-\delta_{\operatorname{min}})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}}

where δmin=min(u,i)𝒟δu,i\delta_{\operatorname{min}}=\operatorname{min}_{(u,i)\in\mathcal{D}}\delta_{u,i}, δmax=max(u,i)𝒟δu,i\delta_{\operatorname{max}}=\operatorname{max}_{(u,i)\in\mathcal{D}}\delta_{u,i}, ϵ=log(4/η)/2𝒟(u,i)1/p^u,i2\epsilon^{\prime}={\sqrt{\log(4/\eta)/2\cdot\sum_{\mathcal{D}\setminus(u,i)}1/\hat{p}^{2}_{u,i}}}, and 𝒟(u,i)\mathcal{D}\setminus(u,i) is the set of 𝒟\mathcal{D} excluding the element (u,i)(u,i).

Note that 𝒟(u,i)pu,i/p^u,i=O(|𝒟|)\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}=O(|\mathcal{D}|) and ϵ=O(|𝒟|1/2)\epsilon^{\prime}=O(|\mathcal{D}|^{1/2}) in Theorem 4, it follows that the tail bound of the SDR estimator converges to 0 for large samples. In addition, the tail bound is bounded for arbitrarily small p^u,i\hat{p}_{u,i}. Compared to the tail bound of IPS and DR, with probability 1η1-\eta, we have

|IPS𝔼𝒪[IPS]|log(2/η)2|𝒟|2(u,i)𝒟(eu,ip^u,i)2,|DR𝔼𝒪[DR]|log(2/η)2|𝒟|2(u,i)𝒟(δu,ip^u,i)2,\displaystyle\left|\mathcal{E}_{{IPS}}-\mathbb{E}_{\mathcal{O}}\left[\mathcal{E}_{{IPS}}\right]\right|\leq\sqrt{\frac{\log\left(2/\eta\right)}{2|\mathcal{D}|^{2}}\sum_{(u,i)\in\mathcal{D}}\left(\frac{e_{u,i}}{\hat{p}_{u,i}}\right)^{2}},\left|\mathcal{E}_{{DR}}-\mathbb{E}_{\mathcal{O}}\left[\mathcal{E}_{{DR}}\right]\right|\leq\sqrt{\frac{\log\left(2/\eta\right)}{2|\mathcal{D}|^{2}}\sum_{(u,i)\in\mathcal{D}}\left(\frac{\delta_{u,i}}{\hat{p}_{u,i}}\right)^{2}},

which are both unbounded when p^u,i0\hat{p}_{u,i}\rightarrow 0. For SDR in the prediction model training phase, the boundedness of the generalization error bound (see Theorem 5 in Appendix A.5) follows immediately from the boundedness of the bias and tail bound. The above analysis demonstrates that SDR can comprehensively mitigate the negative effects caused by extreme propensities and results in more stable predictions. Theorems 2-5 are stated under the constraint (1)(\ref{step2}). If we estimate the propensities with constraint (1), but finally constraint (1) somehow doesn’t hold exactly, the associated bias, variance, and generalization error bound of SDR are presented in Appendix B.

4 Cycle Learning with Stabilization

In this section, we propose a novel SDR-based cycle learning approach, that not only exploits the stable statistical properties of the SDR estimator itself, but also carefully designs the updating process among various models to achieve higher stability. In general, inspired by the idea of value iteration in reinforcement learning (Sutton & Barto, 2018), alternatively updating the model tends to achieve better predictive performance, as existing debiasing training approaches suggested (Wang et al., 2019; Guo et al., 2021; Chen et al., 2021). As shown in Figure 1, the proposed approach dynamically interacts with three models, utilizing the propensity model and imputation model simultaneously in a differentiated way, which can be regarded as an extension of these methods. In cycle learning, given pre-trained propensities, the inverse propensity weighted imputation error loss is used to first obtain an imputation model, and then take the constraint (1) as the regularization term to train a stabilized propensity model and ensure the double robustness of SDR. Finally, the prediction model is updated by minimizing the SDR loss and used to readjust the imputed errors. By repeating the above update processes cyclically, the cycle learning approach can fully utilize and combine the advantages of the three models to achieve more accurate rating predictions.

Specifically, the data MNAR leads to the presence of missing ru,i(1)r_{u,i}(1), so that all eu,ie_{u,i} cannot be used directly. Therefore, we obtain imputed errors by learning a pseudo-labeling model r~u,i(1)\tilde{r}_{u,i}(1) parameterized by β\beta, and the imputed errors e^u,i=CE(r~u,i(1),r^u,i(1))\hat{e}_{u,i}=\mathrm{CE}(\tilde{r}_{u,i}(1),\hat{r}_{u,i}(1)) are updated by minimizing

e(ϕ,α,β)=|𝒟|1(u,i)𝒟ou,i(e^u,ieu,i)2π(xu,i;α)+λeβF2,\displaystyle\mathcal{L}_{e}\left(\phi,\alpha,\beta\right)=|\mathcal{D}|^{-1}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}(\hat{e}_{u,i}-e_{u,i})^{2}}{\pi(x_{u,i};\alpha)}+\lambda_{e}\|\beta\|_{F}^{2},

where eu,i=CE(ru,i(1),r^u,i(1)){e}_{u,i}=\mathrm{CE}(r_{u,i}(1),\hat{r}_{u,i}(1)), λe0\lambda_{e}\geq 0, p^u,i=π(xu,i;α)\hat{p}_{u,i}=\pi(x_{u,i};\alpha) is the propensity model, F2\|\cdot\|_{F}^{2} is the Frobenius norm. For each observed ratings, the inverse of the estimated propensities are used for weighting to account for MNAR effects. Next, we consider two methods for estimating propensity scores, which are Naive Bayes with Laplace smoothing and logistic regression. The former provides a wide range of opportunities for achieving stability constraint (1) through the selection of smoothing coefficients. The latter requires user and item embeddings, which are obtained by employing MF before performing cycle learning. The learned propensities need to both satisfy the accuracy, which is evaluated with cross entropy, and meet the constraint (1) for stabilization and double robustness. The propensity model π(xu,i;α)\pi(x_{u,i};\alpha) is updated by using the loss ce(ϕ,α,β)+ηstable(ϕ,α,β)\mathcal{L}_{ce}\left(\phi,\alpha,\beta\right)+\eta\cdot\mathcal{L}_{{stable}}\left(\phi,\alpha,\beta\right), where ce(ϕ,α,β)\mathcal{L}_{ce}\left(\phi,\alpha,\beta\right) is cross entropy loss of propensity model and

stable(ϕ,α,β)=|𝒟|1{(u,i)𝒟ou,iπ(xu,i;α)(e^u,i^)}2+λstableαF2,\displaystyle\mathcal{L}_{{stable}}\left(\phi,\alpha,\beta\right)=|\mathcal{D}|^{-1}\Big{\{}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\pi(x_{u,i};\alpha)}{\left(\hat{e}_{u,i}-\hat{\mathcal{E}}\right)}\Big{\}}^{2}+\lambda_{stable}\|\alpha\|_{F}^{2},

where λstable0\lambda_{stable}\geq 0, and η\eta is a hyper-parameter for trade-off. Finally, the prediction model f(xu,i;ϕ)f(x_{u,i};\phi) is updated by minimizing the SDR loss

sdr(ϕ,α,β)=[(u,i)𝒟ou,ieu,iπ(xu,i;α)]/[(u,i)𝒟ou,iπ(xu,i;α)]+λsdrϕF2,\displaystyle\mathcal{L}_{sdr}\left(\phi,\alpha,\beta\right)=\Big{[}{\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}e_{u,i}}{\pi(x_{u,i};\alpha)}}\Big{]}\Big{/}\Big{[}{\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\pi(x_{u,i};\alpha)}}\Big{]}+\lambda_{sdr}\|\phi\|_{F}^{2},

where the first term is equivalent to the left hand side of equation (3), and λsdr0\lambda_{sdr}\geq 0. In cycle learning, the updated prediction model will be used for re-update the imputation model using the next sample batch. Notably, the designed algorithm strictly follows the proposed SDR estimator in Section 3.2. From Figure 1 and Alg. 1, our algorithm first updates imputed errors e^\hat{e} by Step 1, and then learns a propensity p^\hat{p} based on learned e^\hat{e} to satisfy the constraint (1) in Step 2. The main purpose of the first two steps is to ensure that the SDR estimator in Step 3 has double robustness and has a lower extrapolation dependence compared to the previous DR methods. Finally, from Step 3 we update the predicted rating r^\hat{r} by minimizing the estimation of the ideal loss using the proposed SDR estimator. For the next round, instead of re-initializing, Step 1 updates the imputed errors e^\hat{e} according to the new prediction model, then Step 2 re-updates the constrained propensities p^\hat{p}, and then uses Step 3 to update the prediction model r^\hat{r} again, and so on. We summarized the cycle learning approach in Alg. 1.

Input: observed ratings 𝐘o\mathbf{Y}^{o}, and η,λe,λstable,λsdr0\eta,\lambda_{e},\lambda_{stable},\lambda_{sdr}\geq 0
1 while stopping criteria is not satisfied do
2       for number of steps for training the imputation model do
3            Sample a batch of user-item pairs {(uj,ij)}j=1J\left\{\left(u_{j},i_{j}\right)\right\}_{j=1}^{J} from 𝒪\mathcal{O};
4             Update β\beta by descending along the gradient βe(ϕ,α,β)\nabla_{\beta}\mathcal{L}_{e}\left(\phi,\alpha,\beta\right);
5            
6       end for
7      for number of steps for training the propensity model do
8             Sample a batch of user-item pairs {(uk,ik)}k=1K\left\{\left(u_{k},i_{k}\right)\right\}_{k=1}^{K} from 𝒟\mathcal{D};
9             Calculate the gradient of propensity cross entropy error αce(ϕ,α,β)\nabla_{\alpha}\mathcal{L}_{{ce}}\left(\phi,\alpha,\beta\right);
10             Calculate the gradient of propensity stable constraint (1) αstable(ϕ,α,β)\nabla_{\alpha}\mathcal{L}_{stable}\left(\phi,\alpha,\beta\right);
11             Update α\alpha by descending along the gradient αce(ϕ,α,β)+ηαstable(ϕ,α,β)\nabla_{\alpha}\mathcal{L}_{ce}\left(\phi,\alpha,\beta\right)+\eta\cdot\nabla_{\alpha}\mathcal{L}_{{stable}}\left(\phi,\alpha,\beta\right)
12       end for
13      for number of steps for training the prediction model do
14             Sample a batch of user-item pairs {(ul,il)}l=1L\left\{\left(u_{l},i_{l}\right)\right\}_{l=1}^{L} from 𝒪\mathcal{O};
15             Update ϕ\phi by descending along the gradient ϕsdr(ϕ,α,β)\nabla_{\phi}\mathcal{L}_{sdr}\left(\phi,\alpha,\beta\right);
16       end for
17      
18 end while
Algorithm 1 The Proposed Stable DR (MRDR) Cycle Learning, Stable-DR (MRDR)

5 Real-world Experiments

In this section, several experiments are conducted to evaluate the proposed methods on two real-world benchmark datasets. We conduct experiments to answer the following questions:

  1. RQ1.

    Do the proposed Stable-DR and Stable-MRDR approaches improve in debiasing performance compared to the existing studies?

  2. RQ2.

    Do our methods stably perform well under the various propensity models?

  3. RQ3.

    How does the performance of our method change under different strengths of the stabilization constraint?

5.1 Experimental Setup

Dataset and preprocessing. To answer the above RQs, we need to use the datasets that contain both MNAR ratings and missing-at-random (MAR) ratings. Following the previous studies (Schnabel et al., 2016; Wang et al., 2019; Guo et al., 2021; Chen et al., 2021), we conduct experiments on the two commonly used datasets: Coat333https://www.cs.cornell.edu/~schnabts/mnar/ contains ratings from 290 users to 300 items. Each user evaluates 24 items, containing 6,960 MNAR ratings in total. Meanwhile, each user evaluates 16 items randomly, which generates 4,640 MAR ratings. Yahoo! R3444http://webscope.sandbox.yahoo.com/ contains totally 311,704 MNAR and 54,000 MAR ratings from 15,400 users to 1,000 items.

Table 1: Performance on Coat and Yahoo!R3, using MF, SLIM, and NCF as the base models.
Dataset Coat Yahoo!R3
Method MSE AUC N@5 N@10 MSE AUC N@5 N@10
MF 0.2428 0.7063 0.6025 0.6774 0.2500 0.6722 0.6374 0.7634
+ IPS 0.2316 0.7166 0.6184 0.6897 0.2194 0.6742 0.6304 0.7556
+ SNIPS 0.2333 0.7070 0.6222 0.6851 0.1931 0.6831 0.6348 0.7608
+ AS-IPS 0.2121 0.7180 0.6160 0.6824 0.2391 0.6770 0.6364 0.7601
+ CVIB 0.2195 0.7239 0.6285 0.6947 0.2625 0.6853 0.6513 0.7729
+ DR 0.2298 0.7132 0.6243 0.6918 0.2093 0.6873 0.6574 0.7741
+ DR-JL 0.2254 0.7209 0.6252 0.6961 0.2194 0.6863 0.6525 0.7701
+ Stable-DR (Ours) 0.2159 0.7508 0.6511 0.7073 0.2090 0.6946 0.6620 0.7786
+ MRDR-JL 0.2252 0.7318 0.6375 0.6989 0.2173 0.6830 0.6437 0.7652
+ Stable-MRDR (Ours) 0.2076 0.7548 0.6532 0.7105 0.2081 0.6915 0.6585 0.7757
SLIM 0.2419 0.7074 0.7064 0.7650 0.2126 0.6636 0.7190 0.8134
+ IPS 0.2411 0.7058 0.7235 0.7644 0.2046 0.6583 0.7285 0.8244
+ SNIPS 0.2420 0.7071 0.7369 0.7672 0.2155 0.6720 0.7303 0.8227
+ AS-IPS 0.2133 0.7105 0.6238 0.6975 0.1946 0.6769 0.6508 0.7702
+ CVIB 0.2413 0.7108 0.7214 0.7638 0.2024 0.6790 0.7335 0.8221
+ DR 0.2334 0.7064 0.7267 0.7649 0.2054 0.6771 0.7344 0.8248
+ DR-JL 0.2407 0.7090 0.7279 0.7655 0.2044 0.6792 0.7360 0.8260
+ Stable-DR (Ours) 0.2356 0.7201 0.7389 0.7724 0.2080 0.6874 0.7473 0.8349
+ MRDR-JL 0.2409 0.7074 0.7329 0.7679 0.2016 0.6791 0.7338 0.8239
+ Stable-MRDR (Ours) 0.2369 0.7148 0.7378 0.7711 0.2086 0.6842 0.7435 0.8308
NCF 0.2050 0.7670 0.6228 0.6954 0.3215 0.6782 0.6501 0.7672
+ IPS 0.2042 0.7646 0.6327 0.7054 0.1777 0.6719 0.6548 0.7703
+ SNIPS 0.1904 0.7707 0.6271 0.7062 0.1694 0.6903 0.6693 0.7807
+ AS-IPS 0.2061 0.7630 0.6156 0.6983 0.1715 0.6879 0.6620 0.7769
+ CVIB 0.2042 0.7655 0.6176 0.6946 0.3088 0.6715 0.6669 0.7793
+ DR 0.2081 0.7578 0.6119 0.6900 0.1705 0.6886 0.6628 0.7768
+ DR-JL 0.2115 0.7600 0.6272 0.6967 0.2452 0.6818 0.6516 0.7678
+ Stable-DR (Ours) 0.1896 0.7712 0.6337 0.7095 0.1664 0.6907 0.6756 0.7861
+ MRDR-JL 0.2046 0.7609 0.6182 0.6992 0.2367 0.6778 0.6465 0.7664
+ Stable-MRDR (Ours) 0.1899 0.7710 0.6380 0.7082 0.1671 0.6910 0.6734 0.7846

Baselines. In our experiments, we take Matrix Factorization (MF) (Koren et al., 2009), Sparse LInear Method (SLIM) (Ning & Karypis, 2011), and Neural Collaborative Filtering (NCF) (He et al., 2017) as the base model respectively, and compare against the proposed methods with the following baselines: Base Model, IPS (Saito et al., 2020; Schnabel et al., 2016), SNIPS (Swaminathan & Joachims, 2015), IPS-AT (Saito, 2020), CVIB (Wang et al., 2020b), DR (Saito, 2020), DR-JL (Wang et al., 2019), and MRDR-JL (Guo et al., 2021). In addition, Naive Bayes with Laplace smoothing and logistic regression are used to establish the propensity model respectively.

Experimental protocols and details. The following four metrics are used simultaneously in the evaluation of debiasing performance: MSE, AUC, NDCG@5, and NDCG@10. All the experiments are implemented on PyTorch with Adam as the optimizer555For all experiments, we use NVIDIA GeForce RTX 3090 as the computing resource.. We tune the learning rate in {0.005,0.01,0.05,0.1}\{0.005,0.01,0.05,0.1\}, weight decay in {1e6,5e6,,5e3,1e2}\{1e-6,5e-6,\dots,5e-3,1e-2\}, constrain parameter eta in {50,100,150,200}\{50,100,150,200\} for Coat and {500,1000,1500,2000}\{500,1000,1500,2000\} for Yahoo! R3, and batch size in {128,256,512,1024,2048}\{128,256,512,1024,2048\} for Coat and {1024,2048,4096,8192,16384}\{1024,2048,4096,8192,16384\} for Yahoo! R3. In addition, for the Laplacian smooth parameter in Naive Bayes model, the initial value is set to 0 and the learning rate is tuned in {5,10,15,20}\{5,10,15,20\} for Coat and in {50,100,150,200}\{50,100,150,200\} for Yahoo! R3.

5.2 Performance Comparison (RQ1)

Table 5.1 summarizes the performance of the proposed Stable-DR and Stable-MRDR methods compared with previous methods. First, the causally-inspired methods perform better than the base model, verifying the necessity of handling the selection bias in rating prediction. For previous methods, SNIPS, CVIB and DR demonstrate competitive performance. Second, the proposed Stable-DR and Stable-MRDR have the best performance in all four metrics. On one hand, our methods outperform SNIPS, attributed to the inclusion of the propensity model in the training process, as well as the boundedness and double robustness of SDR. On the other hand, our methods outperform DR-JL and MRDR-JL, attributed to the stabilization constraint introduced in the training of the propensity model. This further demonstrates the benefit of cycle learning, in which the propensity model is acted as the mediation between the imputation and prediction model during the training process, rather than updating the prediction model from the imputation model directly.

5.3 Ablation and Parameter Sensitivity Study (RQ2, RQ3)

The debiasing performance under different stabilization constraint strength and propensity models is shown in Figure 2. First, the proposed Stable-DR and Stable-MRDR outperform DR-JL and MRDR-JL, when either Naive Bayes with Laplace smoothing or logistic regression is used as propensity models. It indicates that our methods have better debiasing ability in both the feature containing and collaborative filtering scenarios. Second, when the strength of the stabilization constraint is zero, our method performs similarly to SNIPS and slightly worse than the DR-JL and MRDR-JL, which indicates that simply using cross-entropy loss to update the propensity model is not effective in improving the model performance. However, as the strength of the stabilization constraint increases, Stable-DR and Stable-MRDR using cycle learning have a stable and significant improvement compared to DR-JL and MRDR-JL. Our methods achieve the optimal performance at the appropriate constraint strength, which can be interpreted as simultaneous consideration of accuracy and stability to ensure boundedness and double robustness of SDR.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: MSE, AUC and Increasing Ratio (IR) of Stable-DR and Stable-MRDR comparing with two baseline algorithms DR-JL and MRDR-JL in two different propensity model setting: Naive Bayes with Laplace smoothing (Top) and logistic regression (Bottom) respectively.

6 Conclusion

In this paper, we propose an SDR estimator for data MNAR that maintains double robustness and improves the stability of DR in the following three aspects: first, we show that SDR has a weaker extrapolation dependence than DR and can result in more stable and accurate predictions in the presence of MNAR effects. Next, through theoretical analysis, we show that the proposed SDR has bounded bias, variance, and generalization error bounds under inaccurate imputed errors and arbitrarily small estimated propensities, while DR does not. Finally, we propose a novel learning approach for SDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approach significantly outperforms the existing methods in terms of both convergence and prediction accuracy.

Ethics Statement

This work is mostly theoretical and experiments are based on synthetic and public datasets. We claim that this work does not present any foreseeable negative social impact.

Reproducibility Statement

Code is provided in Supplementary Materials to reproduce the experimental results.

Acknowledgments

The work was supported by the National Key R&D Program of China under Grant No. 2019YFB1705601.

References

  • Ai et al. (2018) Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. Unbiased learning to rank with unbiased propensity estimation. In SIGIR, 2018.
  • Bang & Robins (2005) Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61:962–972, 2005.
  • Bonner & Vasile (2018) Stephen Bonner and Flavian Vasile. Causal embeddings for recommendation. In RecSys, 2018.
  • Chen et al. (2020) Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. Bias and debias in recommender system: A survey and future directions. arXiv:2010.03240, 2020.
  • Chen et al. (2021) Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. Autodebias: Learning to debias for recommendation. In SIGIR, 2021.
  • Dai et al. (2022) Quanyu Dai, Haoxuan Li, Peng Wu, Zhenhua Dong, Xiao-Hua Zhou, Rui Zhang, Xiuqiang He, Rui Zhang, and Jie Sun. A generalized doubly robust learning framework for debiasing post-click conversion rate prediction. In KDD, 2022.
  • Ding et al. (2022) Sihao Ding, Peng Wu, Fuli Feng, Xiangnan He, Yitong Wang, Yong Liao, and Yongdong Zhang. Addressing unmeasured confounder for recommendation with sensitivity analysis. In KDD, 2022.
  • Guo et al. (2021) Siyuan Guo, Lixin Zou, Yiding Liu, Wenwen Ye, Suqi Cheng, Shuaiqiang Wang, Hechang Chen, Dawei Yin, and Yi Chang. Enhanced doubly robust learning for debiasing post-click conversion rate estimation. In SIGIR, 2021.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In WWW, 2017.
  • Hernán & Robins (2020) Miguel A. Hernán and James M. Robins. Causal Inference: What If. Boca Raton: Chapman and Hall/CRC, 2020.
  • Imbens & Rubin (2015) Guido W. Imbens and Donald B. Rubin. Causal Inference For Statistics Social and Biomedical Science. Cambridge University Press, 2015.
  • Kang & Schafer (2007) Joseph D.Y. Kang and Joseph L. Schafer. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22:523–539, 2007.
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  • Li et al. (2023a) Haoxuan Li, Quanyu Dai, Yuru Li, Yan Lyu, Zhenhua Dong, Xiao-Hua Zhou, and Peng Wu. Multiple robust learning for recommendation. In AAAI, 2023a.
  • Li et al. (2023b) Haoxuan Li, Yan Lyu, Chunyuan Zheng, and Peng Wu. TDR-CL: Targeted doubly robust collaborative learning for debiased recommendations. In ICLR, 2023b.
  • Li et al. (2023c) Haoxuan Li, Yanghao Xiao, Chunyuan Zheng, and Peng Wu. Balancing unobserved confounding with a few unbiased ratings in debiased recommendations. In WWW, 2023c.
  • Liu et al. (2020) Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. A general knowledge distillation framework for counterfactual recommendation via uniform data. In SIGIR, 2020.
  • Liu et al. (2021) Dugang Liu, Pengxiang Cheng, Hong Zhu, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. Mitigating confounding bias in recommendation via information bottleneck. In RecSys, 2021.
  • Liu et al. (2016) Yiming Liu, Xuezhi Cao, and Yong Yu. Are you influenced by others when rating?: Improve rating prediction by conformity modeling. In RecSys, 2016.
  • Mansoury et al. (2020) Masoud Mansoury, Himan Abdollahpouri, Mykola Pechenizkiy, and Bamshad Mobasher. Feedback loop and bias amplification in recommender systems. In WWW, 2020.
  • Marlin et al. (2007) Benjamin Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney. Collaborative filtering and the missing at random assumption. UAI, 2007.
  • Molenberghs et al. (2015) Geert Molenberghs, Garrett Fitzmaurice, Michael G. Kenward, Anastasios Tsiatis, and Geert Verbeke. Handbook of Missing Data Methodology. Chapman & Hall/CRC, 2015.
  • Neyman (1990) Jerzy Splawa Neyman. On the application of probability theory to agricultural experiments. essay on principles. section 9. Statistical Science, 5:465–472, 1990.
  • Ning & Karypis (2011) Xia Ning and George Karypis. Slim: Sparse linear methods for top-n recommender systems. In 2011 IEEE 11th international conference on data mining, pp.  497–506. IEEE, 2011.
  • Saito (2019) Yuta Saito. Unbiased pairwise learning from implicit feedback. In NeurIPS Workshop, 2019.
  • Saito (2020) Yuta Saito. Asymmetric tri-training for debiasing missing-not-at-random explicit feedback. In SIGIR, 2020.
  • Saito (2020) Yuta Saito. Doubly robust estimator for ranking metrics with post-click conversions. In RecSys, 2020.
  • Saito et al. (2019) Yuta Saito, Hayato Sakata, and Kazuhide Nakata. Doubly robust prediction and evaluation methods improve uplift modeling for observational data. In SIAM, 2019.
  • Saito et al. (2020) Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. Unbiased recommender learning from missing-not-at-random implicit feedback. In WSDM, 2020.
  • Sato et al. (2019) Masahiro Sato, Janmajay Singh, Sho Takemori, Takashi Sonoda, Qian Zhang, and Tomoko Ohkuma. Uplif-based evaluation and optimization of recommenders. In RecSys, 2019.
  • Sato et al. (2020) Masahiro Sato, Sho Takemori, Janmajay Singh, and Tomoko Ohkuma. Unbiased learning for the causal effect of recommendation. In RecSys, 2020.
  • Schnabel et al. (2016) Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In ICML, 2016.
  • Steck (2013) Harald Steck. Evaluation of recommendations: rating-prediction and ranking. In RecSys, 2013.
  • Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Swaminathan & Joachims (2015) Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In NeurIPS, 2015.
  • Tan (2007) Zhiqiang Tan. Comment: understanding or, ps and dr. Statistical Science, 22:560–568, 2007.
  • van der Laan & Rose (2011) Mark J. van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, 2011.
  • Wang et al. (2019) Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Doubly robust joint learning for recommendation on data missing not at random. In ICML, 2019.
  • Wang et al. (2021) Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Combating selection biases in recommender systems with a few unbiased ratings. In WSDM, 2021.
  • Wang et al. (2020a) Yixin Wang, Dawen Liang, Laurent Charlin, and David M. Blei. Causal inference for recommender systems. In RecSys, 2020a.
  • Wang et al. (2020b) Zifeng Wang, Xi Chen, Rui Wen, Shao-Lun Huang, Ercan Kuruoglu, and Yefeng Zheng. Information theoretic counterfactual learning from missing-not-at-random feedback. NeurIPS, 2020b.
  • Wen et al. (2022) Hongyi Wen, Xinyang Yi, Tiansheng Yao, Jiaxi Tang, Lichan Hong, and Ed H. Chi. Distributionally-robust recommendations for improving worst-case user experience. In WWW, 2022.
  • Wu et al. (2022a) Peng Wu, Shasha Han, Xingwei Tong, and Runze Li. Propensity score regression for causal inference with treatment heterogeneity. Statistica Sinica, 2022a.
  • Wu et al. (2022b) Peng Wu, Haoxuan Li, Yuhao Deng, Wenjie Hu, Quanyu Dai, Zhenhua Dong, Jie Sun, Rui Zhang, and Xiao-Hua Zhou. On the opportunity of causal learning in recommendation systems: Foundation, estimation, prediction and challenges. In IJCAI, 2022b.
  • Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1–38, 2019.
  • Zhang et al. (2020) Wenhao Zhang, Wentian Bao, Xiao-Yang Liu, Keping Yang, Quan Lin, Hong Wen, and Ramin Ramezani. Large-scale causal approaches to debiasing post-click conversion rate estimation with multi-task learning. In WWW, 2020.
  • Zhang et al. (2021) Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong Zhang. Causal intervention for leveraging popularity bias in recommendation. In SIGIR, 2021.

Appendix

Throughout, following existing studies (Schnabel et al., 2016; Wang et al., 2019; Guo et al., 2021; Dai et al., 2022), we assume that the indicator matrix 𝒪\mathcal{O} contains independent random variables and each ou,io_{u,i} follows a Bernoulli distribution with probability pu,ip_{u,i}.

Appendix A Proof of theorems

A.1 Proof of Theorem 1

Proof of Theorem 1.

To demonstrate the double robustness of the SDR, first note that P(lim|𝒟|SDR=ideal)=1P\left(\lim_{|\mathcal{D}|\rightarrow\infty}\mathcal{E}_{SDR}=\mathcal{L}_{ideal}\right)=1 if the learned propensities are accurate (Swaminathan & Joachims, 2015), since |𝒟|1(u,i)𝒟ou,i/p^u,i|\mathcal{D}|^{-1}\sum_{(u,i)\in\mathcal{D}}o_{u,i}/\hat{p}_{u,i} converges to 1 almost surely as |𝒟||\mathcal{D}| goes to infinity and IPS is unbiased. Besides, the constraint (1) is constructed to ensure the unbiasedness of SDR\mathcal{E}_{SDR} if the error imputation model is correctly specified. In fact, SDR\mathcal{E}_{SDR} satisfies

1|𝒟|(u,i)𝒟ou,i(eu,iSDR)p^u,i=0.\displaystyle\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}(e_{u,i}-\mathcal{E}_{SDR})}{\hat{p}_{u,i}}=0. (4)

Combining the constraint (1) and equation (4) gives

1|𝒟|(u,i)𝒟[ou,i(eu,ie^u,i)p^u,i+ou,i(^SDR)p^u,i]=0,\displaystyle\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\left[\frac{o_{u,i}(e_{u,i}-\hat{e}_{u,i})}{\hat{p}_{u,i}}+\frac{o_{u,i}(\hat{\mathcal{E}}-\mathcal{E}_{SDR})}{\hat{p}_{u,i}}\right]=0,

where the first term equals to 0 when the imputation model is correctly specified, it implies that SDR=^\mathcal{E}_{SDR}=\hat{\mathcal{E}}, then the unbiasedness of SDR\mathcal{E}_{SDR} follows immediately from the unbiasedness of ^\hat{\mathcal{E}}.

A.2 Proof of Theorem 2

Proof of Theorem 2.

Equation (3) implies that SDR\mathcal{E}_{SDR} can be expressed as

SDR=[1|𝒟|(u,i)𝒟ou,i(eu,ie^u,i+^)p^u,i]/[1|𝒟|(u,i)𝒟ou,ip^u,i].\mathcal{E}_{SDR}=\Big{[}{\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}(e_{u,i}-\hat{e}_{u,i}+\hat{\mathcal{E}})}{\hat{p}_{u,i}}}\Big{]}\Big{/}\Big{[}{\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}}\Big{]}. (5)

For notational simplicity, let

wu,iou,i/p^u,iandvu,iou,i(eu,ie^u,i+^)/p^u,i,\displaystyle w_{u,i}\triangleq{o_{u,i}}/{\hat{p}_{u,i}}\quad\text{and}\quad v_{u,i}\triangleq{o_{u,i}(e_{u,i}-\hat{e}_{u,i}+\hat{\mathcal{E}})}/{\hat{p}_{u,i}},

then SDR\mathcal{E}_{SDR} can be written as a ratio statistic

SDR=1|𝒟|(u,i)𝒟vu,i/1|𝒟|(u,i)𝒟wu,if(v¯,w¯),\displaystyle\mathcal{E}_{SDR}={\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}v_{u,i}}\Big{/}{\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}w_{u,i}}\triangleq f(\bar{v},\bar{w}),

where f(v,w)=v/wf(v,w)=v/w, v¯=|𝒟|1(u,i)𝒟vu,i\bar{v}=|\mathcal{D}|^{-1}{\sum_{(u,i)\in\mathcal{D}}v_{u,i}}, and w¯=|𝒟|1(u,i)𝒟wu,i.\bar{w}=|\mathcal{D}|^{-1}{\sum_{(u,i)\in\mathcal{D}}w_{u,i}}.

Applying the Taylor expansion around (μv,μw)(𝔼[v¯],𝔼[w¯])(\mu_{v},\mu_{w})\triangleq(\mathbb{E}[\bar{v}],\mathbb{E}[\bar{w}]) yields that

f(v¯,w¯)\displaystyle f(\bar{v},\bar{w}) =f(μv,μw)+fv(μv,μw)(v¯μv)+fw(μv,μw)(w¯μw)\displaystyle=f(\mu_{v},\mu_{w})+f_{v}^{\prime}(\mu_{v},\mu_{w})\left(\bar{v}-\mu_{v}\right)+f_{w}^{\prime}(\mu_{v},\mu_{w})\left(\bar{w}-\mu_{w}\right)
+12{fvv′′(μv,μw)(v¯μv)2+2fvw′′(μv,μw)(v¯μv)(w¯μw)+fww′′(w¯μw)2}\displaystyle+\frac{1}{2}\left\{f_{vv}^{\prime\prime}(\mu_{v},\mu_{w})\left(\bar{v}-\mu_{v}\right)^{2}+2f_{vw}^{\prime\prime}(\mu_{v},\mu_{w})\left(\bar{v}-\mu_{v}\right)\left(\bar{w}-\mu_{w}\right)+f_{ww}^{\prime\prime}\left(\bar{w}-\mu_{w}\right)^{2}\right\}
+R(v~,w~),\displaystyle+R(\tilde{v},\tilde{w}),

where R(v~,w~)R(\tilde{v},\tilde{w}) is the remainder term. Note that fvv′′(μv,μw)=0f_{vv}^{\prime\prime}(\mu_{v},\mu_{w})=0, fvw′′(μv,μw)=1/μw2f_{vw}^{\prime\prime}(\mu_{v},\mu_{w})=-1/\mu_{w}^{2}, and fww′′(μv,μw)=2μv/μw3f^{\prime\prime}_{ww}(\mu_{v},\mu_{w})=2\mu_{v}/\mu_{w}^{3}, then taking an expectation on both sides of the Taylor expansion leads to

𝔼(v¯/w¯)=μvμwCov(v¯,w¯)(μw)2+Var(w¯)μv(μw)3+𝔼[R(v~,w~)].\mathbb{E}(\bar{v}/\bar{w})=\frac{\mu_{v}}{\mu_{w}}-\frac{\operatorname{Cov}(\bar{v},\bar{w})}{\left(\mu_{w}\right)^{2}}+\frac{\operatorname{Var}(\bar{w})\mu_{v}}{\left(\mu_{w}\right)^{3}}+\mathbb{E}[R(\tilde{v},\tilde{w})].

By some calculations, we have Cov(v¯,w¯)=O(|𝒟|1),Var(w¯)=O(|𝒟|1),𝔼[R(v~,w~)]=o(|𝒟|1)\operatorname{Cov}(\bar{v},\bar{w})=O(|\mathcal{D}|^{-1}),\operatorname{Var}(\bar{w})=O(|\mathcal{D}|^{-1}),\mathbb{E}[R(\tilde{v},\tilde{w})]=o(|\mathcal{D}|^{-1}). Thus, the bias of SDR\mathcal{E}_{SDR} is given as

Bias(SDR)\displaystyle{\operatorname{Bias}}(\mathcal{E}_{SDR}) =|1|𝒟|(u,i)𝒟(δu,i(u,i)𝒟δu,ipu,i/p^u,i(u,i)𝒟pu,i/p^u,i)|+O(|𝒟|1).\displaystyle=\Biggl{|}\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\left(\delta_{u,i}-\frac{\sum_{(u,i)\in\mathcal{D}}{\delta_{u,i}p_{u,i}}/{\hat{p}_{u,i}}}{\sum_{(u,i)\in\mathcal{D}}{p_{u,i}}/{\hat{p}_{u,i}}}\right)\Biggr{|}+O(|\mathcal{D}|^{-1}).

A.3 Proof of Theorem 3

Proof of Theorem 3.

According to the proof of Theorem 2, we have

𝔼[v¯/w¯]μv/μw=O(|𝒟|1).\mathbb{E}[\bar{v}/\bar{w}]-\mu_{v}/\mu_{w}=O(|\mathcal{D}|^{-1}). (6)

Then the variance of SDR\mathcal{E}_{SDR} can be decomposed into as

Var(SDR)=\displaystyle\text{Var}(\mathcal{E}_{SDR})={} Var(v¯/w¯)=𝔼[{v¯/w¯𝔼(v¯/w¯)}2]\displaystyle\text{Var}(\bar{v}/\bar{w})=\mathbb{E}\left[\left\{\bar{v}/\bar{w}-\mathbb{E}(\bar{v}/\bar{w})\right\}^{2}\right]
=\displaystyle={} 𝔼[{v¯/w¯μv/μw}22O(|𝒟|1){v¯/w¯μv/μw}+O(|𝒟|2)],\displaystyle\mathbb{E}\left[\left\{\bar{v}/\bar{w}-\mu_{v}/\mu_{w}\right\}^{2}-2O(|\mathcal{D}|^{-1})\cdot\left\{\bar{v}/\bar{w}-\mu_{v}/\mu_{w}\right\}+O(|\mathcal{D}|^{-2})\right],
=\displaystyle={} 𝒱1+𝒱2+O(|𝒟|2),\displaystyle\mathcal{V}_{1}+\mathcal{V}_{2}+O(|\mathcal{D}|^{-2}),

where 𝒱1𝔼[{v¯/w¯μv/μw}2]\mathcal{V}_{1}\triangleq\mathbb{E}[\{\bar{v}/\bar{w}-\mu_{v}/\mu_{w}\}^{2}], 𝒱22O(|𝒟|1)[𝔼(v¯/w¯)μv/μw]\mathcal{V}_{2}\triangleq-2O(|\mathcal{D}|^{-1})\cdot[\mathbb{E}(\bar{v}/\bar{w})-\mu_{v}/\mu_{w}]. Equation (6) implies that 𝒱2=O(|𝒟|2)\mathcal{V}_{2}=O(|\mathcal{D}|^{-2}).

Denote f(v,w)=v/wf(v,w)=v/w, and apply delta method around (μv,μw)(𝔼[v¯],𝔼[w¯])(\mu_{v},\mu_{w})\triangleq(\mathbb{E}[\bar{v}],\mathbb{E}[\bar{w}]) to calculate 𝒱1\mathcal{V}_{1} yields that

𝒱1=\displaystyle\mathcal{V}_{1}={} 𝔼{[f(μv,μw)+fv(μv,μw)(v¯μv)+fw(μv,μw)(w¯μw)+Op(|𝒟|1)f(μv,μw)]2}\displaystyle\mathbb{E}\left\{\left[f(\mu_{v},\mu_{w})+f_{v}^{\prime}(\mu_{v},\mu_{w})\left(\bar{v}-\mu_{v}\right)+f_{w}^{\prime}(\mu_{v},\mu_{w})\left(\bar{w}-\mu_{w}\right)+O_{p}(|\mathcal{D}|^{-1})-f(\mu_{v},\mu_{w})\right]^{2}\right\}
=\displaystyle={} 𝔼{[fv(μv,μw)(v¯μv)+fw(μv,μw)(w¯μw)+Op(|𝒟|1)]2}\displaystyle\mathbb{E}\left\{\left[f_{v}^{\prime}(\mu_{v},\mu_{w})\left(\bar{v}-\mu_{v}\right)+f_{w}^{\prime}(\mu_{v},\mu_{w})\left(\bar{w}-\mu_{w}\right)+O_{p}(|\mathcal{D}|^{-1})\right]^{2}\right\}
=\displaystyle={} 𝔼{fv2(μv,μw)(v¯μv)2+2fv(μv,μw)(v¯μv)fw(μv,μw)(w¯μw)\displaystyle\mathbb{E}\left\{f_{v}^{\prime 2}(\mu_{v},\mu_{w})\left(\bar{v}-\mu_{v}\right)^{2}+2f_{v}^{\prime}(\mu_{v},\mu_{w})\left(\bar{v}-\mu_{v}\right)f_{w}^{\prime}(\mu_{v},\mu_{w})\left(\bar{w}-\mu_{w}\right)\right.
+fw2(μv,μw)(w¯μw)2}+O(|𝒟|2)\displaystyle+\left.f_{w}^{\prime 2}(\mu_{v},\mu_{w})\left(\bar{w}-\mu_{w}\right)^{2}\right\}+O(|\mathcal{D}|^{-2})
=\displaystyle={} fv2(μv,μw)Var(v¯)+2fv(μv,μw)fw(μv,μw)Cov(v¯,w¯)+fw2(μv,μw)Var(w¯)+O(|𝒟|2)\displaystyle f_{v}^{\prime 2}(\mu_{v},\mu_{w})\operatorname{Var}(\bar{v})+2f_{v}^{\prime}(\mu_{v},\mu_{w})f_{w}^{\prime}(\mu_{v},\mu_{w})\operatorname{Cov}(\bar{v},\bar{w})+f_{w}^{\prime 2}(\mu_{v},\mu_{w})\operatorname{Var}(\bar{w})+O(|\mathcal{D}|^{-2})

Note that fv(μv,μw)=1/μwf_{v}^{\prime}(\mu_{v},\mu_{w})=1/\mu_{w} and fw(μv,μw)=μv/μw2f_{w}^{\prime}(\mu_{v},\mu_{w})=-\mu_{v}/\mu_{w}^{2}. Then we have

𝒱1\displaystyle\mathcal{V}_{1} =1(μw)2Var(v¯)+2μv(μw)3Cov(v¯,w¯)+(μv)2(μw)4Var(w¯)+O(|𝒟|2)\displaystyle=\frac{1}{\left(\mu_{w}\right)^{2}}\operatorname{Var}(\bar{v})+2\frac{-\mu_{v}}{\left(\mu_{w}\right)^{3}}\operatorname{Cov}(\bar{v},\bar{w})+\frac{\left(\mu_{v}\right)^{2}}{\left(\mu_{w}\right)^{4}}\operatorname{Var}(\bar{w})+O(|\mathcal{D}|^{-2})
=(μv)2(μw)2[Var(v¯)(μv)22Cov(v¯,w¯)μvμw+Var(w¯)(μw)2]+O(|𝒟|2)\displaystyle=\frac{\left(\mu_{v}\right)^{2}}{\left(\mu_{w}\right)^{2}}\left[\frac{\operatorname{Var}(\bar{v})}{\left(\mu_{v}\right)^{2}}-2\frac{\operatorname{Cov}(\bar{v},\bar{w})}{\mu_{v}\mu_{w}}+\frac{\operatorname{Var}(\bar{w})}{\left(\mu_{w}\right)^{2}}\right]+O(|\mathcal{D}|^{-2})
=𝔼(v¯μvμww¯)2μw2+O(|𝒟|2)\displaystyle=\frac{\mathbb{E}\left(\bar{v}-\frac{\mu_{v}}{\mu_{w}}\bar{w}\right)^{2}}{\mu^{2}_{w}}+O(|\mathcal{D}|^{-2})
=(u,i)pu,i(1pu,i)hu,i2/p^u,i2((u,i)pu,i/p^u,i)2+O(|𝒟|2),\displaystyle=\frac{\sum_{(u,i)}p_{u,i}(1-p_{u,i})h^{2}_{u,i}/\hat{p}^{2}_{u,i}}{\left(\sum_{(u,i)}p_{u,i}/\hat{p}_{u,i}\right)^{2}}+O(|\mathcal{D}|^{-2}),

where hu,i=(eu,ie^u,i)(u,i){pu,i(eu,ie^u,i)/p^u,i}/(u,i){pu,i/p^u,i}h_{u,i}=(e_{u,i}-\hat{e}_{u,i})-\sum_{(u,i)}\{p_{u,i}(e_{u,i}-\hat{e}_{u,i})/\hat{p}_{u,i}\}/\sum_{(u,i)}\{p_{u,i}/\hat{p}_{u,i}\} is a bounded difference between eu,ie^u,ie_{u,i}-\hat{e}_{u,i} and its weighted average. The conclusion that the SDR variance is bounded for any propensities is given directly by the self-normalized form of SDR, i.e., the bounded range of SDR is [δmin,δmax][\delta_{\operatorname{min}},\delta_{\operatorname{max}}].

A.4 Proof of Theorem 4

Proof of Theorem 4.

The McDiarmid’s inequality states that for independent bounded random variables X1,X2,XnX_{1},X_{2},\ldots X_{n}, where Xi𝒳iX_{i}\in\mathcal{X}_{i} for all ii and a mapping f:𝒳1×𝒳2××𝒳nf:\mathcal{X}_{1}\times\mathcal{X}_{2}\times\cdots\times\mathcal{X}_{n}\rightarrow\mathbb{R}. Assume there exist constant c1,c2,,cnc_{1},c_{2},\ldots,c_{n} such that for all ii,

supx1,,xi1,xi,xi,xi+1,,xn|f(x1,,xi1,xi,xi+1,,xn)f(x1,,xi1,xi,xi+1,,xn)|ci.\sup_{x_{1},\cdots,x_{i-1},x_{i},x_{i}^{\prime},x_{i+1},\cdots,x_{n}}\left|f\left(x_{1},\ldots,x_{i-1},x_{i},x_{i+1},\cdots,x_{n}\right)-f\left(x_{1},\ldots,x_{i-1},x_{i}^{\prime},x_{i+1},\cdots,x_{n}\right)\right|\leq c_{i}.

Then, for any ϵ>0\epsilon>0,

(|f(X1,X2,,Xn)𝔼[f(X1,X2,,Xn)]|ϵ)2exp(2ϵ2i=1nci2).\mathbb{P}\left(\left|f\left(X_{1},X_{2},\cdots,X_{n}\right)-\mathbb{E}\left[f\left(X_{1},X_{2},\cdots,X_{n}\right)\right]\right|\geq\epsilon\right)\leq 2\exp\left(-\frac{2\epsilon^{2}}{\sum_{i=1}^{n}c_{i}^{2}}\right).

In fact, equation (5) implies that the SDR estimator can be written as

SDR=(u,i)𝒟ou,i(eu,ie^u,i)p^u,i/(u,i)𝒟ou,ip^u,i+^,\displaystyle\mathcal{E}_{SDR}={\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}(e_{u,i}-\hat{e}_{u,i})}{\hat{p}_{u,i}}}\Big{/}{\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}}+\hat{\mathcal{E}},

denoted as f(o1,1,,ou,i,,oU,I)f(o_{1,1},\dots,o_{u,i},\dots,o_{U,I}). Note that

supou,i,ou,i|f(o1,1,,ou,i,,oU,I)f(o1,1,,ou,i,oU,I)|\displaystyle\sup_{o_{u,i},o^{\prime}_{u,i}}\left|f\left(o_{1,1},\dots,o_{u,i},\dots,o_{U,I}\right)-f\left(o_{1,1},\dots,o^{\prime}_{u,i}\dots,o_{U,I}\right)\right|
{δmaxδu,i/p^u,i+𝒟(u,i)ou,i/p^u,iδmax1/p^u,i+𝒟(u,i)ou,i/p^u,i,ifδu,i(δmin+δmax)/2,𝒟(u,i)ou,i/p^u,iδmin+δu,i/p^u,i𝒟(u,i)ou,i/p^u,i+1/p^u,iδmin,ifδu,i>(δmin+δmax)/2,\displaystyle\leq\begin{cases}\delta_{\operatorname{max}}-\dfrac{\delta_{u,i}/\hat{p}_{u,i}+\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}\delta_{\operatorname{max}}}{1/\hat{p}_{u,i}+\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}},\quad&\text{if}\quad\delta_{u,i}\leq(\delta_{\operatorname{min}}+\delta_{\operatorname{max}})/2,\\ \dfrac{\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}\delta_{\operatorname{min}}+\delta_{u,i}/\hat{p}_{u,i}}{\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}+1/\hat{p}_{u,i}}-\delta_{\operatorname{min}},\quad&\text{if}\quad\delta_{u,i}>(\delta_{\operatorname{min}}+\delta_{\operatorname{max}})/2,\end{cases} (7)

where 𝒟(u,i)\mathcal{D}\setminus(u,i) is the set of 𝒟\mathcal{D} excluding the element (u,i)(u,i).

Next, we focus on analyzing the 𝒟(u,i)ou,i/p^u,i\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}. The Hoeffding’s inequality states that for independent bounded random variables X1,,XnX_{1},\ldots,X_{n} that take values in intervals of sizes ρ1,,ρn\rho_{1},\ldots,\rho_{n} with probability 1 and for any ϵ>0\epsilon>0,

(|kXk𝔼(kXk)|ϵ)2exp(2ϵ2kρk2).\mathbb{P}\Big{(}\Big{|}\sum_{k}X_{k}-\mathbb{E}(\sum_{k}X_{k})\Big{|}\geq\epsilon\Big{)}\leq 2\exp\left(\frac{-2\epsilon^{2}}{\sum_{k}\rho_{k}^{2}}\right).

For 𝒟(u,i)ou,i/p^u,i\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}, we have

(|𝒟(u,i)ou,i/p^u,i𝒟(u,i)pu,i/p^u,i|ϵ)2exp(2ϵ2𝒟(u,i)1/p^u,i2),\displaystyle\mathbb{P}\Big{(}\Big{|}\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}-\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}\Big{|}\geq\epsilon)\leq 2\exp\left(\frac{-2\epsilon^{2}}{\sum_{\mathcal{D}\setminus(u,i)}1/\hat{p}^{2}_{u,i}}\right),

Setting the last term equals to η/2\eta/2, and solving for ϵ\epsilon gives that with probability at least 1η/21-\eta/2, the following inequality holds

|𝒟(u,i)ou,i/p^u,i𝒟(u,i)pu,i/p^u,i|12log4η𝒟(u,i)1p^u,i2ϵ.\Biggl{|}\sum_{\mathcal{D}\setminus(u,i)}o_{u,i}/\hat{p}_{u,i}-\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}\Biggr{|}\leq\sqrt{\frac{1}{2}\log\frac{4}{\eta}{\sum_{\mathcal{D}\setminus(u,i)}\frac{1}{\hat{p}^{2}_{u,i}}}}\triangleq\epsilon^{\prime}. (8)

Therefore, combining (7) and (8) yields that with probability at least 1η/21-\eta/2,

supo1,1,,ou,i,ou,i,,oU,I|f(o1,1,,ou,i,,oU,I)f(o1,1,,ou,i,oU,I)|\displaystyle\sup_{o_{1,1},\dots,o_{u,i},o^{\prime}_{u,i},\dots,o_{U,I}}\left|f\left(o_{1,1},\dots,o_{u,i},\dots,o_{U,I}\right)-f\left(o_{1,1},\dots,o^{\prime}_{u,i}\dots,o_{U,I}\right)\right|
{δmaxδu,i/p^u,i+(𝒟(u,i)pu,i/p^u,iϵ)δmax1/p^u,i+(𝒟(u,i)pu,i/p^u,iϵ),ifδu,i(δmin+δmax)/2,(𝒟(u,i)pu,i/p^u,iϵ)δmin+δu,i/p^u,i(𝒟(u,i)pu,i/p^u,iϵ)+1/p^u,iδmin,ifδu,i>(δmin+δmax)/2,\displaystyle\leq{\begin{cases}\delta_{\operatorname{max}}-\dfrac{\delta_{u,i}/\hat{p}_{u,i}+(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\delta_{\operatorname{max}}}{1/\hat{p}_{u,i}+(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})},\quad&\text{if}\quad\delta_{u,i}\leq(\delta_{\operatorname{min}}+\delta_{\operatorname{max}})/2,\\ \dfrac{(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\delta_{\operatorname{min}}+\delta_{u,i}/\hat{p}_{u,i}}{(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})+1/\hat{p}_{u,i}}-\delta_{\operatorname{min}},\quad&\text{if}\quad\delta_{u,i}>(\delta_{\operatorname{min}}+\delta_{\operatorname{max}})/2,\end{cases}}
{(δmaxδu,i)/{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)},ifδu,i(δmin+δmax)/2,(δu,iδmin)/{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)},ifδu,i>(δmin+δmax)/2,\displaystyle\leq\begin{cases}(\delta_{\operatorname{max}}-\delta_{u,i})/\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\},\quad&\text{if}\quad\delta_{u,i}\leq(\delta_{\operatorname{min}}+\delta_{\operatorname{max}})/2,\\ (\delta_{u,i}-\delta_{\operatorname{min}})/\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\},\quad&\text{if}\quad\delta_{u,i}>(\delta_{\operatorname{min}}+\delta_{\operatorname{max}})/2,\end{cases}

where δu,i=eu,ie^u,i\delta_{u,i}=e_{u,i}-\hat{e}_{u,i} is the error deviation, δmin=min(u,i)𝒟δu,i\delta_{\operatorname{min}}=\operatorname{min}_{(u,i)\in\mathcal{D}}\delta_{u,i}, and δmax=max(u,i)𝒟δu,i\delta_{\operatorname{max}}=\operatorname{max}_{(u,i)\in\mathcal{D}}\delta_{u,i}.

Invoking McDiarmid’s inequality leads to that

(|SDR𝔼𝒪(SDR)|ϵ)\displaystyle\mathbb{P}\left(\left|\mathcal{E}_{SDR}-\mathbb{E}_{{\mathcal{O}}}(\mathcal{E}_{SDR})\right|\geq\epsilon\right)
\displaystyle\leq{} 2exp{2ϵ2/((u,i):δu,iδmin+δmax2(δmaxδu,i)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2\displaystyle 2\exp\Biggl{\{}{-2\epsilon^{2}}\Biggl{/}\Biggl{(}\sum_{(u,i):\delta_{u,i}\leq\frac{\delta_{\operatorname{min}}+\delta_{\operatorname{max}}}{2}}\frac{(\delta_{\operatorname{max}}-\delta_{u,i})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}-(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}
+(u,i):δu,i>δmin+δmax2(δu,iδmin)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2)}\displaystyle{}+\sum_{(u,i):\delta_{u,i}>\frac{\delta_{\operatorname{min}}+\delta_{\operatorname{max}}}{2}}\frac{(\delta_{u,i}-\delta_{\operatorname{min}})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}-(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}\Biggr{)}\Biggr{\}}
\displaystyle\leq{} 2exp(2ϵ2(u,i){(δmaxδu,i)2+(δu,iδmin)2}/{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2)\displaystyle 2\exp\left(\frac{-2\epsilon^{2}}{\sum_{(u,i)}\{(\delta_{\operatorname{max}}-\delta_{u,i})^{2}+(\delta_{u,i}-\delta_{\operatorname{min}})^{2}\}/\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}-(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}\right)

Setting the last term equals to η/2\eta/2, and solving for ϵ\epsilon complete the proof.

A.5 Generalization Bound under Inaccurate Models

Theorem 5 (Generalization Bound under Inaccurate Models).

For any finite hypothesis space of predictions ={𝐘^1,,𝐘^||}\mathcal{H}=\{\hat{\mathbf{Y}}^{1},\ldots,\hat{\mathbf{Y}}^{|\mathcal{H}|}\}, with probability 1η1-\eta, the true risk R(𝐘^)R(\hat{\mathbf{Y}}^{\dagger}) deviates from the SDR estimator with imputed errors e^u,i\hat{e}_{u,i} and learned propensities p^u,i\hat{p}_{u,i} satisfying the stabilization constraint 1 is bounded by

R(𝐘^)\displaystyle R(\hat{\mathbf{Y}}^{\dagger})\leq{} ^SDR(𝐘^)+|1|𝒟|(u,i)𝒟δu,i(u,i)𝒟δu,ipu,i/p^u,i(u,i)𝒟pu,i/p^u,i|Bias Term\displaystyle\mathcal{\hat{E}}_{SDR}(\hat{\mathbf{Y}}^{\dagger})+\underbrace{\Biggl{|}\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\delta^{\dagger}_{u,i}-\frac{\sum_{(u,i)\in\mathcal{D}}{\delta^{\dagger}_{u,i}p_{u,i}}/{\hat{p}_{u,i}}}{\sum_{(u,i)\in\mathcal{D}}{p_{u,i}}/{\hat{p}_{u,i}}}\Biggr{|}}_{\text{Bias Term }}
+12log(4||η)(u,i)𝒟(δmaxδu,i§)2+(δu,i§δmin)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2Variance Term\displaystyle+\underbrace{\sqrt{\frac{1}{2}{\log\left(\frac{4|\mathcal{H}|}{\eta}\right)}\sum_{(u,i)\in\mathcal{D}}\frac{(\delta_{\operatorname{max}}-\delta^{\S}_{u,i})^{2}+(\delta^{\S}_{u,i}-\delta_{\operatorname{min}})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}}}_{\text{Variance Term }}

where δu,i§\delta_{u,i}^{\S} is the error deviation corresponding to the prediction model

𝐘^§=argmax𝐘^h(u,i)𝒟(δmaxδu,i§)2+(δu,i§δmin)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2.\hat{\mathbf{Y}}^{\S}=\operatorname{argmax}_{\hat{\mathbf{Y}}^{h}\in\mathcal{H}}{\sum_{(u,i)\in\mathcal{D}}\frac{(\delta_{\operatorname{max}}-\delta^{\S}_{u,i})^{2}+(\delta^{\S}_{u,i}-\delta_{\operatorname{min}})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}}.
Proof of Theorem 5.

Proof. Theorem 4 shows that for all predictions 𝐘^h\hat{\mathbf{Y}}^{h}\in\mathcal{H}, we have

P(|SDR(𝐘^h)𝔼[SDR(𝐘^h)]|ϵ)\displaystyle P\left(\left|\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})-\mathbb{E}[\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})]\right|\geq\epsilon\right)
\displaystyle\leq{} 2exp(2ϵ2(u,i){(δmaxδu,ih)2+(δu,ihδmin)2}/{1+p^u,i}2)\displaystyle 2\exp\left(\frac{-2\epsilon^{2}}{\sum_{(u,i)}\{(\delta_{\operatorname{max}}-\delta^{h}_{u,i})^{2}+(\delta^{h}_{u,i}-\delta_{\operatorname{min}})^{2}\}/\{1+\hat{p}_{u,i}\}^{2}}\right)

McDiarmid’s inequality and union bound ensures the following uniform convergence results:

P(|SDR(𝐘^)𝔼[SDR(𝐘^)]|ϵ)1η\displaystyle P\left(\left|\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{\dagger})-\mathbb{E}[\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{\dagger})]\right|\leq\epsilon\right)\geq 1-\eta
P(max𝐘^h|SDR(𝐘^h)𝔼[SDR(𝐘^h)]|ϵ)1η\displaystyle\Leftarrow P\left(\max_{\hat{\mathbf{Y}}^{h}\in\mathcal{H}}\left|\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})-\mathbb{E}[\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})]\right|\leq\epsilon\right)\geq 1-\eta
P(𝐘^i|SDR(𝐘^h)𝔼[SDR(𝐘^h)]|ϵ)<η\displaystyle\Leftrightarrow P\left(\bigvee_{\hat{\mathbf{Y}}_{i}\in\mathcal{H}}\left|\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})-\mathbb{E}[\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})]\right|\geq\epsilon\right)<\eta
h=1||P(|SDR(𝐘^h)𝔼[SDR(𝐘^h)]|ϵ)<η\displaystyle\Leftarrow\sum_{h=1}^{|\mathcal{H}|}P\left(\left|\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})-\mathbb{E}[\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{h})]\right|\geq\epsilon\right)<\eta
h=1||2exp(2ϵ2(u,i){(δmaxδu,ih)2+(δu,ihδmin)2}/{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2)<η\displaystyle\Leftarrow\sum_{h=1}^{|\mathcal{H}|}2\exp\left(\frac{-2\epsilon^{2}}{\sum_{(u,i)}\{(\delta_{\operatorname{max}}-\delta^{h}_{u,i})^{2}+(\delta^{h}_{u,i}-\delta_{\operatorname{min}})^{2}\}/\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}-(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}\right)<\eta
||2exp(2ϵ2(u,i){(δmaxδu,i§)2+(δu,i§δmin)2}/{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2)<η\displaystyle\Leftarrow|\mathcal{H}|\cdot 2\exp\left(\frac{-2\epsilon^{2}}{\sum_{(u,i)}\{(\delta_{\operatorname{max}}-\delta^{\S}_{u,i})^{2}+(\delta^{\S}_{u,i}-\delta_{\operatorname{min}})^{2}\}/\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}-(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}\right)<\eta

Solving the last inequality for ϵ\epsilon, it is concluded that, with probability 1η1-\eta, the following inequality holds

𝔼[SDR(𝐘^)]SDR(𝐘^)12log(4||η)(u,i)𝒟(δmaxδu,i§)2+(δu,i§δmin)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2.\mathbb{E}[\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{\dagger})]-\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{\dagger})\leq\sqrt{\frac{1}{2}{\log\left(\frac{4|\mathcal{H}|}{\eta}\right)}\sum_{(u,i)\in\mathcal{D}}\frac{(\delta_{\operatorname{max}}-\delta^{\S}_{u,i})^{2}+(\delta^{\S}_{u,i}-\delta_{\operatorname{min}})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}}.

Theorem 2 shows that for the optimal prediction model 𝐘^\hat{\mathbf{Y}}^{\dagger}, the following inequality holds

R(𝐘^)𝔼[SDR(𝐘^)]|1|𝒟|(u,i)𝒟δu,i(u,i)𝒟δu,ipu,i/p^u,i(u,i)𝒟pu,i/p^u,i|.\begin{aligned} R(\hat{\mathbf{Y}}^{\dagger})-\mathbb{E}[\mathcal{E}_{SDR}(\hat{\mathbf{Y}}^{\dagger})]\leq\Biggl{|}\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\delta^{\dagger}_{u,i}-\frac{\sum_{(u,i)\in\mathcal{D}}{\delta^{\dagger}_{u,i}p_{u,i}}/{\hat{p}_{u,i}}}{\sum_{(u,i)\in\mathcal{D}}{p_{u,i}}/{\hat{p}_{u,i}}}\Biggr{|}\end{aligned}.

The stated results can be obtained by adding the two inequalities above.

Appendix B Further Theoretical Analysis of SDR

Without loss of generality, we assume

1|𝒟|(u,i)𝒟ou,ip^u,i(e^u,i^)=λ,λ0.\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}(\hat{e}_{u,i}-\mathcal{\hat{E}})=\lambda,\quad\lambda\neq 0.

In this case, the learned propensities must be inaccurate; otherwise, the constraint (1) holds naturally as the same size increases. Thus, if the imputed errors are accurate, then ideal=^\mathcal{L}_{ideal}=\mathcal{\hat{E}}. By a exactly same arguments of equation (3), we have

1|𝒟|(u,i)𝒟ou,ip^u,i(^SDR)=λ,\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}(\mathcal{\hat{E}}-\mathcal{E}_{SDR})=\lambda,

which implies that

SDR=idealλ/1|𝒟|(u,i)𝒟ou,ip^u,i.\mathcal{E}_{SDR}=\mathcal{L}_{ideal}-\lambda\Big{/}\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\frac{o_{u,i}}{\hat{p}_{u,i}}.

This means that the degree of violation of constraint (1) determines the size of the bias of SDR.

Furthermore, we can compute the bias, variance, tail bound, and generalization error bound of SDR. Specifically, if both the learned propensities and imputed errors are inaccurate, constraint (3) does not hold either. Then the bias of SDR is

Bias(SDR)=|1|𝒟|(u,i)𝒟(eu,i(u,i)𝒟eu,ipu,i/p^u,i(u,i)𝒟pu,i/p^u,i)|+O(|𝒟|1),{\text{Bias}}(\mathcal{E}_{SDR})=\Biggl{|}\frac{1}{|\mathcal{D}|}\sum_{(u,i)\in\mathcal{D}}\left(e_{u,i}-\frac{\sum_{(u,i)\in\mathcal{D}}{e_{u,i}p_{u,i}}/\hat{p}_{u,i}}{\sum_{(u,i)\in\mathcal{D}}{p_{u,i}}/\hat{p}_{u,i}}\right)\Biggr{|}+O(|\mathcal{D}|^{-1}),

the variance of SDR becomes

Var(SDR)=(u,i)pu,i(1pu,i)h~u,i2/p^u,i2((u,i)pu,i/p^u,i)2+O(|𝒟|2),\text{Var}\left(\mathcal{E}_{SDR}\right)=\frac{\sum_{(u,i)}p_{u,i}(1-p_{u,i})\tilde{h}^{2}_{u,i}/\hat{p}^{2}_{u,i}}{\left(\sum_{(u,i)}p_{u,i}/\hat{p}_{u,i}\right)^{2}}+O(|\mathcal{D}|^{-2}),

where h~u,i=eu,i(u,i)𝒟{pu,ieu,i/p^u,i}/(u,i)𝒟{pu,i/p^u,i}.\tilde{h}_{u,i}=e_{u,i}-\sum_{(u,i)\in\mathcal{D}}\{p_{u,i}e_{u,i}/\hat{p}_{u,i}\}\Big{/}\sum_{(u,i)\in\mathcal{D}}\{p_{u,i}/\hat{p}_{u,i}\}. The tail bound of SDR is given as

|SDR𝔼𝒪(SDR)|12log(4η)(u,i)𝒟(emaxeu,i)2+(eu,iemin)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2,\left|\mathcal{E}_{SDR}-\mathbb{E}_{{\mathcal{O}}}(\mathcal{E}_{SDR})\right|\leq\sqrt{\frac{1}{2}{\log\left(\frac{4}{\eta}\right)}\sum_{(u,i)\in\mathcal{D}}\frac{(e_{\text{max}}-e_{u,i})^{2}+(e_{u,i}-e_{\text{min}})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}},

where δmin=min(u,i)𝒟eu,i\delta_{\text{min}}=\text{min}_{(u,i)\in\mathcal{D}}e_{u,i}, δmax=max(u,i)𝒟eu,i\delta_{\text{max}}=\text{max}_{(u,i)\in\mathcal{D}}e_{u,i}, ϵ=log(4/η)/2𝒟(u,i)1/p^u,i2\epsilon^{\prime}={\sqrt{\log(4/\eta)/2\cdot\sum_{\mathcal{D}\setminus(u,i)}1/\hat{p}^{2}_{u,i}}}, and 𝒟(u,i)\mathcal{D}\setminus(u,i) is the set of 𝒟\mathcal{D} excluding the element (u,i)(u,i). In addition, we can derive the generation error bound of SDR. Given a finite hypothesis space \mathcal{H} of the prediction model, then for any a prediction model hh\in\mathcal{H}, with probability 1η1-\eta, the true risk R(h)R(h) deviates from the SDR estimator is bounded by

R(h)^SDR(h)+Bias(SDR)+12log(4||η)(u,i)D(emaxeu,i§)2+(eu,i§emin)2{1+p^u,i(𝒟(u,i)pu,i/p^u,iϵ)}2,R(h)\leq\mathcal{\hat{E}}_{SDR}(h)+{\text{Bias}}(\mathcal{E}_{SDR})+\sqrt{\frac{1}{2}{\log(\frac{4|\mathcal{H}|}{\eta})}\sum_{(u,i)\in D}\frac{(e_{\text{max}}-e^{\S}_{u,i})^{2}+(e^{\S}_{u,i}-e_{\text{min}})^{2}}{\{1+\hat{p}_{u,i}(\sum_{\mathcal{D}\setminus(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}},

where eu,i§e_{u,i}^{\S} is the error deviation corresponding to the prediction model

h§=argmaxh(u,i)D(emaxeu,i§)2+(eu,i§emin)2{1+p^u,i(D(u,i)pu,i/p^u,iϵ)}2.h^{\S}=\arg\max_{h\in\mathcal{H}}\sum_{(u,i)\in D}\frac{(e_{max}-e^{\S}_{u,i})^{2}+(e^{\S}_{u,i}-e_{min})^{2}}{\{1+\hat{p}_{u,i}(\sum_{D-(u,i)}p_{u,i}/\hat{p}_{u,i}-\epsilon^{\prime})\}^{2}}.