This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Bayesian Inference for Big Data: Combining Sensor-based Records with Traditional Survey Data

Ali Rafei    Carol A. C. Flannagan    Brady T. West    Michael R. Elliott Corresponding author; address: 426 Thompson St. Ann Arbor, MI 48109. Rm 4068 ISR, email: mrelliot@umich.edu. Survey Methodology Program, University of Michigan University of Michigan Transportation Research Institute
Abstract

Big Data often presents as massive non-probability samples. Not only is the selection mechanism often unknown, but larger data volume amplifies the relative contribution of selection bias to total error. Existing bias adjustment approaches assume that the conditional mean structures have been correctly specified for the selection indicator or key substantive measures. In the presence of a reference probability sample, these methods rely on a pseudo-likelihood method to account for the sampling weights of the reference sample, which is parametric in nature. Under a Bayesian framework, handling the sampling weights is an even bigger hurdle. To further protect against model misspecification, we expand the idea of double robustness such that more flexible non-parametric methods as well as Bayesian models can be used for prediction. In particular, we employ Bayesian additive regression trees, which not only capture non-linear associations automatically but permit direct quantification of the uncertainty of point estimates through its posterior predictive draws. We apply our method to sensor-based naturalistic driving data from the second Strategic Highway Research Program using the 2017 National Household Travel Survey as a benchmark.

Big Data,
non-probability sample,
quasi-randomization,
prediction model,
doubly robust,
augmented inverse propensity weighting,
Bayesian additive regression trees,
keywords:
\startlocaldefs\endlocaldefs

, , and

1 Introduction

The 21st century is witnessing a re-emergence of non-probability sampling in various domains (Murdoch and Detsky, 2013; Daas et al., 2015; Lane, 2016; Senthilkumar et al., 2018). Probability sampling is facing new challenges, mainly because of a steady decline in response rates (Groves, 2011; Johnson and Smith, 2017; Miller, 2017). At the same time, new modes of data collection via sensors, web portals, and smart devices have emerged that routinely capture a variety of human activities. These automated processes have led to an ever-accumulating massive volume of unstructured information, so-called “Big Data” (Couper, 2013; Kreuter and Peng, 2014; Japec et al., 2015). Being easy to access, inexpensive to collect, and highly detailed, this broad range of data is perceived to be valuable for producing official statistics as an alternative or supplement to probability surveys (Struijs et al., 2014; Kitchin, 2015). However, “Big Data” typically have a self-selecting data-generating process, which can lead to biased estimates. When this is the case, the larger data volume in the non-probability sample increases the relative contribution of selection bias to absolute or squared error. Meng et al. (2018) call this phenomenon a “Big Data Paradox”, and these authors show both theoretically and empirically that the impact of selection bias on the effective sample size can be extremely large.

The motivating application in this article comes from naturalistic driving studies (NDS), which are one real-world example of Big Data for rare event investigations. Since traffic collisions are inherently rare events, measuring accurate pre-crash behaviors as well as exposure frequency in normal driving demands accurate long-term follow-up of the population of drivers. Thus, NDS are designed to continually monitor

drivers’ behavior via in-vehicle sensors, cameras, and advanced wireless technologies (Guo et al., 2009). The detailed information collected by NDS are considered a rich resource for assessing various aspects of transportation such as traffic safety, crash causality, and travel patterns (Huisingh et al., 2019; Tan et al., 2017). In particular, we consider the sensor-based Big Data from the second phase of the Strategic Highway Research Program (SHRP2), which is the largest NDS conducted to date. This study recruited a convenience sample from geographically restricted regions (6 US states: Florida, Indiana, New York, North Carolina, Pennsylvania, and Washington) and attempted to oversample both younger and older drivers, leading to potential selection bias in the sample mean of some trip-related variables (Antin et al., 2015). To deal with this, we employ the 2017 National Household Travel Survey (NHTS) as a “reference survey”, which can serve as a probability sample representing the population of American drivers (Santos et al., 2011). While daily trip measures in SHRP2 are recorded via sensors, NHTS asks respondents to self-report their trip measures through an online travel log. By analyzing the aggregated data at the day level, we develop adjusted sensor-based estimates from SHRP2 for measures such as frequency of trips, trip duration, trip speed, and starting time of trip that can be compared with self-reported weighted estimates in NHTS to assess the performance of our proposed methods in terms of bias and efficiency, as well as estimates of maximum speed, brake use per mile driven, and stop time that are available only in SHRP2.

Standard design-based approaches cannot be applied to non-probability samples for the simple reason that the probabilities of selection are unknown (Chen et al., 2020). Thus the American Association for Public Opinion Research (AAPOR) task force on non-probability samples recommends that adjustment methods should rely on models and external auxiliary information (Baker et al., 2013). In the presence of a relevant probability sample with a set of common auxiliary variables, which is often called a “reference survey”, two general approaches can be taken: (1) prediction modeling (PM)–fitting models on the non-probability sample to predict the response variable for units in the reference survey (Rivers, 2007; Kim and Rao, 2012; Wang et al., 2015; Kim et al., 2018), and (2) quasi-randomization (QR)–estimating the probabilities of being included in the non-probability sample, also known as propensity scores (PS), while treating the Big Data as a quasi-random sample (Lee, 2006; Lee and Valliant, 2009; Valliant and Dever, 2011). Our focus is on the PM setting, since in our application the key measures of interest are not available in the probability (reference) survey, and our goal is to use prediction to impute them.

Correct specification of the model predicting the outcome is critical for imputation. To help relax this assumption, the PM approach can be combined with the QR method, in a way that the adjusted estimate of a population quantity is consistent if either the propensity or the outcome model holds. Augmented inverse propensity weighting (AIPW) was the first of these so-called “doubly-robust” (DR) methods (Robins et al., 1994), with applications to causal inference (Scharfstein et al., 1999; Bang and Robins, 2005; Tan, 2006; Kang et al., 2007; Tan et al., 2019) and adjustment for non-response bias (Kott, 1994; Kim and Park, 2006; Kott, 2006; Haziza and Rao, 2006; Kott and Chang, 2010). Further extension to multiple robustness has been developed by Han and Wang (2013), where multiple models are specified and consistency is obtained as long as at least one of the models is correctly specified. Chen et al. (2020) offer further adjustments to adapt the AIPW estimator to a non-probability sampling setting where an external benchmark survey is available. While their method employs a modified pseudo-likelihood approach to estimate the selection probabilities for the non-probability sample, a parametric model is used to impute the outcome for units of the reference survey.

In a non-probability sample setting, Rafei et al. (2020) utilized BART in the QR approach outlined in Elliott and Valliant (2017). In this paper, we extend Rafei et al. (2020) in two major ways: first, by blending the QR and PS methods into a novel DR method that is made even more robust by using BART, which provides a strong non-parametric predictive tool by automatically capturing non-linear associations as well as high-order interactions (Chipman et al., 2007). The proposed method separates the propensity model from the sampling weights in a two-step process, allowing for a broader range of models to be utilized for imputing the missing inclusion probabilities. This allows us to consider both parametric (linear and generalized linear models) and non-parametric (BART) models for both propensity and outcome. In addition, the posterior predictive distribution produced by BART makes it easier to quantify the uncertainty due to the imputation of pseudo-weights and the outcome variable (Tan et al., 2019; Kaplan and Chen, 2012). Second, we derive asymptotic variance estimators for the previously proposed QR estimators in Rafei et al. (2020) as well as proofs of the double robustness of the proposed DR estimators.

The rest of the article is organized as follows. In Section 2, we develop the theoretical background behind the proposed methods and associated variance estimators. A simulation study is designed in Section 3 to assess the repeated sampling properties of the proposed estimator, i.e. bias and efficiency. Section 4 uses the NHTS to develop adjusted estimates from the SHRP2 using the methods discussed and developed in the previous sections. All the statistical analyses in both the simulations and empirical studies have been performed using R version 3.6.1. The annotated R code is available for public use at https://github.com/arafei/drinfNP. Finally, Section 5 reviews the strengths and weaknesses of the study in more detail and suggests some future research directions.

2 Methods

2.1 Notation

Denote by UU a finite population of size N<N<\infty. We consider the values of a scalar outcome variable, yiy_{i}, i=1,2,,Ni=1,2,...,N and xi=[1,xi1,xi2,,xip]x_{i}=[1,x_{i1},x_{i2},...,x_{ip}], the values of a pp-dimensional vector of relevant auxiliary variables, XX. Let SBS_{B} be a non-probability sample of size nBn_{B} selected from UU. The goal is to estimate an unknown finite population quantity, e.g. Q(Y)Q(Y). Here, the quantity of interest is considered to be the finite population mean that is a function of the outcome variable, i.e. Q(Y)=y¯U=i=1Nyi/NQ(Y)=\overline{y}_{U}=\sum_{i=1}^{N}y_{i}/N. Suppose δiB=I(iSB)\delta^{B}_{i}=I(i\in S_{B}) (i=1,,N)(i=1,...,N) is the inclusion indicator variable of the “big data” survey SBS_{B} of size nBn_{B} in UU. Further, we initially assume that given XX, elements in SBS_{B} are independent draws from UU, but later, we relax this assumption by considering SBS_{B} to have a single-stage clustering design as is the case in the real-data application of this article.

Suppose SRS_{R} is a parallel reference survey of size nRn_{R}, for which the same set of covariates, XX, has been measured. We also define di=[1,di1,di2,,diq]d_{i}=[1,d_{i1},d_{i2},...,d_{iq}], the values of a qq-dimensional vector of design variables for the reference survey. We assume yiy_{i} is unobserved for iSRi\in S_{R}; otherwise inference could be directly drawn based on SRS_{R}. Also, let δiR=I(iSR)\delta^{R}_{i}=I(i\in S_{R}) denote the inclusion indicator variable associated with SRS_{R} for iUi\in U. To avoid unnecessary complexity, we assume that units of SRS_{R} are selected independently. Being a full probability sample implies that the selection mechanism in SRS_{R} is ignorable given its design features, i.e. f(δiR|yi,di)=f(δiR|di)f(\delta^{R}_{i}|y_{i},d_{i})=f(\delta^{R}_{i}|d_{i}) for iUi\in U, where did_{i} denotes a qq-dimensional vector of associated design variables. Thus, one can define the selection probabilities and sampling weights in SRS_{R} as πiR=p(δiR=1|di)\pi^{R}_{i}=p(\delta^{R}_{i}=1|d_{i}) and wiR=1/πiRw^{R}_{i}=1/\pi^{R}_{i}, respectively, for iUi\in U, which we assume are known.

While XX and DD may overlap or correlate, we define xi=[xi,di]Tx^{*}_{i}=[x_{i},d_{i}]^{T}, the (p+q)(p+q)-dimensional vector of all auxiliary variables associated with SBS_{B} and SRS_{R}. To be able to make unbiased inference for SBS_{B}, we consider the following assumptions for SBS_{B}:

  1. C1.

    PositivitySBS_{B} actually does have a probabilistic sampling mechanism, albeit unknown. That means p(δiB=1|xi)>0p(\delta^{B}_{i}=1|x_{i})>0 for all possible values of xix_{i} in UU.

  2. C2.

    Ignorability—the selection mechanism of SBS_{B} is fully governed by XX, which implies YδB|XY\rotatebox[origin={c}]{90.0}{$\models$}\delta^{B}|X. Then, for iUi\in U, the unknown “pseudo-inclusion” probability associated with SBS_{B} is defined as πiA=p(δiB=1|xi)\pi^{A}_{i}=p(\delta^{B}_{i}=1|x_{i}).

  3. C3.

    Independence—conditional on XX^{*}, SRS_{R} and SBS_{B} are selected independently, i.e. δBδR|X\delta^{B}\rotatebox[origin={c}]{90.0}{$\models$}\delta^{R}|X^{*}.

Note that the first two assumptions are collectively called “strong ignorability” by Rosenbaum and Rubin (1983). Considering C1-C3, the joint density of yiy_{i}, δiB\delta^{B}_{i} and δiR\delta^{R}_{i} can be factorized as below:

f(yi,δiB,δiR|xi;θ,β,γ)=f(yi|xi;θ)f(δiB|xi;β)f(δiR|di;γ),iUf(y_{i},\delta^{B}_{i},\delta^{R}_{i}|x^{*}_{i};\theta,\beta,\gamma)=f(y_{i}|x_{i}^{*};\theta)f(\delta^{B}_{i}|x_{i};\beta)f(\delta^{R}_{i}|d_{i};\gamma),\hskip 11.38109pt\forall i\in U (2.1)

where Ψ=(θ,β,γ)\Psi=(\theta,\beta,\gamma) are some distributional parameters. While θ\theta and β\beta are unknown, γ\gamma may be known as SRS_{R} is a probability sample. A QR approach involves modeling f(δiB|xi;β)f(\delta^{B}_{i}|x^{*}_{i};\beta), whereas a PM approach deals with modeling f(yi|xi;θ)f(y_{i}|x^{*}_{i};\theta).

Now suppose SBS_{B} and SRS_{R} have trivial overlap, i.e. p(δiB+δiR=2)0p(\delta^{B}_{i}+\delta^{R}_{i}=2)\approx 0. This assumption is reasonable when the sampling fraction in both samples is small. Note that under the ignorable assumption, the propensity model for SBS_{B} depends on XX observed for the entire population. Thus, given the combined sample, S=SBSRS=S_{B}\cup S_{R}, with n=nB+nRn=n_{B}+n_{R} being the sample size, it is reasonable to expect that the pseudo-inclusion probabilities, πiB\pi^{B}_{i}’s, are a function of both xix_{i} and did_{i} for iSi\in S. Let zi=I(iSB|δi=1)z_{i}=I(i\in S_{B}|\delta_{i}=1) be the indicator of subject ii belonging to the non-probability sample in the combined sample where δi=δiB+δiR\delta_{i}=\delta^{B}_{i}+\delta^{R}_{i}. Note that since SBSR=S_{B}\cap S_{R}=\emptyset, δi\delta_{i} can take values of either 0 or 11 as below:

δi={0,ifδiR=0andδiB=01,ifδiR=1orδiB=1\delta_{i}=\begin{cases}0,&\hskip 8.53581pt\text{if}\hskip 8.53581pt\delta^{R}_{i}=0\hskip 8.53581ptand\hskip 8.53581pt\delta^{B}_{i}=0\\ 1,&\hskip 8.53581pt\text{if}\hskip 8.53581pt\delta^{R}_{i}=1\hskip 17.07164ptor\hskip 8.53581pt\delta^{B}_{i}=1\end{cases}

Figure 1 illustrates the data structure in both the finite population and the combined sample.

Refer to caption
Figure 1: Data structure in the population and the combined sample

2.2 Quasi-randomization

In QR, the non-probability sample is treated as if the self-selection mechanism of population units mimics a stochastic process, but with unknown selection probabilities. Then, attempts are made to estimate these missing quantities in SBS_{B} based on external information. Conditional on xix^{*}_{i}, suppose πiB\pi^{B}_{i} follows a logistic regression model in the finite population. We have

πB(xi;β)=p(δiB=1|xi;β)=exp{β0Txi}1+exp{β0Txi},iU\pi^{B}(x_{i};\beta)=p(\delta^{B}_{i}=1|x_{i};\beta)=\frac{\exp\{\beta^{T}_{0}x_{i}\}}{1+\exp\{\beta^{T}_{0}x_{i}\}},\hskip 14.22636pt\forall i\in U (2.2)

where β\beta is a vector of model parameters in UU. Using a modified pseudo-maximum likelihood approach (PMLE), Chen et al. (2020) demonstrate that, given SS, a consistent estimate of β\beta can be obtained by solving the following estimating equation with respect to β\beta:

U(β)=i=1nBxii=1nRπB(xi;β)xi/πiR=0U(\beta)=\sum_{i=1}^{n_{B}}x_{i}-\sum_{i=1}^{n_{R}}\pi^{B}(x_{i};\beta)x_{i}/\pi^{R}_{i}=0 (2.3)

The estimates of the πiB\pi^{B}_{i}’s, which we also call propensity scores (PS), are obtained by plugging the solution of Eq. 2.3, i.e. β^\widehat{\beta}, into Eq. 2.2. It is important to note that the proposed PS estimator by Chen et al. (2020) depends implicitly on did_{i} in addition to xix_{i}, because we know that πiR\pi^{R}_{i} is a determinstic function of did_{i} for iUi\in U. Under certain regularity conditions, the authors show that the inverse PS weighted (IPSW) mean from SBS_{B} yields a consistent and asymptotically unbiased estimate for the population mean.

Obviously, the possible solutions of Eq. 2.3 are not a typical output of logistic regression procedures in the existing standard software. With one additional assumption, which is mutual exclusiveness of the two samples, i.e. SBSR=S_{B}\cap S_{R}=\emptyset, we show that estimating πiB\pi^{B}_{i}’s can be reduced to modeling ZiZ_{i} for iSi\in S instead of modeling δiB\delta^{B}_{i} for iUi\in U. Intuitively, one can view the selection process of the ii-th population unit in SBS_{B} as being initially selected in the joint sample (δi=1\delta_{i}=1) and then being selected in SBS_{B} given the combined sample (Zi=1Z_{i}=1). By conditioning on xix^{*}_{i}, the selection probabilities in SBS_{B} are factorized as

p(δiB=1|xi)\displaystyle p(\delta^{B}_{i}=1|x^{*}_{i}) =p(δiB=1,δi=1|xi)\displaystyle=p(\delta^{B}_{i}=1,\delta_{i}=1|x^{*}_{i}) (2.4)
=p(δiB=1|δi=1,xi)p(δi=1|xi)\displaystyle=p(\delta^{B}_{i}=1|\delta_{i}=1,x^{*}_{i})p(\delta_{i}=1|x^{*}_{i})
=p(Zi=1|xi)p(δi=1|xi)iS\displaystyle=p(Z_{i}=1|x^{*}_{i})p(\delta_{i}=1|x^{*}_{i})\hskip 14.22636pti\in S

Note that the last expression in Eq. 2.4 results from the definition of ZiZ_{i} given SS. The same factorization can be derived for the selection probabilities in SRS_{R}. Thus, we have

p(δiR=1|xi)=p(Zi=0|xi)p(δi=1|xi)\displaystyle p(\delta^{R}_{i}=1|x^{*}_{i})=p(Z_{i}=0|x^{*}_{i})p(\delta_{i}=1|x^{*}_{i}) (2.5)

By dividing the two sides of the equations 2.4 and 2.5, one can get rid of p(δi=1|xi)p(\delta_{i}=1|x^{*}_{i}) and obtain the pseudo-selection probabilities in SBS_{B} as below:

p(δiB=1|xi)=p(δiR=1|xi)p(Zi=1|xi)p(Zi=0|xi)p(\delta^{B}_{i}=1|x^{*}_{i})=p(\delta^{R}_{i}=1|x^{*}_{i})\frac{p(Z_{i}=1|x^{*}_{i})}{p(Z_{i}=0|x^{*}_{i})} (2.6)

It is clear that p(δiR=1|xi)=πiRp(\delta^{R}_{i}=1|x^{*}_{i})=\pi^{R}_{i} as xix^{*}_{i} contains did_{i} and the sampling design of SRS_{R} is known given did_{i}. As will be seen in Section 2.4, conditioning on did_{i} is vital for the DR estimator, as Chen’s method is limited to situations where the dimension of the auxiliary variables must be the same in QR and PM.

Note that Eq. 2.6 is identical to the pseudo-weighting formula Elliott and Valliant (2017) derive for a non-probability sample. Unlike the PMLE approach, modeling ZiZ_{i} in SS can be performed using the standard binary logistic regression or any alternative classification methods, such as supervised machine learning algorithms. Under a logistic regression model, we have

p(Zi=1|xi)=exp{β1Txi}1+exp{β1Txi}p(Z_{i}=1|x^{*}_{i})=\frac{\exp\{\beta^{T}_{1}x_{i}^{*}\}}{1+\exp\{\beta^{T}_{1}x_{i}^{*}\}} (2.7)

where β1\beta_{1} denotes the vector of model parameters being estimated via maximum likelihood estimation (MLE). Hence, in situations where πiR\pi^{R}_{i} is known or can be calculated for iSBi\in S_{B}, the estimate of πiB\pi^{B}_{i} for iSBi\in S_{B} is given by

π^iB=πiRexp{β^1Txi}=πiRpi(β^1)1pi(β^1)\widehat{\pi}^{B}_{i}=\pi^{R}_{i}\exp\{\widehat{\beta}^{T}_{1}x_{i}^{*}\}=\pi^{R}_{i}\frac{p_{i}(\widehat{\beta}_{1})}{1-p_{i}(\widehat{\beta}_{1})} (2.8)

where β^1\widehat{\beta}_{1} denotes the MLE estimate of the logistic regression model parameters, and pi(β^1)p_{i}(\widehat{\beta}_{1}) is a shorthand of p(Zi=1|xi;β^1)p(Z_{i}=1|x^{*}_{i};\widehat{\beta}_{1}). Intuitively, one can envision that the first factor in 2.8 treats SBS_{B} as if it is selected under the design of SRS_{R}, and the second factor attempts to balance the distribution of xx in SBS_{B} with respect to that in SRS_{R}.

Having πiB\pi^{B}_{i} estimated based on 2.8 for all iSBi\in S_{B}, one can construct the Hájek-type pseudo-weighted estimator for the finite population mean as below:

y¯^PAPW=1N^Bi=1nByiπ^iB\widehat{\overline{y}}_{PAPW}=\frac{1}{\widehat{N}_{B}}\sum_{i=1}^{n_{B}}\frac{y_{i}}{\widehat{\pi}^{B}_{i}} (2.9)

where N^B=i=1nB1/πiB\widehat{N}_{B}=\sum_{i=1}^{n_{B}}1/\pi^{B}_{i}. Hereafter, we refer to the estimator in 2.9 as propensity-adjusted probability weighting (PAPW). Under mild regularity conditions, the ignorable assumption in SBS_{B} given xx, the logistic regression model and the additional assumption of SBSR=S_{B}\cap S_{R}=\emptyset, Appendix 8.1 shows that this estimator is consistent and asymptotically unbiased for y¯U\overline{y}_{U}. Further, when πiR\pi_{i}^{R} is known, the sandwich-type variance estimator for y¯^PAPW\widehat{\overline{y}}_{PAPW} is given by

Var^(y¯^PAPW)=1N2i=1nB{1π^iB}(yiy¯^PAPWπ^iB)2\displaystyle\widehat{Var}\left(\widehat{\overline{y}}_{PAPW}\right)=\frac{1}{N^{2}}\sum_{i=1}^{n_{B}}\big{\{}1-\widehat{\pi}^{B}_{i}\big{\}}\left(\frac{y_{i}-\widehat{\overline{y}}_{PAPW}}{\widehat{\pi}^{B}_{i}}\right)^{2} 2b^TN2i=1nB{1pi(β^1)}(yiy¯^PAPWπ^iB)xi\displaystyle-2\frac{\widehat{b}^{T}}{N^{2}}\sum_{i=1}^{n_{B}}\big{\{}1-p_{i}(\widehat{\beta}_{1})\big{\}}\left(\frac{y_{i}-\widehat{\overline{y}}_{PAPW}}{\widehat{\pi}^{B}_{i}}\right)x^{*}_{i} (2.10)
+b^T[1N2i=1npi(β^1)xixiT]b^\displaystyle+\widehat{b}^{T}\left[\frac{1}{N^{2}}\sum_{i=1}^{n}p_{i}(\widehat{\beta}_{1})x^{*}_{i}x_{i}^{*T}\right]\widehat{b}

where

b^T={1Ni=1nB(yiy¯^PAPWπ^iB)xiT}{1Ni=1npi(β^1)xixiT}1\widehat{b}^{T}=\bigg{\{}\frac{1}{N}\sum_{i=1}^{n_{B}}\left(\frac{y_{i}-\widehat{\overline{y}}_{PAPW}}{\widehat{\pi}^{B}_{i}}\right)x^{*T}_{i}\bigg{\}}\bigg{\{}\frac{1}{N}\sum_{i=1}^{n}p_{i}(\widehat{\beta}_{1})x^{*}_{i}x_{i}^{*T}\bigg{\}}^{-1} (2.11)

where π^iB\widehat{\pi}^{B}_{i} is the estimated pseudo-selection probability based on 2.9 for iSBi\in S_{B}. See Appendix 8.1 for the derivation.

In situations where πiR\pi^{R}_{i} is unknown for iSBi\in S_{B}, Elliott and Valliant (2017) suggest predicting this quantity for units of the non-probability sample. Note that, in this situation, it is no longer required to condition on did_{i} in addition to xix_{i}. Treating πiR\pi^{R}_{i} as a random variable for iSBi\in S_{B} conditional on xix_{i}, one can obtain this quantity by regressing the πiR\pi^{R}_{i}’s on the xix_{i}’s in the reference survey. We have

p(δiR=1|xi)\displaystyle p(\delta^{R}_{i}=1|x_{i}) =01p(δiR=1|πiR,xi)p(πiR|xi)𝑑πiR\displaystyle=\int_{0}^{1}p(\delta^{R}_{i}=1|\pi^{R}_{i},x_{i})p(\pi^{R}_{i}|x_{i})d\pi^{R}_{i} (2.12)
=01πiRp(πiR|xi)𝑑πiR\displaystyle=\int_{0}^{1}\pi^{R}_{i}p(\pi^{R}_{i}|x_{i})d\pi^{R}_{i}
=E(πiR|xi)iSR.\displaystyle=E(\pi^{R}_{i}|x_{i})\hskip 14.22636pti\in S_{R}.

However, since the outcome is continuous bounded taking values within (0,1)(0,1), fitting a BetaBeta regression model is recommended (Ferrari and Cribari-Neto, 2004). Note that, πiR\pi^{R}_{i} is fixed given did_{i} as SRS_{R} is a probability sample, but conditional on xix_{i}, πiR\pi^{R}_{i} can be regarded as a random variable.

Rafei et al. (2020) call this approach propensity-adjusted probability prediction (PAPP). This two-step derivation of pseudo-inclusion probabilities is especially useful, as it separates sampling weights in SRS_{R} from the propensity model computationally. When the true model is unknown, this feature enables us to fit a broader and more flexible range of models, such as algorithmic tree-based methods. It is worth noting that modeling E(πiR|xi)E(\pi^{R}_{i}|x_{i}) does not impose an additional ignorable assumption in SRS_{R} given xx, because in the extreme case if δiRxi\delta^{R}_{i}\rotatebox[origin={c}]{90.0}{$\models$}x_{i}, that means weighted and unweighted distributions of xx are identical in SRS_{R}, and therefore the πiR\pi^{R}_{i}’s can be safely ignored in propensity modeling.

2.3 Prediction modeling approach

An alternative approach to deal with selectivity in Big Data is modeling f(y|x)f(y|x^{*}) (Smith, 1983). In a fully model-based fashion, this essentially involves imputing yy for the population non-sampled units, USBU-S_{B}. When xx^{*} is unobserved for non-sampled units, it is recommended that a synthetic population is generated by undoing the selection mechanism of SRS_{R} through a non-parametric Bayesian bootstrap method using the design variables in SRS_{R} (Dong et al., 2014; Zangeneh and Little, 2015). In the non-probability sample context, Elliott and Valliant (2017) propose an extension of the General Regression Estimator (GREG) when only summary information about xx^{*}, such as totals, is known regarding UU. In situations where an external probability sample is available with xx^{*} measured, an alternative is to limit the outcome prediction to the units in SRS_{R}, and then, use design-based approaches to estimate the population quantity (Rivers, 2007; Kim et al., 2018).

However, to the best of our knowledge, none of the prior literature distinguish the role of DD from XX in the conditional mean structure of the outcome, while the likelihood factorization in Eq. 2.1 indicates that predicting yy requires conditioning not only on xx but also on dd. Suppose UU is a realization of a repeated random sampling process from a super-population under the following model:

E(yi|xi;θ)=m(xi;θ)iU\displaystyle E(y_{i}|x^{*}_{i};\theta)=m(x^{*}_{i};\theta)\hskip 14.22636pt\forall i\in U (2.13)

where m(xi;θ)m(x^{*}_{i};\theta) can be either a parametric model with mm being a continuous differentiable function or an unspecified non-parametric form. Under the ignorable condition where

f(yi|xi,zi=1;θ)=f(yi|xi,zi=0;θ)=f(yi|xi,di;θ)\displaystyle f(y_{i}|x^{*}_{i},z_{i}=1;\theta)=f(y_{i}|x^{*}_{i},z_{i}=0;\theta)=f(y_{i}|x_{i},d_{i};\theta) (2.14)

an MLE estimate of θ\theta can be obtained by regressing YY on XX^{*} given SBS_{B}. The predictions for units in SRS_{R} are then given by

y^i=E(yi|xi,zi=0;θ^)=m(xi;θ^)iSR\displaystyle\widehat{y}_{i}=E(y_{i}|x^{*}_{i},z_{i}=0;\widehat{\theta})=m(x^{*}_{i};\widehat{\theta})\hskip 14.22636pt\forall i\in S_{R} (2.15)

where m(xi;θ^)=θ^Txim(x^{*}_{i};\widehat{\theta})=\widehat{\theta}^{T}x^{*}_{i}. Once yy is imputed for all units in the reference survey, the population mean can be estimated through the Hájek formula as below:

y¯^PM=1N^Ri=1nRy^iπiR\displaystyle\widehat{\overline{y}}_{PM}=\frac{1}{\widehat{N}_{R}}\sum_{i=1}^{n_{R}}\frac{\widehat{y}_{i}}{\pi^{R}_{i}} (2.16)

where y^i=m(xi;θ^)\widehat{y}_{i}=m(x^{*}_{i};\widehat{\theta}) for iSRi\in S_{R}, N^R=i=1nRwiR\widehat{N}_{R}=\sum_{i=1}^{n_{R}}w^{R}_{i} and πiR\pi^{R}_{i} is the selection probability for subject iSi\in S.

The asymptotic properties of the estimator in 2.16, including consistency and unbiasedness, have been investigated by Kim et al. (2018). Note that in situations where πiR\pi^{R}_{i} is available for iSBi\in S_{B}, one can use wiRw^{R}_{i} instead of the high-dimensional did_{i} as a predictor in m(.)m(.). This method is known as linear-in-the-weight prediction (LWP) (Scharfstein et al., 1999; Bang and Robins, 2005; Zhang and Little, 2011). However, since outcome imputation relies fully on extrapolation, even minor misspecification of the underlying model can be seriously detrimental to bias correction.

2.4 Doubly robust adjustment approach

To reduce the sensitivity to model misspecification, Chen et al. (2020) reconcile the two aforementioned approaches, i.e. QR and PM, in a way that estimates remain consistent even if one of the two models is incorrectly specified. Their method involves an extension of the augmented inverse propensity weighting (AIPW) proposed by Robins et al. (1994). When NN is known, the expanded AIPW estimator takes the following form:

y¯DR=1Ni=1nB{yim(xi;θ)}πB(xi;β)+1Nj=1nRm(xi;θ)πjR\overline{y}_{DR}=\frac{1}{N}\sum_{i=1}^{n_{B}}\frac{\{y_{i}-m(x^{*}_{i};\theta)\}}{\pi^{B}(x^{*}_{i};\beta)}+\frac{1}{N}\sum_{j=1}^{n_{R}}\frac{m(x^{*}_{i};\theta)}{\pi^{R}_{j}} (2.17)

where given xx^{*}, β\beta and θ\theta are estimated using the modified PMLE and MLE methods mentioned in sections 2.2 and 2.3, respectively. The theoretical proof of the asymptotic unbiasedness of y¯DR\overline{y}_{DR} under the correct modeling of πB(xi;β)\pi^{B}(x^{*}_{i};\beta) or m(xi;θ)m(x^{*}_{i};\theta) is reviewed in Appendix 8.1.

To avoid using πR\pi^{R} in modeling δiB\delta^{B}_{i} because of the PMLE restrictions we discussed in Section 2.2, in this study, we suggest estimating πiB\pi^{B}_{i} for iSBi\in S_{B} in Eq. 2.17 based on the PAPW/PAPP method depending on whether πiR\pi^{R}_{i} is available for iSBi\in S_{B} or not. As a result, in situations where πiR\pi^{R}_{i} is known for iSBi\in S_{B}, our proposed DR estimator is given by

y¯^DR=1Ni=1nB1πiR[1pi(β1)pi(β1)]{yim(xi;θ)}+1Nj=1nRm(xi;θ)πjR\widehat{\overline{y}}_{DR}=\frac{1}{N}\sum_{i=1}^{n_{B}}\frac{1}{\pi^{R}_{i}}\left[\frac{1-p_{i}(\beta_{1})}{p_{i}(\beta_{1})}\right]\{y_{i}-m(x^{*}_{i};\theta)\}+\frac{1}{N}\sum_{j=1}^{n_{R}}\frac{m(x^{*}_{i};\theta)}{\pi^{R}_{j}} (2.18)

wher πB(xi;β)\pi^{B}(x^{*}_{i};\beta) is substituted using Eq. 2.8. We demonstrate that this form of the AIPW estimator is identical to that defined by Kim and Haziza (2014) in the non-response adjustment context under probability surveys. Assuming that yiy_{i} is fully observed for iSRi\in S_{R}, let us define the following HT-estimator for the population mean:

y¯^U=1Ni=1nRyiπiR\widehat{\overline{y}}_{U}=\frac{1}{N}\sum_{i=1}^{n_{R}}\frac{y_{i}}{\pi^{R}_{i}} (2.19)

Now, one can easily conclude that

y¯^DR\displaystyle\widehat{\overline{y}}_{DR} =1Ni=1n1πiR[Zi(1pi(β1)pi(β1)){yim(xi;θ)}+(1Zi)m(xi;θ)]\displaystyle=\frac{1}{N}\sum_{i=1}^{n}\frac{1}{\pi^{R}_{i}}\left[Z_{i}\left(\frac{1-p_{i}(\beta_{1})}{p_{i}(\beta_{1})}\right)\{y_{i}-m(x^{*}_{i};\theta)\}+(1-Z_{i})m(x^{*}_{i};\theta)\right] (2.20)
=y¯^U+1Ni=1n1πiR[Zipi(β1)1]{yim(xi;θ)}\displaystyle=\widehat{\overline{y}}_{U}+\frac{1}{N}\sum_{i=1}^{n}\frac{1}{\pi^{R}_{i}}\left[\frac{Z_{i}}{p_{i}(\beta_{1})}-1\right]\big{\{}y_{i}-m(x^{*}_{i};\theta)\big{\}}

where pi(β1)=p(Zi=1|xi;β1)p_{i}(\beta_{1})=p(Z_{i}=1|x_{i}^{*};\beta_{1}). The formula in 2.20 is similar to what is derived by Kim and Haziza (2014). Therefore, the rest of the theoretical proof of asymptotic unbiasedness, i.e. y¯^DRy^¯U=Op(nB1/2)\widehat{\overline{y}}_{DR}-\overline{\widehat{y}}_{U}=O_{p}(n_{B}^{-1/2}), in Kim and Haziza (2014) should hold for our modified AIPW estimator in 2.18 as well.

To preserve the DR property for both the point and variance estimator of y¯DR\overline{y}_{DR}, as suggested by Kim and Haziza (2014), one can solve the following estimating equations simultaneously given SS to obtain the estimate of (β1,θ)(\beta_{1},\theta). The aim is to cancel the first-order derivative terms in the Taylor-series expansion of y¯^DRy¯^U\widehat{\overline{y}}_{DR}-\widehat{\overline{y}}_{U} under QR and PM. These estimating equations are given by

β1[y¯^DRy¯^U]\displaystyle\frac{\partial}{\partial\beta_{1}}\left[\widehat{\overline{y}}_{DR}-\widehat{\overline{y}}_{U}\right] =1Ni=1nZiπiR[1pi(β1)1]{yim(xi;θ)}xi=0\displaystyle=\frac{1}{N}\sum_{i=1}^{n}\frac{Z_{i}}{\pi^{R}_{i}}\left[\frac{1}{p_{i}(\beta_{1})}-1\right]\{y_{i}-m(x^{*}_{i};\theta)\}x^{*}_{i}=0 (2.21)
θ[y¯^DRy¯^U]\displaystyle\frac{\partial}{\partial\theta}\left[\widehat{\overline{y}}_{DR}-\widehat{\overline{y}}_{U}\right] =1Ni=1n1πiR[Zipi(β1)1]m˙(xi;θ)=0\displaystyle=\frac{1}{N}\sum_{i=1}^{n}\frac{1}{\pi^{R}_{i}}\left[\frac{Z_{i}}{p_{i}(\beta_{1})}-1\right]\dot{m}(x^{*}_{i};\theta)=0

where m˙\dot{m} is the derivative of mm with respect to β1\beta_{1}. Under a linear regression model, m˙(xi)=xi\dot{m}(x^{*}_{i})=x^{*}_{i}. Therefore, given the same regularity conditions, ignorability in SBS_{B}, the logistic regression model as well as the additional imposed assumption of SBSR=S_{B}\cap S_{R}=\emptyset, one can show that the proposed DR estimator is consistent and approximately unbiased given that either the QR or PM model holds.

It is important to note that the system of equations in 2.21 may not have unique solutions unless the dimension of covariates in QR and PM is identical. Therefore, the AIPW estimator by Chen et al. (2020) may not be applicable here, as our likelihood factorization suggests that conditioning on did_{i} is necessary at least for the PM. Furthermore, when πiR\pi^{R}_{i} is known for iSBi\in S_{B}, one can replace the qq-dimensional did_{i} with the 11-dimensional wiRw^{R}_{i} in modeling both QR and PM. Bang and Robins (2005) shows that estimators based on a linear-in-weight prediction model remains consistent.

2.5 Extension of the proposed method under a two-step Bayesian framework

A fully Bayesian approach specifies a model for the joint distribution of selection indicator, δiB\delta^{B}_{i}, and the outcome variable, yiy_{i}, for iUi\in U (McCandless et al., 2009; An, 2010). This requires multiply generating synthetic populations and fitting the QR and PM models on each of them repeatedly (Little and Zheng, 2007; Zangeneh and Little, 2015), which can be computationally expensive under a Big Data setting. While joint modeling may result in good frequentist properties (Little, 2004), feedback occurs between the two models (Zigler et al., 2013). This can be controversial in the sense that PS estimates should not be informed by the outcome model (Rubin, 2007). Here, we are interested in modeling the PS and the outcome separately through the two-step framework proposed by Kaplan and Chen (2012). The first step involves fitting Bayesian models to multiply impute the PS and the outcome by randomly subsampling the posterior predictive draws, and Rubin’s combining rules are utilized as the second step to obtain the final point and interval estimates. This method not only is computationally efficient as it suffices to fit the models once and on the combined sample but also cuts the undesirable feedback between the models as they are fitted separately. Bayesian modeling can be performed either parametrically or non-parametrically.

2.5.1 Parametric Bayes

As the first step, we employ Bayesian Generalized Linear Models to handle multiple imputations of πiB\pi^{B}_{i} and yiy_{i} for iSi\in S, and πiR\pi^{R}_{i} if it is unknown for iSRi\in S_{R}. Under a standard Bayesian framework, a set of independent prior distributions are assigned to the model parameters, and conditional on the observed data, the associated posterior distributions are simulated through an appropriate MCMC method, such as Metropolis–Hastings algorithm. We propose the following steps:

Step1:(γT,ϕ,βT,θT,σ)\displaystyle Step1:\hskip 42.67912pt(\gamma^{T},\phi,\beta^{T},\theta^{T},\sigma) p(γ)p(ϕ)p(β)p(θ)p(σ)\displaystyle\sim p(\gamma)p(\phi)p(\beta)p(\theta)p(\sigma)
Step2:πiR|xi,γ,ϕ\displaystyle Step2:\hskip 71.13188pt\pi^{R}_{i}|x_{i},\gamma,\phi Beta(ϕ[logit1(γTxi)],ϕ[1logit1(γTxi)])\displaystyle\sim Beta(\phi[logit^{-1}(\gamma^{T}x_{i})],\phi[1-logit^{-1}(\gamma^{T}x_{i})])
Step3:Zi|xi,β\displaystyle Step3:\hskip 85.35826ptZ_{i}|x_{i},\beta Bernoulli(logit1{βTxi})\displaystyle\sim Bernoulli(logit^{-1}\{\beta^{T}x_{i}\})
Step4:Yi|xi,θ,σ\displaystyle Step4:\hskip 76.82243ptY_{i}|x^{*}_{i},\theta,\sigma Normal(θTxi,σ2)\displaystyle\sim Normal(\theta^{T}x^{*}_{i},\sigma^{2})

where (γT,ϕ)(\gamma^{T},\phi), βT\beta^{T} and (θT,σ)(\theta^{T},\sigma) are the parameters associated with modeling πiR\pi^{R}_{i} in a Beta regression (Step2Step2), ZiZ_{i} in a binary logistic regression (Step3Step3) and YiY_{i} is a linear regression (Step4Step4), respectively, and p(.)p(.) denotes a prior density function. Note that in situations where πiR\pi^{R}_{i} is calculable for iSBi\in S_{B}, Step2Step2 should be skipped, and xix_{i} should be replaced by xix^{*}_{i} for Step3Step3. Standard weak or non-informative priors for the regression parameter models can be used (Kaplan and Chen, 2012). We also note that Step2Step2, which will be required for the estimation of πiR\pi^{R}_{i} when not provided directly or through the availability of did_{i} in SBS_{B}, relies on a reasonably strong association between the available xix_{i} and πiR\pi^{R}_{i} in order to accurately estimate πiR\pi^{R}_{i}. We explore the effect of differing degrees of this association via simulation in Sections 3.2 and 3.3.

Suppose Θ(m)T=[(γ(m)T,ϕ(m),β(m)T,θT(m),σ(m)]\Theta^{(m)T}=\left[(\gamma^{(m)T},\phi^{(m)},\beta^{(m)T},\theta^{T(m)},\sigma^{(m)}\right] is the mm-th unit of an MM-sized random sample from the MCMC draws associated with the posterior distribution of the models parameters. Then, given that πiR\pi^{R}_{i} is known for iSBi\in S_{B}, one can obtain the m-th draw of the proposed AIPW estimator as below:

y¯DR(m)=1N^Bi=1nByiθ(m)TxiπiRexp[β(m)Txi]+1N^Rj=1nRθ(m)TxiπjR{\overline{y}}^{(m)}_{DR}=\frac{1}{\widehat{N}_{B}}\sum_{i=1}^{n_{B}}\frac{y_{i}-\theta^{(m)T}x^{*}_{i}}{\pi_{i}^{R}\exp[\beta^{(m)T}x^{*}_{i}]}+\frac{1}{\widehat{N}_{R}}\sum_{j=1}^{n_{R}}\frac{\theta^{(m)T}x^{*}_{i}}{\pi^{R}_{j}} (2.22)

where θ(m)Txi\theta^{(m)T}x^{*}_{i} corresponds to an imputation of yiy_{i} for the unobserved values in the probability sample. In situations where πiR\pi^{R}_{i} is unknown for iSBi\in S_{B}, the mm-th draw of the AIPW estimator is given by

y¯DR(m)=1N^Bi=1nB{1+exp[γ(m)Txi]exp[γ(m)Txi]}{yiθ(m)Txiexp[β(m)Txi]}+1N^Rj=1nRθ(m)TxiπjR.{\overline{y}}^{(m)}_{DR}=\frac{1}{\widehat{N}_{B}}\sum_{i=1}^{n_{B}}\bigg{\{}\frac{1+\exp[\gamma^{(m)T}x_{i}]}{\exp[\gamma^{(m)T}x_{i}]}\bigg{\}}\bigg{\{}\frac{y_{i}-\theta^{(m)T}x^{*}_{i}}{\exp[\beta^{(m)T}x_{i}]}\bigg{\}}+\frac{1}{\widehat{N}_{R}}\sum_{j=1}^{n_{R}}\frac{\theta^{(m)T}x^{*}_{i}}{\pi^{R}_{j}}. (2.23)

Having y¯DR(m){\overline{y}}^{(m)}_{DR} for all m=1,2,,Mm=1,2,...,M, then, Rubin’s combining rule for the point estimate (Rubin, 2004) can be employed to obtain the final AIPW estimator as below:

y¯DR=1Mm=1My¯DR(m).{\overline{y}}_{DR}=\frac{1}{M}\sum_{m=1}^{M}{\overline{y}}^{(m)}_{DR}. (2.24)

If at least one of the underlying models is correctly specified, we would expect that this estimator is approximately unbiased. The variance estimation under the two-step Bayesian method is discussed in Section 2.6.

2.5.2 Non-parametric Bayes

Despite the prominent feature of double robustness, for a given non-probability sample, neither QR nor PM have a known modeling structure in practice. When both working models are invalid, the AIPW estimator will be biased and a non-robust estimator based on PM may produce a more efficient estimate than the AIPW (Kang et al., 2007). To show the advantage of our modified estimator in 2.18 over that proposed by Chen et al. (2020), we employ Bayesian Additive Regression Trees (BART) as a predictive tool for multiply imputing the πiB\pi^{B}_{i}’s as well as the yiy_{i}’s in SS. A brief introduction to BART has been provided in Appendix 8.2.

Suppose BART approximates a continuous outcome variable through an implicit function ff as below:

yi=f(xi)+ϵiiSBy_{i}=f(x^{*}_{i})+\epsilon_{i}\hskip 28.45274pt\forall i\in S_{B} (2.25)

where ϵiN(0,σ2)\epsilon_{i}\sim N(0,\sigma^{2}). Accordingly, one can train BART in SBS_{B} and multiply impute yiy_{i} for iSRi\in S_{R} using the simulated posterior predictive distribution. Regarding QR, we consider two situations; first, πiR\pi^{R}_{i} is known for iSBi\in S_{B}. Under this circumstance, it suffices to model ziz_{i} on xix_{i}^{*} in SS to estimate πiB\pi^{B}_{i} for iSBi\in S_{B}. For a binary outcome variable, BART utilizes a data augmentation technique with respect to a given link function, to map {0,1}\{0,1\} values to \mathbb{R} via a probit link. Suppose

Φ1[p(Zi=1|xi)]=h(xi)iS\Phi^{-1}[p(Z_{i}=1|x^{*}_{i})]=h(x^{*}_{i})\hskip 28.45274pt\forall i\in S (2.26)

where Φ1\Phi^{-1} is the inverse CDF of the standard normal distribution. Hence, using the posterior predictive draws generated by BART in 2.26, p(Zi=1|xi)p(Z_{i}=1|x^{*}_{i}) and consequently πiB\pi^{B}_{i} can be imputed multiply for iSBi\in S_{B}. For a given imputation mm (m=1,2,,M)(m=1,2,...,M), one can expand the DR estimator in 2.18 as below:

y¯DR(m)=1N^Bi=1nB1πiR{1Φ[h(m)(xi)]Φ[h(m)(xi)]}{yif(m)(xi)}+1N^Rj=1nRf(m)(xj)πjR{\overline{y}}^{(m)}_{DR}=\frac{1}{\widehat{N}_{B}}\sum_{i=1}^{n_{B}}\frac{1}{\pi_{i}^{R}}\bigg{\{}\frac{1-\Phi[h^{(m)}(x^{*}_{i})]}{\Phi[h^{(m)}(x^{*}_{i})]}\bigg{\}}\big{\{}y_{i}-f^{(m)}(x^{*}_{i})\big{\}}+\frac{1}{\widehat{N}_{R}}\sum_{j=1}^{n_{R}}\frac{f^{(m)}(x^{*}_{j})}{\pi^{R}_{j}} (2.27)

where f(m)(.)f^{(m)}(.) and h(m)(.)h^{(m)}(.) are the constructed sum-of-trees associated with the mm-th MCMC draw in 2.25 and 2.26, respectively, after training BART on the combined sample.

Secondly, in situations where πiR\pi^{R}_{i} is not available for iSBi\in S_{B}, we suggest applying BART to multiply impute the missing πiR\pi^{R}_{i}’s in SBS_{B}. Since the outcome is continuous bounded within (0,1)(0,1), a logit transformation of the πiR\pi^{R}_{i}’s can be used as the outcome variable in BART to map the values to \mathbb{R}. Such a model is presented by

log(πiR1πiR)=k(xi)+ϵiiSR\log\left(\frac{\pi^{R}_{i}}{1-\pi^{R}_{i}}\right)=k(x_{i})+\epsilon_{i}\hskip 28.45274pt\forall i\in S_{R} (2.28)

where kk is a sum-of-trees function approximated by BART. Under this circumstance, y¯DR{\overline{y}}_{DR} based on the mm-th draw from the posterior distribution is given by

y¯DR(m)=1N^Bi=1nB{1+exp[k(m)(xi)]exp[k(m)(xi)]}{1Φ[h(m)(xi)]Φ[h(m)(xi)]}{yif(m)(xi)}+1N^Rj=1nRf(m)(xj)πjR.{\overline{y}}^{(m)}_{DR}=\frac{1}{\widehat{N}_{B}}\sum_{i=1}^{n_{B}}\bigg{\{}\frac{1+\exp[k^{(m)}(x_{i})]}{\exp[k^{(m)}(x_{i})]}\bigg{\}}\bigg{\{}\frac{1-\Phi[h^{(m)}(x_{i})]}{\Phi[h^{(m)}(x_{i})]}\bigg{\}}\big{\{}y_{i}-f^{(m)}(x^{*}_{i})\big{\}}+\frac{1}{\widehat{N}_{R}}\sum_{j=1}^{n_{R}}\frac{f^{(m)}(x^{*}_{j})}{\pi^{R}_{j}}. (2.29)

Having y¯DR(m){\overline{y}}^{(m)}_{DR} estimated for m=1,2,,Mm=1,2,...,M, one can eventually use Rubin’s combining rule (Rubin, 2004) to obtain the ultimate point estimate as in 2.24.

2.6 Variance estimation

To obtain an unbiased variance estimate for the the proposed DR estimator, one needs to account for three sources of uncertainty: (i) the uncertainty due to estimated pseudo-weights in SBS_{B}, (ii) the uncertainty due to the predicted outcome in both SBS_{B} and SRS_{R}, and (iii) the uncertainty due to sampling itself. In the following, we consider two scenarios:

2.6.1 Scenario I: πiR\pi^{R}_{i} is known for iSBi\in S_{B}

In this scenario the derivation of the asymptotic variance estimator for y¯^DR\widehat{\overline{y}}_{DR} is straightforward and follows Chen et al. (2020). It is given by

Var^(y¯^DR)=V^1+V^2B^(V^2)\widehat{Var}(\widehat{\overline{y}}_{DR})=\widehat{V}_{1}+\widehat{V}_{2}-\widehat{B}(\widehat{V}_{2}) (2.30)

where V1=Var(y¯^PM)V_{1}=Var(\widehat{\overline{y}}_{PM}) under the sampling design of SRS_{R}. V2V_{2} is the variance of y¯^DRy¯^U\widehat{\overline{y}}_{DR}-\widehat{\overline{y}}_{U} under the joint sampling design of SRS_{R} and the PS model. This quantity can be estimated from SRS_{R} as below:

V^2=1N2i=1nB[1π^iB(π^iB)2]{yim(xi;θ^)}2.\widehat{V}_{2}=\frac{1}{N^{2}}\sum_{i=1}^{n_{B}}\left[\frac{1-\widehat{\pi}^{B}_{i}}{(\widehat{\pi}^{B}_{i})^{2}}\right]\{y_{i}-m(x^{*}_{i};\widehat{\theta})\}^{2}. (2.31)

Finally, B(V^2)B(\widehat{V}_{2}) corrects for the bias in V2V_{2} under the PM, and is given by

B^(V^2)=1N2i=1n[Ziπ^iB1ZiπiR]σ^i2\widehat{B}(\widehat{V}_{2})=\frac{1}{N^{2}}\sum_{i=1}^{n}\left[\frac{Z_{i}}{\widehat{\pi}^{B}_{i}}-\frac{1-Z_{i}}{\pi^{R}_{i}}\right]\widehat{\sigma}^{2}_{i} (2.32)

where σ^i2=Var^(yi|xi)\widehat{\sigma}^{2}_{i}=\widehat{Var}(y_{i}|x_{i}). Since the quantity in 2.32 tends to zero asymptotically under the QR model, the derived variance estimator in 2.30 is DR.

2.6.2 Scenario II: πiR\pi_{i}^{R} is unknown for iSBi\in S_{B}

To estimate the variance of y¯^DR\widehat{\overline{y}}_{DR} in 2.18 under the GLM, we employ the bootstrap repeated replication method proposed by Rao and Wu (1988). For a given replication bb (b=1,2,,B)(b=1,2,...,B), we draw replicated bootstrap subsamples, SR(b)S_{R}^{(b)} and SB(b)S_{B}^{(b)}, of sizes nR1n_{R}-1 and nB1n_{B}-1 from SRS_{R} and SBS_{B}, respectively. The sampling weights in SR(b)S_{R}^{(b)} are updated as below:

wi(b)=winRnR1hiiSR(b)w^{(b)}_{i}=w_{i}\frac{n_{R}}{n_{R}-1}h_{i}\hskip 28.45274pt\forall i\in S^{(b)}_{R} (2.33)

where hih_{i} is the number of times the ii-th unit has been repeated in SB(b)S_{B}^{(b)}. Let’s assume y¯^DR(b)\widehat{\overline{y}}^{(b)}_{DR} is the DR estimate based on the bb-th combined bootstrap sample, S(b)S^{(b)}, using Eq. 2.7. The variance estimator is given by

Var^(y¯^DR)=1Bb=1B[y¯^DR(b)y¯¯DR]2\widehat{Var}(\widehat{\overline{y}}_{DR})=\frac{1}{B}\sum_{b=1}^{B}\left[\widehat{\overline{y}}^{(b)}_{DR}-\overline{\overline{y}}_{DR}\right]^{2} (2.34)

where y¯¯DR=b=1By¯^DR(b)/B\overline{\overline{y}}_{DR}=\sum_{b=1}^{B}\widehat{\overline{y}}^{(b)}_{DR}/B. Note that when SRS_{R} and SBS_{B} are clustered, which is the case in our application, bootstrap subsamples are selected from the primary sampling units (PSU), and nRn_{R} and nBn_{B} are replaced by their respective PSU sizes.

To estimate the variance of y¯^DR\widehat{\overline{y}}_{DR} under a Bayesian framework, whether parametric or non-parametric, we treat yiy_{i} for iSRi\in S_{R}, and πiR\pi^{R}_{i} and eie_{i} for iSBi\in S_{B}, as missing values in Eq. 2.18 and multiply impute these quantities using the Monte Carlo Markov Chain (MCMC) sequence of the posterior predictive distribution generated by BART. For MM randomly selected MCMC draws, one can estimate y¯^DR(m)\widehat{\overline{y}}^{(m)}_{DR} for m=1,2,,Mm=1,2,...,M based on Eq. 2.18. Following Rubin’s combining rule for variance estimation under multiple imputation, the final variance estimate of y¯^DR\widehat{\overline{y}}_{DR} is given as below:

Var^(y¯^DR)=V¯W+(1+1M)VB\widehat{Var}(\widehat{\overline{y}}_{DR})=\overline{V}_{W}+\left(1+\frac{1}{M}\right)V_{B} (2.35)

where V¯W=m=1MVar^(y¯^DR(m))/M\overline{V}_{W}=\sum_{m=1}^{M}\widehat{Var}(\widehat{\overline{y}}^{(m)}_{DR})/M, VB=m=1M(y¯^DR(m)y¯¯DR)2/(M1)V_{B}=\sum_{m=1}^{M}(\widehat{\overline{y}}^{(m)}_{DR}-\overline{\overline{y}}_{DR})^{2}/(M-1) and y¯¯DR=m=1My¯^DR(m)/M\overline{\overline{y}}_{DR}=\sum_{m=1}^{M}\widehat{\overline{y}}^{(m)}_{DR}/M. We have shown in the Appendix 8.1 that the within-imputation component can be approximated by

Var^(y¯^DR(m))1N^B2i=1nBvar(yi)(π^iB)2+1N^R2var(1πiR){i=1nR(y^i(m))2+nR(t^RN^R)22i=1nRy^i(m)}\widehat{Var}(\widehat{\overline{y}}^{(m)}_{DR})\approx\frac{1}{\widehat{N}^{2}_{B}}\sum_{i=1}^{n_{B}}\frac{var(y_{i})}{\left(\widehat{\pi}^{B}_{i}\right)^{2}}+\frac{1}{\widehat{N}^{2}_{R}}var\left(\frac{1}{\pi^{R}_{i}}\right)\bigg{\{}\sum_{i=1}^{n_{R}}\left(\widehat{y}^{(m)}_{i}\right)^{2}+n_{R}\left(\frac{\widehat{t}_{R}}{\widehat{N}_{R}}\right)^{2}-2\sum_{i=1}^{n_{R}}\widehat{y}^{(m)}_{i}\bigg{\}} (2.36)

where tR=i=1nRy^i(m)/πiRt_{R}=\sum_{i=1}^{n_{R}}\widehat{y}^{(m)}_{i}/\pi^{R}_{i}. Note that when SRS_{R} or SBS_{B} is clustered, under a Bayesian framework, it is important to fit multilevel models to obtain unbiased variance (Zhou et al., 2020). In addition, one needs to account for the intraclass correlation across the sample units in Var^(y¯^DR(m))\widehat{Var}(\widehat{\overline{y}}^{(m)}_{DR}) for m=1,2,,Mm=1,2,...,M. Further, one may use the extension of BART with random intercept to properly specify the working models under a cluster sampling design (Tan et al., 2016).

3 Simulation Study

Three simulations are studied in this section to assess the performance of our proposed methods and associated variance estimators in terms of bias magnitude and other repeated sampling properties. In Simulation 1, we mimic the simulation design in Chen et al. (2020) to compare the proposed methods. Here the probability of selection in the probability sample is a fixed linear combination of a subset of the covariates that govern selection into the non-probability sample. In Simulation 2 we separate the design variable for the probability sample and the selection covariate for the non-probability sample in order to consider different associations between these values, as well as misspecification of the functional form of the means to consider the advantages of BART in modeling. Simulation 3 extends Simulation 2 to allow for cluster sampling in the probability sample to better match the design of the National Household Transportation Survey in the application.

3.1 Simulation I

The design of our first simulation is inspired by the one implemented in Chen et al. (2020). For all three studies, the non-probability samples are given a random selection mechanism with unequal probabilities, but it is later assumed that these selection probabilities are unknown at the stage of analysis, and the goal is to adjust for the selection bias using a parallel probability sample whose sampling mechanism is known. We conduct the simulation under both asymptotic frequentist and two-step Bayesian frameworks. Consider a finite population of size N=106N=10^{6} with z={z1,z2,z3,z4}z=\{z_{1},z_{2},z_{3},z_{4}\} being a set of auxiliary variables generated as follows:

z1Ber(p=0.5)z2U(0,2)z3Exp(μ=1)z4χ(4)2z_{1}\sim Ber(p=0.5)\hskip 42.67912ptz_{2}\sim U(0,2)\hskip 42.67912ptz_{3}\sim Exp(\mu=1)\hskip 42.67912ptz_{4}\sim\chi^{2}_{(4)} (3.1)

and x={x1,x2,x3,x4}x=\{x_{1},x_{2},x_{3},x_{4}\} is defined as a function of zz as below:

x1=z1x2=z2+0.3z1x3=z3+0.2(x1+x2)x4=z4+0.1(x1+x2+x3)x_{1}=z_{1}\hskip 28.45274ptx_{2}=z_{2}+0.3z_{1}\hskip 28.45274ptx_{3}=z_{3}+0.2(x_{1}+x_{2})\hskip 28.45274ptx_{4}=z_{4}+0.1(x_{1}+x_{2}+x_{3}) (3.2)

Given xx, a continuous outcome variable yy is defined by

yi=2+x1i+x2i+x3i+x4i+σϵiy_{i}=2+x_{1i}+x_{2i}+x_{3i}+x_{4i}+\sigma\epsilon_{i} (3.3)

where ϵiN(0,1)\epsilon_{i}\sim N(0,1), and σ\sigma is defined such that the correlation between yiy_{i} and k=14xki\sum_{k=1}^{4}x_{ki} equals ρ\rho, which takes one of the values {0.2,0.5,0.8}\{0.2,0.5,0.8\}. Further, associated with the design of SBS_{B}, a set of selection probabilities are assigned to the population units through the following logistic model:

log(πiB1πiB)=γ0+0.1x1i+0.2x2i+0.1x3i+0.2x4i\log\left(\frac{\pi^{B}_{i}}{1-\pi^{B}_{i}}\right)=\gamma_{0}+0.1x_{1i}+0.2x_{2i}+0.1x_{3i}+0.2x_{4i}

where γ0\gamma_{0} is determined such that i=1NπiB=nB\sum_{i=1}^{N}\pi^{B}_{i}=n_{B}. For SRS_{R}, we assume πiRγ1+z3i\pi^{R}_{i}\propto\gamma_{1}+z_{3i} where γ1\gamma_{1} is obtained such that max{πiR}/min{πiR}=50\max\{\pi^{R}_{i}\}/\min\{\pi^{R}_{i}\}=50. Hence, πiR\pi^{R}_{i} is assumed to be known for iSBi\in S_{B} as long as z3z_{3} is observed in SBS_{B}. Using these measures of size, we repeatedly draw pairs of samples of sizes nR=100n_{R}=100 and nB=1,000n_{B}=1,000 associated with SRS_{R} and SBS_{B} from UU through a Poisson sampling method. Note that units in both SRS_{R} and SBS_{B} are independently selected, and nR<<nBn_{R}<<n_{B}, which might be the case in a Big Data setting. Extensions with nB=100n_{B}=100 and nB=10,000n_{B}=10,000 for both frequentist and Bayesian methods are provided in Appendix 8.3.

Once SBS_{B} and SRS_{R} are drawn from UU, we assume that the πiB\pi^{B}_{i}’s for iSBi\in S_{B} and yiy_{i}’s for iSRi\in S_{R} are unobserved. The simulation is then iterated K=5,000K=5,000 times, where the bias-adjusted mean, standad error (SE), and associated 95% confidence interval (CI) for the mean of yy are estimated in each iteration. Under the frequentist approach, the AIPW point estimates are obtained by simultaneously solving the estimating equations in 2.21. In addition, the proposed two-step method is used to derive the AIPW point estimates under the parametric Bayes. Also, to estimate the variance, we use the DR asymptotic method proposed by Chen et al. (2020), and the conditional variance formula in Eq. 2.35 under the frequentist and Bayesian approaches, respectively. For the latter, we set flat priors to the model parameters, and simulate the posteriors using 1,0001,000 MCMC draws after omitting an additional 1,0001,000 draws as the burn-in period. We then get a random sample of size M=200M=200 from the posterior draws to obtain the point and variance estimates.

To evaluate the repeated sampling properties of the competing method, relative bias (rBias), relative root mean square error (rMSE), the nominal coverage rate of 95% CIs (crCI) and SE ratio (rSE) are calculated as below:

rbias(y¯^DR)\displaystyle rbias(\widehat{\overline{y}}_{DR}) =100×1Kk=1K(y¯^DR(k)y¯U)/y¯U\displaystyle=100\times\frac{1}{K}\sum_{k=1}^{K}\left(\widehat{\overline{y}}^{(k)}_{DR}-\overline{y}_{U}\right)/\overline{y}_{U} (3.4)
rMSE(y¯^DR)\displaystyle rMSE(\widehat{\overline{y}}_{DR}) =100×1Kk=1K(y¯^DR(k)y¯U)2/y¯U\displaystyle=100\times\sqrt{\frac{1}{K}\sum_{k=1}^{K}\left(\widehat{\overline{y}}^{(k)}_{DR}-\overline{y}_{U}\right)^{2}}/\overline{y}_{U} (3.5)
crCI(y¯^DR)\displaystyle crCI(\widehat{\overline{y}}_{DR}) =100×1Kk=1KI(|y¯^DR(k)y¯U|<z0.975var(y¯^DR(k)))\displaystyle=100\times\frac{1}{K}\sum_{k=1}^{K}I\left(\big{|}\widehat{\overline{y}}^{(k)}_{DR}-\overline{y}_{U}\big{|}<z_{0.975}\sqrt{var(\widehat{\overline{y}}^{(k)}_{DR})}\right) (3.6)
rSE(y¯^DR)\displaystyle rSE(\widehat{\overline{y}}_{DR}) =1Kk=1Kvar(y¯^DR(k))/1K1k=1K(y¯^DR(k)y¯¯DR)2\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\sqrt{var(\widehat{\overline{y}}^{(k)}_{DR})}/\sqrt{\frac{1}{K-1}\sum_{k=1}^{K}\left(\widehat{\overline{y}}^{(k)}_{DR}-\overline{\overline{y}}_{DR}\right)^{2}} (3.7)

where y¯^DR(k)\widehat{\overline{y}}^{(k)}_{DR} denotes the DR adjusted sample mean from iteration kk, y¯¯DR=k=1Ky¯^DR(k)/K\overline{\overline{y}}_{DR}=\sum_{k=1}^{K}\widehat{\overline{y}}^{(k)}_{DR}/K, y¯U\overline{y}_{U} is the finite population true mean, and var(.)var(.) represents the variance estimate of the adjusted mean based on the sample. Finally, we consider model misspecification to test the DR property by removing x4x_{4} from the predictors of the working model. Here K=5000K=5000.

Table 1: Comparing the performance of the bias adjustment methods and associated asymptotic variance estimator under the frequentist approach in the first simulation study for ρ={0.2,0.5,0.8}\rho=\{0.2,0.5,0.8\}
ρ=0.2\rho=0.2 ρ=0.5\rho=0.5 ρ=0.8\rho=0.8
Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE
Probability sample (SRS_{R})
    Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.020
    Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008
Non-probability sample (SBS_{B})
    Unweighted 31.742 32.230 0.0 1.009 31.937 32.035 0.0 1.012 31.996 32.049 0.0 1.013
    Fully weighted 0.127 6.587 95.4 1.013 0.078 2.583 95.7 1.014 0.061 1.554 95.4 1.012
Non-robust adjustment
Model specification: True
    PAPW -1.780 8.088 97.0 1.107 -1.906 4.734 95.7 1.103 -1.947 4.186 94.0 1.100
    IPSW -3.054 10.934 97.2 1.305 -3.134 8.145 95.2 1.173 -3.160 7.778 92.4 1.067
    PM 0.490 7.577 95.2 1.007 0.190 4.668 94.6 0.991 0.095 4.204 94.6 0.985
Model specification: False
    PAPW 26.338 27.089 3.1 1.112 26.434 26.618 0.0 1.123 26.461 26.580 0.0 1.128
    IPSW 28.269 28.917 0.6 1.021 28.474 28.648 0.0 1.018 28.536 28.654 0.0 1.014
    PM 28.093 28.750 0.6 1.022 28.315 28.494 0.0 1.022 28.382 28.505 0.0 1.021
Doubly robust adjustment
Model specification: QR–True, PM–True
    AIPW–PAPW 0.238 8.070 95.2 1.017 0.100 4.787 95.0 0.996 0.056 4.235 94.6 0.987
    AIPW–IPSW 0.105 7.861 95.1 1.019 0.053 4.737 94.8 0.996 0.036 4.222 94.6 0.987
Model specification: QR–True, PM–False
    AIPW–PAPW 0.311 8.197 95.4 1.021 0.172 4.988 95.0 1.013 0.127 4.460 95.2 1.011
    AIPW–IPSW 0.222 7.962 95.5 1.024 0.170 4.901 95.4 1.019 0.152 4.405 95.3 1.018
Model specification: QR–False, PM–True
    AIPW–PAPW 0.877 13.362 96.9 1.028 0.327 6.089 95.8 1.027 0.154 4.523 95.2 1.006
    AIPW–IPSW 0.609 12.532 96.6 1.025 0.232 5.842 95.5 1.022 0.113 4.464 95.3 1.003
Model specification: QR–False, PM–False
    AIPW–PAPW 28.301 28.995 1.0 1.024 28.392 28.579 0.0 1.021 28.419 28.546 0.0 1.018
    AIPW–IPSW 28.104 28.762 0.7 1.024 28.313 28.493 0.0 1.023 28.376 28.500 0.0 1.022
  • PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting.

Table 2: Comparing the performance of the bias adjustment methods and associated variance estimator under the two-step parametric Bayesian approach in the first simulation study for ρ={0.2,0.5,0.8}\rho=\{0.2,0.5,0.8\}
ρ=0.2\rho=0.2 ρ=0.5\rho=0.5 ρ=0.8\rho=0.8
Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE
Non-robust adjustment
Model specification: True
    Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.020
    Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008
Non-probability sample (SBS_{B})
    Unweighted 31.620 32.106 0.0 1.014 31.906 32.003 0.0 1.015 31.993 32.045 0.0 1.017
    Fully weighted 0.026 6.615 95.3 1.010 0.052 2.604 95.2 1.007 0.059 1.564 95.2 1.006
Non-robust adjustment
Model specification: True
    PAPW -1.846 8.148 96.3 1.081 -1.890 4.749 96.9 1.163 -1.906 4.195 96.6 1.200
    IPSW 0.113 7.566 96.5 1.076 0.117 4.302 97.7 1.140 0.117 3.759 97.9 1.164
    PM 0.385 7.534 95.2 1.027 0.151 4.644 95.1 1.001 0.078 4.190 95.0 0.989
Model specification: False
    PAPW 26.290 27.041 2.3 1.051 26.499 26.687 0.0 1.071 26.562 26.684 0.0 1.083
    IPSW 28.151 28.784 0.5 1.038 28.446 28.612 0.0 1.025 28.535 28.647 0.0 1.015
    PM 27.981 28.641 0.8 1.040 28.291 28.472 0.0 1.025 28.384 28.510 0.0 1.015
Doubly robust adjustment
Model specification: QR–True, PM–True
    AIPW–PAPW 0.115 8.093 96.9 1.097 0.057 4.764 97.1 1.121 0.037 4.219 97.2 1.130
    AIPW–IPSW 0.009 7.803 96.6 1.083 0.019 4.704 96.7 1.106 0.020 4.206 97.0 1.114
Model specification: QR–True, PM–False
    AIPW–PAPW -0.016 7.930 97.2 1.108 -0.080 4.444 97.9 1.166 -0.098 3.842 98.1 1.193
    AIPW–IPSW -0.079 7.648 96.8 1.095 -0.074 4.411 97.7 1.151 -0.069 3.867 97.9 1.175
Model specification: QR–False, PM–True
    AIPW–PAPW 0.557 7.693 96.4 1.086 0.214 4.669 96.8 1.092 0.105 4.195 96.6 1.090
    AIPW–IPSW 0.392 7.526 96.0 1.067 0.155 4.637 96.3 1.077 0.080 4.189 96.4 1.078
Model specification: QR–False, PM–False
    AIPW–PAPW 28.167 28.864 1.4 1.096 28.359 28.549 0.0 1.082 28.416 28.548 0.0 1.068
    AIPW–IPSW 27.990 28.647 1.0 1.069 28.289 28.471 0.0 1.059 28.379 28.506 0.0 1.049
  • PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting.

Table 1 summarizes the results of the first simulation study under the frequentist approach. As illustrated, unweighted estimates of the population mean are biased in both SRS_{R} and SBS_{B}. For the non-robust estimators, when the working model is valid, it seems that PM outperforms QR consistently in terms of bias correction across different ρ\rho values. While PAPW works slightly better than IPSW with respect to bias, when the QR model is true, the latter tends to overestimate the variance slightly according to the values of rSE. In addition, the smaller value of rMSE indicates that PAPW is more efficient than IPSW. For the PM, both crCI and rSE reflect accurate estimates of the variance for all values of ρ\rho. When the working model is incorrect, point estimates associated with both QR and PM are biased, but the variance remains unbiased. These findings hold across all three values of ρ\rho.

For the DR methods, it is evident that estimates based on both PAPW and IPSW remain unbiased when at least one of the PM or QR models holds. Also, the values of crCI and rSE reveal that the asymptotic variance estimator is DR for both methods. Comparing the rMSE values, the AIPW estimate based on IPSW is slightly more efficient than the one based on PAPW. While the variance estimates remain unbiased under the false-false model specification status, point estimates are severely biased. Finally, the performance of both AIPW estimators improves with respect to bias reduction especially when the QR model is misspecified.

For the Bayesian approach, the simulation results are displayed in Table 2. Note that we no longer are able to use the PMLE approach. Instead, we apply the PAPP method assuming that πiR\pi^{R}_{i} is unknown for iSBi\in S_{B}. As illustrated, PAPP outperforms all the non-robust methods with respect to bias. Surprisingly, the magnitude of bias is even smaller in the Bayesian PAPP than the QR methods examined under the frequentist framework. In addition, estimates under the Bayesian approach are slightly more efficient than those obtained under the frequentist methods. While variance is approximately unbiased for ρ=0.2\rho=0.2, there is evidence that PM and QR increasingly underestimate and overestimate the true variance, respectively, as the value of ρ\rho increases. Regarding the DR methods, the Bayesian and frequentist methods yield similar results. The DR property holds for all values of ρ\rho, when at least one of the working models are correctly specified.

3.2 Simulation II

In the previous simulation study, we violated the ignorable assumption in order to misspecify the working model by dropping a key auxiliary variable. Now, we focus on a situation where models misspecified with respect to the functional form of their conditional means. To this end, we consider non-linear associations and two-way interactions in constructing of the outcome variables as well as the selection probabilities. This also allows us to examine the flexibility of BART as a non-parametric method when the true functional form of the underlying models is unknown. In addition, to simulate a more realistic situation, this time, two separate sets of auxiliary variables are generated, DD associated with the design of SBS_{B}, and XX associated with the design of SRS_{R}. However, we allow the two variables to be correlated through a bivariate Gaussian distribution as below:

(dx)MVN((00),(1ρρ1))\begin{pmatrix}d\\ x\end{pmatrix}\sim MVN\left(\begin{pmatrix}0\\ 0\end{pmatrix},\begin{pmatrix}1&&\rho\\ \rho&&1\end{pmatrix}\right) (3.8)

Note that ρ\rho controls how strongly the sampling design of SRS_{R} is associated with that of SBS_{B}, which in turn controls the quality of our assumption that πiR\pi^{R}_{i} can be well estimated by xix_{i} (Step2Step2 in Section 2.5.1). In addition, the values of did_{i} can be either observed or unobserved for iSBi\in S_{B}. In this simulation, we set ρ=0.2\rho=0.2, but later we check other values ranging from 0 to 0.90.9 as well.

To generate the outcome variable in UU, we consider the following non-linear model:

yi=2fk(xi)di2+0.5xidi+σϵiy_{i}=2f_{k}(x_{i})-d^{2}_{i}+0.5x_{i}d_{i}+\sigma\epsilon_{i} (3.9)

where ϵiN(0,1)\epsilon_{i}\sim N(0,1), and σ\sigma is determined such that the correlation between yiy_{i} and 2fk(xi)di2+0.5xidi2f_{k}(x_{i})-d_{i}^{2}+0.5x_{i}d_{i} equals 0.50.5 for iUi\in U. The function fk(.)f_{k}(.) is assumed to take one of the following forms:

SIN:f1(x)=sin(x)EXP:f2(x)=exp(x/2)SQR:f3(x)=x2/3SIN:f_{1}(x)=sin(x)\hskip 42.67912ptEXP:f_{2}(x)=\exp(x/2)\hskip 42.67912ptSQR:f_{3}(x)=x^{2}/3 (3.10)

We then consider an informative sampling strategy with unequal probabilities of inclusion, where the selection mechanism of SBS_{B} and SRS_{R} depends on xx and dd, respectively. Thus, each iUi\in U is assigned two values corresponding to the probabilities of selection in SRS_{R} and SBS_{B} through a logisticlogistic function as below:

πR(xi)=P(δiR=1|di)\displaystyle\pi^{R}(x_{i})=P(\delta^{R}_{i}=1|d_{i}) =exp{γ0+0.2di2}1+exp{γ0+0.2di2}\displaystyle=\frac{\exp\{\gamma_{0}+0.2d^{2}_{i}\}}{1+\exp\{\gamma_{0}+0.2d^{2}_{i}\}} (3.11)
πkA(xi)=Pk(δiA=1|xi)\displaystyle\pi_{k}^{A}(x_{i})=P_{k}(\delta^{A}_{i}=1|x_{i}) =exp{γ1+fk(xi)}1+exp{γ1+fk(xi)}\displaystyle=\frac{\exp\{\gamma_{1}+f_{k}(x_{i})\}}{1+\exp\{\gamma_{1}+f_{k}(x_{i})\}}

where δiR\delta^{R}_{i} and δiA\delta^{A}_{i} are the indicators of being selected in SRS_{R} and SBS_{B}, respectively.

Associated with SRS_{R} and SBS_{B}, independent samples of size nR=100n_{R}=100 and nA=1,000n_{A}=1,000 were selected randomly from UU with Poisson sampling at the first stage and simple random sampling at the second stage. The sample size per cluster, nαn_{\alpha}, was 11 and 5050 for SRS_{R} and SBS_{B}, respectively. The model intercepts, γ0\gamma_{0} and γ1\gamma_{1} in Eq. 3.11, are obtained such that i=1NπiR=nR\sum_{i=1}^{N}\pi^{R}_{i}=n_{R} and i=1NπiR=nA\sum_{i=1}^{N}\pi^{R}_{i}=n_{A}. We restrict this simulation to Bayesian analysis based on the proposed PAPW and PAPP methods but focus on how well the non-parametric Bayes performs over the parametric Bayes in situations when the true structure of both underlying models are supposed to be unknown. The rest of the simulation design is similar to that defined in Simulation I, except for the way we specify a working model. This is done by including only the main and linear effects of XX and DD in the PM model, and the main and linear effect of XX in the QR model. BART’s performance is examined under the assumption that the true functional form of both QR and PM models is unknown, and thus, only main effects are included in BART.

The findings of this simulation for the two-step Bayesian approach with 1,0001,000 MCMC draws and M=200M=200 are exhibited numerically in Table 3. Regarding the non-robust methods, both QR and PM estimators show unbiased results across the three defined functions, i.e. SIN, EXP and SQR, as long the working GLM is valid, with the minimum value of rBias associated with the PAPP method. According to the rSE values, there is evidence that PAPW and PAPP overestimate the variance, and PM underestimate the variance to some degrees, especially under the EXP and SQR scenarios. When the specified GLM is wrong, as seen, point estimates are biased for both QR and PM methods across all three functions. However, BART produces approximately unbiased results with smaller values of rMSE than GLM. In general, the PM method outperforms the QR methods under BART with respect to bias, but results based on the PAPP method are more efficient. In addition, BART tends to overestimate the variance under both QR and PM methods.

When it comes to the DR adjustment, Bayesian GLM produces unbiased results across all the three defined functions if the working model of either QR or PM holds. However, the variance is slightly underestimated for the SIN function when the PM specified model is wrong, and it is overestimated for the EXP function under all model-specification scenarios. As expected, point estimates are biased when the GLM is misspecified for both QR and PM. However, BART tends to produce unbiased estimates consistently across all three functions, and the magnitude of both rBias and rMSE are smaller in the AIPW estimator based on PAPP compared to the AIPW estimator based on PAPW. Finally, as in the non-robust method, variance under BART is overestimated compared to the GLM.

Table 3: Comparing the performance of the bias adjustment methods and associated variance estimator under the two-step parametric Bayesian approach in the second simulation study for ρ=0.2\rho=0.2
SIN EXP SQR
Model-method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE
Probability sample (SRS_{R})
    Unweighted -17.210 23.109 80.0 0.999 -8.406 11.126 78.3 1.000 -17.302 20.563 65.8 1.002
    Fully weighted -0.623 17.027 94.4 0.987 -0.303 7.947 94.6 0.987 -0.675 13.219 94.0 0.975
Non-probability sample (SBS_{B})
    Unweighted 33.063 33.379 0.0 1.003 40.307 40.409 0.0 1.079 49.356 49.570 0.0 1.016
    Fully weighted 0.019 6.010 95.1 1.006 0.005 2.755 94.9 1.005 0.009 3.948 95.0 0.992
Non-robust adjustment
Model specification: True
    GLM–PAPW -0.425 9.257 96.3 1.072 -0.185 4.262 98.7 1.257 -0.325 6.649 98.4 1.213
    GLM–PAPP 0.018 8.460 95.7 1.018 0.040 3.870 98.6 1.238 -0.037 5.914 98.8 1.222
    GLM–PM -0.411 9.899 94.7 0.982 -0.371 4.504 94.4 0.988 -0.762 8.115 92.5 0.947
Model specification: False
   GLM–PAPW 7.180 11.635 86.4 1.027 2.511 5.299 97.2 1.316 52.170 52.559 0.0 1.102
   GLM–PAPP 7.647 11.265 78.0 0.954 3.025 5.425 96.2 1.277 53.095 53.397 0.0 1.122
    BART–PAPW 4.035 10.078 97.0 1.217 2.811 5.129 98.4 1.472 8.356 11.082 97.2 1.468
    BART–PAPP 1.098 8.530 96.7 1.121 1.108 4.120 98.9 1.391 4.482 7.479 98.0 1.401
    GLM–PM 5.870 10.542 87.9 0.972 -6.589 9.264 82.5 0.976 48.993 49.409 0.0 0.994
    BART–PM 0.577 9.635 97.0 1.115 0.087 4.501 97.5 1.155 0.249 8.276 96.1 1.062
Doubly robust adjustment
Model specification: QR–True, PM–True
    GLM–AIPW–PAPW -0.450 9.930 95.8 1.023 -0.165 4.593 98.2 1.200 -0.458 8.116 96.5 1.089
    GLM–AIPW–PAPP -0.452 9.925 95.8 1.020 -0.162 4.592 98.1 1.193 -0.453 8.106 96.5 1.086
Model specification: QR–True, PM–False
    GLM–AIPW–PAPW -0.279 9.996 93.2 0.926 0.310 5.697 98.8 1.303 -0.338 7.128 97.5 1.154
    GLM–AIPW–PAPP -0.134 9.418 94.1 0.961 0.508 4.977 99.5 1.475 -0.275 7.376 97.6 1.152
Model specification: QR–False, PM–True
    GLM–AIPW–PAPW -0.411 10.098 96.1 1.024 -0.176 4.715 98.5 1.234 -0.771 8.122 95.5 1.057
    GLM–AIPW–PAPP -0.417 10.101 96.0 1.021 -0.173 4.705 98.4 1.229 -0.778 8.119 95.4 1.057
Model specification: QR–False, PM–False
   GLM–AIPW–PAPW 9.015 13.176 84.1 1.000 6.735 8.693 94.1 1.456 50.835 51.288 0.0 1.019
   GLM–AIPW–PAPP 9.191 12.717 84.9 1.082 6.787 8.181 96.7 1.761 51.667 52.131 0.0 1.047
   BART–AIPW–PAPW 0.425 10.071 97.9 1.184 0.122 4.689 99.3 1.407 -0.259 8.349 98.0 1.231
   BART–AIPW–PAPP -0.144 9.794 97.8 1.184 -0.100 4.541 99.3 1.405 -0.245 8.329 97.7 1.203
  • PAPW: propensity adjusted probability weighting; PAPP: propensity adjusted probability Prediction; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting.

3.3 Simulation III

Since the non-probability sample in the application of this study is clustered, we performed a third simulation study. To this end, the hypothetical population is assumed to be clustered with A=103A=10^{3} clusters, each of size nα=103n_{\alpha}=10^{3} (N=106N=10^{6}). Then, three cluster-level covariates, {x1,x2,d}\{x_{1},x_{2},d\}, are defined with the following distributions:

(dαx0αx1α)MVN((001),(1ρ/2ρρ/21ρ/2ρρ/21))x2α=I(x0α>0)\begin{pmatrix}d_{\alpha}\\ x_{0\alpha}\\ x_{1\alpha}\end{pmatrix}\sim MVN\left(\begin{pmatrix}0\\ 0\\ 1\end{pmatrix},\begin{pmatrix}1&-\rho/2&\rho\\ -\rho/2&1&-\rho/2\\ \rho&-\rho/2&1\end{pmatrix}\right)\hskip 28.45274ptx_{2\alpha}=I(x_{0\alpha}>0) (3.12)

where dd denotes a design variable in SRS_{R}, and {x1,x2}\{x_{1},x_{2}\} describes the selection mechanism in SBS_{B}. Primarily, we set ρ=0.8\rho=0.8, but later we check other values ranging from 0 to 0.90.9 as well. Note that ρ\rho controls how strongly the sampling design of SRS_{R} is associated with that of SBS_{B}. Furthermore, we assume that both dd and xx are observed for units of SS.

Again, to be able to assess BART’s performance, we consider non-linear associations with polynomial terms and two-way interactions in construction of the outcome variables as well as the selection probabilities. Two outcome variables are studied, one continuous (ycy_{c}) and one binary (yby_{b}), which both depend on {x,d}\{x,d\} as below:

yαic|xα,dα\displaystyle y^{c}_{\alpha i}|x_{\alpha},d_{\alpha} N(μ=1+0.5x1α2+0.4x1α30.3x2α0.2x1αx2α0.1dα+uα,σ2=1)\displaystyle\sim N(\mu=1+0.5x_{1\alpha}^{2}+0.4x_{1\alpha}^{3}-0.3x_{2\alpha}-0.2x_{1\alpha}x_{2\alpha}-0.1d_{\alpha}+u_{\alpha},\sigma^{2}=1) (3.13)
yαib|xα,dα\displaystyle y^{b}_{\alpha i}|x_{\alpha},d_{\alpha} Ber(p=exp{1+0.1x1α2+0.2x1α30.3x2α0.4x1αx20.5dα+uα}1+exp{1+0.1x1α2+0.2x1α30.3x2α0.4x1αx20.5dα+uα})\displaystyle\sim Ber\left(p=\frac{\exp\{-1+0.1x_{1\alpha}^{2}+0.2x_{1\alpha}^{3}-0.3x_{2\alpha}-0.4x_{1\alpha}x_{2}-0.5d_{\alpha}+u_{\alpha}\}}{1+\exp\{-1+0.1x_{1\alpha}^{2}+0.2x_{1\alpha}^{3}-0.3x_{2\alpha}-0.4x_{1\alpha}x_{2}-0.5d_{\alpha}+u_{\alpha}\}}\right) (3.14)

where uαN(0,σu2)u_{\alpha}\sim N(0,\sigma^{2}_{u}), and σu2\sigma^{2}_{u} is determined such that the intraclass correlation equals 0.20.2 (Oman and Zucker, 2001; Hunsberger et al., 2008). For each iUi\in U, we then consider the following set of selection probabilities associated with the design of the SRS_{R} and SBS_{B}:

πR(xα)=P(δαR=1|dα)\displaystyle\pi^{R}(x_{\alpha})=P(\delta^{R}_{\alpha}=1|d_{\alpha}) =exp{γ0+0.5dα}1+exp{γ0+0.5dα}\displaystyle=\frac{\exp\{\gamma_{0}+0.5d_{\alpha}\}}{1+\exp\{\gamma_{0}+0.5d_{\alpha}\}} (3.15)
πB(xα)=P(δαB=1|xα)\displaystyle\pi^{B}(x_{\alpha})=P(\delta^{B}_{\alpha}=1|x_{\alpha}) =exp{γ10.1x1α+0.2x1α2+0.3x2α0.4x1αx2α}1+exp{γ10.1x1α+0.2x1α2+0.3x2α0.4x1αx2α}\displaystyle=\frac{\exp\{\gamma_{1}-0.1x_{1\alpha}+0.2x_{1\alpha}^{2}+0.3x_{2\alpha}-0.4x_{1\alpha}x_{2\alpha}\}}{1+\exp\{\gamma_{1}-0.1x_{1\alpha}+0.2x_{1\alpha}^{2}+0.3x_{2\alpha}-0.4x_{1\alpha}x_{2\alpha}\}}

where δiR\delta^{R}_{i} and δiB\delta^{B}_{i} are the indicators of being selected in SRS_{R} and SBS_{B}, respectively. Associated with SRS_{R} and SBS_{B}, two-stage cluster samples of size nR=100n_{R}=100 and nB=10,000n_{B}=10,000 were selected randomly from UU with Poisson sampling at the first stage and simple random sampling at the second stage. The sample size per cluster, nαn_{\alpha}, was 11 and 5050 for SRS_{R} and SBS_{B}, respectively. The model intercepts, γ0\gamma_{0} and γ1\gamma_{1} in 3.15, are obtained such that i=1NπiR=nR\sum_{i=1}^{N}\pi^{R}_{i}=n_{R} and i=1NπiR=nB/nα\sum_{i=1}^{N}\pi^{R}_{i}=n_{B}/n_{\alpha}.

The rest of the simulation design is similar to that defined in Simulation II, except for the methods we use for point and variance estimation. In addition to the situation where πiR\pi^{R}_{i} is known for iSBi\in S_{B}, we consider a situation where πR\pi^{R} is unobserved for iSBi\in S_{B} and draw the estimates based on PAPP. Furthermore, unlike the simulation I, DR estimates are achieved by separately fitting the QR and PM models, and to get the variance estimates, a bootstrap technique is applied with B=200B=200 based on Rao and Wu (1988). Finally, under BART, Rubin’s combining rules are employed to derive the point and variance estimates based on the random draws of the posterior predictive distribution. As in Simulations II, we consider different scenarios of model specification. To misspecify a model, we only include the main effects in the working model. Also, under BART, no interaction or polynomial is included as input.

The means of the synthesized UU for the outcome variables were y¯Uc=3.39\overline{y}^{c}_{U}=3.39 and y¯Ub=0.40\overline{y}^{b}_{U}=0.40. Figure 2 compares the bias magnitude and efficiency across the non-robust methods. As illustrated, point estimates from both SRS_{R} and SBS_{B} are biased if the true sampling weights are ignored. After adjusting, for both continuous and binary outcomes, the bias is close to zero under both QR and PM methods when the working model is correct. However, the lengths of the error bars reveal that the proposed PAPW/PAPP method is more efficient than the IPSW. When only main effects are included in the model, all adjusted estimates are biased except for those based on BART. Note that BART cannot be applied under IPSW. Further details about the simulation results for the non-robust methods are displayed in Appendix 8.3. We see that IPSW tends to have slightly larger magnitudes of rBias and rMSE for both ycy^{c} and yby^{b}. Also, the values of rSE close to 11 indicate that the Rao & Wu’s bootstrap method of variance estimation performs well under both QR and PM approaches. However, 95%95\% coverage is achieved only when the working model is correct.

In Figure 3, we depict the results of the DR estimators under different permutations of model specification. One can immediately infer that AIPW produces unbiased results when either the PM or QR model holds. However, in situations where the true underlying models for both PM and QR are unknown, the point estimates based on BART remains unbiased under both the PAPW and PAPP approaches. Furthermore, under the GLM, it is evident that AIPW estimates based on PAPW/PAPP are slightly less biased and more efficient than those based on IPSW when the PM is incorrect. Details of the numerical results can be found in Appendix 8.3. The latter compares BART with GLM under a situation where both working models are wrong. Results showing the performance of the bootstrap variance estimator are provided in Figure 4. The crCI values are all close to the correct value unless both working models are incorrectly specified. When the models are incorrrectly specified, the BART approach yields correct variance estimation for the continuous outcome; variance is underestimated and anticonservative for the binary outcome, although closer to nominal coverage than competing methods. To conclude, we observe that when neither the PM nor QR model are known, BART based on PAPP produces unbiased and efficient estimates with accurate variance.

As the final step, we replicate the simulation for different values of ρ\rho ranging from 0 to 0.90.9 to show how stable the competing methods perform in terms of rbias and rMSE. Figure 5 depicts changes in the values of rBias and rMSE for different adjustment methods as the value of ρ\rho increases. Generally, it seems that the value of rMSE decreases for all competing methods as ρ\rho increases, but for all values of ρ\rho, PAPW are PAPP are less biased than IPSW. It is only when ρ=0\rho=0 for the continuous variable that IPSW outperforms the PPAW/PAPP in bias reduction. However, when dd is highly correlated with xx, there is also evidence of better performance by PAPP than IPSW in terms of bias reduction. We believe this is mainly because the stronger association between xx and dd implies that the additional ignorable assumption under PAPP is better met, while this correlation causes a sort of collinearity in IPSW leading to a loss of efficiency. The rest of the methods did not show significant changes as the value of ρ\rho increases. Numerical values associated with Figure 5 have been provided in Appendix 8.3.

Refer to caption
Figure 2: Comparing the performance of the non-robust approaches for (a) the continuous outcome (YcY_{c}) and (b) the binary outcome (YbY_{b}) when the model is correctly specified. Error bars represent the 2.5% and 97.5% percentiles of the empirical distribution of bias over the simulation iterations. UW: unweighted; FW: fully weighted; PM: prediction model; PAPP: propensity adjusted probability prediction; IPSW: inverse propensity score weighting.
Refer to caption
Figure 3: Comparing the performance of the doubly robust estimators under different model-specification scenarios for (a) the continuous outcome (YcY_{c}) and (b) the binary outcome (YbY_{b}). 95% CIs have been generated based on the 2.5% and 97.5% percentiles of the empirical distribution of bias over the simulation iterations. UW: unweighted; FW: fully weighted; PAPP: propensity adjusted probability prediction; IPSW: inverse propensity score weighting.
Refer to caption
Figure 4: Comparing the 95% CI coverage rates for the means of (a) continuous outcome and (b) binary outcome and SE ratios for (c) continuous outcome and (d) binary outcome across different DR methods under different model specification scenarios. UW: unweighted; FW: fully weighted; PAPP: propensity adjusted probability prediction; IPSW: inverse propensity score weighting.
Refer to caption
Figure 5: Comparing the rBias for the means of (a) continuous outcome and (b) binary outcome and rMSE for the means of (c) continuous outcome and (d) binary outcome across different adjustment methods and different values of ρ\rho. UW: unweighted; FW: fully weighted; PAPP: propensity adjusted probability prediction; IPSW: inverse propensity score weighting.

4 Strategic Highway Research Program 2 Analysis

We briefly describe SHRP2, the non-probability sample, and the NHTS, the probability sample, as well as the variables used for statistical adjustment.

4.1 Strategic Highway Research Program 2

SHRP2 is the largest naturalistic driving study conducted to date, with the primary aim to assess how people interact with their vehicle and traffic conditions while driving (of Sciences, 2013). About A=3,140A=3,140 drivers aged 169516-95 years were recruited from six geographically dispersed sites across the United States (Florida, Indiana, New York, North Carolina, Pennsylvania, and Washington), and over five million trips and 5050 million driven miles have been recorded. The average follow-up time per person was n¯α=440\overline{n}_{\alpha}=440 days. A quasi-random approach was initially employed to select samples by random cold calling from a pool of 17,00017,000 pre-registered volunteers. However, because of the low success rate along with budgetary constraints, the investigators later chose to pursue voluntary recruitment. Sites were assigned one of three pre-determined sample sizes according to their population density (Campbell, 2012). The youngest and oldest age groups were oversampled because of the higher crash risk among those subgroups. Thus, one can conclude that the selection mechanism in SHRP2 is a combination of convenience and quota sampling methods. Further description of the study design and recruitment process can be found in Antin et al. (2015).

SHRP2 data are collected in multiple stages. Selected participants are initially asked to complete multiple assessment tests, including executive function and cognition, visual perception, visual-cognitive, physical and psychomotor capabilities, personality factors, sleep-related factors, general medical condition, driving knowledge, etc. In addition, demographic information such as age, gender, household income, education level, and marital status as well as vehicle characteristics such as vehicle type, model year, manufacturer, and annual mileage are gathered at the screening stage. A trip in SHRP2 is defined as the time interval during which the vehicle is operating. The in-vehicle sensors start recording kinematic information, the driver’s behaviors, and traffic events continuously as soon as the vehicle is switched on. Trip-related information such as average speed, duration, distance, and GPS trajectory coordinates are obtained by aggregating the sensor records at the trip level (Campbell, 2012).

4.2 National Household Travel Survey data

In the present study, we use data from the eighth round of the NHTS conducted from March 2016 through May 2017 as the reference survey. The NHTS is a nationally representative survey, repeated cross-sectionally approximately every seven years. It is aimed at characterizing personal travel behaviors among the civilian, non-institutionalized population of the United States. The 2017 NHTS was a mixed-mode survey, in which households were initially recruited by mailing through an address-based sampling (ABS) technique. Within the selected households, all eligible individuals aged 5\geq 5 years were requested to report the trips they made on a randomly assigned weekday through a web-based travel log. Proxy interviews were requested for younger household members who were 15\leq 15 years old.

The overall sample size was 129,696, of which roughly 20% was used for national representativity and the remaining 80% was regarded as add-ons for the state-level analysis. The recruitment response rate was 30.4%, of which 51.4% reported their trips via the travel logs (Santos et al., 2011). In NHTS, a travel day is defined from 4:00 AM of the assigned day to 3:59 AM of the following day on a typical weekday. A trip is defined as that made by one person using any mode of transportation. While trip distance was measured by online geocoding, the rest of the trip-related information was based on self-reporting. A total of 264,234 eligible individuals aged \geq5 took part in the study, for which 923,572 trips were recorded (McGuckin and Fucci, 2018).

4.3 Auxiliary variables and analysis plan

Because of the critical role of auxiliary variables in maintaining the ignorable assumption for the selection mechanism of the SHRP2 sample, particular attention was paid to identify and build as many common variables as possible in the combined sample that are expected to govern both selection mechanism and outcome variables in SHRP2. However, since the SHRP2 sample is gathered from a limited geographical area, in order to be able to generalize the findings to the American population of drivers, we had to assume that no other auxiliary variable apart from those investigated in this study will define the distribution of the outcome variables. This assumption is in fact embedded in the ignorable condition in the SHRP2 given the set of common observed covariates. Three distinct sets of variables were considered: (i) demographic information of the drivers, (ii) vehicle characteristics, and (iii) day-level information. These variables and associated levels/ranges are listed in Table 4.

Our focus was on inference at the day level, so SHRP2 data were aggregated. We constructed several trip-related outcome variables such as daily frequency of trips, daily total trip duration, daily total distance driven, mean daily trip average speed, and mean daily start time of trips that were available in both datasets as well as daily maximum speed, daily frequency of brakes per mile, and daily percentage of trip with a full stop, which was available in SHRP2 only. The final sample sizes of the complete day-level datasets were nB=837,061n_{B}=837,061 and nR=133,582n_{R}=133,582 in SHRP2 and NHTS, respectively.

In order to make the two datasets more comparable, we filtered out all the subjects in NHTS who were not drivers or were younger than 1616 years old or used public transportation or transportation modes other than cars, SUVs, vans, or light pickup trucks. One major structural difference between NHTS and SHRP2 was that in the NHTS, participants’ trips were recorded for only one randomly assigned weekday, while in SHRP2, individuals were followed up for several months or years. Therefore, to properly account for the potential intraclass correlation across sample units in SHRP2, we treated SHRP2 participants as clusters for variance estimation. For BART, we fitted random intercept BART (Tan et al., 2016). In addition, since the πiR\pi^{R}_{i} were not observed for units of SHRP2, we employed the PAPP and IPSW methods to estimate pseudo-weights, so variance estimation under the GLM was based on the Rao & Wu bootstrap method throughout the application section.

Table 4: List of auxiliary variables and associated levels/ranges that are used to adjust for selection bias in SHRP2
Auxiliary variables (scale) Levels/range
Demographic information
gender (female, male)
age (yrs) (16-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75+)
race (White, Black, other)
ethnicity (Hispanic, non-Hispanic)
birth country (citizen, alien)
education level (\leqHS, HS completed, associate, graduate, post-graduate)
household income (×\times$1,000) (0-49k, 50-99k, 100-149k, 150k+)
household size (1, 2, 3-5, 6-10, 10+)
job status (part-time, full time)
home ownership (owner, renter)
pop. size of resid. area (×1,000\times 1,000) (0-49, 50-200, 200-500, 500+)
Vehicle characteristics
age (yrs) (0-4, 5-9, 10-14, 15-19, 20+)
type (passenger car, Van, SUV, truck)
make (American, European, Asian)
mileage (×\times1000km) (0-4, 5-9, 10, 10-19, 20-49, 50+)
fuel type (gas, other)
Day-level information
weekend indicator of trip day {0,1}
season of trip day (winter, spring, summer, fall)

4.4 Results

According to Figure 7 of Appendix 8.4, one can visually infer that the largest discrepancies between the sample distribution of auxiliary variables in SHRP2 and that in the population stem from participants’ age, race and population size of residential area as well as vehicles’ age and vehicles’ type. The youngest and eldest age groups have been oversampled as are Whites and non-Hispanics. In addition, we found that the proportion of urban dwellers is higher in SHRP2 than that in the NHTS. In terms of vehicle characteristics, SHRP2 participants tend to own passenger cars more than the population average, whereas individuals with other vehicle types were underrepresented in SHRP2.

As the first step of QR, we checked if there is any evidence of a lack of common distributional support between the two studies for the auxiliary variables. Figure 6a compares the kernel density of the estimated PS using BART across the two samples. As illustrated, a notable lack of overlap appears on the left tail of the PS distribution in SHRP2. However, owing to the huge sample size in SHRP2, we believe this does not jeopardize the positivity assumption. The available auxiliary variables are strong predictors of the NHTS selection probabilities for SRHP2: the average pseudo-R2R^{2} was for BART 73%73\% in a 1010-fold cross validation.

In Figure 6b, we compare the distribution of estimated pseudo-weights across the QR methods. It seems that PAPP based on BART is the only method that does not produce influential weights. Also, the highest variability in the estimated pseudo-weights belonged to the PAPP method under GLM. As can be seen in Figure 13 in Appendix 8.4, the largest values of area under the ROC curve (AUC) and the largest values of (pseudo)-R2R^{2} in the radar across different trip-related outcome variables are associated with BART, versus classification and regression trees (CART) or GLM when modeling ZZ on XX and YY on XX, respectively. Additionally, Figure 14 in Appendix 8.4 exhibits how pseudo-weighting based on PAPP-BART improves the imbalance in the distribution of XX in SHRP2 with respect to the weighted distribution of NHTS.

Refer to caption
Figure 6: Comparing the distribution of (a) estimated propensity scores between SHRP2 and NHTS using BART and (b) estimated pseudo-weights in SHRP2 across the applied quasi-randomization methods

Figure 8 depicts the adjusted sample means for some trip-related measures that were available in both SHRP2 and NHTS. The methods we compare here encompass PAPP, IPSW and PM as the non-robust approaches, and AIPW with PAPP and AIPW with IPSW as the DR approaches. Also, a comparison is made between GLM and BART for all the methods except those involving IPSW. Our results suggest that, as expected, the oversampling of younger and older drivers lead to underestimating miles driven and length of trips, and overestimating the time of the first trip of the day; other factors may impact these variables, as well as the average speed of a given drive. For three of these four variables (total trip duration, total distance driven, and start hour of daily trip), there appeared to be improvement with respect to bias considering the NHTS weighted estimates as the benchmark, although only trip duration appears to be fully corrected. In Figure 7, we display the posterior predictive density of mean daily total distance driven under PAPP, PM and AIPW-PAPP. Note that the narrow variance associated with the PAPP approach is due to the fact that the posterior predictive distribution under pseudo-weighting does not account for the clustering effects in SHRP2. It is in fact V¯W\overline{V}_{W} in 2.35 that is capturing this source of uncertainty in variance estimation.

Refer to caption
Figure 7: The posterior predictive distributions of the adjusted sample mean of daily total distance driven based on BART

Among the QR methods, we observed that the PAPP based on BART gives the most accurate estimate with respect to bias for this variable. However, the relatively narrow 95% CI associated with BART may indicate that BART does not properly propagate the uncertainty in pseudo-weighting. Regarding the PM, it seems BART performs as well as GLM, but with wider uncertainty. As a consequence, the AIPW estimator performs the same in terms of bias across different QR methods. The AIPW estimator based on IPSW, on the other hand, seems to be is more efficient than the ones based on PAPP. However, these findings are not consistent across the outcome variables.

Results related to the adjusted means for some SHRP2-specific outcome variables are summarized in Figure 9. These variables consist of (a) daily maximum speed, (b) frequency of brakes per mile, and (c) percentage of trip duration when the vehicle is fully stopped. For the daily maximum speed, we take one further step and present the DR adjusted mean based on the IPSW-GLM and PAPP-BART by some auxiliary variables in Figure 10. As illustrated, higher levels of mean daily maximum speed are associated with males, age group 35-44 years, Blacks, high school graduates, Asian cars, and weekends. According to the lengths of 95% CIs, one can see that the AIPW-PAPP-BART consistently produces more efficient estimates than the AIPW-IPSW-GLM. Further numerical details of these findings by the auxiliary variables have been provided in Tables 11-17 in Appendix 8.4.

Refer to caption
Figure 8: Evaluation of pseudo-weights by comparing weighted estimates of daily frequency of trips between NHTS and SHRP2: (a) Mean daily total trip duration, (b) Mean daily total distance driven, (c) Mean trip average speed, and (d) Mean daily start hour of trips. The dashed line and surrounding shadowed area represent weighted estimates and 95% CIs in NHTS, respectively. UW: unweighted; PAPP: propensity adjusted probability prediction; IPSW: inverse propensity score weighting; NA: not applicable.
Refer to caption
Refer to caption
Figure 9: Adjusted estimates of some SHRP2-specific outcomes: (a) Mean daily maximum speed, (b) daily frequency of brakes per mile driven, and (c) daily percentage of stop time. UW: unweighted; PAPP: propensity adjusted probability prediction; IPSW: inverse propensity score weighting.
Refer to caption
Figure 10: Bias adjusted estimates of mean daily maximum speed (MPH) driven by (a) gender, (b) age groups, (c) race, (d) education, (e) vehicle manufacturer, and (f) weekend indicator. UW: unweighted; PAPP: propensity adjusted probability prediction; IPSW: inverse propensity score weighting; NA: not applicable.

5 Discussion

In this study, we proposed a doubly robust (DR) adjustment method for finite population inference in non-probability samples when a well-designed probability sample is available as a benchmark. Combining the ideas of pseudo-weighting with prediction modeling, our method involved a modified version of AIPW, which is DR in the sense that estimates are consistent if either underlying model holds. More importantly, the proposed method permitted us to apply a wider class of predictive tools, especially supervised algorithmic methods. To better address model misspecification, our study employed BART to multiply impute both pseudo-inclusion probabilities and the outcome variable. We also proposed a method to estimate the variance of the DR estimator based on the posterior predictive draws simulated by BART. In a simulation study, we then assessed the repeated sampling properties of our proposed estimator. Finally, we apply it to real Big Data from naturalistic driving studies with the aim to improve the potential selection bias in the estimates of finite population means.

Generally, the simulation findings revealed that our modified AIPW method produces less biased estimates than its competitors, especially when nR<<nBn_{R}<<n_{B}. When at least one of the models, i.e. QR or PM, is correctly specified, all the DR methods generated unbiased results, though our estimator was substantially more efficient with narrower 95% CIs. However, when both working models are invalid, our findings suggest that DR estimates based on the GLM can be severely biased. However, under BART, it seems that estimates remain approximately unbiased if the true model structure associated with both QR and PM is unknown to the researcher. In contrast to the conventional IPSW estimator, we found that the new proposed estimator produces more stable results in terms of bias and efficiency across different sampling fractions and various degrees of association between the sampling designs of SRS_{R} and SBS_{B}.

Generally, the results of the application suggest near total removal of bias for only one of the four variables that can be estimated from the reference survey (daily total distance driven). We believe this failure originates from several sources. First and foremost, the bias observed in the final estimates is very likely to be mixed with measurement error because we compared the results of sensor data with self-reported data as a benchmark. Second, there was evidence of departure from the positivity assumption in SHRP2. Studies show that even a slight lack of common support in the distribution of auxiliary variables may lead to inflated variance and aggravated bias (Hill and Su, 2013). Part of this can be due to the fact that we attempted to generalize the results to the general population of American drivers, while SHRP2 data was restricted to six states. Another reason might be deviation from the ignorable assumptions: The associations between the auxiliary variables and the outcome variables were relatively weak and varying across the variables.

Our study was not without weaknesses. First, our approach assumes the ideal situation where the did_{i} are available in the non-probability sample, since that is demanded by the general theory linking together the probability and non-probability samples. In practice it can be difficult to fully meet this requirement, and indeed in many practical settings it might be that only the available subset of xix_{i}^{*} is required to fully model selection into the non-probability sample and the outcome variable, or alternatively, that the available components of xix_{i}^{*} will provide a much better approximation to the true estimates than simply using the non-probability sample without correction. Second, our adjustment method assumes that the two samples are mutually exclusive. However, in many Big Data scenarios (though not the one we consider), the sampling fraction may be non-trivial, so the two samples may overlap substantially. In such a situation, it is important to check how sensitive our proposed pseudo-weighting approach is to this assumption. Extensions may be plausible to account for the duplicate units of the population in the pooled sample. Third, our multiple imputation variance estimator (Eq. 2.35) ignores covariance between V¯W\overline{V}_{W} and VBV_{B} induced by the weights. This covariance is typically negative and leads to conservative inference, as seen in the modest overestimation of variance in the BART estimations in Simulations 2 and 3. Use of a bootstrap procedure such as that described in the simulation study of Chen et al. (2020) may be an alternative, although impractical in our setting given the computational demands of fitting the BART models to each bootstrap sample. Another drawback is that the combined dataset may be subject to differential measurement error in the variables. This issue is particularly acute in our SHPR2 analysis, because the definition of a trip may not be identical between the two studies: although trip measures in the SHRP2 are recorded by sensors, in the NHTS trip measures are memory and human estimation based, as they are self-reported. Having such error-prone information either as the outcome or as an auxiliary variable may lead to biased results. Finally, we failed to use the two-step Bayesian method under GLM for the application part, because SHRP2 data were clustered demanding for Bayesian generalized linear mixed effect models to properly estimate the variance of the DR estimators required computational resources beyond our reach. This prompted us to apply resampling techniques to the actual data instead of a full Bayesian method.

There are a number of potential future directions for our research. First, we would like to expand the asymptotic variance estimator under PAPP when πiR\pi_{i}^{R} cannot be computed for iSBi\in S_{B}. Alternatively, one may be interested in developing a fully model-based approach, in which a synthetic population is created by undoing the sampling stages via a Bayesian bootstrap method, and attempts are made to impute the outcome for non-sampled units of the population (Dong et al., 2014; Zangeneh and Little, 2015; An and Little, 2008). The synthetic population idea makes it easier to incorporate the design features of the reference survey into adjustments, especially when Bayesian inference is of interest. While correcting for selection bias, one can adjust for the potential measurement error in the outcome variables as well if there exists a validation dataset where both mismeasured and error-free values of the variables are observed (Kim and Tam, 2020). When combining data from multiple sources, it is also likely that auxiliary variables are subject to differential measurement error. Hong et al. (2017) propose a Bayesian approach to adjust for a different type of measurement error in a causal inference context. Also, in a Big Data setting, fitting models can be computationally demanding. To address this issue, it might be worth expanding the divide-and-recombine techniques for the proposed DR methods. Finally, as noted by a reviewer, the basic structure of our problem (see Figure 1) approximates that tackled by “data fusion” methods, developed primarily in the computer science literature. While this literature does not appear to have directly addressed issues around sample design, it may be a useful vein of research to mine for future connections to non-probability sampling research.

6 Acknowledgement

The present study was part of the doctoral research of the first author of this article at the Michigan Program in Survey and Data Science. The authors would like to thank all the researchers and staff who have been involved in collecting the data of both NHTS 2017 and SHRP2. Our gratitude also goes to professors Katharine Abraham, Stanley Presser and Joseph Sedarski at the University of Maryland who have improved the quality of this article with their valuable comments. The findings and conclusions of this paper are those of the authors and do not necessarily represent the views of the Virginia Tech Transportation Institute (VTTI), SHRP2, the Transportation Research Board, or the National Academy of Sciences.

7 Conflict of Interest

The authors declare that there was no conflict of interest in the current research.

References

  • Abu-Nimeh et al. (2008) Abu-Nimeh, S., D. Nappa, X. Wang, and S. Nair (2008). Bayesian additive regression trees-based spam detection for enhanced email privacy. In Availability, Reliability and Security, 2008. ARES 08., pp. 1044–1051. IEEE.
  • An and Little (2008) An, H. and R. J. Little (2008). Robust model-based inference for incomplete data via penalized spline propensity prediction. Communications in Statistics–Simulation and Computation 37(9), 1718–1731.
  • An (2010) An, W. (2010). 4. bayesian propensity score estimators: Incorporating uncertainties in propensity scores into causal inference. Sociological Methodology 40(1), 151–189.
  • Antin et al. (2015) Antin, J., K. Stulce, L. Eichelberger, and J. Hankey (2015). Naturalistic driving study: descriptive comparison of the study sample with national data. Technical report.
  • Baker et al. (2013) Baker, R., J. M. Brick, N. A. Bates, M. Battaglia, M. P. Couper, J. Dever, K. J. Gile, and R. Tourangeau (2013). Summary report of the aapor task force on non-probability sampling. Journal of Survey Statistics and Methodology 1, 90–143.
  • Bang and Robins (2005) Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61(4), 962–973.
  • Beresewicz et al. (2018) Beresewicz, M., R. Lehtonen, F. Reis, L. Di Consiglio, and M. Karlberg (2018). An overview of methods for treating selectivity in big data sources.
  • Brick (2015) Brick, J. M. (2015). Compositional model inference. JSM Proceedings (Survey Research Methods Section), 299–307.
  • Campbell (2012) Campbell, K. L. (2012). The shrp 2 naturalistic driving study: Addressing driver performance and behavior in traffic safety. TR News (282), 30–35.
  • Cao et al. (2009) Cao, W., A. A. Tsiatis, and M. Davidian (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96(3), 723–734.
  • Cassel et al. (1976) Cassel, C. M., C. E. Särndal, and J. H. Wretman (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika 63(3), 615–620.
  • Chen et al. (2020) Chen, Y., P. Li, and C. Wu (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association 115, 2011–2021.
  • Chipman et al. (2007) Chipman, H. A., E. I. George, and R. E. McCulloch (2007). Bayesian ensemble learning. In Advances in neural information processing systems, pp. 265–272.
  • Chipman et al. (2010) Chipman, H. A., E. I. George, R. E. McCulloch, et al. (2010). Bart: Bayesian additive regression trees. The Annals of Applied Statistics 4(1), 266–298.
  • Couper (2013) Couper, M. (2013). Is the sky falling? new technology, changing media, and the future of surveys. keynote presentation at the 5th european survey research association conference. ljubliana, slovenia.
  • Daas et al. (2015) Daas, P. J., M. J. Puts, B. Buelens, and P. A. van den Hurk (2015). Big data as a source for official statistics. Journal of Official Statistics 31(2), 249–262.
  • Dawid (1982) Dawid, A. P. (1982). The well-calibrated bayesian. Journal of the American Statistical Association 77(379), 605–610.
  • Dong et al. (2014) Dong, Q., M. R. Elliott, and T. E. Raghunathan (2014). A nonparametric method to generate synthetic populations to adjust for complex sampling design features. Survey Methodology 40(1), 29.
  • Dutwin and Buskirk (2017) Dutwin, D. and T. D. Buskirk (2017). Apples to oranges or gala versus golden delicious? comparing data quality of nonprobability internet samples to low response rate probability samples. Public Opinion Quarterly 81(S1), 213–239.
  • Elliott and Valliant (2017) Elliott, M. R. and R. Valliant (2017). Inference for nonprobability samples. Statistical Science 32(2), 249–264.
  • Ferrari and Cribari-Neto (2004) Ferrari, S. and F. Cribari-Neto (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics 31(7), 799–815.
  • Franke et al. (2016) Franke, B., J.-F. Plante, R. Roscher, E.-s. A. Lee, C. Smyth, A. Hatefi, F. Chen, E. Gil, A. Schwing, A. Selvitella, et al. (2016). Statistical inference, learning and models in big data. International Statistical Review 84(3), 371–389.
  • Gelman et al. (2007) Gelman, A. et al. (2007). Struggles with survey weighting and regression modeling. Statistical Science 22(2), 153–164.
  • Green and Kern (2012) Green, D. P. and H. L. Kern (2012). Modeling heterogeneous treatment effects in survey experiments with bayesian additive regression trees. Public Opinion Quarterly 76(3), 491–511.
  • Groves (2011) Groves, R. M. (2011). Three eras of survey research. Public Opinion Quarterly 75(5), 861–871.
  • Guo et al. (2009) Guo, F., J. M. Hankey, et al. (2009). Modeling 100-car safety events: A case-based approach for analyzing naturalistic driving data. Technical report, Virginia Tech. Virginia Tech Transportation Institute.
  • Han and Wang (2013) Han, P. and L. Wang (2013). Estimation with missing data: beyond double robustness. Biometrika 100(2), 417–430.
  • Haziza and Rao (2006) Haziza, D. and J. N. Rao (2006). A nonresponse model approach to inference under imputation for missing survey data. Survey Methodology 32(1), 53.
  • Hill and Su (2013) Hill, J. and Y.-S. Su (2013). Assessing lack of common support in causal inference using bayesian nonparametrics: Implications for evaluating the effect of breastfeeding on children’s cognitive outcomes. The Annals of Applied Statistics, 1386–1420.
  • Hill et al. (2011) Hill, J., C. Weiss, and F. Zhai (2011). Challenges with propensity score strategies in a high-dimensional setting and a potential alternative. Multivariate Behavioral Research 46(3), 477–513.
  • Hong et al. (2017) Hong, H., K. E. Rudolph, and E. A. Stuart (2017). Bayesian approach for addressing differential covariate measurement error in propensity score methods. Psychometrika 82(4), 1078–1096.
  • Huisingh et al. (2019) Huisingh, C., C. Owsley, E. B. Levitan, M. R. Irvin, P. Maclennan, and G. McGwin (2019). Distracted driving and risk of crash or near-crash involvement among older drivers using naturalistic driving data with a case-crossover study design. The Journals of Gerontology. Series A, Biological Sciences and Medical Sciences 74, 550–555.
  • Hunsberger et al. (2008) Hunsberger, S., B. I. Graubard, and E. L. Korn (2008). Testing logistic regression coefficients with clustered data and few positive outcomes. Statistics in Medicine 27(8), 1305–1324.
  • Japec et al. (2015) Japec, L., F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O’Neil, and A. Usher (2015). Big data in survey research: Aapor task force report. Public Opinion Quarterly 79(4), 839–880.
  • Johnson and Smith (2017) Johnson, T. P. and T. W. Smith (2017). Big data and survey research: Supplement or substitute? In Seeing Cities Through Big Data, pp.  113–125. Springer.
  • Kang et al. (2007) Kang, J. D., J. L. Schafer, et al. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22(4), 523–539.
  • Kaplan and Chen (2012) Kaplan, D. and J. Chen (2012). A two-step bayesian approach for propensity score analysis: Simulations and case study. Psychometrika 77(3), 581–609.
  • Kern et al. (2016) Kern, H. L., E. A. Stuart, J. Hill, and D. P. Green (2016). Assessing methods for generalizing experimental impact estimates to target populations. Journal of Research on Educational Effectiveness 9(1), 103–127.
  • Kim and Haziza (2014) Kim, J. K. and D. Haziza (2014). Doubly robust inference with missing data in survey sampling. Statistica Sinica 24(1), 375–394.
  • Kim and Park (2006) Kim, J. K. and H. Park (2006). Imputation using response probability. Canadian Journal of Statistics 34(1), 171–182.
  • Kim et al. (2018) Kim, J. K., S. Park, Y. Chen, and C. Wu (2018). Combining non-probability and probability survey samples through mass imputation. arXiv preprint arXiv:1812.10694.
  • Kim and Rao (2012) Kim, J. K. and J. N. Rao (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika 99(1), 85–100.
  • Kim and Tam (2020) Kim, J.-k. and S.-M. Tam (2020). Data integration by combining big data and survey sample data for finite population inference. arXiv preprint arXiv:2003.12156.
  • Kim et al. (2018) Kim, J. K., Z. Wang, Z. Zhu, and N. B. Cruze (2018). Combining survey and non-survey data for improved sub-area prediction using a multi-level model. Journal of Agricultural, Biological and Environmental Statistics 23(2), 175–189.
  • Kitchin (2015) Kitchin, R. (2015). The opportunities, challenges and risks of big data for official statistics. Statistical Journal of the IAOS 31(3), 471–481.
  • Kott (1994) Kott, P. S. (1994). A note on handling nonresponse in sample surveys. Journal of the American Statistical Association 89(426), 693–696.
  • Kott (2006) Kott, P. S. (2006). Using calibration weighting to adjust for nonresponse and coverage errors. Survey Methodology 32(2), 133–142.
  • Kott and Chang (2010) Kott, P. S. and T. Chang (2010). Using calibration weighting to adjust for nonignorable unit nonresponse. Journal of the American Statistical Association 105(491), 1265–1275.
  • Kreuter and Peng (2014) Kreuter, F. and R. D. Peng (2014). Extracting information from big data: Issues of measurement, inference and linkage. Privacy, Big Data, and the Public Good: Frameworks for Engagement, 257.
  • Lane (2016) Lane, J. (2016). Big data for public policy: The quadruple helix. Journal of Policy Analysis and Management 35(3), 708–715.
  • Lee (2006) Lee, S. (2006). Propensity score adjustment as a weighting scheme for volunteer panel web surveys. Journal of Official Statistics 22(2), 329.
  • Lee and Valliant (2009) Lee, S. and R. Valliant (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociological Methods & Research 37(3), 319–343.
  • Lenis et al. (2018) Lenis, D., B. Ackerman, and E. A. Stuart (2018). Measuring model misspecification: Application to propensity score methods with complex survey data. Computational Statistics & Data Analysis.
  • Little (2004) Little, R. J. (2004). To model or not to model? competing modes of inference for finite population sampling. Journal of the American Statistical Association 99(466), 546–556.
  • Little and Zheng (2007) Little, R. J. and H. Zheng (2007). The bayesian approach to the analysis of finite population surveys. Bayesian Statistics 8(1), 1–20.
  • Lohr and Raghunathan (2017) Lohr, S. L. and T. E. Raghunathan (2017). Combining survey data with other data sources. Statistical Science 32(2), 293–312.
  • Mayer et al. (2020) Mayer, I., E. Sverdrup, T. Gauss, J.-D. Moyer, S. Wager, J. Josse, et al. (2020). Doubly robust treatment effect estimation with missing attributes. Annals of Applied Statistics 14(3), 1409–1431.
  • McCandless et al. (2009) McCandless, L. C., P. Gustafson, and P. C. Austin (2009). Bayesian propensity score analysis for observational data. Statistics in medicine 28(1), 94–112.
  • McConnell and Lindner (2019) McConnell, K. J. and S. Lindner (2019). Estimating treatment effects with machine learning. Health Services Research 54(6), 1273–1282.
  • McGuckin and Fucci (2018) McGuckin, N. and A. Fucci (2018). Summary of travel trends: 2017 national household travel survey (report fhwa-pl-18-019). Washington, DC: Federal Highway Administration, US Department of Transportation.
  • Meng et al. (2018) Meng, X.-L. et al. (2018). Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 us presidential election. The Annals of Applied Statistics 12(2), 685–726.
  • Mercer (2018) Mercer, A. W. (2018). Selection Bias in Nonprobability Surveys: A Causal Inference Approach. Ph. D. thesis.
  • Miller (2017) Miller, P. V. (2017). Is there a future for surveys? Public Opinion Quarterly 81(S1), 205–212.
  • Murdoch and Detsky (2013) Murdoch, T. B. and A. S. Detsky (2013). The inevitable application of big data to health care. Journal of the American Medical Association 309(13), 1351–1352.
  • of Sciences (2013) of Sciences, T. R. B. N. A. (2013). The 2nd Strategic Highway Research Program Naturalistic Driving Study Dataset.
  • Oman and Zucker (2001) Oman, S. D. and D. M. Zucker (2001). Modelling and generating correlated binary variables. Biometrika 88(1), 287–290.
  • Rafei et al. (2020) Rafei, A., C. A. Flannagan, and M. R. Elliott (2020). Big data for finite population inference: Applying quasi-random approaches to naturalistic driving data using bayesian additive regression trees. Journal of Survey Statistics and Methodology 8(1), 148–180.
  • Rao (2015) Rao, J. N. (2015). Small-Area Estimation. New York: Wiley.
  • Rao and Wu (1988) Rao, J. N. and C. Wu (1988). Resampling inference with complex survey data. Journal of the american statistical association 83(401), 231–241.
  • Rivers (2007) Rivers, D. (2007). Sampling for web surveys. In Joint Statistical Meetings.
  • Robins et al. (1994) Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89(427), 846–866.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55.
  • Rubin (2004) Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys. New York: Wiley.
  • Rubin (2007) Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Statistics in Medicine 26(1), 20–36.
  • Saarela et al. (2016) Saarela, O., L. R. Belzile, and D. A. Stephens (2016). A bayesian view of doubly robust causal inference. Biometrika 103(3), 667–681.
  • Santos et al. (2011) Santos, A., N. McGuckin, H. Y. Nakamoto, D. Gray, and S. Liss (2011). Summary of travel trends: 2009 national household travel survey. Technical report.
  • Scharfstein et al. (1999) Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association 94(448), 1096–1120.
  • Senthilkumar et al. (2018) Senthilkumar, S., B. K. Rai, A. A. Meshram, A. Gunasekaran, and S. Chandrakumarmangalam (2018). Big data in healthcare management: A review of literature. American Journal of Theoretical and Applied Business 4(2), 57–69.
  • Smith (1983) Smith, T. (1983). On the validity of inferences from non-random sample. Journal of the Royal Statistical Society A146, 394–403.
  • Spertus and Normand (2018) Spertus, J. V. and S.-L. T. Normand (2018). Bayesian propensity scores for high-dimensional causal inference: A comparison of drug-eluting to bare-metal coronary stents. Biometrical Journal 60, 721–733.
  • Struijs et al. (2014) Struijs, P., B. Braaksma, and P. J. Daas (2014). Official statistics and big data. Big Data & Society 1(1), 1–6.
  • Tan et al. (2017) Tan, Y. V., M. R. Elliott, and C. A. Flannagan (2017). Development of a real-time prediction model of driver behavior at intersections using kinematic time series data. Accident Analysis & Prevention 106, 428–436.
  • Tan et al. (2016) Tan, Y. V., C. A. Flannagan, and M. R. Elliott (2016). Predicting human-driving behavior to help driverless vehicles drive: random intercept bayesian additive regression trees. arXiv preprint arXiv:1609.07464.
  • Tan et al. (2019) Tan, Y. V., C. A. Flannagan, and M. R. Elliott (2019). “robust-squared” imputation models using bart. Journal of Survey Statistics and Methodology 7(4), 465–497.
  • Tan (2006) Tan, Z. (2006). A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association 101(476), 1619–1637.
  • Valliant (2020) Valliant, R. (2020). Comparing alternatives for estimation from nonprobability samples. Journal of Survey Statistics and Methodology 8, 231–263.
  • Valliant and Dever (2011) Valliant, R. and J. A. Dever (2011). Estimating propensity adjustments for volunteer web surveys. Sociological Methods & Research 40(1), 105–137.
  • Wang et al. (2015) Wang, W., D. Rothschild, S. Goel, and A. Gelman (2015). Forecasting elections with non-representative polls. International Journal of Forecasting 31(3), 980–991.
  • Wendling et al. (2018) Wendling, T., K. Jung, A. Callahan, A. Schuler, N. Shah, and B. Gallego (2018). Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Statistics in medicine.
  • Wu and Sitter (2001) Wu, C. and R. R. Sitter (2001). A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association 96(453), 185–193.
  • Yang and Kim (2018) Yang, S. and J. K. Kim (2018). Integration of survey data and big observational data for finite population inference using mass imputation. arXiv preprint arXiv:1807.02817.
  • Yang et al. (2020) Yang, S., J. K. Kim, and R. Song (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society B82, 445–465.
  • Zangeneh and Little (2015) Zangeneh, S. Z. and R. J. Little (2015). Bayesian inference for the finite population total from a heteroscedastic probability proportional to size sample. Journal of Survey Statistics and Methodology 3(2), 162–192.
  • Zhang and Little (2011) Zhang, G. and R. Little (2011). A comparative study of doubly robust estimators of the mean with missing data. Journal of Statistical Computation and Simulation 81(12), 2039–2058.
  • Zhou et al. (2020) Zhou, Q., C. McNeal, L. A. Copeland, J. P. Zachariah, and J. J. Song (2020). Bayesian propensity score analysis for clustered observational data. Statistical Methods & Applications 29(2), 335–355.
  • Zigler et al. (2013) Zigler, C. M., K. Watts, R. W. Yeh, Y. Wang, B. A. Coull, and F. Dominici (2013). Model feedback in bayesian propensity score estimation. Biometrics 69(1), 263–273.

8 Appendix

8.1 Theoretical proofs

Suppose there exists an infinite sequence of finite populations UνU_{\nu} of sizes NνN_{\nu} with ν=1,2,,\nu=1,2,...,\infty. Corresponding to UνU_{\nu} are a non-probability sample SB,νS_{B,\nu} and a probability sample SR,νS_{R,\nu} with nB,νn_{B,\nu} and nR,νn_{R,\nu} being the respective sample sizes. Also, let us assume that NνN_{\nu}{\to}\infty, nB,νn_{B,\nu}{\to}\infty and nR,νn_{R,\nu}{\to}\infty as ν\nu{\to}\infty, while nB,ν/NνfBn_{B,\nu}/N_{\nu}{\to}f_{B}, and nR,ν/NνfRn_{R,\nu}/N_{\nu}{\to}f_{R} with 0<fR<10<f_{R}<1 and 0<fB<10<f_{B}<1. However, from now on, we suppress the subscript ν\nu for rotational simplicity. In order to be able to make unbiased inference based on SBS_{B}, we consider the following conditions:

  1. 1.

    The set of observed auxiliary variables, XX, fully governs the selection mechanism in SBS_{B}. This is called an ignorable condition, implying p(δiB=1|yi,xi)=p(δiB=1|xi)p(\delta^{B}_{i}=1|y_{i},x_{i})=p(\delta_{i}^{B}=1|x_{i}) for iUi\in U.

  2. 2.

    The SBS_{B} actually does have a probability sampling mechanism, albeit unknown. This means p(δiB=1|xi)>0p(\delta^{B}_{i}=1|x_{i})>0 for all iUi\in U.

  3. 3.

    Units of SRS_{R} and SBS_{B} are selected independently from UU given the observed auxiliary variables, XX^{*}, i.e. δiRδjB|X\delta^{R}_{i}\rotatebox[origin={c}]{90.0}{$\models$}\delta^{B}_{j}|X^{*} for iji\neq j.

  4. 4.

    The sampling fractions, fRf_{R} and fBf_{B}, are small enough such that the possible overlap between SRS_{R} and SBS_{B} is negligible, i.e. SRSB=S_{R}\cap S_{B}=\emptyset.

  5. 5.

    The true underling models for Y|XY|X^{*} and δB|X\delta_{B}|X and δR|X\delta^{R}|X are known.

In addition, to be able to drive the asymptotic properties of the proposed estimators, we consider the following regularity conditions according to Chen et al. (2020):

  1. 1.

    For any given xx, m(x;θ)/θ\partial m(x;\theta)/\partial\theta exists and is continuous with respect to θ\theta, and |m(x;θ)/θ|h(x;θ)|\partial m(x;\theta)/\partial\theta|\leq h(x;\theta) for θ\theta in the neighborhood of θ\theta, and i=1Nh(xi;θ)=O(1)\sum_{i=1}^{N}h(x_{i};\theta)=O(1).

  2. 2.

    For any given xx, 2m(x;θ)/θT\partial^{2}m(x;\theta)/\partial\theta^{T} exists and is continuous with respect to θ\theta, and maxj,l|2m(x;θ)/θjθl|k(x;θ)max_{j,l}|\partial^{2}m(x;\theta)/\partial\theta_{j}\partial\theta_{l}|\leq k(x;\theta) for θ\theta in the neighborhood of θ\theta, and i=1Nk(xi;θ)=O(1)\sum_{i=1}^{N}k(x_{i};\theta)=O(1).

  3. 3.

    For ui={xi,yi,m(xi;θ)}u_{i}=\{x_{i},y_{i},m(x_{i};\theta)\}, the finite population and the sampling design in SRS_{R} satisfy N1i=1nRui/πiRN1i=1Nui=Op(nR1/2)N^{-1}\sum_{i=1}^{n_{R}}u_{i}/\pi^{R}_{i}-N^{-1}\sum_{i=1}^{N}u_{i}=O_{p}(n_{R}^{-1/2}).

  4. 4.

    There exist c1c_{1} and c2c_{2} such that 0<c1NπiB/nBc20<c_{1}\leq N\pi^{B}_{i}/n_{B}\leq c_{2} and 0<c1NπiR/nRc20<c_{1}\leq N\pi^{R}_{i}/n_{R}\leq c_{2} for all iUi\in U.

  5. 5.

    The finite population and the propensity scores satisfy N1i=1Nyi2=O(1)N^{-1}\sum_{i=1}^{N}y_{i}^{2}=O(1), N1i=1Nxi3=O(1)N^{-1}\sum_{i=1}^{N}||x_{i}||^{3}=O(1), and N1i=1NπiB(1πiB)xixiTN^{-1}\sum_{i=1}^{N}\pi^{B}_{i}(1-\pi^{B}_{i})x_{i}x_{i}^{T} is a positive definite matrix.

Note that while we assume πiR\pi^{R}_{i} is calculable for iSBi\in S_{B} throughout the proofs, extensions can be provided for situations where πiR\pi^{R}_{i} need to be predicted for iSBi\in S_{B}.

8.1.1 Asymptotic properties of PAPW estimator

Since β^1\widehat{\beta}_{1} is the MLE estimate of β1\beta_{1} in the logistic regression of ZiZ_{i} on xix^{*}_{i}, it is clear that β^1𝑝β1\widehat{\beta}_{1}\overset{p}{\to}\beta_{1}. Two immediate result of this are that π^iB𝑝πiB\widehat{\pi}^{B}_{i}\overset{p}{\to}\pi^{B}_{i} and E(π^iB|xi)=πiBE(\widehat{\pi}^{B}_{i}|x^{*}_{i})=\pi^{B}_{i} where π^iB\widehat{\pi}^{B}_{i} is defined as in 2.8. Now, we prove the consistency and asymptotic unbiasedness of the PAPW estimator in 2.16. To this end, we show that y¯^PAPWy¯U=Op(nB1/2)\widehat{\overline{y}}_{PAPW}-{\overline{y}}_{U}=O_{p}(n_{B}^{-1/2}). Consider the following set of estimating equations:

Φn(η)\displaystyle\Phi_{n}(\eta) =[n1i=1nZi(yiy¯U)/πiBn1i=1n{Zipi(β1)}xi]=[N1i=1NδiB(yiy¯U)/πiBN1i=1Nδi{Zipi(β1)}xi]=0\displaystyle=\begin{bmatrix}n^{-1}\sum_{i=1}^{n}Z_{i}(y_{i}-\overline{y}_{U})/\pi^{B}_{i}\\ \\ n^{-1}\sum_{i=1}^{n}\{Z_{i}-p_{i}(\beta_{1})\}x^{*}_{i}\end{bmatrix}=\begin{bmatrix}N^{-1}\sum_{i=1}^{N}\delta^{B}_{i}(y_{i}-\overline{y}_{U})/\pi^{B}_{i}\\ \\ N^{-1}\sum_{i=1}^{N}\delta_{i}\{Z_{i}-p_{i}(\beta_{1})\}x^{*}_{i}\end{bmatrix}=0 (8.1)

where η=(y¯U,β1)\eta=(\overline{y}_{U},\beta_{1}).

In the following, we show that EδB[Φn(η^)|xi]=0E_{\delta^{B}}[\Phi_{n}(\widehat{\eta})|x^{*}_{i}]=0. We start with the first component of Φn(η^)\Phi_{n}(\widehat{\eta})

EδB[1Ni=1NEδB(δiB)(yiy¯U)πiB|xi]\displaystyle E_{\delta^{B}}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{E_{\delta^{B}}(\delta^{B}_{i})(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}}\big{|}x^{*}_{i}\right] =1Ni=1NEδB(δiB|xi)(yiy¯U)πiB\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\frac{E_{\delta^{B}}(\delta^{B}_{i}|x^{*}_{i})(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}}
=1Ni=1NπiB(yiy¯U)πiB\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\frac{\pi^{B}_{i}(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}}
=0\displaystyle=0

Noting that EδB[Φn(η^)]=Eδ[EZ{Φn(η^)|δi=1}]E_{\delta^{B}}[\Phi_{n}(\widehat{\eta})]=E_{\delta}[E_{Z}\{\Phi_{n}(\widehat{\eta})|\delta_{i}=1\}], for the second component, we have

EδB[1Ni=1Nδi{Zipi(β1)}xi|xi]\displaystyle E_{\delta^{B}}\left[\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\{Z_{i}-p_{i}(\beta_{1})\}x_{i}\bigg{|}x_{i}\right] =Eδ[EZ{1Ni=1Nδi{Zipi(β1)}xi|δi=1,xi}]\displaystyle=E_{\delta}\left[E_{Z}\bigg{\{}\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\{Z_{i}-p_{i}(\beta_{1})\}x_{i}\big{|}\delta_{i}=1,x_{i}\bigg{\}}\right] (8.2)
=Eδ[1Ni=1Nδi{EZ(Zi|δi=1,xi)pi(β1)}xi]\displaystyle=E_{\delta}\left[\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\{E_{Z}(Z_{i}|\delta_{i}=1,x_{i})-p_{i}(\beta_{1})\}x^{*}_{i}\right]
=Eδ[1Ni=1Nδi{pi(β1)pi(β1)}xi]\displaystyle=E_{\delta}\left[\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\{p_{i}(\beta_{1})-p_{i}(\beta_{1})\}x^{*}_{i}\right]
=0\displaystyle=0

Now, we apply the first-order Taylor approximation to Φn(η^)\Phi_{n}(\widehat{\eta}) around η1\eta_{1} as below:

η^η1=[E{ϕn(η1)}]1Φn(η1)+Op(nB1/2)\widehat{\eta}-\eta_{1}=[E\{\phi_{n}(\eta_{1})\}]^{-1}\Phi_{n}(\eta_{1})+O_{p}(n_{B}^{-1/2}) (8.3)

where ϕn(η)=Φn(η)/η\phi_{n}(\eta)=\partial\Phi_{n}(\eta)/\partial\eta.

y¯U[1Ni=1NδiB(yiy¯U)πiB]=1Ni=1NδiBπiB\frac{\partial}{\partial\overline{y}_{U}}\left[\frac{1}{N}\sum_{i=1}^{N}\delta^{B}_{i}\frac{(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}}\right]=-\frac{1}{N}\sum_{i=1}^{N}\frac{\delta^{B}_{i}}{\pi^{B}_{i}} (8.4)
β1[1Ni=1NδiB(yiy¯U)πiB]\displaystyle\frac{\partial}{\partial\beta_{1}}\left[\frac{1}{N}\sum_{i=1}^{N}\delta^{B}_{i}\frac{(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}}\right] =β1[1Ni=1NδiBπiB{pi(β1)1pi(β1)}(yiy¯U)]\displaystyle=\frac{\partial}{\partial\beta_{1}}\left[\frac{1}{N}\sum_{i=1}^{N}\frac{\delta^{B}_{i}}{\pi^{B}_{i}}\bigg{\{}\frac{p_{i}(\beta_{1})}{1-p_{i}(\beta_{1})}\bigg{\}}(y_{i}-\overline{y}_{U})\right] (8.5)
=1Ni=1NδiBπiB(yiy¯U)xiT\displaystyle=-\frac{1}{N}\sum_{i=1}^{N}\frac{\delta^{B}_{i}}{\pi^{B}_{i}}(y_{i}-\overline{y}_{U})x_{i}^{*T}
β1[1Ni=1Nδi{Zipi(β1)}]=1Ni=1Nδipi(β1)[1pi(β1)]xixiT\frac{\partial}{\partial\beta_{1}}\left[\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\big{\{}Z_{i}-p_{i}(\beta_{1})\big{\}}\right]=-\frac{1}{N}\sum_{i=1}^{N}\delta_{i}p_{i}(\beta_{1})\left[1-p_{i}(\beta_{1})\right]x^{*}_{i}x_{i}^{*T} (8.6)

Therefore, we have

ϕn(η1)=(1Ni=1NδiBπiB1Ni=1NδiBπiB(yiy¯U)xiT01Ni=1Nδipi(β1)[1pi(β1)]xixiT)\phi_{n}(\eta_{1})=\begin{pmatrix}-\frac{1}{N}\sum_{i=1}^{N}\frac{\delta^{B}_{i}}{\pi^{B}_{i}}&-\frac{1}{N}\sum_{i=1}^{N}\frac{\delta^{B}_{i}}{\pi^{B}_{i}}(y_{i}-\overline{y}_{U})x_{i}^{*T}\\ 0&-\frac{1}{N}\sum_{i=1}^{N}\delta_{i}p_{i}(\beta_{1})\left[1-p_{i}(\beta_{1})\right]x^{*}_{i}x_{i}^{*T}\end{pmatrix} (8.7)

Thus, it follows that y¯^PM=y¯U+Op(nB1/2)\widehat{\overline{y}}_{PM}=\overline{y}_{U}+O_{p}(n_{B}^{-1/2}).

Now, we turn to deriving the asymptotic variance estimator for y¯^PM\widehat{\overline{y}}_{PM}. According to the sandwich formula, we have

Var(η^1)=[E{ϕn(η1)}]1Var{ϕn(η1)}[E{ϕn(η1)}T]1+Op(nB1)Var(\widehat{\eta}_{1})=\left[E\{\phi_{n}(\eta_{1})\}\right]^{-1}Var\big{\{}\phi_{n}(\eta_{1})\big{\}}\left[E\{\phi_{n}(\eta_{1})\}^{T}\right]^{-1}+O_{p}(n_{B}^{-1}) (8.8)

Given the fact that

E(δi=1|xi)=p(δiB=1|xi)p(Zi=1|xi)=πiR1pi(β1)E(\delta_{i}=1|x^{*}_{i})=\frac{p(\delta^{B}_{i}=1|x^{*}_{i})}{p(Z_{i}=1|x^{*}_{i})}=\frac{\pi^{R}_{i}}{1-p_{i}(\beta_{1})} (8.9)

It can be shown that

E{ϕn(η1)}=(11Ni=1N(yiy¯U)xiT01Ni=1NπiB[1pi(β1)]xixiT)E\big{\{}\phi_{n}(\eta_{1})\big{\}}=\begin{pmatrix}-1&-\frac{1}{N}\sum_{i=1}^{N}(y_{i}-\overline{y}_{U})x_{i}^{*T}\\ 0&-\frac{1}{N}\sum_{i=1}^{N}\pi^{B}_{i}\left[1-p_{i}(\beta_{1})\right]x^{*}_{i}x_{i}^{*T}\end{pmatrix} (8.10)

And

[E{ϕn(η1)}]1=(1bT0[1Ni=1NπiB[1pi(β1)]xixiT]1)\left[E\big{\{}\phi_{n}(\eta_{1})\big{\}}\right]^{-1}=\begin{pmatrix}-1&b^{T}\\ 0&-\left[\frac{1}{N}\sum_{i=1}^{N}\pi^{B}_{i}\left[1-p_{i}(\beta_{1})\right]x^{*}_{i}x_{i}^{*T}\right]^{-1}\end{pmatrix} (8.11)

where

bT={1Ni=1N(yiy¯U)xiT}{1Ni=1NπiB{1pi(β1)}xixiT}1b^{T}=\bigg{\{}\frac{1}{N}\sum_{i=1}^{N}(y_{i}-\overline{y}_{U})x_{i}^{*T}\bigg{\}}\bigg{\{}\frac{1}{N}\sum_{i=1}^{N}\pi^{B}_{i}\big{\{}1-p_{i}(\beta_{1})\big{\}}x^{*}_{i}x_{i}^{*T}\bigg{\}}^{-1} (8.12)

Now, the goal is to calculate Var{ϕn(η1)}Var\big{\{}\phi_{n}(\eta_{1})\big{\}}. We know that

VarδB(1Ni=1NδiB(yiy¯U)πiB|xi)\displaystyle Var_{\delta^{B}}\left(\frac{1}{N}\sum_{i=1}^{N}\frac{\delta^{B}_{i}(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}}\bigg{|}x_{i}\right) =1Ni=1N(yiy¯U)2(πiB)2πiB(1πiB)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\frac{(y_{i}-\overline{y}_{U})^{2}}{(\pi^{B}_{i})^{2}}\pi^{B}_{i}(1-\pi^{B}_{i}) (8.13)
=1Ni=1N{1πiBπiB}(yiy¯U)2\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\bigg{\{}\frac{1-\pi^{B}_{i}}{\pi^{B}_{i}}\bigg{\}}(y_{i}-\overline{y}_{U})^{2}
VarδB(1Ni=1Nδi{Zipi(β1)}|xi)\displaystyle Var_{\delta^{B}}\left(\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\big{\{}Z_{i}-p_{i}(\beta_{1})\big{\}}\bigg{|}x^{*}_{i}\right) =Eδ[VarZ(1Ni=1Nδi{Zipi(β1)}|δi=1,xi)]\displaystyle=E_{\delta}\left[Var_{Z}\left(\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\big{\{}Z_{i}-p_{i}(\beta_{1})\big{\}}\big{|}\delta_{i}=1,x^{*}_{i}\right)\right] (8.14)
+Varδ[δiEZ(1Ni=1Nδi{Zipi(β1)}|δi=1,xi)]\displaystyle+Var_{\delta}\left[\delta_{i}E_{Z}\left(\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\big{\{}Z_{i}-p_{i}(\beta_{1})\big{\}}\big{|}\delta_{i}=1,x^{*}_{i}\right)\right]
=1N2Eδ(i=1Nδi2VarZ(Zi)xixiT|xi)+0\displaystyle=\frac{1}{N^{2}}E_{\delta}\left(\sum_{i=1}^{N}\delta^{2}_{i}Var_{Z}(Z_{i})x^{*}_{i}x_{i}^{*T}\bigg{|}x^{*}_{i}\right)+0
=1N2i=1NπiRpi(β1)xixiT\displaystyle=\frac{1}{N^{2}}\sum_{i=1}^{N}\pi^{R}_{i}p_{i}(\beta_{1})x^{*}_{i}x_{i}^{*T}
Cov(1Ni=1NδiB(yiy¯U)πiB,1Ni=1Nδi{Zipi(β1)}|xi)\displaystyle Cov\left(\frac{1}{N}\sum_{i=1}^{N}\frac{\delta^{B}_{i}(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}},\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\big{\{}Z_{i}-p_{i}(\beta_{1})\big{\}}\bigg{|}x^{*}_{i}\right) =Eδ[EZ(1Ni=1NδiZi(yiy¯U)πiB|δi=1,xi)]\displaystyle=E_{\delta}\left[E_{Z}\left(\frac{1}{N}\sum_{i=1}^{N}\delta_{i}\frac{Z_{i}(y_{i}-\overline{y}_{U})}{\pi^{B}_{i}}\bigg{|}\delta_{i}=1,x^{*}_{i}\right)\right] (8.15)
=1N2i=1N{1pi(β1)}(yiy¯U)xi\displaystyle=\frac{1}{N^{2}}\sum_{i=1}^{N}\big{\{}1-p_{i}(\beta_{1})\big{\}}(y_{i}-\overline{y}_{U})x^{*}_{i}

Therefore, we have

Var{Φn(η1)}=(1N2i=1N{(1πiB)/πiB}(yiy¯U)1N2i=1N{1pi(β1)}(yiy¯U)xiT1N2i=1N{1pi(β1)}(yiy¯U)xi1N2i=1NπiB{1pi(β1)}xixiT)Var\big{\{}\Phi_{n}(\eta_{1})\big{\}}=\begin{pmatrix}\frac{1}{N^{2}}\sum_{i=1}^{N}\{(1-\pi^{B}_{i})/\pi^{B}_{i}\}(y_{i}-\overline{y}_{U})&\frac{1}{N^{2}}\sum_{i=1}^{N}\big{\{}1-p_{i}(\beta_{1}\big{)}\}(y_{i}-\overline{y}_{U})x_{i}^{*T}\\ \frac{1}{N^{2}}\sum_{i=1}^{N}\big{\{}1-p_{i}(\beta_{1}\big{)}\}(y_{i}-\overline{y}_{U})x^{*}_{i}&\frac{1}{N^{2}}\sum_{i=1}^{N}\pi^{B}_{i}\big{\{}1-p_{i}(\beta_{1})\big{\}}x^{*}_{i}x_{i}^{*T}\end{pmatrix} (8.16)

The final asymptotic variance estimator of y¯^PAPW\widehat{\overline{y}}_{PAPW} is given by

Var{y¯^PAPW}=1N2i=1N{1πiBπiB}(yiy¯U)22bTN2i=1N{1pi(β1)}(yiy¯U)xi+bT[1N2i=1NπiB{1pi(β1)}xixiT]bVar\big{\{}\widehat{\overline{y}}_{PAPW}\big{\}}=\frac{1}{N^{2}}\sum_{i=1}^{N}\bigg{\{}\frac{1-\pi^{B}_{i}}{\pi^{B}_{i}}\bigg{\}}(y_{i}-\overline{y}_{U})^{2}-2\frac{b^{T}}{N^{2}}\sum_{i=1}^{N}\big{\{}1-p_{i}(\beta_{1})\big{\}}(y_{i}-\overline{y}_{U})x^{*}_{i}+b^{T}\left[\frac{1}{N^{2}}\sum_{i=1}^{N}\pi^{B}_{i}\big{\{}1-p_{i}(\beta_{1})\big{\}}x^{*}_{i}x_{i}^{*T}\right]b (8.17)

To obtain the variance estimate based on the observed samples of SBS_{B} and SRS_{R}, we substitute the population components with their estimates from the samples.

Var^{y¯^PAPW}=1N2i=1nB{1π^iB}(yiy¯Uπ^iB)22b^TN2i=1nB{1pi(β^1)}(yiy¯Uπ^iB)xi+b^T[1N2i=1npi(β^1)xixiT]b^\widehat{Var}\big{\{}\widehat{\overline{y}}_{PAPW}\big{\}}=\frac{1}{N^{2}}\sum_{i=1}^{n_{B}}\big{\{}1-\widehat{\pi}^{B}_{i}\big{\}}\left(\frac{y_{i}-\overline{y}_{U}}{\widehat{\pi}^{B}_{i}}\right)^{2}-2\frac{\widehat{b}^{T}}{N^{2}}\sum_{i=1}^{n_{B}}\big{\{}1-p_{i}(\widehat{\beta}_{1})\big{\}}\left(\frac{y_{i}-\overline{y}_{U}}{\widehat{\pi}^{B}_{i}}\right)x^{*}_{i}+\widehat{b}^{T}\left[\frac{1}{N^{2}}\sum_{i=1}^{n}p_{i}(\widehat{\beta}_{1})x^{*}_{i}x_{i}^{*T}\right]\widehat{b} (8.18)

where

b^T={1Ni=1nB(yiy¯Uπ^iB)xiT}{1Ni=1npi(β^1)xixiT}1\widehat{b}^{T}=\bigg{\{}\frac{1}{N}\sum_{i=1}^{n_{B}}\left(\frac{y_{i}-\overline{y}_{U}}{\widehat{\pi}^{B}_{i}}\right)x_{i}^{*T}\bigg{\}}\bigg{\{}\frac{1}{N}\sum_{i=1}^{n}p_{i}(\widehat{\beta}_{1})x^{*}_{i}x_{i}^{*T}\bigg{\}}^{-1} (8.19)

8.1.2 Proof of doubly robustness

As discussed in the section 2.4, a doubly robust estimator should be consistent even if either model is misspecified. To prove the doubly robustness property of the AIPW estimator proposed here, let initially assume that θ^𝑝θ\widehat{\theta}\overset{p}{\to}\theta if the prediction model (PM) is correctly specified, and ϕ^𝑝ϕ\widehat{\phi}\overset{p}{\to}\phi and β^𝑝β\widehat{\beta}\overset{p}{\to}\beta if the pseudo-weighting model is correctly specified. Given the true probabilities of selection in SBS_{B}, we know that HT estimator is design-unbiased for the population total, i.e.

E(i=1nByi/πiB)\displaystyle E\left(\sum_{i=1}^{n_{B}}y_{i}/\pi^{B}_{i}\right) =E(i=1NδiByi/πiB)\displaystyle=E\left(\sum_{i=1}^{N}\delta^{B}_{i}y_{i}/\pi^{B}_{i}\right) (8.20)
=i=1NE(δiB)yi/πiB\displaystyle=\sum_{i=1}^{N}E(\delta^{B}_{i})y_{i}/\pi^{B}_{i}
=i=1NπiByi/πiB\displaystyle=\sum_{i=1}^{N}\pi^{B}_{i}y_{i}/\pi^{B}_{i}
=i=1Nyi\displaystyle=\sum_{i=1}^{N}y_{i}
=y^U\displaystyle=\widehat{y}_{U}

And the same result will be obtained for SRS_{R}. Therefore

E(i=1nByi/πiB)\displaystyle E\left(\sum_{i=1}^{n_{B}}y_{i}/\pi^{B}_{i}\right) =E(i=1nRyi/πiR)\displaystyle=E\left(\sum_{i=1}^{n_{R}}y_{i}/\pi^{R}_{i}\right) (8.21)
=y^U\displaystyle=\widehat{y}_{U}

Now we have

y^DR𝑝E(y^DR)\displaystyle\widehat{y}_{DR}\overset{p}{\to}E(\widehat{y}_{DR}) =E{i=1nB(yiy^i)π^iB+i=1nRy^iπiR}\displaystyle=E\bigg{\{}\sum_{i=1}^{n_{B}}\frac{(y_{i}-\widehat{y}_{i})}{\widehat{\pi}^{B}_{i}}+\sum_{i=1}^{n_{R}}\frac{\widehat{y}_{i}}{\pi^{R}_{i}}\bigg{\}} (8.22)
=E{i=1nB(yiy^i)π^iB+i=1nBy^iπiB}\displaystyle=E\bigg{\{}\sum_{i=1}^{n_{B}}\frac{(y_{i}-\widehat{y}_{i})}{\widehat{\pi}^{B}_{i}}+\sum_{i=1}^{n_{B}}\frac{\widehat{y}_{i}}{\pi^{B}_{i}}\bigg{\}}
=E{i=1nB(yiy^i)π^iB+y^iπiB}\displaystyle=E\bigg{\{}\sum_{i=1}^{n_{B}}\frac{(y_{i}-\widehat{y}_{i})}{\widehat{\pi}^{B}_{i}}+\frac{\widehat{y}_{i}}{\pi^{B}_{i}}\bigg{\}}
=E{i=1nBy^iπiB+(yiy^i)π^iB(y^iy^i)πiB}\displaystyle=E\bigg{\{}\sum_{i=1}^{n_{B}}\frac{\widehat{y}_{i}}{\pi^{B}_{i}}+\frac{(y_{i}-\widehat{y}_{i})}{\widehat{\pi}^{B}_{i}}-\frac{(\widehat{y}_{i}-\widehat{y}_{i})}{\pi^{B}_{i}}\bigg{\}}
y^DR𝑝E(y^DR)\displaystyle\widehat{y}_{DR}\overset{p}{\to}E(\widehat{y}_{DR}) =yU+E{i=1nB(yiy^i)(1π^iB1πiB)}\displaystyle=y_{U}+E\bigg{\{}\sum_{i=1}^{n_{B}}(y_{i}-\widehat{y}_{i})(\frac{1}{\widehat{\pi}^{B}_{i}}-\frac{1}{\pi^{B}_{i}})\bigg{\}} (8.23)
=yU+E{i=1nB(yiy^i)(πiBπ^iB1)}\displaystyle=y_{U}+E\bigg{\{}\sum_{i=1}^{n_{B}}(y_{i}-\widehat{y}_{i})(\frac{\pi^{B}_{i}}{\widehat{\pi}^{B}_{i}}-1)\bigg{\}}

Under the ignorable assumption in SBS_{B}, we have YπB|X,πRY\rotatebox[origin={c}]{90.0}{$\models$}\pi^{B}|X,\pi^{R}. Hence

y^DR𝑝E(y^DR)\displaystyle\widehat{y}_{DR}\overset{p}{\to}E(\widehat{y}_{DR}) =yU+E{i=1nB(yiy^i)(πiBπ^iB1)}\displaystyle=y_{U}+E\bigg{\{}\sum_{i=1}^{n_{B}}(y_{i}-\widehat{y}_{i})(\frac{\pi^{B}_{i}}{\widehat{\pi}^{B}_{i}}-1)\bigg{\}} (8.24)
=yU+E{E{i=1nB(yiy^i)(πiBπ^iB1)|xi,πiR}}\displaystyle=y_{U}+E\bigg{\{}E\bigg{\{}\sum_{i=1}^{n_{B}}(y_{i}-\widehat{y}_{i})(\frac{\pi^{B}_{i}}{\widehat{\pi}^{B}_{i}}-1)|x_{i},\pi^{R}_{i}\bigg{\}}\bigg{\}}
=yU+E{i=1nBE(yiy^i|xi,πiR)E(πiBπ^iB1|xi,πiR)}\displaystyle=y_{U}+E\bigg{\{}\sum_{i=1}^{n_{B}}E(y_{i}-\widehat{y}_{i}|x_{i},\pi^{R}_{i})E(\frac{\pi^{B}_{i}}{\widehat{\pi}^{B}_{i}}-1|x_{i},\pi^{R}_{i})\bigg{\}}

If we assume the pseudo-weighting model is correctly specified, then we expect π^iB𝑝πiB\widehat{\pi}^{B}_{i}\overset{p}{\to}\pi^{B}_{i} and

E(πiBπ^iB1|xi,πiR)𝑝πiBπiB1=0\displaystyle E\left(\frac{\pi^{B}_{i}}{\widehat{\pi}^{B}_{i}}-1|x_{i},\pi^{R}_{i}\right)\overset{p}{\to}\frac{\pi^{B}_{i}}{\pi^{B}_{i}}-1=0 (8.25)

which implies that y^DR𝑝yU\widehat{y}_{DR}\overset{p}{\to}y_{U} regardless of whether the PM is correctly specified or not. In situations where the mean model is correctly specified, then we expect that y^i𝑝yi\widehat{y}_{i}\overset{p}{\to}y_{i}. Hence

E(yiy^i|xi,πiR)𝑝E(yiyi|xi,πiR)=0\displaystyle E\left(y_{i}-\widehat{y}_{i}|x_{i},\pi^{R}_{i}\right)\overset{p}{\to}E\left(y_{i}-y_{i}|x_{i},\pi^{R}_{i}\right)=0 (8.26)

which means that y^DR𝑝yU\widehat{y}_{DR}\overset{p}{\to}y_{U} even if the PW model is incorrectly specified.

8.1.3 Variance estimation under the Bayesian approach

As discussed in Section 2.6, in this study, we use Rubin’s combining rule to estimate the variance of the AIPW estimator under the two-step Bayesian approach. The idea stems from the conditional variance formula, which involves two parts: (1) within-imputation variance and between-imputation variance. The latter is straightforward and achieves by taking the variance of the y¯^DR(m)\widehat{\overline{y}}^{(m)}_{DR} across the MM MCMC draws. The within-imputation variance requires more attention as one needs to account for the intraclass correlations due to clustering and use linearization techniques when dealing with a ratio estimator.

It is clear that this component is calculated conditional on the observed y^i(m)\widehat{y}^{(m)}_{i} for iSi\in S, π^iB(m)\widehat{\pi}^{B(m)}_{i} for iSBi\in S_{B} and p^iiB\widehat{p}i^{B}_{i}.

var(y¯^DR(m)|π^iB(m),y^i(m))\displaystyle var\left(\widehat{\overline{y}}^{(m)}_{DR}\big{|}\widehat{\pi}^{B(m)}_{i},\widehat{y}^{(m)}_{i}\right) =var(1N^Bi=1nB(yiy^i(m))π^iB+1N^Ri=1nRy^i(m)πiR|π^iB(m),y^i(m))\displaystyle=var\left(\frac{1}{\widehat{N}_{B}}\sum_{i=1}^{n_{B}}\frac{\left(y_{i}-\widehat{y}^{(m)}_{i}\right)}{\widehat{\pi}^{B}_{i}}+\frac{1}{\widehat{N}_{R}}\sum_{i=1}^{n_{R}}\frac{\widehat{y}^{(m)}_{i}}{\pi^{R}_{i}}\bigg{|}\widehat{\pi}^{B(m)}_{i},\widehat{y}^{(m)}_{i}\right) (8.27)
=1N^B2i=1nBvar(yi)(π^iB)2+var(1N^Ri=1nRy^i(m)πiR|y^i(m))\displaystyle=\frac{1}{\widehat{N}^{2}_{B}}\sum_{i=1}^{n_{B}}\frac{var(y_{i})}{\left(\widehat{\pi}^{B}_{i}\right)^{2}}+var\left(\frac{1}{\widehat{N}_{R}}\sum_{i=1}^{n_{R}}\frac{\widehat{y}^{(m)}_{i}}{\pi^{R}_{i}}\bigg{|}\widehat{y}^{(m)}_{i}\right)

For the first component, which equals var(y)i=1nB(π^iB)2/N^B2var(y)\sum_{i=1}^{n_{B}}\left(\widehat{\pi}^{B}_{i}\right)^{-2}/\widehat{N}^{2}_{B}, it suffices to estimate the variance of yy. The second component, however, deals with the variance of a ratio estimator, which requires linearization techniques. Let’s define t^R=i=1nRy^i(m)/πiR\widehat{t}_{R}=\sum_{i=1}^{n_{R}}\widehat{y}^{(m)}_{i}/\pi^{R}_{i}, Taylor-series approximation of the variance is given by

var(1N^Ri=1nRy^i(m)πiR|y^i(m))\displaystyle var\left(\frac{1}{\widehat{N}_{R}}\sum_{i=1}^{n_{R}}\frac{\widehat{y}^{(m)}_{i}}{\pi^{R}_{i}}\bigg{|}\widehat{y}^{(m)}_{i}\right) =var(t^RN^R|y^i(m))\displaystyle=var\left(\frac{\widehat{t}_{R}}{\widehat{N}_{R}}\bigg{|}\widehat{y}^{(m)}_{i}\right) (8.28)
1N^R2{var(t^R|y^i(m))+(t^RN^R)2var(N^R)2(t^RN^R)cov(t^R,N^R|y^i(m))}\displaystyle\approx\frac{1}{\widehat{N}^{2}_{R}}\bigg{\{}var\left(\widehat{t}_{R}\big{|}\widehat{y}^{(m)}_{i}\right)+\left(\frac{\widehat{t}_{R}}{\widehat{N}_{R}}\right)^{2}var(\widehat{N}_{R})-2\left(\frac{\widehat{t}_{R}}{\widehat{N}_{R}}\right)cov\left(\widehat{t}_{R},\widehat{N}_{R}\big{|}\widehat{y}^{(m)}_{i}\right)\bigg{\}}

Since t^R\widehat{t}_{R} depends on y^i(m)\widehat{y}^{(m)}_{i}, we have

var(t^R|y^i(m))=i=1nR(y^i(m))2var(1πiR)var\left(\widehat{t}_{R}\big{|}\widehat{y}^{(m)}_{i}\right)=\sum_{i=1}^{n_{R}}\left(\widehat{y}^{(m)}_{i}\right)^{2}var\left(\frac{1}{\pi^{R}_{i}}\right) (8.29)
cov(t^R,N^R|y^i(m))=i=1nRy^i(m)var(1πiR)cov\left(\widehat{t}_{R},\widehat{N}_{R}\big{|}\widehat{y}^{(m)}_{i}\right)=\sum_{i=1}^{n_{R}}\widehat{y}^{(m)}_{i}var\left(\frac{1}{\pi^{R}_{i}}\right) (8.30)

Therefore, the variance of the ratio estimator is approximated by

var(1N^Ri=1nRy^i(m)πiR|y^i(m))1N^R2var(1πiR){i=1nR(y^i(m))2+nR(t^RN^R)22i=1nRy^i(m)}var\left(\frac{1}{\widehat{N}_{R}}\sum_{i=1}^{n_{R}}\frac{\widehat{y}^{(m)}_{i}}{\pi^{R}_{i}}\bigg{|}\widehat{y}^{(m)}_{i}\right)\approx\frac{1}{\widehat{N}^{2}_{R}}var\left(\frac{1}{\pi^{R}_{i}}\right)\bigg{\{}\sum_{i=1}^{n_{R}}\left(\widehat{y}^{(m)}_{i}\right)^{2}+n_{R}\left(\frac{\widehat{t}_{R}}{\widehat{N}_{R}}\right)^{2}-2\sum_{i=1}^{n_{R}}\widehat{y}^{(m)}_{i}\bigg{\}} (8.31)

And the final within-imputation variance can be given by

var(y¯^DR(m)|π^iB(m),y^i(m))1N^B2i=1nBvar(yi)(π^iB)2+1N^R2var(1πiR){i=1nR(y^i(m))2+nR(t^RN^R)22i=1nRy^i(m)}var\left(\widehat{\overline{y}}^{(m)}_{DR}\big{|}\widehat{\pi}^{B(m)}_{i},\widehat{y}^{(m)}_{i}\right)\approx\frac{1}{\widehat{N}^{2}_{B}}\sum_{i=1}^{n_{B}}\frac{var(y_{i})}{\left(\widehat{\pi}^{B}_{i}\right)^{2}}+\frac{1}{\widehat{N}^{2}_{R}}var\left(\frac{1}{\pi^{R}_{i}}\right)\bigg{\{}\sum_{i=1}^{n_{R}}\left(\widehat{y}^{(m)}_{i}\right)^{2}+n_{R}\left(\frac{\widehat{t}_{R}}{\widehat{N}_{R}}\right)^{2}-2\sum_{i=1}^{n_{R}}\widehat{y}^{(m)}_{i}\bigg{\}} (8.32)

Note that in situations where either SRS_{R} or SBS_{B} is a clustered sample, the derivation of the within-imputation variance would remain the same, but yiy_{i}, πiR\pi^{R}_{i}, π^iB(m)\widehat{\pi}^{B(m)}_{i}, and y^i(m)\widehat{y}^{(m)}_{i} will represent the total for cluster ii, and nRn_{R} and nBn_{B} are the number of clusters in SRS_{R} and SBS_{B}, respectively.

8.2 Bayesian Additive Regression Trees

BART is a flexible ensemble of trees method, which allows handling non-linear relationships as well as multi-way interaction effects. The idea of BART is based on the sum-of-trees, where trees are sequentially modified on the basis of residuals from the other trees. In a tree-based method, the variation in the response variable is explained by hierarchically splitting the sample into more homogeneous subgroups (Green and Kern, 2012). As illustrated in Figure 11, a binary-structured tree consists of a root node, a set of interior nodes, a set of terminal nodes associated with parameters and decision rules that links these nodes (Abu-Nimeh et al., 2008).

Refer to caption
Figure 11: Example of a binary-structured trees model

8.2.1 BART for continuous outcomes

Suppose y=f(x)+ϵy=f(x)+\epsilon as is the case in every statistical model, where yy\in\mathbb{R} is a continuous outcome, xx denotes an n×pn\times p matrix of covariates, and ϵN(0,σ2)\epsilon\sim N(0,\sigma^{2}) is the error term. BART will then approximate the outcome as below:

yj=1mf(x,Tj,Mj)y\approx\sum_{j=1}^{m}f(x,T_{j},M_{j}) (8.33)

where TjT_{j} is the jj-th tree with bjb_{j} terminal nodes, and associated Mj=(μ1j,μ2j,,μbjj)TM_{j}=(\mu_{1j},\mu_{2j},…,\mu_{{b_{j}}j})^{T} parameters. BART is a Bayesian approach, since it assigns prior distributions to TT, MM, and σ\sigma (Chipman et al., 2010; Tan et al., 2016). Assuming an independence structure between trees, we can define the prior as follows:

p[(T1,M1),,(Tm,Mm),σ2]=[j=1mp(Tj,Mj)]p(σ2)p[(T_{1},M_{1}),...,(T_{m},M_{m}),\sigma^{-2}]=[\prod_{j=1}^{m}p(T_{j},M_{j})]p(\sigma^{-2}) (8.34)

Using the multiplication law of probability, the joint distribution of p(Tj,Mj)p(T_{j},M_{j}) can be written as:

p(Tj,Mj)\displaystyle p(T_{j},M_{j}) =p(Mj|Tj)p(Tj)\displaystyle=p(M_{j}|T_{j})p(T_{j}) (8.35)
=i=1bjp(μij|Tj)p(Tj)\displaystyle=\prod_{i=1}^{b_{j}}p(\mu_{ij}|T_{j})p(T_{j})

where i=1,,bji=1,...,b_{j} denotes the terminal node parameters for tree jj. Therefore, the joint distribution in 8.34 can be factored as below:

p[(T1,M1),,(Tm,Mm),σ2]=[j=1m{i=1bjp(μij|Tj)}p(Tj)]p(σ2)p\left[(T_{1},M_{1}),...,(T_{m},M_{m}),\sigma^{-2}\right]=\left[\prod_{j=1}^{m}\bigg{\{}\prod_{i=1}^{b_{j}}p(\mu_{ij}|T_{j})\bigg{\}}p(T_{j})\right]p(\sigma^{-2}) (8.36)

Suggested by Chipman et al. (2007), the following distributions can be used for μij|Tj\mu_{ij}|T_{j} and σ2\sigma^{-2}:

μij|TjN(μμ,σμ2)\mu_{ij}|T_{j}\sim N(\mu_{\mu},\sigma_{\mu}^{2}) (8.37)
σ2G(ν2,νλ2)\sigma^{-2}\sim G(\frac{\nu}{2},\frac{\nu\lambda}{2}) (8.38)

The prior for TjT_{j} involves three components of the tree structure: length of the tree, decision rules and the choice of covariate at a given node. However, prior specification for TjT_{j} depends on several factors, and detailed discussions can be found in Chipman et al. (2010). Given the data, these parameters are updated through a combination of “Bayesian backfitting” and MCMC Gibbs sampler method. The trained trees are then summed up to approximate the outcome variable. Finally, mm is typically assumed to be fixed, but can be assessed by cross-validation.

8.2.2 BART for binary outcomes

For the binary outcome, a probitprobit link function is usually employed in the sense that yy is an indicator variable dichotomizing a normally distributed latent continuous outcome like yy^{*} at a real value cc so that:

y={1y*>c0yc,yN(0,1)y=\begin{cases}\text{1}&\quad\text{y}^{\text{*}}>\text{c}\\ \text{0}&\quad\text{y}^{*}\leq\text{c}\\ \end{cases},\hskip 14.22636pty^{*}\sim N(0,1) (8.39)

Therefore, the new model will be given by:

G(x)=Φ1[p(y=1|x)]=j=1mf(x,Tj,Mj)G(x)=\Phi^{-1}[p(y=1|x)]=\sum_{j=1}^{m}f(x,T_{j},M_{j}) (8.40)

where Φ1[.]\Phi^{-1}[.] is the inverse of standard normal CDF. Since we implicitly assumed σ1\sigma\equiv 1, the only priors we need to specify are p(μij|Tj)p(\mu_{ij}|T_{j}) and p(Tj)p(T_{j}). In order to be able to draw the posterior distribution of TjT_{j} and μij\mu_{ij}, we need to generate the latent continuous variable, yy^{*}, given yky_{k}. Chipman et al. (2010) recommends a data augmentation method based on the following algorithm:

yk={max(N(G(xk),1),0)if yk=1max(N(G(xk),1),0)if yk=0y_{k}^{*}=\begin{cases}max(N(G(x_{k}),1),0)&\text{if }y_{k}=1\\ max(N(G(x_{k}),1),0)&\text{if }y_{k}=0\end{cases} (8.41)

Since the structure of priors is very similar to BART for continuous outcomes (Tan et al., 2016), we update the estimates G(xk)G(x_{k}) after drawing samples from TjT_{j}’s and μij\mu_{ij}’s. To apply BART, in this research, we utilize the ‘BayesTree’ and ‘BART’ packages in R.

8.3 Further extensions of the simulation study

8.3.1 Simulation study I

This subsection provides additional results associated with Simulation I. Table 5 and Table 6 summarize the findings of the simulation in 3.1 under the frequentist approach when nB=100n_{B}=100, and nB=10,000n_{B}=10,000. We report the corresponding results under the two-step Bayesian approach in Table 7 and Table 8, respectively.

Table 5: Comparing the performance of the bias adjustment methods and associated asymptotic variance estimator under the frequentist approach in the first simulation study for nR=100n_{R}=100 and nB=100n_{B}=100
ρ=0.3\rho=0.3 ρ=0.5\rho=0.5 ρ=0.8\rho=0.8
Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE
Probability sample (SRS_{R})
Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.02
Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008
Non-probability sample (SBS_{B})
Unweighted 31.895 36.418 57.0 1.014 32.213 33.2 1.740 1.008 32.310 32.853 0.0 0.995
Fully weighted 0.171 21.078 94.8 0.996 0.247 8.265 94.9 0.999 0.268 4.994 94.2 0.995
Non-robust adjustment
Model specification: True
PAPW -1.192 23.466 95.2 1.018 -1.205 9.452 95.3 1.015 -1.211 5.982 95.8 1.007
IPSW -2.917 26.505 97.3 1.386 -3.036 12.700 97.0 1.355 -3.075 9.470 97.0 1.308
PM 0.372 20.989 94.6 0.994 0.148 8.351 94.9 0.995 0.077 5.160 95.0 0.992
Model specification: False
PAPW 27.140 33.436 75.6 1.059 27.393 28.814 16.6 1.043 27.470 28.276 2.5 1.025
IPSW 28.372 33.972 67.9 1.012 28.711 29.951 8.3 1.002 28.815 29.515 0.5 0.99
PM 28.199 33.790 68.4 1.011 28.541 29.771 8.3 1.001 28.645 29.337 0.3 0.988
Doubly robust adjustment
Model specification: QR–True, PM–True
AIPW–PAPW -0.084 22.973 96.4 1.047 -0.014 8.996 96.2 1.038 0.007 5.368 95.5 1.017
AIPW–IPSW -0.184 22.449 96.3 1.046 -0.049 8.826 96.1 1.038 -0.009 5.314 95.9 1.016
Model specification: QR–True, PM–False
AIPW–PAPW -0.436 23.709 96.4 1.038 -0.286 9.866 96.6 1.062 -0.241 6.520 97.2 1.101
AIPW–IPSW -0.427 23.083 96.4 1.039 -0.227 9.570 96.6 1.070 -0.166 6.298 97.5 1.119
Model specification: QR–False, PM–True
AIPW–PAPW -0.045 29.068 97.3 1.107 0.011 11.113 96.9 1.097 0.026 6.073 96.2 1.068
AIPW–IPSW -0.194 28.208 97.5 1.104 -0.044 10.825 97.1 1.094 0.001 5.974 96.5 1.062
Model specification: QR–False, PM–False
AIPW–PAPW 28.301 34.194 71.3 1.037 28.570 29.868 10.9 1.028 28.652 29.379 0.7 1.016
AIPW–IPSW 28.178 33.806 70.4 1.035 28.525 29.764 9.4 1.025 28.631 29.326 0.5 1.013
  • PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting. Fully weighted implies the weighted means if the true sampling weights are known.

Table 6: Comparing the performance of the bias adjustment methods and associated asymptotic variance estimator under the frequentist approach in the first simulation study for nR=100n_{R}=100 and nB=10,000n_{B}=10,000
ρ=0.2\rho=0.2 ρ=0.5\rho=0.5 ρ=0.8\rho=0.8
Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE
Probability sample (SRS_{R})
Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.02
Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008
Non-probability sample (SBS_{B})
Unweighted 30.014 30.066 0.0 1.008 30.197 30.207 0.0 1.019 30.252 30.257 0.0 1.033
Fully weighted 0.032 2.083 95.3 1.005 0.018 0.816 95.1 1.007 0.012 0.490 95.1 1.007
Non-robust adjustment
Model specification: True
PAPW -2.067 4.582 94.9 1.108 -2.145 4.120 92.8 1.107 -2.170 4.072 92.2 1.107
IPSW -2.618 7.717 94.5 0.958 -2.673 7.334 91.1 0.923 -2.692 7.308 90.6 0.979
PM 0.296 4.515 95.2 0.994 0.121 4.134 94.8 0.986 0.065 4.095 94.6 0.985
Model specification: False
PAPW 24.493 24.616 0.0 1.126 24.592 24.651 0.0 1.153 24.621 24.673 0.0 1.161
IPSW 26.675 26.804 0.0 0.992 26.871 26.949 0.0 0.970 26.930 27.002 0.0 0.964
PM 26.509 26.645 0.0 1.003 26.717 26.800 0.0 0.989 26.779 26.856 0.0 0.986
Doubly robust adjustment
Model specification: QR–True, PM–True
AIPW–PAPW 0.180 4.633 95.1 0.994 0.080 4.162 94.8 0.986 0.047 4.104 94.7 0.985
AIPW–IPSW 0.052 4.582 95.2 0.995 0.035 4.152 94.6 0.987 0.028 4.101 94.5 0.985
Model specification: QR–True, PM–False
AIPW–PAPW 0.262 4.719 95.1 1.000 0.163 4.250 94.9 0.997 0.130 4.191 94.7 0.996
AIPW–IPSW 0.188 4.652 95.4 1.002 0.171 4.225 95.0 0.998 0.164 4.174 94.8 0.998
Model specification: QR–False, PM–True
AIPW–PAPW 1.376 8.569 94.5 0.953 0.503 4.829 95.1 0.995 0.231 4.215 95.2 0.992
AIPW–IPSW 0.864 7.648 94.7 0.948 0.322 4.643 95.3 0.990 0.152 4.182 95.0 0.989
Model specification: QR–False, PM–False
AIPW–PAPW 26.696 26.835 0.0 0.998 26.779 26.862 0.0 0.987 26.803 26.880 0.0 0.985
AIPW–IPSW 26.520 26.655 0.0 1.001 26.718 26.801 0.0 0.989 26.777 26.854 0.0 0.986
  • PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting. Fully weighted implies the weighted means if the true sampling weights are known.

Table 7: Comparing the performance of the bias adjustment methods and associated variance estimator under the two-step Bayesian approach in the first simulation study for nR=100n_{R}=100 and nB=100n_{B}=100
ρ=0.3\rho=0.3 ρ=0.5\rho=0.5 ρ=0.8\rho=0.8
Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE
Probability sample (SRS_{R})
Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.020
Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 95.0 1.008
Non-probability sample (SBS_{B})
Unweighted 32.238 36.815 56.3 1.003 32.303 33.3 1.620 1.003 32.322 32.865 0.0 0.996
Fully weighted 0.494 21.398 94.3 0.981 0.329 8.400 94.0 0.981 0.276 5.057 93.6 0.979
Non-robust adjustment
Model specification: True
PAPW -0.589 24.195 97.4 1.117 -0.755 9.795 99.0 1.326 -0.801 6.178 99.8 1.653
IPSW 1.169 22.844 97.2 1.118 1.016 9.163 98.6 1.345 0.976 5.719 99.8 1.701
PM 0.709 21.489 95.280 1.029 0.272 8.545 95.580 1.020 0.140 5.245 94.640 1.000
Model specification: False
PAPW 28.008 34.396 76.3 1.091 28.027 29.477 19.6 1.116 28.022 28.840 3.4 1.141
IPSW 29.763 35.215 70.2 1.083 29.827 31.032 10.0 1.106 29.841 30.519 0.8 1.125
PM 28.588 34.226 70.9 1.055 28.658 29.895 10.6 1.050 28.691 29.380 0.7 1.042
Doubly robust adjustment
Model specification: QR–True, PM–True
AIPW–PAPW 0.320 23.802 97.8 1.154 0.125 9.306 99.1 1.357 0.067 5.493 99.9 1.731
AIPW–IPSW 0.249 22.778 97.4 1.142 0.099 8.976 99.1 1.339 0.056 5.387 99.9 1.688
Model specification: QR–True, PM–False
AIPW–PAPW 0.304 23.858 97.7 1.156 0.126 9.386 99.2 1.389 0.065 5.661 99.9 1.781
AIPW–IPSW 0.226 22.814 97.5 1.146 0.096 9.041 99.1 1.376 0.052 5.543 99.8 1.747
Model specification: QR–False, PM–True
AIPW–PAPW 0.881 22.077 96.8 1.126 0.333 8.742 98.6 1.281 0.153 5.303 99.8 1.558
AIPW–IPSW 0.762 21.483 96.6 1.103 0.290 8.554 98.4 1.251 0.135 5.246 99.7 1.509
Model specification: QR–False, PM–False
AIPW–PAPW 28.659 34.756 77.6 1.135 28.660 30.013 17.4 1.142 28.649 29.399 2.1 1.151
AIPW–IPSW 28.575 34.237 74.7 1.115 28.656 29.903 13.7 1.124 28.674 29.368 1.1 1.132
  • PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting. Fully weighted implies the weighted means if the true sampling weights are known.

Table 8: Comparing the performance of the bias adjustment methods and associated variance estimator under the two-step Bayesian approach in the first simulation study for nR=100n_{R}=100 and nB=10,000n_{B}=10,000
ρ=0.2\rho=0.2 ρ=0.5\rho=0.5 ρ=0.8\rho=0.8
Method rBias rMSE crCI rSE rBias rMSE crCI rSE rBias rMSE crCI rSE
Probability sample (SRS_{R})
Unweighted 8.528 19.248 92.6 1.009 8.647 11.065 77.4 1.018 8.682 9.719 50.9 1.020
Fully weighted -0.029 20.276 94.7 1.001 0.006 8.035 95.1 1.010 0.015 5.008 94.9 1.008
Non-probability sample (SBS_{B})
Unweighted 30.014 30.066 0.0 1.008 30.197 30.207 0.0 1.019 30.252 30.257 0.0 1.033
Fully weighted 0.032 2.083 95.3 1.005 0.018 0.816 95.1 1.007 0.012 0.490 95.1 1.007
Non-robust adjustment
Model specification: True
PAPW -2.032 4.578 93.0 1.031 -2.106 4.111 90.9 1.032 -2.138 4.062 90.2 1.035
IPSW -0.015 4.094 95.2 1.011 -0.036 3.605 95.1 1.004 -0.042 3.547 95.2 1.002
PM 0.297 4.517 81.6 0.679 0.120 4.136 75.3 0.579 0.065 4.094 73.1 0.563
Model specification: False
PAPW 24.524 24.647 0.0 1.042 24.618 24.678 0.0 1.062 24.650 24.702 0.0 1.069
IPSW 26.406 26.518 0.0 0.982 26.602 26.662 0.0 0.940 26.663 26.717 0.0 0.931
PM 26.512 26.648 0.0 0.851 26.715 26.798 0.0 0.728 26.779 26.856 0.0 0.700
Doubly robust adjustment
Model specification: QR–True, PM–True
AIPW–PAPW 0.178 4.635 84.7 0.721 0.079 4.160 77.3 0.607 0.047 4.103 75.7 0.588
AIPW–IPSW 0.058 4.574 83.6 0.705 0.036 4.149 77.0 0.601 0.028 4.100 75.5 0.585
Model specification: QR–True, PM–False
AIPW–PAPW 0.151 4.273 94.5 0.971 0.050 3.734 93.7 0.943 0.025 3.660 93.9 0.941
AIPW–IPSW 0.106 4.245 94.4 0.966 0.083 3.767 93.7 0.945 0.075 3.712 93.7 0.941
Model specification: QR–False, PM–True
AIPW–PAPW 0.496 4.566 83.7 0.709 0.193 4.142 76.8 0.599 0.096 4.096 75.2 0.581
AIPW–IPSW 0.312 4.514 82.7 0.695 0.127 4.133 76.7 0.595 0.068 4.094 74.9 0.579
Model specification: QR–False, PM–False
AIPW–PAPW 26.709 26.849 0.0 0.893 26.786 26.869 0.0 0.751 26.808 26.885 0.0 0.717
AIPW–IPSW 26.521 26.656 0.0 0.870 26.718 26.800 0.0 0.740 26.777 26.854 0.0 0.709
  • PAPW: propensity adjusted probability weighting; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting. Fully weighted implies the weighted means if the true sampling weights are known.

8.3.2 Simulation study III

Table 9 and Table 10 exhibits the numerical results associated with the plots of Simulation III.

Table 9: Comparing the performance of the bias adjustment methods in the third simulation study for ρ=0.8\rho=0.8
Continuous outcome (YcY_{c}) Binary outcome (YbY_{b})
Model-method rBias rMSE crCI rSE rBias rMSE crCI rSE
Probability sample (SRS_{R})
     Unweighted 48.705 52.900 30.7 1.015 11.304 16.881 88.2 1.022
     Fully weighted 0.080 15.400 96.2 1.025 0.131 13.858 95.3 1.026
Non-probability sample (SBS_{B})
     Unweighted 68.309 70.415 0.0 0.156 21.763 22.794 0.5 0.181
     Fully weighted 0.137 7.581 95.7 1.023 0.074 6.512 94.7 0.99
Non-robust adjustment
Model specification: True
     GLM–PAPW 0.448 10.994 94.7 1.036 0.072 7.266 96.2 1.034
     GLM–PAPP 0.204 11.192 93.9 1.037 0.080 7.188 96.2 1.031
     GLM–IPSW 0.839 18.138 96.0 1.275 -0.838 9.458 97.3 1.116
     GLM–PM 0.110 11.157 94.2 1.015 0.055 7.401 94.4 0.995
Model specification: False
     GLM–PAPW 7.337 13.187 94.2 1.033 5.115 8.502 90.4 1.02
     GLM–PAPP 6.762 13.546 94.2 1.032 5.046 8.471 88.5 1.035
     GLM–IPSW 22.513 35.600 99.5 1.155 9.390 13.098 89.5 1.099
BART–PAPW 2.272 10.468 100.0 2.487 1.633 7.391 99.5 1.436
BART–PAPP 3.990 11.469 100.0 2.299 0.313 7.243 99.3 1.342
     GLM–PM 37.071 42.523 53.0 1.006 12.600 14.932 63.6 1.003
     BART–PM 0.286 11.581 92.7 0.996 0.594 9.102 81.2 0.688
Doubly robust adjustment
Model specification: QR–True, PM–True
     GLM–AIPW–PAPW 0.307 11.186 95.0 1.019 0.083 7.459 94.2 1.001
     GLM–AIPW–PAPP 0.295 11.187 94.5 1.019 0.089 7.439 94.0 0.998
     GLM–AIPW–IPSW 0.372 11.193 95.8 1.037 0.120 7.478 94.4 1.003
Model specification: QR–True, PM–False
     GLM–AIPW–PAPW 0.381 12.774 95.5 1.035 0.047 7.487 96.2 1.04
     GLM–AIPW–PAPP 0.424 11.934 94.7 1.041 0.155 7.275 96.0 1.032
     GLM–AIPW–IPSW -8.223 17.625 92.3 1.181 -2.842 9.086 95.2 1.047
Model specification: QR–False, PM–True
     GLM–AIPW–PAPW 0.127 11.177 94.7 1.020 0.067 7.451 94.0 0.997
     GLM–AIPW–PAPP 0.122 11.172 94.7 1.019 0.054 7.438 94.2 0.997
     GLM–AIPW–IPSW 0.117 11.167 94.8 1.020 0.055 7.433 94.0 0.998
Model specification: QR–False, PM–False
GLM–AIPW–PAPW 50.327 53.922 21.9 1.002 15.651 17.552 50.3 1.007
GLM–AIPW–PAPP 50.793 54.215 20.9 1.002 15.834 17.605 47.8 1.003
GLM–AIPW–IPSW 47.867 51.106 27.9 1.163 15.112 16.884 53.8 1.051
BART–AIPW–PAPW 0.276 11.593 94.4 1.035 0.701 9.186 81.9 0.698
BART–AIPW–PAPP 0.261 11.591 94.2 1.031 0.682 9.155 81.7 0.697
  • PAPW: propensity adjusted probability weighting; PAPP: propensity adjusted probability Prediction; IPSW: Inverse propensity score weighting; QR: quasi-randomization; PM: prediction model; AIPW: augmented inverse propensity weighting.

Table 10: Comparing the values of rBias and rMSE for different methods across different values of ρ\rho.
Continuous outcome (YcY_{c}) Binary outcome (YbY_{b})
Non-robust Doubly robust Non-robust Doubly robust
ρ\rho PAPW PAPP IPSW PM PAPW PAPP IPSW PAPW PAPP IPSW PM PAPW PAPP IPSW
rBias
0.0 0.545 0.870 -6.791 0.259 0.447 0.443 0.511 -0.186 0.128 -3.248 -0.016 -0.006 -0.005 0.004
0.1 -0.179 0.022 -4.772 -0.224 -0.215 -0.218 -0.235 -0.537 -0.220 -2.345 -0.399 -0.464 -0.475 -0.489
0.2 -0.195 0.493 -4.406 0.048 -0.161 -0.160 -0.250 -0.329 0.095 -2.071 -0.144 -0.159 -0.149 -0.172
0.3 0.288 0.668 -3.832 0.420 0.459 0.449 0.435 -0.244 0.069 -1.980 0.160 0.108 0.114 0.111
0.4 0.212 0.361 -2.332 0.425 0.237 0.254 0.233 -0.097 0.150 -1.372 -0.031 0.031 0.048 0.085
0.5 0.248 0.173 -2.257 0.175 0.227 0.216 0.239 -0.169 -0.067 -1.817 0.117 0.068 0.065 -0.010
0.6 0.286 0.516 -1.010 0.411 0.404 0.404 0.420 0.128 0.231 -1.146 0.133 0.162 0.156 0.198
0.7 0.072 -0.052 -0.217 -0.027 0.084 0.084 0.100 -0.019 0.062 -1.029 0.021 -0.001 0.009 -0.001
0.8 0.538 0.527 0.652 0.623 0.509 0.517 0.498 0.012 0.175 -1.017 0.015 0.053 0.048 0.078
0.9 0.343 0.424 2.090 0.469 0.496 0.495 0.466 0.079 0.122 -0.932 0.155 0.158 0.144 0.164
rMSE
0.0 13.916 14.724 20.949 14.934 14.964 14.934 14.994 8.702 8.702 11.773 9.214 9.214 9.214 9.214
0.1 12.979 13.640 19.378 13.760 13.790 13.790 13.850 8.443 8.188 10.746 8.699 8.699 8.699 8.699
0.2 12.297 13.220 18.877 13.161 13.220 13.220 13.250 8.237 8.237 10.811 8.494 8.494 8.494 8.494
0.3 12.187 13.049 18.132 13.019 13.049 13.019 13.049 7.859 7.859 9.887 8.113 8.113 8.113 8.366
0.4 11.823 12.392 17.330 12.452 12.511 12.511 12.511 7.910 7.654 9.951 8.165 8.165 8.165 8.165
0.5 11.745 12.101 17.647 12.190 12.190 12.190 12.190 7.576 7.576 9.849 7.829 7.829 7.829 7.829
0.6 11.691 12.145 18.748 12.085 12.115 12.115 12.115 7.673 7.673 9.975 7.929 7.929 7.929 7.929
0.7 10.927 11.166 15.567 11.196 11.226 11.226 11.226 7.332 7.080 8.849 7.332 7.585 7.332 7.585
0.8 10.769 10.918 17.888 10.918 10.918 10.918 10.948 7.359 7.105 9.389 7.359 7.612 7.359 7.612
0.9 10.951 11.042 22.084 10.981 11.012 11.012 11.012 6.935 6.935 8.990 7.192 7.192 7.192 7.192
  • NOTE: GLM has been used for prediction, and the underlying models in each method have been correctly specified.

8.4 Supplemental results of the application on SHRP2 data

Table 11: Mean daily trip duration (min) and associated 95% CIs by different covariates across DR adjustment methods
Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPP
Covariate n (95%CI) (95%CI) (95%CI) (95%CI)
Total 837,061 68.94 (67.955,69.925) 71.603 (66.565,76.641) 70.058 (67.902,72.214) 69.582 (66.117,73.047)
Gender
Male 407,312 70.289 (68.809,71.77) 72.411 (63.583,81.238) 70.97 (67.971,73.97) 70.61 (66.131,75.088)
Female 429,749 67.662 (66.355,68.968) 70.79 (67.353,74.226) 69.107 (66.683,71.531) 68.522 (64.432,72.611)
Age group
16-24 311,106 70 (68.514,71.485) 72.889 (69.435,76.342) 72.318 (69.636,74.999) 71.937 (66.79,77.085)
25-34 117,758 73.889 (71.099,76.679) 72.669 (67.713,77.625) 71.562 (67.688,75.435) 72.511 (66.132,78.889)
35-44 61,908 75.4 (71.304,79.496) 71.215 (64.668,77.762) 75.72 (69.882,81.559) 71.919 (63.874,79.964)
45-54 77,903 74.666 (71.734,77.599) 71.803 (61.432,82.175) 70.437 (66.525,74.349) 73.237 (67.727,78.747)
55-64 63,891 70.823 (67.027,74.62) 66.99 (60.85,73.13) 67.054 (62.252,71.855) 67.518 (60.885,74.152)
65-74 88,762 67.122 (64.13,70.113) 84.262 (52.155,116.369) 64.475 (59.374,69.576) 64.286 (59.779,68.794)
75+ 115,733 54.103 (51.965,56.241) 49.358 (46.14,52.576) 51.359 (47.896,54.822) 51.442 (46.894,55.99)
Race
White 745,596 67.845 (66.833,68.858) 71.687 (65.246,78.128) 68.183 (65.836,70.529) 67.861 (64.386,71.336)
Black 43,109 86.294 (80.759,91.83) 74.42 (66.374,82.466) 81.587 (75.046,88.127) 79.728 (68.019,91.437)
Asian 26,265 68.723 (63.684,73.761) 66.792 (58.089,75.495) 66.777 (60.785,72.769) 65.958 (53.748,78.169)
Other 22,091 72.284 (66.895,77.674) 79.723 (69.505,89.942) 75.924 (69.729,82.118) 75.314 (63.089,87.539)
Ethnicity
Non-Hisp 808,098 68.699 (67.697,69.701) 71.999 (66.066,77.933) 69.337 (67.166,71.507) 68.555 (64.866,72.244)
Hispanic 28,963 75.681 (70.488,80.873) 72.068 (63.599,80.536) 74.545 (69.145,79.944) 75.449 (66.582,84.316)
Education
¡High school 50,943 61.108 (58.134,64.083) 67.647 (58.129,77.165) 67.32 (61.588,73.051) 68.246 (56.385,80.108)
HS completed 78,045 69.025 (65.979,72.071) 86.848 (58.569,115.128) 69.752 (64.868,74.637) 70.399 (61.472,79.325)
College 237,206 68.997 (67.153,70.841) 70.312 (64.184,76.44) 70.712 (66.638,74.785) 70.896 (65.722,76.069)
Graduate 326,860 70.859 (69.188,72.529) 71.314 (68.333,74.296) 71.313 (69.073,73.554) 69.984 (65.783,74.186)
Post-grad 144,007 67.218 (64.984,69.451) 64.26 (60.143,68.377) 68.713 (64.864,72.562) 66.496 (62.395,70.597)
HH income
0-49 332,586 68.105 (66.553,69.658) 75.441 (62.136,88.745) 69.441 (65.872,73.009) 69.049 (65.13,72.968)
50-99 309,387 69.755 (68.089,71.421) 70.608 (63.639,77.578) 70.359 (67.276,73.442) 69.836 (66.552,73.12)
100-149 132,757 69.487 (66.999,71.975) 68.685 (63.743,73.626) 70.276 (66.911,73.642) 69.55 (60.835,78.265)
150+ 62,331 68.187 (65.109,71.264) 69.772 (66.389,73.154) 69.9 (66.158,73.643) 70.352 (64.31,76.394)
HH size
1 177,140 66.779 (64.452,69.106) 80.258 (54.973,105.544) 66.501 (62.817,70.186) 67.607 (63.28,71.934)
2 286,106 67.608 (65.994,69.223) 65.532 (61.489,69.574) 66.781 (63.894,69.667) 67.282 (63.371,71.193)
3 152,684 71.233 (68.836,73.631) 72.398 (66.412,78.384) 74.177 (69.507,78.848) 71.127 (67.04,75.214)
4 143,442 70.161 (67.969,72.352) 69.794 (65.273,74.315) 69.944 (66.494,73.395) 70.839 (65.417,76.261)
5+ 77,689 72.012 (68.913,75.11) 74.664 (64.68,84.648) 76.567 (71.368,81.765) 73.321 (68.479,78.163)
Urban size
¡50k 34,987 67.602 (62.771,72.432) 79.22 (59.18,99.26) 65.75 (59.749,71.751) 66.109 (57.069,75.149)
50-200k 119,970 62.608 (60.337,64.879) 65.759 (61.25,70.268) 65.151 (62.164,68.138) 67.211 (61.409,73.014)
200-500k 44,578 68.576 (63.52,73.632) 87.248 (73.018,101.477) 68.884 (63.664,74.104) 69.636 (61.746,77.526)
500-1000k 276,629 68.017 (66.289,69.745) 66.524 (61.364,71.685) 68.123 (65.323,70.923) 70.338 (64.971,75.704)
1000k+ 360,897 71.928 (70.451,73.404) 70.91 (67.926,73.894) 73.567 (71.441,75.693) 72.962 (68.493,77.43)
Vehicle make
American 290,228 66.507 (64.905,68.108) 71.826 (59.917,83.734) 68.256 (65.302,71.21) 69.04 (63.968,74.113)
Asian 528,810 70.265 (69,71.53) 72.7 (69.653,75.747) 71.602 (69.436,73.768) 70.211 (66.415,74.007)
European 18,023 69.261 (63.898,74.624) 66.191 (59.703,72.679) 71.403 (65.95,76.855) 69.836 (60.506,79.166)
Vehicle type
Car 610,245 68.686 (67.539,69.834) 73.853 (65.931,81.776) 69.706 (67.4,72.012) 70.236 (66.799,73.673)
Van 27,866 69.2 (64.432,73.968) 68.389 (61.064,75.714) 73.096 (66.388,79.804) 64.905 (54.298,75.512)
SUV 158,202 68.993 (66.851,71.134) 68.424 (62.145,74.704) 69.291 (66.318,72.263) 69.469 (64.351,74.587)
Pickup 40,748 72.361 (66.713,78.008) 69.934 (59.062,80.805) 74.495 (64.949,84.04) 70.256 (58.87,81.643)
Fuel type
Gas/D 761,292 68.637 (67.61,69.664) 71.334 (66.221,76.446) 69.895 (67.66,72.131) 69.443 (65.954,72.931)
Other 75,769 71.986 (68.598,75.373) 82.674 (72.987,92.361) 77.039 (72.37,81.708) 75.696 (67.822,83.571)
Weekend
Weekday 712,411 67.671 (66.701,68.64) 70.362 (65.734,74.991) 68.72 (66.616,70.824) 68.348 (64.806,71.89)
Weekend 124,650 76.196 (75.001,77.392) 78.646 (71.128,86.164) 77.649 (75.099,80.199) 76.577 (73.08,80.074)
Table 12: Mean daily trip distance (mile) and associated 95% CIs by different covariates across DR adjustment methods
Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPP
Covariate n (95%CI) (95%CI) (95%CI) (95%CI)
Total 837,061 32.418 (31.823,33.013) 33.76 (31.806,35.715) 33.39 (32.22,34.56) 32.926 (31.185,34.667)
Gender
Male 407,312 33.852 (32.963,34.741) 35.51 (32.247,38.773) 34.782 (33.146,36.418) 34.358 (32.254,36.461)
Female 429,749 31.06 (30.27,31.849) 31.901 (29.932,33.871) 31.947 (30.601,33.293) 31.428 (28.965,33.89)
Age group
16-24 311,106 32.828 (32,33.657) 34.904 (33.085,36.723) 34.864 (33.358,36.369) 34.491 (32.804,36.178)
25-34 117,758 36.246 (34.603,37.888) 36.546 (33.364,39.728) 34.841 (32.742,36.94) 35.324 (32.837,37.81)
35-44 61,908 35.958 (33.318,38.597) 31.774 (28.585,34.962) 35.067 (32.321,37.813) 33.9 (30.173,37.627)
45-54 77,903 36.103 (34.231,37.976) 35.721 (31.46,39.981) 34.301 (31.843,36.759) 34.578 (30.108,39.049)
55-64 63,891 35.037 (32.735,37.34) 33.159 (29.352,36.966) 33.138 (30.097,36.178) 32.135 (29.896,34.375)
65-74 88,762 31.548 (29.552,33.544) 33.875 (24.893,42.856) 29.948 (26.934,32.962) 29.028 (26.593,31.464)
75+ 115,733 22.269 (21.044,23.493) 20.184 (18.108,22.259) 21.037 (19.304,22.77) 21.421 (18.979,23.863)
Race
White 745,596 32.189 (31.554,32.824) 33.173 (30.862,35.484) 33.426 (32.11,34.743) 32.85 (31.118,34.582)
Black 43,109 37.275 (34.577,39.973) 35.696 (31.364,40.029) 34.187 (31.146,37.228) 34.131 (28.834,39.427)
Asian 26,265 30.638 (28.095,33.181) 29.601 (25.527,33.675) 28.311 (25.238,31.383) 28.289 (22.62,33.958)
Other 22,091 32.789 (29.699,35.879) 37.919 (30.975,44.862) 35.518 (30.553,40.484) 35.058 (27.876,42.239)
Ethnicity
Non-Hisp 808,098 32.328 (31.723,32.933) 32.852 (30.823,34.881) 33.217 (32.085,34.349) 32.362 (30.845,33.879)
Hispanic 28,963 34.935 (31.713,38.158) 36.882 (32.766,40.997) 35.126 (31.226,39.027) 36.344 (29.859,42.828)
Education
¡High school 50,943 25.659 (23.905,27.412) 28.248 (24.014,32.482) 28.977 (25.986,31.967) 30.351 (22.902,37.8)
HS completed 78,045 32.04 (30.14,33.939) 36.853 (29.19,44.515) 33.596 (30.989,36.203) 33.497 (30.336,36.658)
College 237,206 31.848 (30.812,32.885) 33.666 (30.509,36.824) 33.038 (31.177,34.899) 32.956 (30.622,35.29)
Graduate 326,860 33.879 (32.85,34.908) 35.56 (32.73,38.389) 35.093 (33.48,36.706) 34.091 (31.938,36.244)
Post-grad 144,007 32.637 (31.235,34.039) 29.935 (27.6,32.269) 32.801 (30.877,34.725) 31.414 (29.555,33.273)
HH income
0-49 332,586 31.185 (30.273,32.097) 32.845 (28.788,36.901) 31.979 (30.333,33.626) 31.538 (28.932,34.144)
50-99 309,387 33.024 (32.004,34.043) 35.811 (32.755,38.866) 33.907 (32.201,35.613) 33.744 (32.011,35.476)
100-149 132,757 33.765 (32.235,35.295) 33.354 (30.228,36.479) 34.433 (32.472,36.394) 33.532 (30.736,36.329)
150+ 62,331 33.124 (31.231,35.017) 31.693 (29.605,33.781) 33.795 (31.147,36.442) 33.428 (29.886,36.97)
HH size
1 177,140 30.588 (29.231,31.945) 34.322 (27.864,40.779) 31.133 (28.899,33.366) 30.768 (28.385,33.152)
2 286,106 32.415 (31.372,33.458) 31.701 (29.362,34.039) 32.742 (30.989,34.494) 32.301 (30.95,33.651)
3 152,684 33.786 (32.452,35.12) 34.54 (30.838,38.242) 34.806 (32.647,36.966) 34.421 (31.549,37.293)
4 143,442 32.95 (31.524,34.376) 32.048 (29.257,34.84) 33.04 (31.09,34.99) 32.898 (30.012,35.784)
5+ 77,689 32.934 (31.314,34.554) 36.522 (32.397,40.647) 36.383 (33.774,38.993) 34.731 (32.103,37.359)
Urban size
¡50k 34,987 36.147 (32.885,39.408) 34.93 (28.343,41.518) 34.077 (30.506,37.648) 32.945 (30.263,35.628)
50-200k 119,970 31.028 (29.388,32.668) 32.379 (29.47,35.288) 32.032 (29.692,34.372) 32.636 (30.058,35.215)
200-500k 44,578 36.416 (33.616,39.216) 44.143 (36.054,52.231) 35.585 (33.228,37.942) 36.461 (32.151,40.771)
500-1000k 276,629 31.973 (30.952,32.994) 32.453 (27.672,37.234) 31.005 (29.331,32.679) 32.781 (28.04,37.522)
1000k+ 360,897 32.366 (31.497,33.236) 32.475 (30.577,34.373) 32.371 (31.117,33.626) 32.302 (30.134,34.471)
Vehicle make
American 290,228 30.948 (29.995,31.9) 35.285 (31.317,39.254) 32.784 (31.234,34.335) 32.956 (30.909,35.004)
Asian 528,810 33.249 (32.476,34.022) 33.339 (31.554,35.124) 34.022 (32.623,35.42) 33.124 (31.156,35.092)
European 18,023 31.719 (28.896,34.542) 29.905 (26.439,33.37) 32.979 (30.006,35.951) 32.05 (27.154,36.946)
Vehicle type
Car 610,245 32.126 (31.428,32.823) 33.619 (31.253,35.985) 32.916 (31.691,34.141) 32.518 (30.426,34.611)
Van 27,866 31.212 (28.225,34.199) 29.109 (24.54,33.679) 32.682 (28.052,37.312) 31.109 (24.519,37.699)
SUV 158,202 32.848 (31.558,34.137) 33.857 (30.466,37.249) 33.15 (31.374,34.926) 32.559 (29.986,35.133)
Pickup 40,748 35.958 (32.813,39.103) 35.086 (30.375,39.798) 36.557 (32.917,40.197) 36.352 (31.036,41.668)
Fuel type
Gas/D 761,292 32.121 (31.502,32.739) 33.524 (31.522,35.526) 33.18 (32.006,34.354) 32.813 (31.021,34.605)
Other 75,769 35.409 (33.302,37.515) 43.864 (37.513,50.214) 39.259 (35.942,42.576) 37.388 (34.271,40.505)
Weekend
Weekday 712,411 31.895 (31.307,32.482) 33.181 (31.327,35.036) 32.817 (31.666,33.968) 32.41 (30.68,34.14)
Weekend 124,650 35.41 (34.689,36.132) 37.037 (34.351,39.724) 36.64 (35.09,38.19) 35.853 (33.806,37.899)
Table 13: Mean daily average speed (MPH) of trips and associated 95% CIs by different covariates across DR adjustment methods
Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPP
Covariate n (95%CI) (95%CI) (95%CI) (95%CI)
Total 837,061 25.03 (24.8,25.261) 25.775 (24.277,27.274) 25.562 (25.063,26.06) 25.39 (24.309,26.471)
Gender
Male 407,312 25.474 (25.137,25.811) 26.964 (24.667,29.261) 26.199 (25.578,26.82) 25.999 (24.877,27.12)
Female 429,749 24.61 (24.297,24.923) 24.463 (23.239,25.688) 24.906 (24.32,25.492) 24.76 (23.59,25.93)
Age group
16-24 311,106 25.239 (24.893,25.586) 25.876 (24.878,26.873) 25.921 (25.287,26.555) 25.831 (24.724,26.938)
25-34 117,758 26.951 (26.318,27.583) 27.242 (26.062,28.422) 26.571 (25.723,27.419) 27.07 (26.055,28.085)
35-44 61,908 26.065 (25.183,26.947) 25.045 (23.196,26.894) 25.696 (24.675,26.716) 25.516 (23.327,27.705)
45-54 77,903 26.527 (25.727,27.328) 27.731 (22.197,33.265) 26.412 (25.419,27.405) 25.665 (23.242,28.088)
55-64 63,891 26.22 (25.471,26.969) 26.075 (24.48,27.67) 26.275 (25.152,27.398) 25.525 (24.078,26.971)
65-74 88,762 23.956 (23.339,24.572) 22.601 (20.487,24.716) 23.618 (22.683,24.554) 23.216 (22.2,24.232)
75+ 115,733 21.12 (20.559,21.681) 20.545 (19.251,21.838) 21.317 (20.539,22.094) 20.728 (19.29,22.167)
Race
White 745,596 25.109 (24.863,25.354) 25.225 (24.255,26.196) 26.086 (25.565,26.608) 25.656 (24.95,26.363)
Black 43,109 24.227 (23.188,25.265) 27.223 (22.295,32.151) 23.198 (22.202,24.194) 23.433 (20.47,26.397)
Asian 26,265 24.35 (23.278,25.423) 25.203 (23.46,26.947) 23.038 (21.674,24.402) 24.508 (21.082,27.934)
Other 22,091 24.76 (23.473,26.046) 25.049 (23.008,27.091) 24.68 (23.128,26.232) 25.956 (22.168,29.744)
Ethnicity
Non-Hisp 808,098 25.039 (24.805,25.274) 25.023 (24.156,25.89) 25.674 (25.197,26.152) 25.246 (24.437,26.055)
Hispanic 28,963 24.777 (23.526,26.028) 27.808 (24.042,31.574) 25.233 (23.873,26.593) 26.284 (22.881,29.688)
Education
¡High school 50,943 23.16 (22.31,24.01) 23.142 (21.364,24.919) 23.825 (22.322,25.328) 24.791 (21.638,27.943)
HS completed 78,045 25.192 (24.438,25.945) 24.851 (22.707,26.995) 26.025 (25.019,27.031) 25.165 (22.538,27.793)
College 237,206 24.506 (24.09,24.921) 25.888 (22.586,29.189) 24.988 (24.184,25.793) 24.7 (23.717,25.683)
Graduate 326,860 25.426 (25.043,25.81) 26.769 (25.764,27.775) 26.363 (25.795,26.93) 26.368 (25.077,27.659)
Post-grad 144,007 25.569 (25.027,26.11) 25.031 (23.955,26.107) 25.446 (24.775,26.117) 25.555 (24.399,26.711)
HH income
0-49 332,586 24.333 (23.956,24.709) 23.975 (22.766,25.183) 24.659 (23.851,25.467) 24.401 (22.918,25.884)
50-99 309,387 25.25 (24.878,25.623) 27.547 (24.479,30.615) 25.971 (25.316,26.627) 25.744 (24.828,26.66)
100-149 132,757 25.963 (25.411,26.515) 25.892 (24.611,27.174) 26.281 (25.569,26.994) 25.981 (24.275,27.687)
150+ 62,331 22.937 (22.564,23.31) 25.096 (21.747,28.444) 23.419 (22.632,24.206) 23.288 (22.044,24.533)
HH size
1 177,140 23.837 (23.337,24.337) 23.986 (22.176,25.797) 24.355 (23.538,25.173) 24.024 (22.597,25.452)
2 286,106 25.155 (24.746,25.563) 25.606 (24.679,26.532) 25.778 (25.128,26.428) 25.55 (24.735,26.365)
3 152,684 25.77 (25.223,26.316) 26.042 (24.785,27.3) 25.645 (24.886,26.403) 26.035 (24.807,27.262)
4 143,442 25.423 (24.895,25.952) 25.025 (23.669,26.381) 25.766 (25.015,26.517) 25.622 (24.254,26.991)
5+ 77,689 25.112 (24.48,25.745) 27.472 (22.365,32.58) 26.14 (24.981,27.3) 25.155 (23.23,27.08)
Urban size
¡50k 34,987 28.437 (27.061,29.813) 25.595 (22.943,28.247) 27.951 (26.354,29.548) 27.097 (25.536,28.659)
50-200k 119,970 24.455 (23.814,25.096) 25.081 (23.851,26.31) 24.784 (23.965,25.603) 25.031 (24.155,25.907)
200-500k 44,578 27.64 (26.634,28.645) 27.073 (23.546,30.601) 27.024 (26.049,27.999) 26.931 (24.582,29.279)
500-1000k 276,629 25.758 (25.355,26.162) 26.189 (23.701,28.678) 25.05 (24.289,25.812) 25.513 (24.121,26.904)
1000k+ 360,897 20.451 (18.941,21.961) 23.557 (21.359,25.755) 21.9 (20.41,23.389) 23.488 (20.639,26.336)
Vehicle make
American 290,228 24.799 (24.402,25.195) 27.212 (24.331,30.094) 25.766 (25.047,26.485) 25.353 (24.079,26.627)
Asian 528,810 25.174 (24.884,25.464) 24.771 (23.69,25.853) 25.509 (25.004,26.015) 25.464 (24.358,26.569)
European 18,023 24.534 (23.553,25.514) 24.974 (23.083,26.866) 24.291 (22.942,25.64) 25.307 (23.285,27.329)
Vehicle type
Car 610,245 24.893 (24.622,25.164) 25.115 (24.327,25.904) 25.313 (24.794,25.832) 25.357 (24.467,26.247)
Van 27,866 23.562 (22.539,24.586) 23.064 (21.378,24.75) 23.484 (22.376,24.591) 23.527 (20.832,26.223)
SUV 158,202 25.398 (24.87,25.925) 26.495 (22.622,30.369) 25.635 (24.887,26.384) 25.008 (23.511,26.505)
Pickup 40,748 26.43 (25.484,27.375) 26.245 (23.43,29.059) 26.628 (25.453,27.804) 25.788 (23.842,27.733)
Fuel type
Gas/D 761,292 24.955 (24.711,25.199) 25.727 (24.205,27.249) 25.507 (25.005,26.01) 25.361 (24.277,26.446)
Other 75,769 25.784 (25.091,26.476) 27.804 (26.114,29.493) 27.052 (25.991,28.113) 26.676 (24.691,28.66)
Weekend
Weekday 712,411 25.077 (24.847,25.308) 25.744 (24.351,27.138) 25.598 (25.1,26.096) 25.425 (24.36,26.49)
Weekend 124,650 24.76 (24.518,25.003) 25.939 (23.843,28.034) 25.356 (24.811,25.901) 25.194 (23.987,26.401)
Table 14: Mean start time of the first daytrips and associated 95% CIs by different covariates across DR adjustment methods
Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPP
Covariate n (95%CI) (95%CI) (95%CI) (95%CI)
Total 837,061 13.811 (13.763,13.859) 13.564 (13.391,13.737) 13.553 (13.427,13.68) 13.5 (13.364,13.636)
Gender
Male 407,312 13.824 (13.751,13.898) 13.556 (13.304,13.807) 13.572 (13.418,13.725) 13.486 (13.304,13.667)
Female 429,749 13.799 (13.736,13.861) 13.578 (13.362,13.793) 13.533 (13.386,13.681) 13.515 (13.389,13.64)
Age group
16-24 311,106 14.411 (14.354,14.468) 14.396 (14.254,14.537) 14.351 (14.218,14.485) 14.266 (14.13,14.402)
25-34 117,758 13.999 (13.891,14.106) 13.864 (13.562,14.165) 13.923 (13.734,14.112) 13.843 (13.65,14.037)
35-44 61,908 13.57 (13.399,13.741) 13.694 (13.178,14.211) 13.467 (13.164,13.77) 13.448 (13.117,13.78)
45-54 77,903 13.489 (13.368,13.61) 13.414 (13.028,13.8) 13.389 (13.187,13.592) 13.335 (13.043,13.626)
55-64 63,891 13.344 (13.185,13.503) 13.59 (13.248,13.933) 13.279 (13.004,13.555) 13.27 (12.902,13.637)
65-74 88,762 13.244 (13.091,13.397) 12.529 (12.082,12.975) 13.131 (12.874,13.388) 13.026 (12.663,13.388)
75+ 115,733 13.047 (12.9,13.193) 13.412 (13.089,13.736) 13.051 (12.815,13.287) 13.216 (12.904,13.528)
Race
White 745,596 13.77 (13.719,13.821) 13.522 (13.331,13.712) 13.506 (13.372,13.64) 13.46 (13.318,13.602)
Black 43,109 14.065 (13.887,14.242) 13.632 (13.387,13.876) 13.695 (13.49,13.901) 13.662 (13.21,14.114)
Asian 26,265 14.351 (14.135,14.567) 14.281 (13.405,15.157) 14.051 (13.522,14.58) 13.841 (13.51,14.172)
Other 22,091 14.055 (13.786,14.324) 13.349 (12.959,13.739) 13.585 (13.254,13.917) 13.492 (12.695,14.289)
Ethnicity
Non-Hisp 808,098 13.803 (13.754,13.851) 13.563 (13.359,13.767) 13.547 (13.419,13.675) 13.49 (13.362,13.617)
Hispanic 28,963 14.049 (13.81,14.288) 13.602 (13.303,13.901) 13.544 (13.278,13.81) 13.558 (13.084,14.031)
Education
¡High school 50,943 14.3 (14.177,14.424) 13.601 (13.416,13.786) 13.604 (13.394,13.814) 13.513 (13.106,13.921)
HS completed 78,045 13.895 (13.73,14.06) 13.425 (12.957,13.893) 13.509 (13.236,13.781) 13.478 (13.072,13.884)
College 237,206 14.003 (13.913,14.092) 13.641 (13.36,13.922) 13.611 (13.433,13.788) 13.532 (13.339,13.725)
Graduate 326,860 13.695 (13.617,13.774) 13.68 (13.378,13.982) 13.65 (13.5,13.799) 13.558 (13.42,13.696)
Post-grad 144,007 13.539 (13.431,13.648) 13.307 (12.878,13.737) 13.369 (13.197,13.542) 13.399 (13.122,13.677)
HH income
0-49 332,586 13.891 (13.809,13.973) 13.62 (13.357,13.882) 13.641 (13.465,13.817) 13.612 (13.339,13.886)
50-99 309,387 13.745 (13.669,13.822) 13.573 (13.341,13.805) 13.55 (13.395,13.705) 13.469 (13.323,13.615)
100-149 132,757 13.777 (13.671,13.882) 13.383 (13.062,13.704) 13.424 (13.243,13.605) 13.415 (13.247,13.584)
150+ 62,331 13.531 (13.437,13.625) 13.201 (12.949,13.454) 13.457 (13.277,13.636) 13.342 (13.14,13.544)
HH size
1 177,140 13.649 (13.533,13.765) 13.337 (12.98,13.694) 13.518 (13.349,13.688) 13.489 (13.276,13.703)
2 286,106 13.6 (13.513,13.687) 13.462 (13.164,13.761) 13.469 (13.275,13.663) 13.383 (13.215,13.551)
3 152,684 14.02 (13.918,14.122) 13.718 (13.351,14.085) 13.58 (13.395,13.765) 13.5 (13.336,13.664)
4 143,442 14.033 (13.941,14.125) 13.514 (13.189,13.838) 13.64 (13.491,13.788) 13.542 (13.362,13.723)
5+ 77,689 14.138 (14.017,14.259) 13.819 (13.433,14.206) 13.581 (13.321,13.841) 13.73 (13.474,13.985)
Urban size
¡50k 34,987 13.52 (13.266,13.773) 13.383 (12.795,13.972) 13.337 (12.845,13.829) 13.328 (13.031,13.625)
50-200k 119,970 13.928 (13.794,14.062) 13.747 (13.489,14.005) 13.842 (13.672,14.011) 13.698 (13.522,13.873)
200-500k 44,578 13.918 (13.705,14.13) 13.518 (12.933,14.103) 13.817 (13.571,14.064) 13.66 (13.276,14.045)
500-1000k 276,629 13.759 (13.678,13.84) 13.395 (13.213,13.576) 13.564 (13.45,13.679) 13.503 (13.305,13.7)
1000k+ 360,897 14.286 (13.859,14.713) 13.451 (12.893,14.01) 13.654 (13.317,13.992) 13.385 (12.683,14.087)
Vehicle make
American 290,228 13.8 (13.714,13.886) 13.339 (13.067,13.61) 13.455 (13.258,13.651) 13.426 (13.221,13.631)
Asian 528,810 13.799 (13.741,13.858) 13.646 (13.485,13.808) 13.627 (13.517,13.736) 13.561 (13.43,13.692)
European 18,023 14.337 (14.094,14.58) 14.19 (13.492,14.888) 13.684 (13.334,14.034) 13.552 (13.171,13.933)
Vehicle type
Car 610,245 13.849 (13.792,13.906) 13.649 (13.445,13.854) 13.658 (13.53,13.787) 13.611 (13.476,13.747)
Van 27,866 13.588 (13.336,13.84) 13.414 (13.064,13.764) 13.472 (13.172,13.772) 13.514 (13.168,13.861)
SUV 158,202 13.773 (13.675,13.871) 13.623 (13.279,13.966) 13.497 (13.336,13.657) 13.528 (13.264,13.793)
Pickup 40,748 13.714 (13.536,13.893) 13.725 (13.41,14.04) 13.544 (13.305,13.783) 13.502 (13.079,13.925)
Fuel type
Gas/D 761,292 13.841 (13.791,13.891) 13.565 (13.389,13.741) 13.556 (13.428,13.685) 13.498 (13.36,13.636)
Other 75,769 13.51 (13.338,13.683) 13.525 (13.217,13.833) 13.463 (13.254,13.672) 13.56 (13.264,13.856)
Weekend
Weekday 712,411 13.824 (13.775,13.872) 13.576 (13.408,13.744) 13.558 (13.431,13.684) 13.502 (13.364,13.64)
Weekend 124,650 13.74 (13.685,13.794) 13.496 (13.22,13.773) 13.531 (13.397,13.664) 13.486 (13.334,13.637)
Table 15: Mean daily maximum speed (MPH) and associated 95% CIs by different covariates across DR adjustment methods
Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPP
Covariate n (95%CI) (95%CI) (95%CI) (95%CI)
Total 837,061 59.808 (59.467,60.149) 61.547 (59.717,63.377) 60.447 (59.833,61.062) 59.947 (58.623,61.27)
Gender
Male 407,312 60.187 (59.706,60.669) 62.687 (59.483,65.89) 60.847 (60.045,61.649) 60.677 (58.953,62.402)
Female 429,749 59.448 (58.969,59.928) 60.28 (59.195,61.366) 60.023 (59.205,60.84) 59.193 (58.034,60.353)
Age group
16-24 311,106 61.475 (60.97,61.98) 62.484 (61.235,63.733) 62.212 (61.442,62.981) 62.078 (60.788,63.368)
25-34 117,758 63.41 (62.567,64.253) 62.907 (61.082,64.733) 62.359 (61.171,63.546) 62.373 (60.598,64.148)
35-44 61,908 62.617 (61.358,63.878) 62.761 (60.791,64.731) 63.986 (62.235,65.737) 62.039 (59.911,64.166)
45-54 77,903 60.872 (59.853,61.89) 64.943 (58.295,71.591) 60.117 (58.688,61.545) 59.738 (57.435,62.041)
55-64 63,891 59.478 (58.406,60.55) 59.666 (57.225,62.107) 59.611 (57.872,61.35) 58.797 (56.332,61.262)
65-74 88,762 55.91 (55.068,56.753) 56.915 (55.026,58.805) 55.693 (54.645,56.742) 55.5 (53.256,57.744)
75+ 115,733 52.613 (51.8,53.426) 52.602 (50.493,54.71) 52.88 (51.871,53.889) 52.262 (50.92,53.604)
Race
White 745,596 59.449 (59.092,59.806) 60.312 (59.478,61.145) 60.254 (59.581,60.929) 59.586 (58.217,60.954)
Black 43,109 64.628 (62.991,66.264) 68.229 (60.656,75.802) 62.22 (60.3,64.14) 62.152 (58.85,65.453)
Asian 26,265 61.08 (59.323,62.836) 60.573 (58.414,62.733) 59.081 (56.873,61.288) 59.464 (55.818,63.11)
Other 22,091 61.008 (59.099,62.917) 60.986 (58.585,63.387) 59.968 (58.45,61.485) 60.988 (55.744,66.231)
Ethnicity
Non-Hisp 808,098 59.718 (59.37,60.066) 60.221 (59.395,61.048) 60.232 (59.595,60.869) 59.528 (58.485,60.571)
Hispanic 28,963 62.31 (60.707,63.914) 66.308 (60.971,71.645) 61.976 (60.418,63.533) 62.437 (58.69,66.184)
Education
¡High school 50,943 58.103 (56.954,59.251) 59.199 (57.013,61.386) 58.949 (57.325,60.572) 60.162 (57.18,63.145)
HS completed 78,045 59.865 (58.83,60.901) 61.116 (59.657,62.576) 60.812 (59.457,62.166) 60.382 (57.673,63.092)
College 237,206 59.874 (59.25,60.497) 62.623 (58.482,66.764) 59.982 (58.9,61.065) 59.491 (57.762,61.219)
Graduate 326,860 60.185 (59.62,60.751) 61.474 (60.049,62.898) 61.405 (60.379,62.43) 60.718 (59.63,61.807)
Post-grad 144,007 59.414 (58.566,60.262) 59.809 (58.019,61.598) 60.113 (58.801,61.426) 59.221 (57.561,60.881)
HH income
0-49 332,586 59.127 (58.575,59.68) 59.271 (57.864,60.679) 59.263 (58.317,60.208) 58.757 (57.339,60.175)
50-99 309,387 60.031 (59.461,60.6) 64.22 (60.289,68.151) 60.94 (60.026,61.853) 60.508 (58.903,62.113)
100-149 132,757 60.663 (59.901,61.425) 60.409 (58.784,62.035) 61.507 (60.192,62.822) 60.611 (59.169,62.052)
150+ 62,331 60.513 (59.305,61.721) 61.386 (59.529,63.244) 60.484 (58.966,62.004) 60.549 (58.951,62.147)
HH size
1 177,140 57.902 (57.123,58.682) 58.973 (57.29,60.655) 58.243 (57.246,59.24) 57.958 (56.619,59.298)
2 286,106 59.02 (58.421,59.619) 60.033 (58.585,61.48) 59.371 (58.389,60.353) 59.371 (57.722,61.019)
3 152,684 61.35 (60.569,62.132) 60.399 (58.548,62.25) 61.373 (59.998,62.748) 60.541 (58.797,62.285)
4 143,442 61.214 (60.476,61.951) 61.592 (60.031,63.152) 61.488 (60.377,62.599) 61.068 (59.616,62.52)
5+ 77,689 61.428 (60.57,62.286) 66.669 (60.605,72.733) 62.759 (60.967,64.551) 60.989 (58.922,63.057)
Urban size
¡50k 34,987 60.422 (58.928,61.917) 61.118 (58.986,63.251) 59.964 (58.006,61.921) 59.665 (57.238,62.093)
50-200k 119,970 56.12 (55.22,57.021) 57.621 (55.544,59.698) 57.162 (56.071,58.254) 57.313 (55.844,58.782)
200-500k 44,578 62.847 (61.27,64.423) 64.07 (60.75,67.39) 62.507 (60.978,64.037) 62.711 (60.338,65.086)
500-1000k 276,629 60.193 (59.62,60.766) 61.073 (57.067,65.078) 59.886 (58.928,60.846) 60.296 (58.82,61.773)
1000k+ 360,897 60.303 (59.8,60.807) 62.075 (59.415,64.736) 60.647 (59.977,61.317) 60.287 (59.099,61.475)
Vehicle make
American 290,228 59.36 (58.773,59.946) 63.24 (59.63,66.85) 60.218 (59.339,61.096) 59.877 (58.058,61.695)
Asian 528,810 60.013 (59.588,60.438) 60.796 (59.891,61.701) 60.751 (60.095,61.407) 60.203 (59.036,61.371)
European 18,023 61.016 (59.074,62.958) 58.842 (56.171,61.514) 58.984 (56.944,61.025) 59.049 (55.175,62.922)
Vehicle type
Car 610,245 59.744 (59.338,60.149) 60.92 (60.083,61.757) 60.44 (59.765,61.115) 60.119 (58.854,61.383)
Van 27,866 57.722 (56.154,59.289) 58.36 (55.812,60.907) 58.674 (56.088,61.26) 58.263 (55.586,60.94)
SUV 158,202 60.093 (59.36,60.825) 62.613 (57.431,67.795) 60.444 (59.297,61.59) 59.511 (57.678,61.345)
Pickup 40,748 61.092 (59.557,62.627) 62.97 (61.201,64.739) 61.359 (59.346,63.371) 60.707 (57.51,63.905)
Fuel type
Gas/D 761,292 59.878 (59.516,60.239) 61.537 (59.67,63.404) 60.473 (59.842,61.105) 59.937 (58.565,61.309)
Other 75,769 59.105 (58.131,60.079) 61.685 (58.505,64.865) 61.082 (59.975,62.189) 60.645 (58.056,63.234)
Weekend
Weekday 712,411 59.684 (59.344,60.023) 61.322 (59.663,62.982) 60.295 (59.687,60.902) 59.801 (58.483,61.119)
Weekend 124,650 60.517 (60.151,60.883) 62.809 (60.044,65.575) 61.312 (60.601,62.023) 60.768 (59.333,62.204)
Table 16: Mean daily frequency of brakes per driven mile and associated 95% CIs by different covariates across DR adjustment methods
Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPP
Covariate n (95%CI) (95%CI) (95%CI) (95%CI)
Total 837,061 4.499 (4.387,4.611) 4.356 (3.887,4.825) 4.644 (4.345,4.942) 4.426 (3.984,4.867)
Gender
Male 407,312 4.415 (4.247,4.583) 3.835 (3.139,4.531) 4.456 (4.129,4.784) 4.345 (3.789,4.902)
Female 429,749 4.579 (4.43,4.728) 4.957 (4.518,5.396) 4.825 (4.471,5.179) 4.508 (4.04,4.977)
Age group
16-24 311,106 4.283 (4.114,4.451) 4.368 (3.735,5.001) 4.417 (4.068,4.766) 4.173 (3.777,4.57)
25-34 117,758 4.085 (3.819,4.351) 4.201 (3.59,4.812) 4.609 (4.084,5.133) 3.984 (3.288,4.681)
35-44 61,908 4.422 (4.052,4.792) 4.575 (3.779,5.371) 4.643 (4.05,5.236) 4.574 (3.905,5.243)
45-54 77,903 4.14 (3.846,4.435) 3.764 (2.22,5.308) 4.174 (3.719,4.628) 3.997 (3.359,4.636)
55-64 63,891 4.565 (4.136,4.995) 5.003 (4.102,5.904) 4.589 (4.042,5.136) 4.62 (3.365,5.875)
65-74 88,762 4.801 (4.41,5.193) 3.522 (1.749,5.296) 4.799 (4.195,5.402) 4.596 (3.372,5.821)
75+ 115,733 5.518 (5.147,5.888) 6.902 (5.723,8.081) 5.857 (5.25,6.464) 6.44 (5.548,7.332)
Race
White 745,596 4.521 (4.401,4.641) 4.518 (4.018,5.018) 4.583 (4.272,4.895) 4.402 (4,4.805)
Black 43,109 4.366 (3.885,4.848) 3.807 (2.117,5.496) 5.074 (4.442,5.706) 5.053 (4.117,5.988)
Asian 26,265 4.256 (3.675,4.837) 4.349 (3.393,5.304) 5.222 (4.061,6.383) 4.483 (3.405,5.561)
Other 22,091 4.319 (3.732,4.907) 4.197 (3.011,5.382) 4.438 (3.574,5.302) 3.705 (2.167,5.243)
Ethnicity
Non-Hisp 808,098 4.488 (4.375,4.601) 4.56 (4.117,5.003) 4.597 (4.323,4.871) 4.442 (4.051,4.832)
Hispanic 28,963 4.813 (4.037,5.588) 3.842 (2.609,5.075) 4.84 (3.782,5.898) 4.3 (2.899,5.701)
Education
¡High school 50,943 4.942 (4.522,5.362) 4.779 (3.58,5.979) 5.209 (4.526,5.893) 5.204 (3.91,6.498)
HS completed 78,045 4.163 (3.8,4.526) 3.495 (1.656,5.335) 4.241 (3.662,4.819) 4.179 (3.275,5.084)
College 237,206 4.561 (4.347,4.775) 4.466 (3.609,5.323) 4.713 (4.269,5.158) 4.604 (4.062,5.145)
Graduate 326,860 4.347 (4.165,4.528) 4.174 (3.765,4.583) 4.564 (4.201,4.927) 4.17 (3.59,4.749)
Post-grad 144,007 4.769 (4.509,5.029) 5.031 (4.514,5.548) 4.788 (4.387,5.19) 4.541 (3.857,5.224)
HH income
0-49 332,586 4.542 (4.353,4.731) 4.639 (3.669,5.608) 4.728 (4.325,5.131) 4.752 (4.157,5.347)
50-99 309,387 4.386 (4.207,4.566) 3.75 (2.847,4.653) 4.482 (4.098,4.866) 4.196 (3.682,4.71)
100-149 132,757 4.579 (4.32,4.838) 4.8 (4.301,5.3) 4.743 (4.228,5.258) 4.564 (3.933,5.194)
150+ 62,331 4.66 (4.265,5.056) 4.662 (3.942,5.383) 4.721 (4.213,5.229) 4 (3.02,4.98)
HH size
1 177,140 4.644 (4.38,4.908) 4.13 (2.528,5.733) 4.852 (4.438,5.265) 4.658 (4.058,5.258)
2 286,106 4.674 (4.46,4.888) 4.855 (4.259,5.451) 4.782 (4.302,5.262) 4.515 (3.92,5.111)
3 152,684 4.328 (4.097,4.558) 4.425 (3.937,4.913) 4.593 (4.123,5.064) 4.321 (3.785,4.857)
4 143,442 4.294 (4.064,4.524) 4.356 (3.762,4.95) 4.475 (4.051,4.9) 4.201 (3.71,4.692)
5+ 77,689 4.241 (3.949,4.533) 3.851 (2.313,5.388) 4.396 (3.893,4.899) 4.459 (4.017,4.901)
Urban size
¡50k 34,987 4.051 (3.567,4.535) 4.313 (2.858,5.768) 4.293 (3.57,5.015) 4.24 (3.746,4.734)
50-200k 119,970 4.789 (4.49,5.089) 4.921 (4.454,5.389) 4.761 (4.265,5.257) 4.696 (4.02,5.372)
200-500k 44,578 4.241 (3.825,4.657) 4.177 (3.147,5.206) 4.567 (4.075,5.059) 4.231 (3.049,5.413)
500-1000k 276,629 3.969 (3.793,4.145) 3.977 (3.342,4.612) 4.21 (3.867,4.552) 4.171 (3.549,4.793)
1000k+ 360,897 4.884 (4.703,5.066) 4.539 (3.985,5.093) 4.948 (4.599,5.297) 4.626 (4.025,5.227)
Vehicle make
American 290,228 4.762 (4.548,4.975) 4.239 (3.381,5.097) 5.007 (4.549,5.466) 4.914 (4.347,5.482)
Asian 528,810 4.392 (4.261,4.524) 4.569 (4.244,4.894) 4.451 (4.181,4.721) 4.206 (3.78,4.632)
European 18,023 3.401 (2.946,3.856) 3.161 (2.359,3.963) 3.664 (3.006,4.322) 2.898 (0.958,4.837)
Vehicle type
Car 610,245 4.504 (4.368,4.641) 4.281 (3.65,4.913) 4.564 (4.225,4.903) 4.248 (3.785,4.711)
Van 27,866 4.435 (4.064,4.806) 5.298 (4.287,6.31) 4.855 (4.41,5.3) 4.569 (3.345,5.792)
SUV 158,202 4.351 (4.148,4.555) 4.222 (3.078,5.365) 4.56 (4.215,4.904) 4.381 (3.73,5.032)
Pickup 40,748 5.043 (4.391,5.696) 4.611 (3.881,5.34) 5.022 (4.255,5.789) 5.226 (4.061,6.391)
Fuel type
Gas/D 761,292 4.435 (4.319,4.551) 4.345 (3.865,4.825) 4.649 (4.34,4.959) 4.426 (3.992,4.859)
Other 75,769 5.145 (4.737,5.553) 4.718 (3.688,5.747) 4.934 (4.308,5.561) 4.388 (3.347,5.429)
Weekend
Weekday 712,411 4.492 (4.379,4.605) 4.365 (3.91,4.82) 4.639 (4.339,4.938) 4.413 (3.963,4.864)
Weekend 124,650 4.54 (4.427,4.654) 4.305 (3.738,4.871) 4.675 (4.374,4.975) 4.497 (4.073,4.921)
Table 17: Mean daily percentage of stop time and associated 95% CIs by different covariates across DR adjustment methods
Unweighted GLM-AIPW-PAPP GLM-AIPW-PMLE BART–AIPW-PAPP
Covariate n (95%CI) (95%CI) (95%CI) (95%CI)
Total 837,061 25.518 (25.202,25.834) 25.515 (24.043,26.987) 24.949 (24.217,25.681) 0.251 (0.242,0.26)
Gender
Male 407,312 24.618 (24.157,25.079) 24.048 (21.863,26.234) 24.06 (23.158,24.961) 0.242 (0.231,0.252)
Female 429,749 26.371 (25.945,26.797) 27.107 (25.226,28.988) 25.873 (24.968,26.779) 0.261 (0.25,0.271)
Age group
16-24 311,106 26.713 (26.221,27.204) 26.551 (25.177,27.925) 25.913 (25.109,26.716) 0.258 (0.245,0.271)
25-34 117,758 25.199 (24.385,26.014) 24.178 (22.653,25.704) 25.013 (23.946,26.08) 0.247 (0.232,0.262)
35-44 61,908 25.575 (24.528,26.621) 27.6 (23.466,31.735) 26.828 (25.017,28.639) 0.265 (0.247,0.284)
45-54 77,903 23.406 (22.257,24.555) 24.908 (20.144,29.672) 22.926 (21.57,24.281) 0.239 (0.22,0.258)
55-64 63,891 22.879 (21.906,23.852) 22.949 (21.035,24.862) 23.408 (21.72,25.095) 0.235 (0.211,0.259)
65-74 88,762 24.425 (23.448,25.402) 27.099 (23.644,30.554) 24.739 (23.395,26.084) 0.262 (0.246,0.279)
75+ 115,733 26.315 (25.367,27.264) 27.682 (25.755,29.609) 26.185 (25.039,27.332) 0.28 (0.264,0.296)
Race
White 745,596 25.216 (24.882,25.55) 25.693 (24.172,27.213) 24.071 (23.292,24.849) 0.245 (0.237,0.253)
Black 43,109 29.711 (28.421,31.001) 27.666 (25.29,30.042) 29.697 (28.071,31.324) 0.294 (0.267,0.32)
Asian 26,265 25.989 (24.582,27.396) 23.631 (20.325,26.937) 26.098 (24.013,28.184) 0.252 (0.216,0.289)
Other 22,091 26.955 (24.952,28.958) 26.442 (23.616,29.269) 26.484 (23.923,29.045) 0.252 (0.21,0.294)
Ethnicity
Non-Hisp 808,098 25.438 (25.118,25.758) 25.622 (24.061,27.183) 24.532 (23.79,25.274) 0.251 (0.242,0.26)
Hispanic 28,963 27.746 (25.946,29.546) 26.345 (24.033,28.657) 27.102 (25.369,28.836) 0.251 (0.224,0.277)
Education
¡High school 50,943 27.86 (26.587,29.134) 28.96 (26.302,31.618) 27.223 (25.326,29.119) 0.262 (0.233,0.292)
HS completed 78,045 26.136 (25.022,27.249) 28.155 (25.07,31.241) 25.534 (24.179,26.889) 0.263 (0.24,0.287)
College 237,206 26.881 (26.288,27.474) 27.043 (24.085,30.001) 26.128 (24.88,27.377) 0.264 (0.253,0.276)
Graduate 326,860 24.96 (24.472,25.448) 22.543 (21.102,23.983) 23.625 (22.803,24.447) 0.236 (0.218,0.254)
Post-grad 144,007 23.375 (22.656,24.094) 24.105 (22.558,25.651) 23.845 (22.697,24.992) 0.237 (0.219,0.254)
HH income
0-49 332,586 26.578 (26.059,27.098) 27.376 (25.379,29.374) 26.485 (25.414,27.557) 0.265 (0.254,0.276)
50-99 309,387 25.205 (24.717,25.694) 24.47 (21.599,27.341) 24.537 (23.514,25.559) 0.246 (0.232,0.26)
100-149 132,757 24.218 (23.441,24.996) 24.214 (22.662,25.766) 23.707 (22.615,24.799) 0.241 (0.226,0.256)
150+ 62,331 24.176 (22.962,25.39) 25.297 (20.664,29.931) 23.96 (22.384,25.536) 0.243 (0.223,0.264)
HH size
1 177,140 26.133 (25.442,26.823) 26.166 (22.88,29.452) 25.409 (24.247,26.571) 0.257 (0.247,0.266)
2 286,106 24.412 (23.871,24.953) 24.289 (23.042,25.537) 23.693 (22.808,24.578) 0.244 (0.234,0.254)
3 152,684 25.507 (24.748,26.267) 24.228 (21.958,26.498) 25.177 (23.648,26.705) 0.243 (0.231,0.256)
4 143,442 26.236 (25.522,26.951) 26.843 (23.645,30.042) 25.755 (24.608,26.901) 0.259 (0.244,0.273)
5+ 77,689 26.882 (25.884,27.88) 27.356 (22.56,32.152) 25.719 (24.458,26.98) 0.263 (0.244,0.282)
Urban size
¡50k 34,987 20.874 (19.097,22.651) 26.022 (21.85,30.194) 21.679 (19.496,23.862) 0.228 (0.21,0.247)
50-200k 119,970 23.798 (22.902,24.694) 23.87 (22.364,25.376) 24.287 (22.881,25.692) 0.239 (0.222,0.257)
200-500k 44,578 22.435 (21.355,23.515) 24.487 (18.513,30.461) 23.436 (22.035,24.837) 0.236 (0.209,0.263)
500-1000k 276,629 25.334 (24.785,25.882) 25.695 (24.531,26.86) 26.128 (25.326,26.93) 0.263 (0.249,0.278)
1000k+ 360,897 27.062 (26.615,27.508) 26.883 (25.813,27.953) 27.43 (26.73,28.131) 0.271 (0.26,0.283)
Vehicle make
American 290,228 26.28 (25.728,26.831) 24.962 (22.25,27.673) 24.957 (23.906,26.007) 0.255 (0.244,0.265)
Asian 528,810 25.075 (24.682,25.468) 26.283 (24.65,27.917) 24.94 (24.125,25.755) 0.249 (0.239,0.258)
European 18,023 26.233 (24.82,27.646) 23.294 (19.732,26.857) 25.298 (22.99,27.607) 0.243 (0.21,0.276)
Vehicle type
Car 610,245 25.632 (25.267,25.997) 25.738 (24.14,27.335) 25.054 (24.321,25.788) 0.25 (0.24,0.261)
Van 27,866 27.205 (25.523,28.887) 28.585 (24.943,32.227) 28.5 (25.898,31.103) 0.277 (0.237,0.316)
SUV 158,202 25.274 (24.518,26.031) 25.441 (22.085,28.796) 25.053 (23.917,26.189) 0.254 (0.241,0.267)
Pickup 40,748 23.596 (22.247,24.945) 23.906 (20.704,27.107) 23.839 (21.462,26.217) 0.239 (0.211,0.267)
Fuel type
Gas/D 761,292 25.777 (25.444,26.11) 25.638 (24.128,27.147) 25.059 (24.318,25.801) 0.252 (0.243,0.261)
Other 75,769 22.913 (21.982,23.844) 20.192 (18.427,21.956) 22.057 (20.348,23.765) 0.216 (0.195,0.237)
Weekend
Weekday 712,411 25.395 (25.079,25.712) 25.465 (24.053,26.876) 24.83 (24.105,25.554) 0.25 (0.241,0.259)
Weekend 124,650 26.218 (25.892,26.545) 25.812 (23.822,27.803) 25.623 (24.805,26.442) 0.258 (0.248,0.268)

[Uncaptioned image] Figure 12: Comparing the distribution of common auxiliary variables in SHRP2 with weighted NHTS

Refer to caption
Figure 13: Comparing the performance of BART vs GLM in both estimating propensity scores and predicting some trip-related outcomes. The radar plot on the right side displays the values of (pseudo-)R2R^{2} between BART and GLM. AUC: area under curve; CART: classification and regression trees

[Uncaptioned image] Figure 14: Comparing the distribution of common auxiliary variables in pseudo-weighted SHRP2 (PAPP–BART) with weighted NHTS