This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Inference for Respondent-Driven Sampling Data with Application to HIV Prevalence Estimation

Krista J. Gile
Nuffield College
Krista J. Gile is Postdoctoral Prize Research Fellow, Nuffield College, Oxford. The author would like to thank Mark S. Handcock of the University of Washington for numerous helpful suggestions, Tessie Caballero of COPRESIDA and Tetyana Saluk of AIDS Alliance for the use of their countries’ datasets and Lisa G. Johnston of Tulane University for her help in suggesting, obtaining and assisting in the final cleaning of these datasets. Direct correspondence to Krista J. Gile, Nuffield College, New Road, Oxford OX1 1NF, United Kingdom; e-mail krista.gile@nuffield.ox.ac.uk
Abstract

Respondent-driven sampling is a form of link-tracing network sampling, which is widely used to study hard-to-reach populations, often to estimate population proportions. Previous treatments of this process have used a with-replacement approximation, which we show induces bias in estimates for large sample fractions and differential network connectedness by characteristic of interest.

We present a treatment of respondent-driven sampling as a successive sampling process. Unlike existing representations, our approach respects the essential without-replacement feature of the process, while converging to an existing with-replacement representation for small sample fractions, and to the sample mean for a full-population sample.

We present a successive-sampling based estimator for population means based on respondent-driven sampling data, and demonstrate its superior performance when the size of the hidden population is known. We present sensitivity analyses for unknown population sizes. In addition, we note that like other existing estimators, our new estimator is subject to bias induced by the selection of the initial sample. Using data collected among three populations in two countries, we illustrate the application of this approach to populations with varying characteristics. We conclude that the successive sampling estimator improves on existing estimators, and can also be used as a diagnostic tool when population size is not known.

Key Words: link-tracing, PPSWOR, successive sampling, Markov chain, network sampling, hidden population sampling, snowball sampling, social networks

1 Background

Researchers and government officials are often interested in characteristics of human populations for which there are no practicable sampling frames for direct sampling. In some such hidden populations, members are connected through social networks. A common approach is to collect a sample using a variant of link-tracing (Thompson, 2006a, b), such as a snowball sample (Goodman, 1961), where subsequent sample members are selected based on their relations with previously sampled individuals. When the initial sample is not a probability sample, this approach does not result in a probability sample. However, most available alternatives, (Muhib et al., 2001; Peterson et al., 2008; Watters and Biernacki, 1989) also fail to produce a probability sample of the population.

Respondent-Driven Sampling (RDS, introduced by Heckathorn 1997, 2002, see also Salganik and Heckathorn, 2004, Volz and Heckathorn 2008) is a recently introduced variant of link-tracing sampling which increases the ease of sampling and produces samples which, it is argued, approach probability samples as sampling progresses. The lack of satisfactory alternatives has spurned a strong demand for RDS (Johnston et al., 2008; Lansky et al., 2007).

RDS presents two main innovations: a sampling design and a corresponding approach to estimation. The former is highly effective in many settings (Lansky et al., 2007); the respondent-driven design relies on the respondents at each wave to select the next wave by distributing uniquely identified coupons to others in the target population, who can choose to return the coupons to enroll in the study. Thus, the sampling exploits the network of social relations while also avoiding the confidentiality concerns associated with recording the names of contacts.

The key innovation for estimation is that through many waves of sampling, the dependence of the final sample on the initial sample is reduced, allowing researchers more confidence in making approximate probability statements about the resulting samples. This insight allows for statistical inference in settings where the initial sample is typically selected by a convenience mechanism. Although current inference is likely superior to the alternative non-probability methods, existing methods are sensitive to deviations from many assumptions (Goel and Salganik, 2009; Gile and Handcock, 2010; Neely, 2009). This paper offers a modification of the existing theoretical formulation of respondent-driven sampling, and corresponding inference to address what we find to be a serious conceptual weakness of existing work: the known inaccuracy of the with-replacement approximation to the sampling process.

In the next section, we begin by introducing current RDS estimation, particularly the estimator introduced by Volz and Heckathorn (2008), and illustrate the sensitivity of this estimator to the with-replacement sampling assumption in cases of substantial sample fractions. In Sections 3 and 4, we then introduce a new model for RDS sampling based on successive sampling, and introduce a new estimator based on that model. In Section 5, we use a simulation study to illustrate the superior performance of the new estimator. We also include sensitivity analyses concerning inaccurate estimation of the size of the target population and the characteristics of the initial sample. In Section 6, we apply our estimator to data collected in three populations of drug users and men who have sex with men. We conclude with a broader discussion of the method and its limitations.

2 Previous Approaches to Estimation

The basic ideas underlying estimation from RDS data are clever and important. They allow for something like valid statistical inference, in a sampling setting where the target population cannot be effectively reached using a traditional sampling frame. The original article, (Heckathorn, 1997) made very strong assumptions about the sampling procedure so as to assume that the sample proportions were representative of the population proportions. Salganik and Heckathorn (2004) introduced a Markov chain argument for population mixing, and proposed an estimator based on equating the number of cross-relations between pairs of sub-populations of interest, based on the referral patterns of each group. This estimator is currently in wide use, and is implemented in the standard RDS analysis software (Volz et al., 2007). Volz and Heckathorn (2008) connect RDS estimation to mainstream survey sampling through the use of a generalized Horvitz-Thompson estimator form. This estimator relies on the estimation of the inclusion probabilities of the sampled units, πi\pi_{i}. Based on an argument for treating the sample as independent draws from the stationary distribution of a random walk on the nodes of an undirected graph, Volz and Heckathorn (2008) approximate the sampling probabilities as proportional to nodal degree, did_{i}, or number of incident edges in the graph. This estimator avoids the problem of potentially unknown population size NN by using the generalized Horvitz-Thompson form, normalizing by an estimator of NN as follows:

μ^VH=i=1N𝐒i𝐳i𝐝ii=1N𝐒i𝐝i,\displaystyle\hat{\mu}_{\rm VH}=\frac{\sum_{i=1}^{N}\frac{{\bf S}_{i}{\bf z}_{i}}{{\bf d}_{i}}}{\sum_{i=1}^{N}\frac{{\bf S}_{i}}{{\bf d}_{i}}}, (1)

where 𝐒i=1{\bf S}_{i}=1 indicates that the ithi^{th} unit has been selected for sampling, and 𝐒i=0{\bf S}_{i}=0 indicates it has not been selected, and 𝐳i{\bf z}_{i} represents the variable of interest measured on the ithi^{th} unit.

We refer to this approach as the Volz-Heckathorn (VH) estimator. Gile and Handcock (2010) illustrate that this estimator consistently out-performs the Salganik-Heckathorn estimator, and we believe it is the most principled estimator currently available for RDS. For these reasons, we use the VH estimator as the standard for comparison in this paper.

Many critical assumptions required to justify this estimator are explored in Gile and Handcock (2010) and listed in Table 2. In this paper, we focus on the reliance on a with-replacement sampling model. We present an estimator based on an alternative sampling model reflecting the without-replacement nature of the sampling process.

3 Successive Sampling for RDS

The Volz-Heckathorn estimator requires many waves of sampling to justify its reliance on a stationary distribution. In practice the number of waves is small (almost always fewer than 20, and often 5 or fewer). Also, we wish to consider a without-replacement process for which stationarity does not apply. It is therefore instructive to consider a special case when stationarity is not necessary.

Consider a graph formed in the following manner: Begin with NN vertices, designated by indices 1:N1:N. Assign to each vertex a number of edge-ends according to arbitrary fixed degree distribution =1,2,K\mathbb{N}=\mathbb{N}_{1},\mathbb{N}_{2},\ldots\mathbb{N}_{K}, where j\mathbb{N}_{j} is the number of nodes of degree jj, 11 and KK are the minimum and maximum degrees, respectively, and subject to the constraint that twice the number of edges, 2E=k1Kkk2E=\sum_{k\in 1\ldots K}k\mathbb{N}_{k}, is even. Now select pairs of edge-ends completely at random and assign an edge connecting each pair. This procedure results in a variant of the so-called configuration model for networks, a popular null model for networks, especially in the physics literature (Molloy and Reed, 1995). Note that this formulation does allow loops, or links to oneself as well as multiple edges between the same pair of vertices, although the rate of these events decreases with increasing population size for fixed maximum degree KK, such that several authors have suggested they are negligible for K<(𝐝¯N)1/2K<(\bar{{\bf d}}N)^{1/2} or K<N1/2K<N^{1/2}, where 𝐝¯\bar{{\bf d}} is the mean degree of the network (Chung and Lu, 2002; Burda and Krzywicki, 2003; Boguna et al., 2004; Catanzaro et al., 2005; Foster et al., 2007).

Now consider a random walk 𝐆{\bf G}^{*} on a set of vertices with degrees given by 𝐝1,𝐝2,,𝐝N{\bf d}_{1},{\bf d}_{2},\ldots,{\bf d}_{N}, such that i𝕀(𝐝i=k)=k\sum_{i}\mathbb{I}({\bf d}_{i}=k)=\mathbb{N}_{k}, where 𝕀(A)\mathbb{I}(A) is the indicator function on AA. Volz and Heckathorn (2008) consider the stationary distribution of this random walk for a fixed graph. Instead, we consider the transition probabilities of the corresponding walk over the distribution of all networks of fixed degree distribution constructed as above, in which the jthj^{th} node visited, 𝐆j{\bf G}^{*}_{j}, is selected from the distribution of possible edges from node 𝐆j1{\bf G}^{*}_{j-1}. The transition probabilities are then given by:

P(𝐆j=𝐠j|𝐆1,𝐆2,𝐆j1=𝐠1,𝐠2,𝐠j1)={𝐝𝐠j2E1𝐠j𝐠j1𝐝𝐠j12E1𝐠j=𝐠j1,\displaystyle P({\bf G}^{*}_{j}={\bf g}^{*}_{j}|{\bf G}^{*}_{1},{\bf G}^{*}_{2},\ldots{\bf G}^{*}_{j-1}={\bf g}^{*}_{1},{\bf g}^{*}_{2},\ldots{\bf g}^{*}_{j-1})=\left\{\begin{array}[]{cl}\frac{{\bf d}_{{\bf g}^{*}_{j}}}{2E-1}&{\bf g}^{*}_{j}\neq{\bf g}^{*}_{j-1}\\ \\ \frac{{\bf d}_{{\bf g}^{*}_{j}}-1}{2E-1}&{\bf g}^{*}_{j}={\bf g}^{*}_{j-1},\end{array}\right. (5)

where this probability is taken over the space of all possible configuration model graphs of given degree distribution, as well as over the steps of the random walk.

This procedure results in stationary distribution proportional to 𝐝i{\bf d}_{i}, and, in fact, selection probabilities at each step very nearly proportional to 𝐝i{\bf d}_{i}. Thus, for a network structure given by such a configuration model, the Volz-Heckathorn estimator constitutes a Hansen and Hurwitz (1943) estimator, without further requirement of sufficient waves for convergence.

Now consider the corresponding self-avoiding random walk 𝐆{\bf G}. In this procedure,

P(𝐆j=𝐠j|𝐆1,𝐆2,𝐆j1=𝐠1,𝐠2,𝐠j1)={𝐝𝐠j2Ei=1j1𝐝𝐠i𝐠j𝐠1𝐠j10𝐠j𝐠1𝐠j1.\displaystyle P({\bf G}_{j}={\bf g}_{j}|{\bf G}_{1},{\bf G}_{2},\ldots{\bf G}_{j-1}={\bf g}_{1},{\bf g}_{2},\ldots{\bf g}_{j-1})=\left\{\begin{array}[]{cl}\frac{{\bf d}_{{\bf g}_{j}}}{2E-\sum_{i=1}^{j-1}{\bf d}_{{\bf g}_{i}}}&{\bf g}_{j}\notin{\bf g}_{1}\ldots{\bf g}_{j-1}\\ 0&{\bf g}_{j}\in{\bf g}_{1}\ldots{\bf g}_{j-1}.\end{array}\right. (8)

This sampling procedure is mathematically equivalent to successive sampling (SS) or probability proportional to size without replacement sampling (PPSWOR), dating back to Yates and Grundy (1953), and typically defined by the following sampling process:

  • Begin with a population of NN units, denoted by indices 1N1\ldots N with varying sizes represented by 𝐝1,𝐝2,𝐝N{\bf d}_{1},{\bf d}_{2},\ldots{\bf d}_{N}, with i=1N𝐝i=2E\sum_{i=1}^{N}{\bf d}_{i}=2E, for total edges EE.

  • Sample the first unit 𝐆1{\bf G}_{1} from the full population {1N}\{1\ldots N\} with probability proportional to size 𝐝i{\bf d}_{i}.

  • Select each subsequent unit with probability proportional to size from among the remaining units, such that conditional sampling probabilities are given by (8).

In the survey sampling literature, mostly based on the work of Raj (1956) and Murthy (1957), this sampling design is referred to as probability proportional to size without replacement (PPSWOR), and typically used in instances where the desired probability proportional to size (PPS) design is not feasible. In such cases, the sizes of population units are all known, the sampling design is implemented, and the main interest is in estimating the population total or mean of a variable measured on sampled units. The analytical properties of this procedure are quite difficult, as suggested by more recent work including Rao et al. (1991) and Kochar and Korwar (2001). In fact, even the marginal unit sampling probabilities are not available in closed form.

In the geological discovery literature, successive sampling is not a purposive design, but an approximation to a non-designed sampling process. Important work in this area includes Andreatta and Kaufman (1986), Nair and Wang (1989), and Bickel et al. (1992). These authors address the case of oil field discovery, in which successive fields are discovered with probabilities proportional to some measure of their size, typically taken to be their volume. This literature does not assume the full population of sizes to be known, and typically takes some function of the sizes, such as the sum of the sizes of undiscovered reserves, as the object of inference. Our application of successive sampling also assumes the unobserved sizes to be unknown, however as in Raj (1956) and Murthy (1957), our object is to use these sizes to estimate the population characteristics on another variable measured on all sampled units. To do so, we both develop a novel algorithm and leverage more recent work by Fattorini (2006).

4 Estimation of population means from RDS samples based on successive sampling

Under successive sampling, for population of sizes given by \mathbb{N} and sample size nn, there is a function f𝝅(k;n,)f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}) mapping the size kk of a unit to its sampling probability 𝝅i{\mbox{\boldmath$\pi$}}_{i}. Our proposed estimator is based on estimated sampling weights, which are based on the 𝝅\pi given by the successive sampling procedure, applied to nodal sampling units and sizes given by nodal degrees. There are three key challenges for this approach. The mapping depends on first, the known population size NN and, second, the degree distribution, \mathbb{N}, neither of which is known in the general RDS case. Finally, given \mathbb{N} and the sample size nn, the mapping is not explicit. The lack of explicit mapping has been addressed by Fattorini (2006), who suggests estimating the mapping by simulation. For known \mathbb{N} and given nn, he simulates the successive sampling procedure, then estimates the inclusion probability 𝝅i{\mbox{\boldmath$\pi$}}_{i} associated with unit ii by:

𝝅~i=𝐔i+1M+1,\displaystyle\tilde{{\mbox{\boldmath$\pi$}}}_{i}=\frac{{\bf U}_{i}+1}{M+1}, (9)

where 𝐔i{\bf U}_{i} is the number of times unit ii is sampled in the MM trials. He proposes using these estimated probabilities in the standard Horvitz-Thompson estimator:

j:𝐒j=1𝐳j𝝅~j.\displaystyle\sum_{j:{\bf S}_{j}=1}\frac{{\bf z}_{j}}{\tilde{{\mbox{\boldmath$\pi$}}}_{j}}. (10)

In most of this paper, we assume the population size, NN, is known. We evaluate the sensitivity of our results to that assumption in Section 5.2, and in the examples in Section 6.

Given the known population size, we present a novel approach to estimating the degree distribution \mathbb{N} jointly with inclusion probabilities when degrees are only observed for sampled nodes. This procedure can be applied beyond the RDS setting whenever the population distribution of unit sizes corresponding to a sample collected through successive sampling is unknown.

Our approach iteratively estimates the population distribution \mathbb{N} and the mapping f𝝅(k;n,):k𝝅f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}):k\to{\mbox{\boldmath$\pi$}}. This approach relies on two key points:

  • For known population of degrees \mathbb{N}, the mapping f𝝅(k;n,)f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}) can be estimated by simulation in a form similar to (9).

  • For known mapping f𝝅(k;n,)f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}), the number (or proportion) of the population of degree kk can be estimated using a form similar to (10).

We leverage these two points to propose the following procedure for the estimation of population mean μ\mu in the case of nodal degrees observed only for sampled units.

Let E[;n,]\mathop{\rm E\,}\nolimits[\cdot;n,\mathbb{N}] denote expectation with respect to a sample of size nn sampled by successive sampling from a population with degree counts ={1,2,,K}.\mathbb{N}=\{\mathbb{N}_{1},\mathbb{N}_{2},\ldots,\mathbb{N}_{K}\}. Then:

E[𝐕k;n,]=kf𝝅(k;n,)k=1,,K\displaystyle\mathop{\rm E\,}\nolimits[{\bf V}_{k};n,\mathbb{N}]=\mathbb{N}_{k}f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N})~~~~~~~~~k=1,\ldots,K (11)

where 𝐕k{\bf V}_{k} is the random variable representing the number of sample units with degree kk, f𝝅(k;n,)=E[𝐒j:𝐝j=k;n,]f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N})=\mathop{\rm E\,}\nolimits[{\bf S}_{j}:{\bf d}_{j}=k;n,\mathbb{N}] is the (common) inclusion probability of a node of degree kk, and KK is the maximum degree, K<NK<N. This suggests first order moment equations for the unknown true :\mathbb{N}:

E[𝐕k;n,]=𝐯kk=1,,K\displaystyle\mathop{\rm E\,}\nolimits[{\bf V}_{k};n,\mathbb{N}]={\bf v}_{k}~~~~~~~~~k=1,\ldots,K (12)

where 𝐯k{\bf v}_{k} is the observed number of sample units with degree kk.

The algorithm is then:

  1. 1.

    Initial estimate:

    f𝝅(k;n,0)=kNl=1Kvll,\displaystyle f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}^{0})=\frac{k}{N}\sum_{l=1}^{K}\frac{v_{l}}{l}, (13)

    that is f𝝅(k;n,0)f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}^{0}) proportional to kk.

  2. 2.

    For i=1,,ri=1,\ldots,r iterate the following steps:

    1. (a)

      Estimate the population distribution of degrees:

      ki=N𝐯kf𝝅(k;n,i1)l=1K𝐯lf𝝅(l;n,i1)k=1,,K\displaystyle{\mathbb{N}_{k}^{i}=N\cdot\frac{\frac{{\bf v}_{k}}{f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}^{i-1})}}{\sum_{l=1}^{K}\frac{{\bf v}_{l}}{f_{{\mbox{\boldmath$\pi$}}}(l;n,\mathbb{N}^{i-1})}}}~~~~~~~~~~~k=1,\ldots,K (14)

      where 𝐯k{\bf v}_{k} is the observed number of sample units with degree kk.

    2. (b)

      Estimate inclusion probabilities:

      1. i.

        Simulate MM successive sampling samples of size nn from a population with composition i.\mathbb{N}^{i}.

      2. ii.

        Estimate the inclusion probabilities:

        f𝝅(k;n,i)=E[𝐕k;n,i]ki𝐔k+1Mki+1,\displaystyle f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}^{i})=\frac{\mathop{\rm E\,}\nolimits[{\bf V}_{k};n,\mathbb{N}^{i}]}{\mathbb{N}_{k}^{i}}\approx\frac{{\bf U}_{k}+1}{M\cdot\mathbb{N}_{k}^{i}+1}, (15)

        where 𝐔k{\bf U}_{k} is the total number of observed units of size kk in the MM simulations.

  3. 3.

    Estimate the population distribution of degrees and the corresponding inclusion probabilities via, respectively: ^=r{\hat{\mathbb{N}}}=\mathbb{N}^{r} and π^()=f𝝅(;n,r).{\hat{\pi}}(\cdot)=f_{{\mbox{\boldmath$\pi$}}}(\cdot;n,\mathbb{N}^{r}).

  4. 4.

    Use the resulting mapping π^{\hat{\pi}} to estimate μ\mu via the generalized Horvitz-Thompson estimator:

    μ^SS=j=1N𝐒j𝐳jπ^(𝐝j)j=1N𝐒jπ^(𝐝j).\displaystyle\hat{\mu}_{SS}=\frac{\sum_{j=1}^{N}\frac{{\bf S}_{j}{\bf z}_{j}}{{\hat{\pi}}({\bf d}_{j})}}{\sum_{j=1}^{N}\frac{{\bf S}_{j}}{{\hat{\pi}}({\bf d}_{j})}}. (16)

For computational efficiency, most simulations in this paper were conducted with M=500M=500 and r=3r=3, with good results. In general, we recommend at least M=2000M=2000 and r=3r=3, and we have used these parameters for the simulations of the standard error procedure in the supplemental materials, in the application to real data, and in the extension to N>1000N>1000 in the discussion. Estimation time scales with sample size, population size, and MM. In our simulations, with N=1000N=1000 and M=500M=500, estimates require about 1.51.5 seconds on a personal computer, increasing to about 66 seconds when M=2000M=2000. In practice, these parameters can be adjusted for desired precision in the solution to (12). Higher values of MM are particularly helpful for more dispersed degree distributions and larger population sizes.

Simulations have shown the results of this procedure to be at least as good as those provided by a first order asymptotic approximation provided by Andreatta and Kaufman (1986). This approach is novel and its theoretical properties are not well understood. It is appealing is to consider the algorithm as a variant of an EM algorithm, such as an ECM algorithm (Meng and Rubin, 1993), with the unobserved part of the degree sequence, dn+1,dNd_{n+1},\ldots d_{N} as the latent variable. Unfortunately, in the design-based frame, these values also fully determine the unknown parameters Nk=i=1N𝕀(di=k)N_{k}=\sum_{i=1}^{N}{\mathbb{I}}(d_{i}=k), resulting in a degenerate likelihood form.

4.1 Between Infinite Population and Full Population Sample

Consider the limiting case of the moment equation (12) where N=nN=n. In this case, this equation is only satisfied for the degree distribution given by, ^k=𝐯k\hat{\mathbb{N}}_{k}={\bf v}_{k}, resulting in π^(𝐝j)=1j{\hat{\pi}}({\bf d}_{j})=1~\forall~j, and μ^SS=μ^\hat{\mu}_{SS}=\hat{\mu}, the sample mean.

Now consider the limit as NN\to\infty, for fixed nn and fixed maximum degree. Then the step-wise selection probabilities for an unsampled node approach values proportional to degree:

𝐝ij𝐝jj:𝐒j=1𝐝j𝐩i0,\displaystyle\frac{{\bf d}_{i}}{\sum_{j}{\bf d}_{j}-\sum_{j:{\bf S}_{j}=1}{\bf d}_{j}}-{\bf p}_{i}\to 0, (17)

where 𝐩i=𝐝ij𝐝j{\bf p}_{i}=\frac{{\bf d}_{i}}{\sum_{j}{\bf d}_{j}}. Then P(Binom(n,𝐩i)>1)0P(\rm{Binom}(n,{\bf p}_{i})>1)\to 0, such that

𝝅i𝝅j𝐩i𝐩j=𝐝i𝐝j,\displaystyle\frac{{\mbox{\boldmath$\pi$}}_{i}}{{\mbox{\boldmath$\pi$}}_{j}}\to\frac{{\bf p}_{i}}{{\bf p}_{j}}=\frac{{\bf d}_{i}}{{\bf d}_{j}}, (18)

for overall inclusion probabilities 𝝅\pi, and step-wise selection probabilities 𝐩{\bf p}. Therefore, μ^SSμ^VH\hat{\mu}_{SS}\to\hat{\mu}_{\rm VH}.

In either limit, NnN\to n or NN\to\infty, μ^SS\hat{\mu}_{SS} approaches an existing estimator. Thus, it retains the professed limiting properties of μ^VH\hat{\mu}_{\rm VH}, such as robustness to bias based on the initial sample, while retaining the favorable finite population characteristics of the sample mean in the case of a large sample fraction. In the next section, we use simulation studies to argue that for n<N<n<N<\infty, the proposed estimator appropriately mediates between these two, and therefore out-performs both.

5 Comparing the New and Existing Estimators:
A Simulation Study

Our simulation study is designed to highlight the treatment of without-replacement sampling in the case of a large sample fraction. To increase the realism of the study, we chose parameters to match the characteristics of the pilot data from the CDC surveillance program (Abdul-Quader et al., 2006) wherever possible. The general procedure was as follows:

  • 1000 networks are simulated under each test condition.

  • An RDS sample is simulated from each sampled network.

  • RDS estimators are computed from each sample.

Because the CDC’s surveillance system aims for a sample size of 500, and many RDS studies approach exhaustion of their populations of interest (Johnston, 2009; CDC, 2008), we fix all sample sizes at 500. We also consider a mean degree of 7, close to the mean of the pilot data from the CDC study (Abdul-Quader et al., 2006).

We assign a discoverable class to each member of the simulated population. In reference to studies designed to estimate the prevalence of infectious disease, we refer to this characteristic as infection status, assigning the infected status (𝐳i=1{\bf z}_{i}=1) to 20% of simulated population members in each simulation. Note that μ^SS\hat{\mu}_{SS} could also be applied to a continuous or categorical variable.

We consider networks sampled from models from the Exponential-family Random Graph Model (ERGM) class (Snijders et al., 2006). Here the relations 𝐲{\bf y} are represented as a realization of the random variable 𝐘{\bf Y} with distribution:

Pη(𝐘=𝐲|𝐱)=exp{𝜼𝐠(𝐲,𝐱)κ(𝜼,𝐱)}𝐲𝒴,\displaystyle P_{\eta}({\bf Y}={\bf y}|{\bf x})=\exp\{{\mbox{\boldmath$\eta$}}{\cdot}{\bf g}({\bf y},{\bf x})-\kappa({\mbox{\boldmath$\eta$}},{\bf x})\}\quad\quad{\bf y}\in{\cal Y}, (19)

where 𝐱{\bf x} are nodal or dyadic covariates, 𝐠(𝐲,𝐱){\bf g}({\bf y},{\bf x}) is a pp-vector of network statistics, 𝜼p{\mbox{\boldmath$\eta$}}\in\mathbb{R}^{p} is the parameter vector, 𝒴{\cal Y} is the set of all possible undirected graphs, and exp{κ(𝜼,𝐱)}=𝐮𝒴exp{𝜼𝐠(𝐮,𝐱)}\exp\{\kappa({\mbox{\boldmath$\eta$}},{\bf x})\}=\sum_{{\bf u}\in{\cal Y}}\exp\{{\mbox{\boldmath$\eta$}}{\cdot}{\bf g}({\bf u},{\bf x})\} is the normalizing constant (Barndorff-Nielsen, 1978). The structure of the networks represented is determined by the choice of 𝐠(𝐲,𝐱){\bf g}({\bf y},{\bf x}).

In this study, we vary the structure of the networks in three ways: First, we consider populations of sizes (i.e. numbers of nodes) 1000, 835, 715, 625, 555, and 525, such that samples of size 500 constitute about 50%, 60%, 70%, 80%, 90%, and 95% of the target population. We focus on this range of sample fractions because it highlights settings in which we might expect to see finite population biases such as those μ^SS\hat{\mu}_{SS} is intended to address. Another critical feature is the activity ratio ww, equal to the ratio of the mean degree of infected nodes to the mean degree of uninfected nodes,

w=i=1Ndizii=1Nzii=1N(1zi)i=1Ndi(1zi).\displaystyle w=\frac{\sum_{i=1}^{N}d_{i}z_{i}}{\sum_{i=1}^{N}z_{i}}\frac{\sum_{i=1}^{N}(1-z_{i})}{\sum_{i=1}^{N}d_{i}(1-z_{i})}. (20)

This measures the differential tendency for the groups to be socially connected in the population. We consider w{0.5,0.8,1,1.1,1.4,1.8,2.5,3}w\in\{0.5,0.8,1,1.1,1.4,1.8,2.5,3\}.

We also induce homophily on infection status in these simulations, parameterized as the relative probability of an edge between two infected nodes, and an edge between an infected and an uninfected node. This is an intuitive parameterization, also used in Gile and Handcock (2010), but differs from that used in other analyses of RDS. Except in Section 5.3, the edge probability between the two infected nodes is fixed at five times that of the mixed dyad. For w=1w=1, this, along with the 20% infected, implies that an edge between two uninfected nodes is twice as likely as an edge in a mixed dyad. We consider several other levels of homophily, and in Section 5.3 we show that this feature has important implications for the bias induced by biased initial samples, however for the case of initial samples at random with respect to 𝐳{\bf z}, as in most of the simulations presented here, bias was not affected by level of homophily, although increasing homophily does increase the variance of both μ^SS\hat{\mu}_{SS} and μ^VH\hat{\mu}_{\rm VH}, and is therefore important to consider in standard error estimation.

These features are represented in the ERGM by choosing network statistics to represent the mean degree, the activity ratio ww, and homophily (based on using the infected  status as a nodal covariate). These three values were specified using the nodemix statnet model term, which includes three parameters corresponding to the three cells of the mixing matrix on infection. The parameter 𝜼\eta was chosen so the expected values of the statistics were equal to the values given above (van Duijn et al., 2009). Samples from the resulting models were taken using the statnet R package (Handcock et al., 2003, 2008).

The RDS sampling mechanism is again designed to mimic that of the CDC’s pilot study. Ten initial sample nodes were chosen for each sample, selected sequentially with probability proportional to degree, (i.e. by successive sampling). In the sensitivity analysis, we also consider initial sample selection regimes dependent on 𝐳{\bf z}. Subsequent sample waves were selected without-replacement by sampling up to two nodes at random from among the un-sampled alters of each sampled node. Exactly two alters were sampled whenever possible. This process typically resulted in the sampling of four complete waves and part of a fifth wave, stopping when a sample size of 500 was attained.

We augment our basic results with two sub-studies evaluating the sensitivity of the estimator μ^SS\hat{\mu}_{SS} to assumptions not required for μ^VH\hat{\mu}_{\rm VH}: the accuracy of the assumed population size N^\hat{N} and the dependence on the initial sample.

5.1 Results

The Volz-Heckathorn estimator exhibits substantial bias in cases of non-unity activity ratio, and more so for larger sample fractions. This result was noted in Gile and Handcock (2010), and is illustrated in Figure 1(a). The bias can be understood as follows: in the case of higher mean degree among infected nodes (w>1)(w>1), and large sample fraction, the higher-degree infected samples will be down-weighted proportional to their degrees. The true without-replacement sampling probabilities are closer to uniform than would be suggested by the proportional-to-degree estimates, such that these higher-degree nodes are excessively down-weighted, leading to negative bias in the estimated proportion infected. The corresponding mappings from degree to sampling weight are illustrated in Figure 2.

Refer to caption
(a) Bias of μ^VH\hat{\mu}_{VH}
Refer to caption
(b) Bias of μ^SS\hat{\mu}_{SS}
Figure 1: Bias of the Volz-Heckathorn and Successive Sampling estimators from samples of size 500 constituting about 50%, 60%, 70%, 80%, 90%, and 95% of the population, for varying activity ratio (ww). The same samples were used for both estimators.
Refer to caption
Figure 2: Estimated mappings from nodal degree to inclusion probability for successive sampling samples of size 500 constituting about 50%, 60%, 70%, 80%, 90%, 95%, and 100% of the population, for simulated network degree distributions with w=1w=1, along with the proportional mapping assumed by the Volz-Heckathorn estimator (for 50% sample), indicated by \propto degree. Note that given the simulated degree distribution, the proportional mapping requires probabilities greater than 1 to attain the desired sample size.

The SS estimator, however, is not subject to this type of bias, as illustrated in Figure 1(b). This is the main contribution of the proposed estimatior. In this plot, the bias is negligible, with the exception of the case in which w=0.5w=0.5. At least two factors, related to the strong homophily and the smaller size of the infected group contribute to this exception. First, the homophily-induced dependence implies the initial sample has greater influence on the final sample than in standard successive sampling. As this sample is selected first, its selection probabilities are closer to proportional to degree than the final successive sampling inclusion probabilities, contributing to a resulting sample slightly over-representing high-degree uninfected nodes. Furthermore, because the infected nodes have low degrees and high homophily, they more often fail to produce both possible recruits. The relative group sizes contribute to the difference in the magnitude of this effect for w<1w<1 and w>1w>1.

The variance of μ^SS\hat{\mu}_{SS} is also consistently lower than that of μ^VH\hat{\mu}_{\rm VH}, combining with the lower bias to yield substantially lower mean squared error, often massively so. The mean squared error of μ^SS\hat{\mu}_{SS} is less than that of μ^VH\hat{\mu}_{\rm VH} for the full parameter space of Figure 1. The combination of bias and variance effects is visible in Figures 3(a) and 3(b).

5.2 Sensitivity to Population Size Estimate

It is important to note that the performance of μ^SS\hat{\mu}_{SS} in Figure 1(b) is dependent on knowledge of the true population size NN. This is often unrealistic in practice. Therefore, we evaluate the performance of μ^SS\hat{\mu}_{SS} in the case of over and under estimation of NN. In particular, we consider small (N^s\hat{N}_{s}) and large (N^l\hat{N}_{l}) estimates of NN given by:

N^s=NNn2,N^l=N+Nn2.\displaystyle\hat{N}_{s}=N-\frac{N-n}{2},~~~\hat{N}_{l}=N+\frac{N-n}{2}. (21)

Figure 3 depicts the results of these simulations for the case of w=1.4w=1.4. Each sub-plot gives the distribution of the estimator over 1000 samples from each of the 6 population proportions. The four sub-plots represent μ^VH\hat{\mu}_{\rm VH} (top left), μ^SS(N)\hat{\mu}_{SS}(N) (top right), μ^SS(N^s)\hat{\mu}_{SS}(\hat{N}_{s}) (bottom left) and μ^SS(N^l)\hat{\mu}_{SS}(\hat{N}_{l}) (bottom right).

Refer to caption
(a) μ^VH\hat{\mu}_{VH}
Refer to caption
(b) μ^SS\hat{\mu}_{SS}
Refer to caption
(c) μ^SS(N^s<N)\hat{\mu}_{SS}(\hat{N}_{s}<N)
Refer to caption
(d) μ^SS(N^l>N)\hat{\mu}_{SS}(\hat{N}_{l}>N)
Figure 3: μ^VH\hat{\mu}_{\rm VH} and μ^SS\hat{\mu}_{SS} from samples of size 500 constituting about 50%, 60%, 70%, 80%, 90%, and 95% of the population. Initial samples selected independent of infection status 𝐳{\bf z}. Activity ratio (ww) fixed at 1.4. The first plots depict μ^VH\hat{\mu}_{\rm VH}, and μ^SS\hat{\mu}_{SS}. The last two depict μ^SS\hat{\mu}_{SS} when the number of nodes is under and over estimated, respectively. The true value is 0.200.20. The same samples were used for each estimator.

In this case, the underestimation of NN results in a small positive bias, due to the over-estimation of the curvature of f𝝅(k;n,)f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}). A small negative bias is also present in the case of N^l\hat{N}_{l}, due to the under-estimation of the curvature or f𝝅(k;n,)f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}). In both cases, the bias induced by inaccurate N^\hat{N} is less than the bias of μ^VH\hat{\mu}_{\rm VH}, and the resulting new estimators clearly out-perform the existing estimator.

Figure 4 presents mean point estimates for other values of activity ratio ww. The middle set of plots, w=0.8w=0.8, corresponds to a case of moderately lower mean degree among infected nodes. In this case, the biases introduced by under and over estimation of NN change sign, but remain small in magnitude. The remaining plots illustrate the greater bias possible for extreme values of ww. While the biases become larger, they are generally smaller than those exhibited by μ^VH\hat{\mu}_{\rm VH}, with the exception for the two cases where the negative bias of μ^SS\hat{\mu}_{SS} due to small ww, as described in Section 5.1 is compounded by negative bias due to N^<N\hat{N}<N (w=0.5,0.8w=0.5,0.8, 50%50\% sample). The mean squared error of μ^SS\hat{\mu}_{SS} is still smaller than that of μ^VH\hat{\mu}_{\rm VH} in these cases. Although the performance of μ^SS\hat{\mu}_{SS} is less robust to N^\hat{N} in cases of extreme ww, μ^VH\hat{\mu}_{\rm VH} also performs more poorly in these cases.

Refer to caption
Figure 4: Mean prevalence estimates of four types: μ^VH\hat{\mu}_{\rm VH} (solid circles), μ^SS\hat{\mu}_{SS} (circles), μ^SS(N^s<N)\hat{\mu}_{SS}(\hat{N}_{s}<N) (down-triangles), and μ^SS(N^l>N)\hat{\mu}_{SS}(\hat{N}_{l}>N) (up-triangles), considered for 50%, 70%, and 90% samples, and differential activity w=0.5,0.8,2.5w=0.5,0.8,2.5. The true value is 0.200.20.

The behavior we see in these plots is typical of other simulations not shown here. With the more extreme values of activity ratio ww, the bias induced by the inaccurate estimation of NN is increased, but typically not so much as to cause performance of μ^SS\hat{\mu}_{SS} worse than that of μ^VH\hat{\mu}_{\rm VH}.

5.3 Sensitivity to Initial Sample and Homophily

The estimator μ^VH\hat{\mu}_{\rm VH} requires sufficiently many sample waves to overcome the effect of an initial convenience sample and step-wise sampling probabilities not proportional to degree due to the particular network structure. The estimator μ^SS\hat{\mu}_{SS}, not based on a stationary distribution, does not allow for such an argument. We therefore argue only that it is not worse than μ^VH\hat{\mu}_{\rm VH} in this respect. In Section 5.1, we illustrate that in the simulated networks, bias induced by deviations from the configuration model are is no worse for μ^SS\hat{\mu}_{SS} than for μ^VH\hat{\mu}_{\rm VH}. We focus here on bias induced by the initial sample. Gile and Handcock (2010) show that μ^VH\hat{\mu}_{\rm VH} exhibits considerable bias in the case of a biased initial sample and network homophily. We expect μ^SS\hat{\mu}_{SS} to exhibit similar sensitivity, and here compare its performance to that of μ^VH\hat{\mu}_{\rm VH}.

We consider three regimes for the selection of the initial sample: all uninfected, random with respect to infection, and all infected. We also consider five levels of homophily, measured by the ratio RR defined by:

R=Probability of an infected-infected tieProbability of an infected-uninfected tie,R=1,2,3,5,13.\displaystyle R=\frac{\textrm{Probability of an {{\elevenrm``}}infected-infected{\elevenrm"}\ tie}}{\textrm{Probability of an {{\elevenrm``}}infected-uninfected{\elevenrm"}\ tie}},~~~~R=1,2,3,5,13. (22)

The standard level of homophily used in this paper corresponds to R=5R=5. To present a comparison most favorable to μ^VH\hat{\mu}_{\rm VH}, we treat a population size of 1000, (a 50%50\% sample), and activity ratio w=1w=1.

We consider 1000 samples from each homophily and sampling scenario, and summarize performance in terms of absolute bias, variance, and mean squared error, the last of which is depicted in Table 1. We find that the variance of μ^VH\hat{\mu}_{\rm VH} always exceeds that of μ^SS\hat{\mu}_{SS}, as does bias in all but a few cases of small differences. The MSE of μ^VH\hat{\mu}_{\rm VH} always exceeds that of μ^SS\hat{\mu}_{SS}. Table 1 illustrates this relation for the case of an all-infected initial sample. The patterns for the other two sampling conditions are similar.

Table 1: Comparison of mean squared error for μ^VH\hat{\mu}_{\rm VH} and μ^SS\hat{\mu}_{SS} with all infected initial sample and five levels of homophily. MSE(μ^SS)<MSE(μ^VH)MSE(\hat{\mu}_{SS})<MSE(\hat{\mu}_{\rm VH}) for all conditions.
Homophily (RR) 1 2 3 5 13
MSE μ^VH\hat{\mu}_{\rm VH} 0.00029 0.00036 0.00054 0.00150 0.01525
MSE μ^SS\hat{\mu}_{SS} 0.00026 0.00032 0.00052 0.00140 0.01449
Efficiency: MSE μ^VH\hat{\mu}_{\rm VH}/MSE μ^SS\hat{\mu}_{SS} 1.12 1.12 1.04 1.07 1.05

6 Application to HIV Prevalence in High-Risk
Populations

6.1 Background

The United Nations requires countries to measure and monitor key indicators related to their HIV epidemics (UNAIDS and World Health Organization, 2007; UNAIDS, 2008). In particular, countries with epidemics concentrated in high-risk groups are required to report on several features of key populations such as (injecting) drug users ((I)DU), men who have sex with men (MSM), and sex workers (SW). Such features include HIV prevalence, risk behaviors, and population sizes. Because these populations are typically hard-to-reach, many countries rely on respondent-driven sampling to estimate HIV prevalence and risk behaviors. In this section, we evaluate data collected on IDU and MSM in cities from two different countries in 2007 and 2008.

Refer to caption
(a) East Europe IDU
Refer to caption
(b) Caribbean DU
Refer to caption
(c) Caribbean MSM
Figure 5: Estimated HIV prevalence in three populations according to μ^SS\hat{\mu}_{SS}, μ^VH\hat{\mu}_{\rm VH}, and the sample mean, for various population size estimates. Dotted lines represent 1 standard error above and below μ^SS\hat{\mu}_{SS}, according to the estimator in the supplemental materials. Vertical bars on (a) represent exogenously estimated population size and on (c) represent rule-of-thumb 1% to 3% MSM in the population.

6.2 Injecting Drug Users in an East European City

The first example is based on a 2007 survey of IDU in major cities in a former Soviet block country. The HIV epidemic in this country is largely driven by IDU, where prevalence levels are over 50% in many cities, and RDS is used to study this population, as well as MSM and SW in several larger cities. This country also invests heavily in producing estimates of the size of the hidden population through scale-up and multiplier methods (Kruglov et al., 2008; UNAIDS/WHO, 2003). We focus here on the results for one city. Because in very large populations we expect μ^SS\hat{\mu}_{SS} to be nearly identical to μ^VH\hat{\mu}_{\rm VH}, we consider a city with one of the smaller estimates for the size of the hidden population. In this city, the number of IDU is estimated at 1200, with confidence interval 1100-1400. The RDS sample began with 6 initial samples, and distributed 3 coupons per respondent. The sample size was 175, all with full degree and HIV data. The two longest sample chains ended at wave 9, with 4 total respondents from wave 9.

We estimate the HIV prevalence using μ^SS\hat{\mu}_{SS} for several population sizes including 1100, 1200, and 1400, and also report the corresponding estimates of μ^VH\hat{\mu}_{\rm VH} and the sample mean. These results are summarized in Figure 5(a). We find that the prevalence estimate based on μ^SS\hat{\mu}_{SS} is very close to that based on μ^VH\hat{\mu}_{\rm VH} for population sizes in this range, and well within the uncertainty of the estimate, according to the estimator in the supplemental materials. This pattern is consistent across the cities we have considered in this country. This is partly due to the policy in this country of using RDS only in cities with larger estimated population sizes, opting for strategies of institutional sampling or attempted complete enumeration in areas with smaller populations.

6.3 Drug Users in a Caribbean City

The second example is taken from a small Caribbean country, which has conducted RDS studies of drug users (DU), MSM, and SW in four main cities in 2008 (note that this study did not limit participation to injecting drug users). We focus here on the study of DU in one of the cities with smaller total population. In this study, there were 7 initial samples, resulting in a sample of size 301, of which we include here only the 285 with full degree and HIV information. Again three coupons were distributed to most respondents, although this number was reduced as the sample size approached the target of 300. The two longest sample chains reached wave 11, with 6 respondents from this wave.

In this city, the number of DU is unknown. Therefore, we use the successive sampling estimator to produce a sensitivity analysis for the effect of population size on the HIV prevalence estimate. The results of this analysis are shown in Figure 5(b). Here the point estimate varies from 6.2% for a population size of 301 to 9.6% for a population size of 6000, with μ^VH=9.7%\hat{\mu}_{\rm VH}=9.7\%. In absolute and relative terms, this is more variation than in the previous example, (3.4% or about 15% of μ^VH\hat{\mu}_{\rm VH}). Although these differences are still well within the uncertainty of the estimator, in many cases RDS point estimates are important in themselves. It is also possible that a sensitivity analysis such as this one will be of interest with respect to a particular prevalence threshold, such at the 5%5\% threshold on which UNAIDS bases national epidemic classifications. In this example, the point estimate of prevalence is above 5%5\% for all population sizes, while a nominal 90% confidence interval includes 5% for all population sizes.

6.4 Men who have Sex with Men in a Caribbean City

Finally, we consider a population of MSM in the same Caribbean city. This study design was very similar to the corresponding DU study, with 7 initial samples, and a maximum of 3 coupons distributed. Here, the sample size was 270, with complete information for 269. One chain in this sample reached wave 12, with two respondents in that wave.

Figure 5(c) illustrates the sensitivity of the prevalence estimate to the number of MSM in this city. For comparison, the horizontal axis corresponds to the same population sizes as in 5(b), although here it is labeled in proportions. The prevalence estimate changes by about 1.6 percentage points over this range of possible population sizes (9.5% for 301 or .2% of the population are MSM, to 7.9% for 6000 or 4.2% of the population are MSM) . There are no additional studies available to provide population size estimates, however this problem is less severe for populations of MSM, as a rule-of-thumb is often applied, estimating the number of MSM at about 1%-3% of the general population, corresponding to the two vertical bars on Figure 5(c). In this case, however, further information is available. This sample did not reach its desired sample size (300), because no more coupons were returned, and the research team was not able to find additional MSM to sample. Thus the sample neared exhaustion of the portion of the population available for sampling, suggesting that although the population of MSM might be 1-3% of the city population, a smaller portion of those may be connected to the giant component of the social network of MSM and willing to participate in such a study. Therefore, if we restrict our inference to the reachable portion of the target population, the relevant population size is likely much closer to the sample size, suggesting a value of the μ^SS\hat{\mu}_{SS} estimate over 1.5 percentage ponts higher than μ^VH=7.8%\hat{\mu}_{\rm VH}=7.8\%. Note that among the 12 RDS studies in this country, three samples exhibited this near exhaustion behavior.

It is also of interest to note that in this example, unlike the previous two, μ^SS\hat{\mu}_{SS} is consistently higher than μ^VH\hat{\mu}_{\rm VH}. This result is consistent with the expectation that μ^SS\hat{\mu}_{SS} is typically between μ^VH\hat{\mu}_{\rm VH} and the sample mean. In this case, the degree-based weights of μ^VH\hat{\mu}_{\rm VH} reduced the overall weight given to infected nodes, and the more moderate weights used in μ^SS\hat{\mu}_{SS} reduced this effect.

7 Discussion

The key insight of this paper is the recognition that the true mapping f𝝅(k;n,)f_{{\mbox{\boldmath$\pi$}}}(k;n,\mathbb{N}) from nodal degree 𝐝i{\bf d}_{i} to inclusion probability, 𝝅i{\mbox{\boldmath$\pi$}}_{i} under Respondent-Driven Sampling is better approximated by successive sampling than by the linear mapping assumed by Volz and Heckathorn (2008). In addition, we introduce a novel approach to estimating the unit size (or degree) distribution in the population, based on the population size and sizes of observed units under successive sampling. Combining this insight and estimation strategy, we present a new estimator for population proportions based on an RDS sample.

The contribution of this new estimator is illustrated in Figure 6. In cases with no correlation between degree and quantity of interest (w=0w=0), and initial sample selected at random with respect to infection status, the naive sample mean, μ^\hat{\mu} is an unbiased estimator of the population mean, as are existing estimators such as μ^VH\hat{\mu}_{\rm VH} and the proposed estimator μ^SS\hat{\mu}_{SS}. This is illustrated in the first three columns of the figure.

However, when the variable of interest is related to nodal degree, classes with higher degrees are over-represented in the population mean, resulting in the positive bias in the sample mean μ^\hat{\mu} in the second three columns of Figure 6. This bias shrinks as the sample fraction increases. The existing estimator μ^VH\hat{\mu}_{\rm VH} adjusts reasonably well for this effect when the sample fraction is small, however for larger sample fractions, this estimator over-compensates for the effect of degree distribution, resulting in bias opposite that of μ^\hat{\mu}. The contribution of the proposed estimator, μ^SS\hat{\mu}_{SS} is that it correctly adjusts for the joint effects of varying degree distributions and large sample fractions.

The last three columns of Figure 6 illustrate a shortcoming of all three of these estimators. In this case, all initial samples, or seeds, are selected from among the infected nodes. This results in increased positive bias in all three of these estimators. All three estimators are also subject to other sources of bias discussed in Gile and Handcock (2010), including bias induced by the systematically biased passing of RDS coupons.

Refer to caption
Figure 6: Mean prevalence estimates of three types: μ^VH\hat{\mu}_{\rm VH} (solid circles), μ^SS\hat{\mu}_{SS} (circles), and sample mean μ^\hat{\mu} (crosses), considered for 50%, 70%, and 90% samples, and activity ratio w=1,1.8w=1,1.8, and for initial sample (seeds) selected at random with respect to infection and all infected. None of the estimators exhibit bias in the face of homophily with w=1w=1. When w1w\neq 1, μ^VH\hat{\mu}_{\rm VH} and μ^\hat{\mu} exhibit bias, with magnitude sensitive to sample fraction. All estimators exhibit bias in the case of biased initial sample.

Because RDS is in wide usage, and often in cases where the sample fraction is large, this new estimator may improve estimation in many contexts. Its applicability is limited, however, by the requirement that the population size NN is known. Figures 3 and 4 illustrate that inaccurate estimates of NN can introduce bias into μ^SS\hat{\mu}_{SS}, although this new estimator still out-performs μ^VH\hat{\mu}_{\rm VH} with the level of inaccuracy considered in this study.

While most of our simulation study has focused on sample fractions consistent with the finite population effects that μ^SS\hat{\mu}_{SS} is intended to address, it is of interest to note that μ^SS\hat{\mu}_{SS} is nearly identical to μ^VH\hat{\mu}_{\rm VH} for small sample fractions. We have replicated much of the simulation study here with population size N=10,000N=10,000. We find no significant differences between μ^SS\hat{\mu}_{SS} and μ^VH\hat{\mu}_{\rm VH} in these simulations, over a full range of values of ww. When using inaccurate estimates of NN as in (21), we found only one case of significant difference between μ^VH\hat{\mu}_{\rm VH} and μ^SS\hat{\mu}_{SS}, corresponding to a dramatic under-estimate of population size, Ns^=5250,N=10,000\hat{N_{s}}=5250,N=10,000, and for high activity ratio, w=3w=3. In this case, μ^SS\hat{\mu}_{SS} had bias 0.01390.0139 and μ^VH\hat{\mu}_{\rm VH} had bias 0.01050.0105. The difference in mean squared error was not significant. Thus, the proposed estimator is helpful in correcting for finite population biases when they exist, and sensitive to the estimated population size in these cases. In cases where finite population effects are not present, the proposed estimator nearly coincides with existing estimator μ^VH\hat{\mu}_{\rm VH}, and, unless the inaccurate population size estimate is small enough to transition to the region of finite population sensitivity, μ^SS\hat{\mu}_{SS} is insensitive to population size estimates in these cases.

Known population size NN and random mixing are key assumptions of the estimator μ^SS\hat{\mu}_{SS}. Additional assumptions required by μ^SS\hat{\mu}_{SS} are listed in Table 2. The assumptions with strikethrough are necessary for μ^VH\hat{\mu}_{\rm VH} but not for μ^SS\hat{\mu}_{SS}, and the italic text indicates additional assumptions required by μ^SS\hat{\mu}_{SS} but not μ^VH\hat{\mu}_{\rm VH}. In these terms, the key contribution of μ^SS\hat{\mu}_{SS} is to remove the dependence on a known inaccurate random walk model for estimating sampling probabilities. In return, μ^SS\hat{\mu}_{SS} sacrifices the theoretical robustness to the initial sample promised by the Markov chain model, although this robustness was not truly present in μ^VH\hat{\mu}_{\rm VH} either, and the proposed estimator performs no worse than the former in this respect (Table 1). More critically, μ^SS\hat{\mu}_{SS} relies on the assumption of known population size. The estimator μ^SS\hat{\mu}_{SS} is also sensitive to the degree of non-random mixing, or homophily in the network, although no more sensitive than μ^VH\hat{\mu}_{\rm VH} (Table 1), as well as to other assumptions such as accurate self-reported degree and an undirected network. Gile and Handcock (2010) illustrate the sensitivity of μ^VH\hat{\mu}_{\rm VH} to deviations from some other assumptions in Table 2. We do not expect μ^SS\hat{\mu}_{SS} to be more robust to such deviations than μ^VH\hat{\mu}_{\rm VH}.

Table 2: Assumptions of μ^SS\hat{\mu}_{SS}. Assumptions with strikethrough apply to μ^VH\hat{\mu}_{\rm VH} but not μ^SS\hat{\mu}_{SS}. Assumptions in the italic apply to μ^SS\hat{\mu}_{SS} but not μ^VH\hat{\mu}_{\rm VH}.
Network Structure Sampling Assumptions
Assumptions
Random Walk Population size large (N>>nN>>n) Sampling with replacement
Model Single non-branching chain
Remove Initial Homophily weak enough Sufficiently many sample waves
Sample Connected graph Initial sample unbiased
Dependence
To Estimate All ties reciprocated Degree accurately measured
Probabilities Known population size NN Random referral

Our application to HIV prevalence estimation in Section 6 illustrates several uses for μ^SS\hat{\mu}_{SS}. First, when the population size is known, μ^SS\hat{\mu}_{SS} provides a better estimate of population proportions than the other available methods. This can also be done when only a range of population sizes is known. In particular, in the case of MSM, officials are often willing to assume that the population ranges between 1% and 3%. When no information is available on population size, μ^SS\hat{\mu}_{SS} can be used to perform a sensitivity analysis. Finally, other information from the sampling process, such as the exhaustion of the population available for sampling, may suggest that the sample fraction is large. Additional information may also be gathered by asking respondents about their exposure to others who have been sampled.

Intuition suggests that in the case of a large sample fraction, an RDS estimator should negotiate between the infinite-population assumption of the Volz-Heckathorn estimator and the full-population sample assumption of the naive sample mean. In this paper, we provide an estimator based on a successive sampling model that appropriately negotiates between these two extremes. Whenever the hidden population size is known, it will be preferable to use this estimator. When the hidden population size is not known, this estimator still provides a helpful diagnostic check on the other available estimators.

Beyond the estimator itself, this paper contributes a new theoretical framework for understanding the sampling process in respondent-driven sampling. Previous understandings have relied on a with-replacement or infinite population assumption, an assumption known to be inaccurate, and critically so in some populations. The introduction of a sampling model appropriately accommodating the true without-replacement nature of the sampling process opens new possibilities for future research on respondent-driven sampling. We intend to make code available for these procedures in the R package RDS on CRAN.

References

  • Abdul-Quader et al. (2006) Abu S. Abdul-Quader, Douglas D. Heckathorn, Courtney McKnight, Heidi Bramson, Chris Nemeth, Keith Sabin, Kathleen Gallagher, and Don C. Des Jarlais. Effectiveness of respondent-driven sampling for recruiting drug users in New York City: Findings from a pilot study. Journal of Urban Health, 83:459–476, 2006.
  • Andreatta and Kaufman (1986) Giovanni Andreatta and Gordon M. Kaufman. Estimation of finite population properties when sampling is without replacement and proportional to magnitude. Journal of the American Statistical Association, 81(395):657–666, 1986.
  • Barndorff-Nielsen (1978) Ole E. Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. Wiley, New York, 1978.
  • Bickel et al. (1992) Peter J. Bickel, Vijayan N. Nair, and Paul C. C. Wang. Nonparametric inference under biased sampling from a finite population. The Annals of Statistics, 20(2):853–878, 1992.
  • Boguna et al. (2004) Marian Boguna, Romualdo Pastor-Satorras, and Alessandro Vespignani. Cut-offs and finite size effects in scale-free networks. European Physical Journal B, 38(2):205–209, 2004.
  • Burda and Krzywicki (2003) Zdzislaw Burda and Andre Krzywicki. Uncorrelated random networks. Physical Review E, 67(4):046118, April 2003.
  • Catanzaro et al. (2005) Michele Catanzaro, Marian Boguna, and Romualdo Pastor-Satorras. Generation of uncorrelated random scale-free networks. Physical Review E, 71(2, Part 2), 2005.
  • CDC (2008) CDC. Consultation on the analysis of data collected through respondent driven sampling. Discussion session, February 12-14, 2008. Atlanta, GA, Keith Sabin, Organizer.
  • Chung and Lu (2002) Fan Chung and Linyuan Lu. Connected components in random graphs with given expected degree sequences. Annals of Combinatorics, 6:125–145, 2002.
  • Fattorini (2006) Lorenzo Fattorini. Applying the Horvitz-Thompson criterion in complex designs: A computer-intensive perspective for estimating inclusion probabilities. Biometrika, 93(2):269–278, 2006.
  • Foster et al. (2007) Jacob G. Foster, David V. Foster, Peter Grassberger, and Maya Paczuski. Link and subgraph likelihoods in random undirected networks with fixed and partially fixed degree sequences. Physical Review E, 76(4, Part 2), 2007.
  • Gile and Handcock (2010) Krista J. Gile and Mark S. Handcock. Respondent-driven sampling: An assessment of current methodology. Sociological Methodology, 40:forthcoming, 2010. URL http://www3.interscience.wiley.com/journal/120121482/issue.
  • Goel and Salganik (2009) Sharad Goel and Mathew J. Salganik. Respondent driven sampling as Markov Chain Monte Carlo. Statistics in Medicine, 28:2202–2229, 2009.
  • Goodman (1961) Leo A Goodman. Snowball sampling. Annals of Mathematical Statistics, 32:148–170, 1961.
  • Handcock et al. (2003) Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris. statnet: Software Tools for the Statistical Modeling of Network Data. Statnet Project http://statnetproject.org/, Seattle, WA, 2003. URL http://CRAN.R-project.org/package=statnet. R package version 2.0.
  • Handcock et al. (2008) Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris. ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of Statistical Software, 24(3), 2008. URL http://www.jstatsoft.org/v24/i03/.
  • Hansen and Hurwitz (1943) Morris H. Hansen and William N. Hurwitz. On the theory of sampling from finite populations. The Annals of Mathematical Statistics, 14(4):333–362, 1943.
  • Heckathorn (1997) Douglas D. Heckathorn. Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems, 44:174–199, 1997.
  • Heckathorn (2002) Douglas D. Heckathorn. Respondent-driven sampling II: Deriving valid population estimates from chain-referral samples of hidden populations. Social Problems, 49(1):11–34, 2002.
  • Johnston (2009) Lisa Johnston. personal communication, 2009.
  • Johnston et al. (2008) Lisa G. Johnston, Moshen Malekinejad, Carl Kendall, Irene M. Iuppa, and George W. Rutherford. Implementation challenges to using respondent-driven sampling methodology for HIV biological and behavioral surveillance: Field experiences in international settings. AIDS and Behavior, 12:131–141, 2008.
  • Kochar and Korwar (2001) Subhash C. Kochar and Ramesh Korwar. On random sampling without replacement from a finite population. Annals of the Institute of Statistical Mathematics, 53(3):631–646, 2001.
  • Kruglov et al. (2008) Y. V. Kruglov, Y. V. Kobyshcha, T. Salyuk, O. Varetska, A. Shakarishvili, and V. P. Saldanha. The most severe HIV epidemic in Europe: Ukraine’s national HIV prevalence estimates for 2007. Sexually Transmitted Infections, 84(Suppl 1):i37–41, 2008.
  • Lansky et al. (2007) Amy Lansky, Abu S. Abdul-Quader, Melissa Cribbin, Tricia Hall, Teresa J. Finlayson, Richard S. Garfein, Lillian S. Lin, and Patrick S. Sullivan. Developing an HIV behavioral surveillance system for injecting drug users: the National HIV Behavioral Surveillance System. Technical Report Public Health Reports 2007; 122 Suppl 1: 48-55 17354527, Division of HIV/AIDS Prevention, National Center for HIV, STD, and TB Prevention, Centers for Disease Control and Prevention, 2007.
  • Meng and Rubin (1993) Xiao-Li Meng and Donald B. Rubin. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2):267–278, 1993.
  • Molloy and Reed (1995) Michael S. Molloy and Bruce A. Reed. A critical point for random graphs with a given degree sequence. Random Structures and Algorithms, 6:161–179, 1995.
  • Muhib et al. (2001) Farzana B. Muhib, Lillian S. Lins, Ann Stueve, Robin L. Miller, Wesley L. Ford, Wayne D. Johnson, and Philip J. Smith. A venue-based method for sampling hard-to-reach populations. Technical Report Public Health Reports 2001; 116 Suppl 1: 216222, Association of Schools of Public Health, Washington, DC, 2001.
  • Murthy (1957) M. N. Murthy. Ordered and unordered estimators in sampling without replacement. Sankhya: The Indian Journal of Statistics, 18:379–390, 1957.
  • Nair and Wang (1989) Vijayan N. Nair and Paul C. C. Wang. Maximum likelihood estimation under a successive sampling discovery model. Technometrics, 31(4):423–436, 1989.
  • Neely (2009) W. Whipple Neely. Bayesian methods for data from respondent driven sampling. Dissertation in-progress, Department of Statistics, University of Wisconsin, Madison, 2009.
  • Peterson et al. (2008) James A. Peterson, Heather S. Reisinger, Robert P. Schwartz, Shannon G. Mitchell, Sharon M. Kelly, Barry S. Brown, and Michael H. Agar. Targeted sampling in drug abuse research: A review and case study. Field Methods, 20(2):155–170, May 2008.
  • Raj (1956) Des Raj. Some estimators in sampling with varying probabilities without replacement. Journal of the American Statistical Association, 51(274):269–284, 1956.
  • Rao et al. (1991) T. J. Rao, S. Sengupta, and B. K. Sinha. Some order relations between selection and inclusion probabilities for PPSWOR sampling scheme. Metrika, 38:335–343, 1991.
  • Salganik and Heckathorn (2004) Matthew J. Salganik and Douglas D. Heckathorn. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology, 34:193–239, 2004.
  • Snijders et al. (2006) Tom A. B. Snijders, Philippa Pattison, Garry L. Robins, and Mark S. Handcock. New specifications for exponential random graph models. Sociological Methodology, 36:99–153, 2006.
  • Thompson (2006a) Steven K. Thompson. Adaptive web sampling. Biometrics, 62(4):1224–1234, 2006a.
  • Thompson (2006b) Steven K. Thompson. Targeted random walk designs. Survey Methodology, 32(1):11–24, 2006b.
  • UNAIDS (2008) UNAIDS. 2008 Guidance and Specifications for Additional Reecommended Indicators. Technical report, UNAIDS - Joint United Nations Programme on HIV/AIDS, 2008. URL http://www.unaids.org.
  • UNAIDS and World Health Organization (2007) UNAIDS and World Health Organization. Guidelines on Construction of Core Indicators. 2008 Reporting. Technical report, UNAIDS - Joint United Nations Programme on HIV/AIDS, 2007. URL http://www.unaids.org.
  • UNAIDS/WHO (2003) UNAIDS/WHO. Estimating the size of populations at risk for HIV: Issues and methods. Technical report, UNAIDS/WHO Working Group on HIV/AIDS, 2003. URL http://www.unaids.org.
  • van Duijn et al. (2009) Marijtje A. J. van Duijn, Mark S. Handcock, and Krista J. Gile. A framework for the comparison of maximum pseudo likelihood and maximum likelihood estimation of exponential family random graph models. Social Networks, 31:52–62, 2009.
  • Volz and Heckathorn (2008) Erik Volz and Douglas D. Heckathorn. Probability based estimation theory for respondent driven sampling. Journal of Official Statistics, 24(1):79–97, 2008.
  • Volz et al. (2007) Erik Volz, Cyprian Wejnert, Ismail Degani, and Douglas D. Heckathorn. Respondent-Driven Sampling Analysis Tool (RDSAT) Version 5.6, 2007. Cornell University, Ithaca, NY.
  • Watters and Biernacki (1989) John K. Watters and Patrick Biernacki. Targeted sampling: Options for the study of hidden populations. Social Problems, 36(4):416–430, 1989.
  • Yates and Grundy (1953) Frank Yates and P. Michael Grundy. Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society. Series B (Methodological), 15(2):253–261, 1953.