This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\discount: Counting in Large Image Collections with Detector-Based Importance Sampling

Gustavo Perez     Subhransu Maji     Daniel Sheldon
Manning College of Information & Computer Sciences
University of Massachusetts, Amherst
{gperezsarabi, smaji, sheldon}@cs.umass.edu
Abstract

Many modern applications use computer vision to detect and count objects in massive image collections. However, when the detection task is very difficult or in the presence of domain shifts, the counts may be inaccurate even with significant investments in training data and model development. We propose \discount– a detector-based importance sampling framework for counting in large image collections that integrates an imperfect detector with human-in-the-loop screening to produce unbiased estimates of counts. We propose techniques for solving counting problems over multiple spatial or temporal regions using a small number of screened samples and estimate confidence intervals. This enables end-users to stop screening when estimates are sufficiently accurate, which is often the goal in a scientific study. On the technical side we develop variance reduction techniques based on control variates and prove the (conditional) unbiasedness of the estimators. \discountleads to a 9-12×\times reduction in the labeling costs over naive screening for tasks we consider, such as counting birds in radar imagery or estimating damaged buildings in satellite imagery, and also surpasses alternative covariate-based screening approaches in efficiency. equal advising contribution

1 Introduction

Many modern applications use computer vision to detect and count objects in massive image collections. For example, we are interested in applications that involve counting bird roosts in radar images and damaged buildings in satellite images. The image collections are too massive for humans to solve these tasks in the available time. Therefore, a common approach is to train a computer vision detection model and run it exhaustively on the images.

The task is interesting because the goal is not to generalize, but to achieve the scientific counting goal with sufficient accuracy for a fixed image collection. The best use of human effort is unclear: it could be used for model development, labeling training data, or even directly solving the counting task! A particular challenge occurs when the detection task is very difficult, so the accuracy of counts made on the entire collection is questionable even with huge investments in training data and model development. Some works resort to human screening of the detector outputs Norouzzadeh et al. (2018); Nurkarim and Wijayanto (2023); Perez et al. (2022), which saves time compared to manual counting but is still very labor intensive.

These considerations motivate statistical approaches to counting. Instead of screening the detector outputs for all images, a human can “spot-check” some images to estimate accuracy, and, more importantly, use statistical techniques to obtain unbiased estimates of counts across unscreened images. In a related context, Meng et al. (2022) proposed IS-count, which uses importance sampling to estimate total counts across a collection when (satellite) images are expensive to obtain by using spatial covariates to sample a subset of images.

We contribute counting methods for large image collections that build on IS-count in several ways. First, we work in a different model where images are freely available and it is possible to train a detector to run on all images, but the detector is not reliable enough for the final counting task, or its reliability is unknown. We contribute human-in-the-loop methods for count estimation using the detector to construct a proposal distribution, as seen in Fig. 1. Second, we consider solving multiple counting problems—for example, over disjoint or overlapping spatial or temporal regions—simultaneously, which is very common in practice. We contribute a novel sampling approach to obtain simultaneous estimates, prove their (conditional) unbiasedness, and show that the approach allocates samples to regions in a way that approximates the optimal allocation for minimizing variance. Third, we design confidence intervals, which are important practically to know how much human effort is needed. Fourth, we use variance reduction techniques based on control variates.

Our method produces unbiased estimates and confidence intervals with reduced error compared to covariate-based methods. In addition, the labeling effort is further reduced with \discountas we only have to verify detector predictions instead of producing annotations from scratch. On our tasks, \discountleads to a 9-12×\times reduction in the labeling costs over naive screening and 6-8×\times reduction over IS-Count. Finally, we show that solving multiple counting problems jointly can be done more efficiently than solving them separately, demonstrating a more efficient use of samples.

Refer to caption

(a)          (b)          (c)              (d)

Figure 1: kk-\discount uses detector-based importance sampling to screen counts and solve multiple counting problems. (a) Geographical regions where we want to estimate counts of damaged buildings. (b) Outputs of a damaged building detector on satellite imagery, which can be used to estimate counts g(s)g(s) for each tile (shows as dots). (c) Tiles selected for human screening to obtain true counts f(s)f(s), from which counts for all regions are joinly estimated by kk-\discount. (d) Our experiments show that \discountoutperforms naive (MC) and covariate-based sampling (IS-Count Meng et al. (2022)).

2 Related Work

Computer vision techniques have been deployed for counting in numerous applications where exhaustive human-labeling is expensive due to the sheer volume of imagery involved. This includes areas such as detecting animals in camera trap imagery Norouzzadeh et al. (2018); Tuia et al. (2022), counting buildings, cars, and other structures in satellite images Nurkarim and Wijayanto (2023); Cavender-Bares et al. (2022); Burke et al. (2021); Leitloff et al. (2010), species monitoring in citizen science platforms Tuia et al. (2022); Van Horn et al. (2018), monitoring traffic in videos Won (2020); Coifman et al. (1998), as well as various medicine, science and engineering applications. For many applications the cost associated with training an accurate model is considerably less than that of meticulously labeling the entire dataset. Even with a less accurate model, human-in-the-loop recognition strategies have been proposed to reduce annotation costs by integrating human validation with noisy predictions Branson et al. (2010); Wah et al. (2014).

Our approach is related to work in active learning Settles (2009) and semi-supervised learning Chapelle et al. (2009), where the goal is to reduce human labeling effort to learn models that generalize on i.i.d. held out data. While these approaches reduce the cost of labels on training data, they often rely on large labeled test sets to estimate the performance of the model, which can be impractical. Active testing Nguyen et al. (2018); Kossen et al. (2021) aims to reduce the cost of model evaluation by providing a statistical estimate of the performance using a small number of labeled examples. Unlike traditional learning where the goal is performance on held out data, the goal of active testing is to estimate performance on a fixed dataset. Similarly, our goal is to estimate the counts on a fixed dataset, but different from active testing we are interested in estimates of the true counts and not the model’s performance. In particular, we want unbiased estimates of counts even when the detector is unreliable. Importantly, since generalization is not the goal, overfitting to the dataset statistics may lead to more accurate estimates.

Statistical estimation has been widely used to conduct surveys (e.g., estimating population demographics, polling, etc.) Cochran (1977). In IS-Count Meng et al. (2022), the authors propose an importance sampling approach to estimate counts in large image collections using humans-in-the-loop. They showed that one can count the number of buildings at the continental scale by sampling a small number of regions based on covariates such as population density and annotating those regions, thereby reducing the cost of obtaining high-resolution satellite imagery and human labels. However, for many applications the dataset is readily available, and running the detector is cost effective, but human screening is expensive. To address this, we propose using the detector to guide the screening process and demonstrate that this significantly reduces error rates in count estimation given a fixed amount of human effort. Furthermore, for some applications, screening the outputs of a detector can be significantly faster than to annotate from scratch, leading to additional savings.

An interesting question is what is the best way to utilize human screening effort to count on a dataset. For example, labels might be used to improve the detector, measure performance on the deployed dataset, or, as is the case in our work, to derive a statistical estimate of the counts. Our work is motivated by problems where improving the detector might require significant effort, but counts from the detector are correlated with true counts and can be used as a proposal distribution for sampling.

3 \discount: Detector-based IS-Count

Consider a counting problem in a discrete domain Ω\Omega (usually spatiotemporal) with elements sΩs\in\Omega that represent a single unit such as an image, grid cell, or day of year. For each ss there is a ground truth “count” f(s)0f(s)\geq 0, which can be any non-negative measurement, such as the number or total size of all objects in an image. A human can label the underlying images for any ss to obtain f(s)f(s).

Define F(S)=sSf(s)F(S)=\sum_{s\in S}f(s) to be the cumulative count for a region SS. We wish to estimate the total counts F(S1),,F(Sk)F(S_{1}),\ldots,F(S_{k}) for kk different subsets S1,,SkΩS_{1},\ldots,S_{k}\subseteq\Omega, or regions, while using human effort as efficiently as possible. The regions represent different geographic divisions or time ranges and may overlap — for example, in the roost detection problem we want to estimate cumulative counts of birds for each day of the year, while disaster-relief planners want to estimate building damage across different geographical units such as towns, counties, and states. Assume without loss of generality that i=1kSi=Ω\bigcup_{i=1}^{k}S_{i}=\Omega, otherwise the domain can be restricted so this is true.

We will next present our methods; derivations and proofs of all results are found in the appendix.

3.1 Single-Region Estimators

Consider first the problem of estimating the total count F(S)F(S) for a single region SS. Meng et al. (2022) studied this problem in the context of satellite imagery, with the goal of minimizing the cost of purchasing satellite images to obtain an accurate estimate.

Simple Monte Carlo Meng et al. (2022)

This is a baseline based on simple Monte Carlo sampling. Write F(S)=sSf(s)=|S|𝔼sUnif(S)[f(s)]F(S)=\sum_{s\in S}f(s)=|S|\cdot\operatorname{\mathbb{E}}_{s\sim\text{Unif}(S)}[f(s)]. Then the following estimator, which draws nn random samples uniformly in SS to estimate the total, is unbiased:

F^MC(S)=|S|1ni=1nf(si),siUnif(S).\hat{F}_{\text{MC}}(S)=|S|\cdot\frac{1}{n}\sum_{i=1}^{n}f(s_{i}),\quad s_{i}\sim\text{Unif}(S).

IS-Count Meng et al. (2022)

Meng et al. then proposed an estimator based on importance sampling Owen (2013). Instead of sampling uniformly, the method samples from a proposal distribution qq that is cheap to compute for all sSs\in S. For example, to count buildings in US satellite imagery, the proposal distribution could use maps of artificial light intensity, which are freely available. The importance sampling estimator is:

F^IS(S)=1ni=1nf(si)q(si),siq.\hat{F}_{\text{IS}}(S)=\frac{1}{n}\sum_{i=1}^{n}\frac{f(s_{i})}{q(s_{i})},\quad s_{i}\sim q.

\discount

IS-count assumes images are costly to obtain, which motivates using external covariates for the proposal distribution. However, in many scientific tasks, the images are readily available, and the key cost is that of human supervision. In this case it is possible to train a detection model and run it on all images to produce an approximate count g(s)g(s) for each ss. Define G(S)=sSg(s)G(S)=\sum_{s\in S}g(s) to be the approximate detector-based count for region SS. We propose the detector-based IS-count ("\discount") estimator, which uses the proposal distribution proportional to gg on region SS, i.e., with density g¯S(s)=g(s)𝕀[sS]/G(S)\bar{g}_{S}(s)=g(s)\operatorname{\mathbb{I}}[s\in S]/G(S). The importance-sampling estimator then specializes to:

F^DIS(S)=G(S)1ni=1nf(si)g(si),sig¯S.\hat{F}_{\text{DIS}}(S)=G(S)\cdot\frac{1}{n}\sum_{i=1}^{n}\frac{f(s_{i})}{g(s_{i})},\quad s_{i}\sim\bar{g}_{S}.

To interpret \discount, let wi=f(si)/g(si)w_{i}=f(s_{i})/g(s_{i}) be the ratio of the true count to the detector-based count for the iith sample sis_{i} or (importance) weight. \discountreweights the detector-based total count G(S)G(S) by the average weight w¯=1ni=1nwi\bar{w}=\frac{1}{n}\sum_{i=1}^{n}w_{i}, which can be viewed as a correction factor based on the tendency to over- or under-count, on average, across all of SS.

\discount

is unbiased as long as g¯(s)>0\bar{g}(s)>0 for all sSs\in S such that f(s)>0f(s)>0. Henceforth, we assume detector counts are pre-processed if needed so that g(s)>0g(s)>0 for all relevant units, for example, by adding a small amount to each count.

3.2 kk-\discount

We now return to the multiple region counting problem. A naive approach would be to run \discountseparately for each region. However, this is suboptimal. First, it allocates samples equally to each region, regardless of their size or predicted count. Intuitively, we want to allocate more effort to regions with higher predicted counts. Second, if regions overlap it is wasteful to repeatedly draw samples from each one to solve the estimation problems separately.

kk-\discount

We propose estimators based on nn samples drawn from all of Ω\Omega with probability proportional to gg. Then, we can estimate F(S)F(S) for any region using only the samples from SS. Specifically, the kk-\discountestimator is

F^kDIS(S)={G(S)w¯(S)n(S)>00n(S)=0,sig¯Ω,\hat{F}_{k\text{DIS}}(S)=\begin{cases}G(S)\cdot\bar{w}(S)&n(S)>0\\[5.0pt] 0&n(S)=0\end{cases},\qquad s_{i}\sim\bar{g}_{\Omega},

where n(S)=|{i:siS}|n(S)=|\{i:s_{i}\in S\}| is the number of samples in region SS and w¯(S)=1n(S)i:siSwi\bar{w}(S)=\frac{1}{n(S)}\sum_{i:s_{i}\in S}w_{i} is the average importance weight for region SS.

Claim 1.

The kk-\discountestimator F^kDIS(S)\hat{F}_{k\text{DIS}}(S) is conditionally unbiased given at least one sample in region SS. That is, 𝔼[F^kDIS(S)n(S)>0]=F(S)\operatorname{\mathbb{E}}[\hat{F}_{k\text{DIS}}(S)\mid n(S)>0]=F(S).

The unconditional bias can also be analyzed (see Appendix). Overall, bias has negligible practical impact. It occurs only when the sample size n(S)n(S) is zero, which is an event that is both observable and has probability (1p(S))n(1-p(S))^{n} that decays exponentially in nn, where p(S)=G(S)/G(Ω)p(S)=G(S)/G(\Omega).

In terms of variance, kk-\discountbehaves similarly to \discountrun on each region SS with sample size equal to 𝔼[n(S)]=np(S)\operatorname{\mathbb{E}}[n(S)]=np(S). To first order, both approaches have variance G(S)2σ2(S)np(S)\frac{G(S)^{2}\cdot\sigma^{2}(S)}{np(S)} where σ2(S)\sigma^{2}(S) is the importance-weight variance. In the case of disjoint regions, running \discounton each region is the same as stratified importance sampling across the regions, and the allocation of np(S)np(S) samples to region SS is optimal in the following sense:

Claim 2.

Suppose S1,,SkS_{1},\ldots,S_{k} partition Ω\Omega and the importance weight variance σ2(Si)=σ2\sigma^{2}(S_{i})=\sigma^{2} is constant across regions. Assume \discountis run on each region SiS_{i} with nin_{i} samples. Given a total budget of nn samples, the sample sizes that minimize i=1kVar(F^DIS(Si))\sum_{i=1}^{k}\operatorname{Var}(\hat{F}_{\text{DIS}}(S_{i})) are given by ni=np(Si)=nG(Si)/G(Ω)n_{i}=np(S_{i})=nG(S_{i})/G(\Omega).

The analysis uses reasoning similar to the Neyman allocation for stratified sampling Cochran (1977), and shows that kk-\discountapproximates the optimal allocation of samples to (disjoint) regions under the stated assumptions. One key difference is that kk-\discountdraws samples from all of Ω\Omega and then assigns them to regions, which is called “post-stratification” in the sampling literature Cochran (1977). An exact variance analysis in the Appendix reveals that, if the expected sample size np(S)np(S) for a region is very small, kk-\discountmay have up to 30% “excess” variance compared to stratification due to the random sample size, but the excess variance disappears quickly and both approaches have the same asymptotic variance. A second key difference to stratification is that regions can overlap; kk-\discount’s approach of sampling from all of Ω\Omega and then assigning samples to regions extends cleanly to this setting.

3.3 Control Variates

Control variates are functions h(s)h(s) whose integrals H(S)=sSh(s)H(S)=\sum_{s\in S}h(s) are known and can be combined with importance sampling using the following estimator:

F^kDIScv(S)={G(S)w¯h(S)+H(S)n(S)>00n(S)=0,sig¯Ω,\hat{F}_{k\text{DIS}cv}(S)=\begin{cases}G(S)\cdot\bar{w}_{h}(S)+H(S)&n(S)>0\\[5.0pt] 0&n(S)=0\end{cases},\qquad s_{i}\sim\bar{g}_{\Omega},

where w¯h(S)=1n(S)i:siSwh,i\bar{w}_{h}(S)=\frac{1}{n(S)}\sum_{i:s_{i}\in S}w_{h,i} and wh,i=(f(si)h(si))/g(si)w_{h,i}=(f(s_{i})-h(s_{i}))/g(s_{i}). It is clear that F^kDIScv(S)\hat{F}_{k\text{DIS}cv}(S) has the same expectation as F^kDIS(S)\hat{F}_{k\text{DIS}(S)}, but F^kDIScv(S)\hat{F}_{k\text{DIS}cv}(S) might have a lower variance under certain conditions (if ff and hh are sufficiently correlated Owen (2013)). For bird counting, estimated counts from previous years could be used as control variates as migration is periodic to improve count estimates (see experiments in § 4 for details).

3.4 Confidence intervals

Confidence intervals for kk-\discountcan be constructed in a way similar to standard importance sampling. For a region SS, first estimate the importance weight variance σ2(S)\sigma^{2}(S) as:

σ^2(S)=1n(S)i:siS(f(si)g(si)F^kDIS(S)G(S))2.\hat{\sigma}^{2}(S)=\frac{1}{n(S)}\sum_{i:s_{i}\in S}\left(\frac{f(s_{i})}{g(s_{i})}-\frac{\hat{F}_{k\text{DIS}}(S)}{G(S)}\right)^{2}.

An approximate 1α1-\alpha confidence interval is then given by F^kDIS(S)±zα/2G(S)σ^(S)/n(S)\hat{F}_{k\text{DIS}}(S)\pm z_{\alpha/2}\cdot G(S)\cdot\hat{\sigma}(S)/\sqrt{n(S)}, where zγz_{\gamma} is the 1γ1-\gamma quantile of the standard normal distribution, e.g., z0.025=1.96z_{0.025}=1.96 for a 95% confidence interval. The theoretical justification is subtle due to scaling by the random sample size n(S)n(S). It is based on the following asymptotic result, proved in the Appendix.

Claim 3.

The kk-\discountestimator with scaling factor G(S)σ^(S)/n(S)G(S)\hat{\sigma}(S)/\sqrt{n(S)} is asymptotically normal, that is, the distribution of F^kDIS(S)F(S)G(S)σ^(S)/n(S)\frac{\hat{F}_{k\text{DIS}}(S)-F(S)}{G(S)\cdot\hat{\sigma}(S)/\sqrt{n(S)}} converges to 𝒩(0,1)\mathcal{N}(0,1) as nn\to\infty.

In preliminary experiments we observed that for small expected sample sizes the importance weight variance σ2(S)\sigma^{2}(S) can be underestimated leading to intervals that are too small — as an alternative, we propose a practical heuristic for smaller sample sizes where σ^2(Ω)\hat{\sigma}^{2}(\Omega) is used instead of σ^2(S)\hat{\sigma}^{2}(S); that is, all samples are used to estimate variability of importance weights for each region SS.

4 Experimental Setup

In this section we describe the counting tasks and detection models (§ 4.14.2) and the evaluation metrics (§ 4.3) we will use to evaluate different counting methods. We focus on two applications: counting roosting birds in weather radar images and counting damaged buildings in satellite images of a region struck by a natural disaster.

4.1 Counting Roosting Birds from Weather Radar

Many species of birds and bats congregate in large numbers at nighttime or daytime roosting locations. Their departures from these “roosts” are often visible in weather radar, from which it’s possible to estimate their numbers Winkler (2006); Buler et al. (2012); Horn and Kunz (2008). The US “NEXRAD” weather radar network nex (2023) has collected data for 30 years from 143+ stations and provides an unprecedented opportunity to study long-term and wide-scale biological phenomenon such as roosts Rosenberg et al. (2019); Sánchez-Bayo and Wyckhuys (2019). However, the sheer volume of radar scans (>>250M) prevents manual analysis and motivates computer vision approaches Chilson et al. (2018); Lin et al. (2019); Cheng et al. (2020); Perez et al. (2022). Unfortunately, the best computer vision models Perez et al. (2022); Cheng et al. (2020) for detecting roosts have average precision only around 50% and are not accurate enough for fully automated scientific analysis, despite using state-of-the-art methods such as Faster R-CNNs Ren et al. (2015) and training on thousands of human annotations — the complexity of the task suggests substantial labeling and model development efforts would be needed to improve accuracy, and may be impractical.

Refer to caption
Figure 2: Count estimates with confidence intervals for two station years (i.e., KGRB 2020 and KBUF 2010) using different numbers of samples.

Previous work Belotti et al. (2023); Deng et al. (2023) used a roost detector combined with manual screening of the detections to analyze more than 600,000 radar scans spanning a dozen stations in the Great Lakes region of the US to reveal patterns of bird migration over two decades. The vetting of nearly 64,000 detections was orders of magnitude faster than manual labeling, yet still required a substantial 184 hours of manual effort. Scaling to the entire US network would require at least an order of magnitude more effort, thus motivating a statistical approach.

We use the exhaustively screened detections from the Great Lakes analysis in Belotti et al. (2023); Deng et al. (2023) to systematically analyze the efficiency of sampling based counting. The data is organized into domains Ωsta,yr\Omega^{\texttt{sta,yr}} corresponding to 12 stations and 20 years (see Fig. 7 in Appendix B). Thus the domains are disjoint and treated separately. Counts are collected for each day ss by running the detector using all radar scans for that day to detect and track roost signatures and then mapping detections to bird counts using the measured radar “reflectivity” within the tracks. For the approximate count g(s)g(s) we use the automatically detected tracks, while for the true count f(s)f(s) we use the manually screened and corrected tracks. For a single domain, i.e., each station-year, we divide a complete roosting season into temporal regions in three different scenarios: (1) estimating bird counts up to each day in the roosting season (i.e., regions are nested prefixes of days in the entire season), (2) the end of each quarter of (i.e., regions are nested prefixes of quarters in the entire season), and (3) estimating each quarter’s count (each region is one quarter). We measure error using the fully-screened data and average errors across all domains and regions. Fig. 2 shows the counts and confidence intervals estimated using kk-\discountfor the first scenario on four station-years.

4.2 Counting Damaged Buildings from Satellite Images

Building damage assessment from satellite images Kim and Yoon (2018); Deng and Wang (2022) is often used to plan humanitarian response after a natural disaster strikes. However, the performance of computer vision models degrades when applied to new regions and disaster types. Our approach can be used to quickly vet the data produced by the detector to correctly estimate counts in these scenarios.

We use the building damage detection model by DIUx-xView , the winner of the xView2 challenge xvi . The model is based on U-Net Ronneberger et al. (2015) to detect buildings in the pre-disaster image, followed by a “siamese network" that incorporates at pre- and post-disaster images to estimate damage. The model is trained on the xBD dataset Gupta et al. (2019) that contains building and damage annotations spanning multiple geographical regions and disaster types (e.g., earthquake, hurricane, tsunami, etc.). While the dataset contains four levels of damage (i.e., 0: no-damage, 1: minor-damage, 2: major-damage, and 3: destroyed), in this work we combine all damage levels (i.e., classes 1-3) into a single “damage” class.

We consider the Palu Tsunami from 2018; the data consists of 113 high-resolution satellite images labeled with 31,394 buildings and their damage levels. We run the model on each tile ss to estimate the number of damaged buildings g(s)g(s), while the ground-truth number of damaged buildings is used as f(s)f(s). Our goal is to estimate the cumulative damaged building count in sub-regions expanding from the area with the most damaged buildings as shown in Fig. 9 in the Appendix C. To define the sub-regions, we sort all mm images by their distance from the epicenter (defined as the image tile with the most damaged buildings) and then divide into chunks or “annuli” A1,,A7A_{1},\ldots,A_{7} of size m/7m/7. The task is to estimate the cumulative counts Sj=i=1jAiS_{j}=\bigcup_{i=1}^{j}A_{i} of the first jj chunks for jj from 1 to 7.

4.3 Evaluation

We measure the fractional error between the true and the estimated counts averaged over all regions in a domain S1,,SkΩS_{1},\ldots,S_{k}\subseteq\Omega as:

Error(Ω)=1ki=1k|F(Si)F^(Si)|F(Ω).\texttt{Error}(\Omega)=\frac{1}{k}\sum_{i=1}^{k}\frac{|F(S_{i})-\hat{F}(S_{i})|}{F(\Omega)}.

For the bird counting task, for any given definition of regions within one station-year Ω\Omega (i.e., cumulative days or quarters defined in § 4.1) we report the error averaged across all station-years corresponding to 12 stations and 20\approx 20 years. For the damaged building counting problem there is only a single domain corresponding to the Palu Tsunami region. In addition, we calculate the average confidence interval width normalized by F(Ω)F(\Omega). We run 1000 trials and plot average metrics ±1.96×std. error\pm 1.96\times\text{std. error} over the trials. We also evaluate confidence interval coverage, which is the fraction of confidence intervals that contain the true count over all domains, regions, and trials.

Refer to caption
Figure 3: Detector-based sampling. Estimation error of damaged building counts in the Palu Tsunami region from the xBD dataset (left) and counting roosting birds from the Great Lakes radar stations in the US from NEXRAD data (right). We get lower error with \discountcompared to IS-Count and simple Monte Carlo sampling (MC). The labeling effort is further reduced with \discountsince the user is not required to label an image from scratch but only to verify outputs from the detector (See § 5 for details). The estimation errors are averaged over 1000 runs.

5 Results

In this section, we present the results comparing detector-based to covariate-based sampling. Also, we show reductions in labeling effort and demonstrate the advantages of estimating multiple counts jointly. Finally, we show confidence intervals and control variates results.

Detector-based sampling reduces error

We first compare \discount(detector-based sampling) to IS-Count and simple Monte Carlo sampling for estimating F(Ω)F(\Omega), that is, the total counts of birds in a complete roosting season for a given station year, or damaged buildings in the entire disaster region. Fig. 3 shows the error rate as a function of number of labeled samples (i.e., the number of distinct sis_{i} sampled, since each ss is labeled at most once). In the buildings application, a sample refers to an image tile of size 1024×10241024\times 1024 pixels, while for the birds a sample refers to a single day.

Using the detector directly without any screening results in high error rates — roughly 136% and 149% for estimating the total count for the damaged buildings and bird counting tasks respectively. Meng et al. (2022) show the advantages of using importance sampling with screening to produce count estimates with base covariates as opposed to simple Monte Carlo sampling (MC vs. IS-Count). For the bird counting task, we construct a non-detector covariate gISg_{\text{IS}} by fitting a spline to f(s)f(s) with 10% of the days from an arbitrarily selected station-year pair (station KBUF in 2001). For the damaged building counting task, the covariate gISg_{\text{IS}} is the true count of all buildings (independent of the damage) obtained using the labels provided with the xBD dataset.

Covariate-based sampling (IS-Count) leads to significant savings over simple Monte Carlo sampling (MC), but \discountprovides further improvements. In particular, to obtain an error rate of 20% \discountrequires 1.6×\approx 1.6\times fewer samples than IS-Count and 3×\approx 3\times fewer samples than MC for both counting problems.

Refer to caption
Figure 4: Solving multiple counting problems jointly. Estimation error of counting damaged buildings in the Palu Tsunami region from the xBD dataset (left) and counting roosting birds from the Great Lakes radar stations in the US from NEXRAD data (right). We compare solving the counting problems jointly (kk-\discount) against solving the counting problems separately (\discount). We use 10 samples for both these tests. The estimation errors are averaged over 1000 runs.

Screening leads to a further reduction in labeling effort

\discount

alleviates the need for users to annotate an image from scratch, such as identifying an object and drawing a bounding box around it. Instead, users only need to verify the detector’s output, which tends to be a quicker process. In a study by Su et al. (2012) on the ImageNet dataset Deng et al. (2009), the median time to draw a bounding-box was found to be 25.5 seconds, whereas verification took only 9.0 seconds (this matches the screening time of \approx10s per bounding-box in Deng et al. (2023); Belotti et al. (2023)). The right side of Fig. 3 presents earlier plots with the x-axis scaled based on labeling effort, computed as 100cn/|Ω|100\cdot c\cdot n/|\Omega|, where nn denotes the number of screened samples and c[0,1]c\in[0,1] represents the fraction of time relative to labeling from scratch. For instance, the labeling effort is 100% when all elements must be labeled from scratch (c=1c=1 and n=|Ω|n=|\Omega|). For \discount, we estimate cDIS=9.0/(25.5+9.0)=0.26c_{\text{DIS}}=9.0/(25.5+9.0)=0.26, since annotating from scratch requires both drawing and verification, while screening requires only verification. To achieve the same 20% error rate, \discountrequires 6×6\times less effort than IS-Count and 9×9\times less effort than MC for the bird counting task, and 8×8\times less effort than IS-Count and 12×12\times less effort than MC for building counting.

Multiple counts can be estimated efficiently (kk-\discount)

To solve multiple counting problems, we compared kk-\discountto using \discountseparately on each region. For bird counting, the task was to estimate four quarterly counts (cumulative or individual) as described in § 4.1. For kk-\discount, we sampled n=40n=40 days from the complete season to estimate the counts simultaneously. For \discount\discount, we solved each of the four problems separately using n/4=10n/4=10 samples per region for the same total number of samples. For building damage counting, the task was to estimate seven cumulative counts as described in § 4.2. For kk-\discount, we used n=70n=70 images sampled from the entire domain, while for \discountwe used n/7=10n/7=10 sampled images per region.

Fig. 4 shows that solving multiple counting problems jointly (kk-\discount) is better than solving them separately (\discount). For the cumulative tasks, kk-\discountmakes much more effective use of samples from overlapping regions. For single-quarter bird counts, kk-\discounthas slightly higher error in Q1 and Q4 and lower errors in Q2 and Q3. This can be understood in terms of sample allocation: kk-\discountallocates in proportion to predicted counts, which provides more samples and better accuracy in Q2-Q3, when many more roosts appear, and approximates the optimal allocation of Claim 2. \discountallocates samples equally, so has slightly lower error for the smaller Q1 and Q4 counts. In contrast, for building counting, kk-\discounthas lower error even for the smallest region R1, since this has the most damaged buildings and thus gets more samples than \discount. Fig. 5 (left) shows kk-\discountoutperforms simple Monte Carlo (adapted to multiple regions similarly to kk-\discount) for estimating cumulative daily bird counts as in Fig. 2.

Refer to caption
Figure 5: Control variates and confidence intervals on bird counting. We compare simple Monte Carlo (MC), calibration with isotonic regression, and variations of kk-\discountthat include control variates (-cv) and improved variance estimates (σ(Ω))\left(-\sigma(\Omega)\right). (left) Error rates using kk-\discountare significantly smaller than MC and calibration. (middle) Confidence intervals’ width. (right) Confidence intervals’ coverage. The error and the confidence intervals’ width are slightly reduced when control variates are used while maintaining the coverage. Furthermore, kk-\discount-cv-σ(Ω)\sigma(\Omega) improves the coverage. The results are averaged over all station-years and over 1,000 runs.

Confidence intervals

We measure the width and coverage of the estimated confidence intervals (CIs) per number of samples for cumulative daily bird counting; see examples in Fig. 2. We compare the CIs of kk-\discount, kk-\discount-cv (control variates), kk-\discount-cv-σ(Ω)\sigma(\Omega) (using all samples to estimate variance), and simple Monte Carlo sampling in Fig. 5. When using control variates, the error rate and the CI width are slightly reduced while keeping the same coverage. CI coverage is lower than the nominal coverage (95%) for all methods, but increasing with sample size and substantially improved by kk-\discount-cv-σ(Ω)\sigma(\Omega), which achieves up to 80%\approx 80\% coverage. Importance weight distributions can be heavily right-skewed and the variance easily underestimated Hesterberg (1996).

\discountimproves over a calibration baseline

We implement a calibration baseline where the counts are estimated as F^CAL(S)=sSϕ^(g(s)),\hat{F}_{\text{CAL}}(S)=\sum_{s\in S}\hat{\phi}(g(s)), where we learn an isotonic regression model ϕ^\hat{\phi} between the predicted and true counts trained for each station using 15 uniformly selected samples from one year from that station. Results are shown as the straight line in Fig. 5 (left). \discountoutperforms calibration with less than 10 samples per station suggesting the difficulties in generalization across years using a simple calibration approach.

Control variates (kk-\discount-cv)

We perform experiments adding control variates to kk-\discountin the roosting birds counting problem. We use the calibrated detector counts ϕ^(g(s))\hat{\phi}(g(s)) defined above as the control variate for each station year. Fig. 5 shows that control variates reduce the confidence interval width (middle: kk-\discountvs. kk-\discount-cv) without hurting coverage (right). In addition, the error of the estimate is reduced slightly, as shown in Fig. 5 (left). Note that this is achieved with a marginal increase in the labeling effort.

6 Discussion and Conclusion

We contribute methods for counting in large image collections with a detection model. When the task is complex and the detector is imperfect, allocating human effort to estimate the scientific result directly might be more efficient than improving the detector. For instance, performance gains from adding more training data may be marginal for a mature model. Our proposed solution produces accurate and unbiased estimates with a significant reduction in labeling costs from naive and covariate-based screening approaches. We demonstrate this in two real-world open problems where data screening is still necessary despite large investments in model development. Our approach is limited by the availability of a good detector, and confidence interval coverage is slightly low; possible improvements are to use bootstrapping or corrections based on importance-sampling diagnostics Hesterberg (1996).

7 Acknowledgements

We thank Wenlong Zhao for the deployment of the roost detector, Maria Belotti, Yuting Deng, and our Colorado State University AeroEco Lab collaborators for providing the screened data of the Great Lakes radar stations, and Yunfei Luo for facilitating the building detections on the Palu Tsunami region. This work was supported by the National Science Foundation award #2017756.

References

  • Norouzzadeh et al. [2018] Mohammad Sadegh Norouzzadeh, Anh Nguyen, Margaret Kosmala, Alexandra Swanson, Meredith S Palmer, Craig Packer, and Jeff Clune. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences, 115(25):E5716–E5725, 2018.
  • Nurkarim and Wijayanto [2023] W. Nurkarim and A.W. Wijayanto. Building footprint extraction and counting on very high-resolution satellite imagery using object detection deep learning framework. Earth Sci Inform, 2023.
  • Perez et al. [2022] Gustavo Perez, Wenlong Zhao, Zezhou Cheng, Maria Carolina T. D. Belotti, Yuting Deng, Victoria F. Simons, Elske Tielens, Jeffrey F. Kelly, Kyle G. Horton, Subhransu Maji, and Daniel Sheldon. Using spatio-temporal information in weather radar data to detect and track communal bird roosts. bioRxiv, 2022. doi: 10.1101/2022.10.28.513761. URL https://www.biorxiv.org/content/early/2022/10/31/2022.10.28.513761.
  • Meng et al. [2022] Chenlin Meng, Enci Liu, Willie Neiswanger, Jiaming Song, Marshall Burke, David Lobell, and Stefano Ermon. Is-count: Large-scale object counting from satellite images with covariate-based importance sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  • Tuia et al. [2022] Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia Zuffi, Benjamin Risse, Alexander Mathis, Mackenzie W Mathis, Frank van Langevelde, Tilo Burghardt, et al. Perspectives in machine learning for wildlife conservation. Nature communications, 13(1):792, 2022.
  • Cavender-Bares et al. [2022] Jeannine Cavender-Bares, Fabian D Schneider, Maria João Santos, Amanda Armstrong, Ana Carnaval, Kyla M Dahlin, Lola Fatoyinbo, George C Hurtt, David Schimel, Philip A Townsend, et al. Integrating remote sensing with ecology and evolution to advance biodiversity conservation. Nature Ecology & Evolution, 6(5):506–519, 2022.
  • Burke et al. [2021] Marshall Burke, Anne Driscoll, David B Lobell, and Stefano Ermon. Using satellite imagery to understand and promote sustainable development. Science, 371(6535):eabe8628, 2021.
  • Leitloff et al. [2010] Jens Leitloff, Stefan Hinz, and Uwe Stilla. Vehicle detection in very high resolution satellite images of city areas. IEEE transactions on Geoscience and remote sensing, 48(7):2795–2806, 2010.
  • Van Horn et al. [2018] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The Inaturalist species classification and detection dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Won [2020] Myounggyu Won. Intelligent traffic monitoring systems for vehicle classification: A survey. IEEE Access, 8:73340–73358, 2020.
  • Coifman et al. [1998] Benjamin Coifman, David Beymer, Philip McLauchlan, and Jitendra Malik. A real-time computer vision system for vehicle tracking and traffic surveillance. Transportation Research Part C: Emerging Technologies, 6(4):271–288, 1998.
  • Branson et al. [2010] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge Belongie. Visual recognition with humans in the loop. In European Conference on Computer Vision (ECCV), 2010.
  • Wah et al. [2014] Catherine Wah, Grant Van Horn, Steve Branson, Subhransu Maji, Pietro Perona, and Serge Belongie. Similarity comparisons for interactive fine-grained categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • Settles [2009] Burr Settles. Active learning literature survey. 2009.
  • Chapelle et al. [2009] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
  • Nguyen et al. [2018] Phuc Nguyen, Deva Ramanan, and Charless Fowlkes. Active testing: An efficient and robust framework for estimating accuracy. In International Conference on Machine Learning (ICML), 2018.
  • Kossen et al. [2021] Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. Active testing: Sample-efficient model evaluation. In International Conference on Machine Learning, 2021.
  • Cochran [1977] William G Cochran. Sampling techniques. John Wiley & Sons, 1977.
  • Owen [2013] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
  • Winkler [2006] David W. Winkler. Roosts and migrations of swallows. Hornero, 21(2):85–97, 2006.
  • Buler et al. [2012] Jeffrey J Buler, Lori A Randall, Joseph P Fleskes, Wylie C Barrow Jr, Tianna Bogart, and Daria Kluver. Mapping wintering waterfowl distributions using weather surveillance radar. PloS one, 7(7):e41571, 2012.
  • Horn and Kunz [2008] Jason W Horn and Thomas H Kunz. Analyzing nexrad doppler radar images to assess nightly dispersal patterns and population trends in brazilian free-tailed bats (tadarida brasiliensis). Integrative and Comparative Biology, 48(1):24–39, 2008.
  • nex [2023] Next generation weather radar (NEXRAD). National Centers for Environmental Information (NCEI), Feb 2023. URL https://www.ncei.noaa.gov/products/radar/next-generation-weather-radar.
  • Rosenberg et al. [2019] Kenneth V. Rosenberg, Adriaan M. Dokter, Peter J. Blancher, John R. Sauer, Adam C. Smith, Paul A. Smith, Jessica C. Stanton, Arvind Panjabi, Laura Helft, Michael Parr, and Peter P. Marra. Decline of the north american avifauna. Science, 366(6461):120–124, 2019. doi: 10.1126/science.aaw1313.
  • Sánchez-Bayo and Wyckhuys [2019] Francisco Sánchez-Bayo and Kris AG Wyckhuys. Worldwide decline of the entomofauna: A review of its drivers. Biological conservation, 232:8–27, 2019.
  • Chilson et al. [2018] Carmen Chilson, Katherine Avery, Amy McGovern, Eli Bridge, Daniel Sheldon, and Jeffrey Kelly. Automated detection of bird roosts using NEXRAD radar data and convolutional neural networks. Remote Sensing in Ecology and Conservation, 2018.
  • Lin et al. [2019] Tsung-Yu Lin, Kevin Winner, Garrett Bernstein, Abhay Mittal, Adriaan M Dokter, Kyle G Horton, Cecilia Nilsson, Benjamin M Van Doren, Andrew Farnsworth, Frank A La Sorte, et al. Mistnet: Measuring historical bird migration in the us using archived weather radar data and convolutional neural networks. Methods in Ecology and Evolution, 10(11):1908–1922, 2019.
  • Cheng et al. [2020] Z. Cheng, S. Gabriel, P. Bhambhani, D. Sheldon, S. Maji, A. V, and D. Winkler. Detecting and tracking communal bird roosts in weather radar data. Association for the Advancement of Artificial Intelligence (AAAI), 2020.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 2015.
  • Belotti et al. [2023] Maria Carolina TD Belotti, Yuting Deng, Wenlong Zhao, Victoria F Simons, Zezhou Cheng, Gustavo Perez, Elske Tielens, Subhransu Maji, Daniel Sheldon, Jeffrey F Kelly, et al. Long-term analysis of persistence and size of swallow and martin roosts in the us great lakes. Remote Sensing in Ecology and Conservation, 2023.
  • Deng et al. [2023] Yuting Deng, Maria Carolina TD Belotti, Wenlong Zhao, Zezhou Cheng, Gustavo Perez, Elske Tielens, Victoria F Simons, Daniel R Sheldon, Subhransu Maji, Jeffrey F Kelly, et al. Quantifying long-term phenological patterns of aerial insectivores roosting in the great lakes region using weather surveillance radar. Global Change Biology, 29(5):1407–1419, 2023.
  • Kim and Yoon [2018] KeumJi Kim and SeongHwan Yoon. Assessment of building damage risk by natural disasters in south korea using decision tree analysis. Sustainability, 10(4), 2018. ISSN 2071-1050. doi: 10.3390/su10041072. URL https://www.mdpi.com/2071-1050/10/4/1072.
  • Deng and Wang [2022] Liwei Deng and Yue Wang. Post-disaster building damage assessment based on improved u-net. Scientific Reports, 12(1):2045–2322, 2022.
  • [34] DIUx-xView. Diux-xview/xview2_first_place: 1st place solution for "xview2: Assess building damage" challenge. URL https://github.com/DIUx-xView/xView2_first_place.
  • [35] The xview2 ai challenge. URL https://www.ibm.com/cloud/blog/the-xview2-ai-challenge.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  • Gupta et al. [2019] Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. Creating xBD: A Dataset for Assessing Building Damage from Satellite Imagery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019.
  • Su et al. [2012] Hao Su, Jia Deng, and Li Fei-Fei. Crowdsourcing annotations for visual object detection. In Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • Hesterberg [1996] Tim C Hesterberg. Estimates and confidence intervals for importance sampling sensitivity analysis. Mathematical and computer modelling, 23(8-9):79–85, 1996.
  • Žnidarič [2009] Marko Žnidarič. Asymptotic expansion for inverse moments of binomial and poisson distributions. The Open Mathematics, Statistics and Probability Journal, 1(1), 2009.
  • van der Vaart and Wellner [1996] Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, 1996.

Appendix A Derivations

A.1 IS-Count

Take p(s)=1/|S|p(s)=1/|S| and f~(s)=|S|f(s)\tilde{f}(s)=|S|f(s), we want 𝔼p[f~(s)]=sS1|S||S|f(s)=F(S)\operatorname{\mathbb{E}}_{p}[\tilde{f}(s)]=\sum_{s\in S}\frac{1}{|S|}|S|f(s)=F(S). Importance sampling with proposal qq gives

F(S)\displaystyle F(S) =𝔼p[f~(s)]\displaystyle=\operatorname{\mathbb{E}}_{p}[\tilde{f}(s)]
=𝔼q[p(s)q(s)f~(s)]\displaystyle=\operatorname{\mathbb{E}}_{q}\left[\frac{p(s)}{q(s)}\tilde{f}(s)\right]
=𝔼q[1/|S|q(s)|S|f(s)]\displaystyle=\operatorname{\mathbb{E}}_{q}\left[\frac{1/|S|}{q(s)}|S|f(s)\right]
=𝔼q[f(s)q(s)]\displaystyle=\operatorname{\mathbb{E}}_{q}\left[\frac{f(s)}{q(s)}\right]

A.2 DISCount

Take q=g¯Sq=\bar{g}_{S} in IS-Count, then

F(S)\displaystyle F(S) =𝔼q[f(s)q(s)]\displaystyle=\operatorname{\mathbb{E}}_{q}\left[\frac{f(s)}{q(s)}\right]
=𝔼g¯S[f(s)g(s)/G(S)]\displaystyle=\operatorname{\mathbb{E}}_{\bar{g}_{S}}\left[\frac{f(s)}{g(s)/G(S)}\right]
=G(S)𝔼g¯S[f(s)g(s)]\displaystyle=G(S)\cdot\operatorname{\mathbb{E}}_{\bar{g}_{S}}\left[\frac{f(s)}{g(s)}\right]

A.3 kk-DISCount

Proof of Claim 1.

For any m>0m>0 we have

𝔼[F^kDIS(S)|n(S)=m]\displaystyle\operatorname{\mathbb{E}}\left[\hat{F}_{k\text{DIS}}(S)\,\big{|}\,n(S)=m\right] =𝔼[G(S)1mi=1nwi𝕀[siS]|n(S)=m]\displaystyle=\operatorname{\mathbb{E}}\left[G(S)\cdot\frac{1}{m}\sum_{i=1}^{n}w_{i}\cdot\operatorname{\mathbb{I}}[s_{i}\in S]\,\bigg{|}\,n(S)=m\right]
=G(S)1mi=1n𝔼[wi𝕀[siS]|n(S)=m]\displaystyle=G(S)\cdot\frac{1}{m}\sum_{i=1}^{n}\operatorname{\mathbb{E}}\left[w_{i}\cdot\operatorname{\mathbb{I}}[s_{i}\in S]\,\big{|}\,n(S)=m\right]
=G(S)1mi=1nPr[siSn(S)=m]𝔼[wi|siS,n(S)=m]\displaystyle=G(S)\cdot\frac{1}{m}\sum_{i=1}^{n}\Pr[s_{i}\in S\mid n(S)=m]\cdot\operatorname{\mathbb{E}}\left[w_{i}\,|\,s_{i}\in S,n(S)=m\right]
=G(S)1mi=1nPr[siSn(S)=m]𝔼[wi|siS]\displaystyle=G(S)\cdot\frac{1}{m}\sum_{i=1}^{n}\Pr[s_{i}\in S\mid n(S)=m]\cdot\operatorname{\mathbb{E}}\left[w_{i}\,|\,s_{i}\in S\right]
=G(S)1mi=1nmn𝔼[wi|siS]\displaystyle=G(S)\cdot\frac{1}{m}\sum_{i=1}^{n}\frac{m}{n}\cdot\operatorname{\mathbb{E}}\left[w_{i}\,\big{|}\,s_{i}\in S\right]
=G(S)1mi=1nmnF(S)G(S)\displaystyle=G(S)\cdot\frac{1}{m}\sum_{i=1}^{n}\frac{m}{n}\cdot\frac{F(S)}{G(S)}
=F(S)\displaystyle=F(S)

In the third line, we used the fact that 𝔼[h(X)𝕀[XA]]=Pr[xA]𝔼[h(X)XA]\operatorname{\mathbb{E}}\big{[}h(X)\cdot\operatorname{\mathbb{I}}[X\in A]\big{]}=\Pr[x\in A]\cdot\operatorname{\mathbb{E}}[h(X)\mid X\in A] for any random variable XX and event AA (see Lemma 1 below). In the fourth line we used the fact that sis_{i} is conditionally independent of n(S)n(S) given siSs_{i}\in S, since n(S)=𝕀[siS]+ji𝕀[sjS]n(S)=\operatorname{\mathbb{I}}[s_{i}\in S]+\sum_{j\neq i}\operatorname{\mathbb{I}}[s_{j}\in S] and the latter sum is independent of sis_{i}. In the fifth line we used the fact that Pr[siSn(S)=m]=mn\Pr[s_{i}\in S\mid n(S)=m]=\frac{m}{n} because n(S)=j=1n𝕀[sjS]n(S)=\sum_{j=1}^{n}\operatorname{\mathbb{I}}[s_{j}\in S] and the terms in the sum are exchangeable. In the sixth line we computed the conditional expectation as follows using the fact that the conditional density of sis_{i} given siSs_{i}\in S is equal to g(si)/G(S)g(s_{i})/G(S):

𝔼[wi|siS]=𝔼[f(si)g(si)|siS]=sSg(si)G(S)f(si)g(si)=1G(S)sSf(s)=F(S)G(S).\displaystyle\operatorname{\mathbb{E}}\left[w_{i}\,\big{|}\,s_{i}\in S\right]=\operatorname{\mathbb{E}}\left[\frac{f(s_{i})}{g(s_{i})}\,\bigg{|}\,s_{i}\in S\right]=\sum_{s\in S}\frac{g(s_{i})}{G(S)}\cdot\frac{f(s_{i})}{g(s_{i})}=\frac{1}{G(S)}\sum_{s\in S}f(s)=\frac{F(S)}{G(S)}.

The unconditional bias of kk-\discountcan also be analyzed:

Claim 4.

Let p(S)=Pr[siS]=G(S)/G(Ω)p(S)=\Pr[s_{i}\in S]=G(S)/G(\Omega). The bias of the kk-\discountestimator is 𝔼[F^kDIS(S)]F(S)=(1p(S))n\operatorname{\mathbb{E}}[\hat{F}_{k\text{DIS}}(S)]-F(S)=-\big{(}1-p(S)\big{)}^{n}.

In particular, bias decays exponentially with nn and quickly becomes negligible, with magnitude at most ϵ\epsilon for nlog(1/r)/log(1/ϵ)n\geq\log(1/r)/\log(1/\epsilon) and r=1p(S)r=1-p(S). Further, the bias is easily computable from the detector counts and therefore known prior to sampling, and the event that leads to a biased estimate (n(S)=0)(n(S)=0) is observed after sampling. All these factors make bias a very minor concern.111The kk-\discountestimator can be debiased by dividing by u=1(1p(S))n<1u=1-(1-p(S))^{n}<1. However, this leads to higher overall error: if n(S)=0n(S)=0, the estimator is unchanged, and conditioned on the event n(S)>0n(S)>0 the estimator becomes biased and has higher variance by a factor of 1/u2>11/u^{2}>1.

Proof of Claim 4.

Using Claim 1, we compute the unconditional expectation as

𝔼[F^kDIS(S)]\displaystyle\operatorname{\mathbb{E}}\left[\hat{F}_{k\text{DIS}}(S)\right] =Pr[n(S)=0]𝔼[F^kDIS(S)n(S)=0]+Pr[n(S)>0]𝔼[F^kDIS(S)n(S)>0]\displaystyle=\Pr[n(S)=0]\cdot\operatorname{\mathbb{E}}\left[\hat{F}_{k\text{DIS}}(S)\mid n(S)=0\right]+\Pr[n(S)>0]\cdot\operatorname{\mathbb{E}}[\hat{F}_{k\text{DIS}}(S)\mid n(S)>0]
=Pr[n(S)>0]F(S)\displaystyle=\Pr[n(S)>0]\cdot F(S)
=(1(1G(S)G(Ω))n)F(S).\displaystyle=\bigg{(}1-\bigg{(}1-\frac{G(S)}{G(\Omega)}\bigg{)}^{n}\bigg{)}\cdot F(S).

In the final line, 1G(S)/G(Ω)1-G(S)/G(\Omega) is probability that siSs_{i}\notin S for a single ii, and (1G(S)/G(Ω))n=Pr[n(S)=0](1-G(S)/G(\Omega))^{n}=\Pr[n(S)=0] is the probability that siSs_{i}\notin S for all ii. Rearranging gives the result. ∎

Lemma 1.

𝔼[h(X)𝕀[XA]]=Pr[xA]𝔼[h(X)|XA]\operatorname{\mathbb{E}}\big{[}h(X)\cdot\operatorname{\mathbb{I}}[X\in A]\big{]}=\Pr[x\in A]\cdot\operatorname{\mathbb{E}}[h(X)\,|\,X\in A] for any random variable XX and event AA.

Proof.

Observe

𝔼[h(X)𝕀[XA]]\displaystyle\operatorname{\mathbb{E}}\big{[}h(X)\cdot\operatorname{\mathbb{I}}[X\in A]\big{]} =xPr[X=x]h(x)𝕀[xA]\displaystyle=\sum_{x}\Pr[X=x]h(x)\operatorname{\mathbb{I}}[x\in A]
=xPr[X=x,XA]h(x)\displaystyle=\sum_{x}\Pr[X=x,X\in A]h(x)
=Pr[XA]xPr[X=xXA]h(x)\displaystyle=\Pr[X\in A]\sum_{x}\Pr[X=x\mid X\in A]h(x)
=Pr[XA]𝔼[h(X)XA].\displaystyle=\Pr[X\in A]\cdot\operatorname{\mathbb{E}}[h(X)\mid X\in A].

A.4 Optimal allocation of samples for \discountto disjoint regions

Proof of Claim 2.

The proof is similar to that of Theorem 5.6 in Cochran [1977]. We prove the claim for k=2k=2; the proof generalizes to larger kk in an obvious way. The variance of \discounton SiS_{i} is

Var(F^DIS(Si))=G(Si)2σ2(Si)ni=σ2G(Si)2ni.\operatorname{Var}(\hat{F}_{\text{DIS}}(S_{i}))=\frac{G(S_{i})^{2}\cdot\sigma^{2}(S_{i})}{n_{i}}=\sigma^{2}\frac{G(S_{i})^{2}}{n_{i}}.

We want to minimize iVar(F^DIS(Si))\sum_{i}\operatorname{Var}(\hat{F}_{\text{DIS}}(S_{i})), which with k=2k=2 is proportional to

V=G(S1)2n1+G(S2)2n2.V=\frac{G(S_{1})^{2}}{n_{1}}+\frac{G(S_{2})^{2}}{n_{2}}.

By the Cauchy-Shwarz inequality, for any n1,n2>0n_{1},n_{2}>0,

Vn=(G(S1)2n1+G(S2)2n2)(n1+n2)(G(S1)+G(S2))2.Vn=\left(\frac{G(S_{1})^{2}}{n_{1}}+\frac{G(S_{2})^{2}}{n_{2}}\right)(n_{1}+n_{2})\geq\left(G(S_{1})+G(S_{2})\right)^{2}.

If we substitute ni=G(Si)/Zn_{i}=G(S_{i})/Z for any ZZ on the left of the inequality and simplify, we see the inequality becomes tight, so the minimum is achieved. We further require ini=n\sum_{i}n_{i}=n, so choose ZZ so

ni=nG(Si)G(Ω)=np(Si).n_{i}=n\cdot\frac{G(S_{i})}{G(\Omega)}=np(S_{i}).

A.5 kk-\discountvariance

Recall that p(S)=Pr[siS]=G(S)/G(Ω)p(S)=\Pr[s_{i}\in S]=G(S)/G(\Omega) is the probability of a sample landing in SS under the sampling distribution g¯Ω\bar{g}_{\Omega}. Define

σ2(S)=Var(f(si)/g(si)siS)=sSg(s)G(S)(f(s)g(s)F(S)G(S))2.\sigma^{2}(S)=\operatorname{Var}(f(s_{i})/g(s_{i})\mid s_{i}\in S)=\sum_{s\in S}\frac{g(s)}{G(S)}\cdot\left(\frac{f(s)}{g(s)}-\frac{F(S)}{G(S)}\right)^{2}.

to be the variance of the importance weight for sig¯Ss_{i}\sim\bar{g}_{S}.

Claim 5.

Let r=1p(S)r=1-p(S). The variance of the kk-\discountestimator is given by

Var(F^kDIS(S))=G(S)2σ2(S)(1rn)𝔼[1n(S)|n(S)>0]+F(S)2rn(1rn).\operatorname{Var}(\hat{F}_{k\text{DIS}}(S))=G(S)^{2}\cdot\sigma^{2}(S)\cdot(1-r^{n})\cdot\operatorname{\mathbb{E}}\left[\frac{1}{n(S)}\,\bigg{|}\,n(S)>0\right]+F(S)^{2}\cdot r^{n}\cdot(1-r^{n}).

where (1r)n𝔼[1/n(S)|n(S)>0]=j=1n(1/j)Binomial(j;n,p(S))(1-r)^{n}\operatorname{\mathbb{E}}\left[1/n(S)\,\big{|}\,n(S)>0\right]=\sum_{j=1}^{n}(1/j)\cdot\text{Binomial}\big{(}j;n,p(S)\big{)}.

The second term in the variance arise from the possibility that no samples land in SS; it decays exponentially in nn and is negligible compared to the first term. The first term can be compared to the variance G(S)2σ2(S)1mG(S)^{2}\cdot\sigma^{2}(S)\cdot\frac{1}{m} of importance sampling with exactly mm samples allocated to SS and the proposal distribution g¯S\bar{g}_{S}, i.e., \discount. Because the sample size n(S)n(S) is random, the correct scaling factor for kk-\discountis (1rn)𝔼[1/n(S)n(S)>0](1-r^{n})\operatorname{\mathbb{E}}[1/n(S)\mid n(S)>0], which it turns out is asymptotically equivalent to 1/(np(S))1/(np(S)), i.e., \discountwith a sample size of m=np(S)=𝔼[n(S)]m=np(S)=\operatorname{\mathbb{E}}[n(S)] — see Claim 6 below. We find that for a small expected sample size (around 4) there can be up to 30% “excess variance” due to the randomness in the number of samples (see Figure 6), but that this disappears quickly with larger expected sample size.

Claim 6.

Let F^kDIS,n\hat{F}_{k\text{DIS},n} and F^DIS,m\hat{F}_{\text{DIS},m} be the kk-\discountand \discountestimators with sample sizes nn and mm, respectively. The asymptotic variance of kk-\discountis given by

limnnVar(F^kDIS,n(S))=G(S)2σ2(S)/p(S).\lim_{n\to\infty}n\operatorname{Var}(\hat{F}_{k\text{DIS},n}(S))=G(S)^{2}\cdot\sigma^{2}(S)/p(S).

This is asymptotically equivalent to \discountwith sample size m=𝔼[n(S)]=np(S)m=\operatorname{\mathbb{E}}[n(S)]=np(S). That is

limnVar(F^kDIS,n(S))Var(F^DIS,np(S)(S))=1.\lim_{n\to\infty}\frac{\operatorname{Var}(\hat{F}_{k\text{DIS},n}(S))}{\operatorname{Var}(\hat{F}_{\text{DIS},\lceil np(S)\rceil}(S))}=1.
Refer to caption
Figure 6: Ratio of variance of kk-\discountwith expected sample size np(S)np(S) to \discountwith np(S)np(S) samples; uses formula from Claim 5 (first term only).
Proof of Claim 5.

By the law of total variance,

Var(F^kDIS(S))=𝔼[Var(F^kDIS(S)n(S))]+Var(𝔼[F^kDIS(S)n(S)]).\operatorname{Var}(\hat{F}_{k\text{DIS}}(S))=\operatorname{\mathbb{E}}\big{[}\operatorname{Var}(\hat{F}_{k\text{DIS}}(S)\mid n(S))\big{]}+\operatorname{Var}\big{(}\operatorname{\mathbb{E}}[\hat{F}_{k\text{DIS}}(S)\mid n(S)]\big{)}. (1)

We will treat each term in Eq. 1 separately. For the first term, from the definition of kk-\discountwe see

Var(F^kDIS(S)n(S))={0n(S)=0G(S)2σ2(S)n(S)n(S)>0.\operatorname{Var}\big{(}\hat{F}_{k\text{DIS}}(S)\mid n(S))=\begin{cases}0&n(S)=0\\ G(S)^{2}\cdot\frac{\sigma^{2}(S)}{n(S)}&n(S)>0\end{cases}.

Therefore

𝔼[Var(F^kDIS(S)n(S))]\displaystyle\operatorname{\mathbb{E}}\big{[}\operatorname{Var}(\hat{F}_{k\text{DIS}}(S)\mid n(S))\big{]} =G(S)2Pr[n(S)>0]𝔼[σ2(S)n(S)|n(S)>0]\displaystyle=G(S)^{2}\cdot\Pr[n(S)>0]\operatorname{\mathbb{E}}\left[\frac{\sigma^{2}(S)}{n(S)}\,\bigg{|}\,n(S)>0\right]
=G(S)2σ2(S)(1rn)𝔼[1n(S)|n(S)>0].\displaystyle=G(S)^{2}\cdot\sigma^{2}(S)\cdot(1-r^{n})\cdot\operatorname{\mathbb{E}}\left[\frac{1}{n(S)}\,\bigg{|}\,n(S)>0\right].

In the last line, we used the fact that n(S)Binomial(n,p(S))n(S)\sim\text{Binomial}(n,p(S)), so Pr[n(S)>0]=1rn\Pr[n(S)>0]=1-r^{n} where r=1p(S)r=1-p(S). The summation for (1rn)𝔼[1/n(S)n(S)>0](1-r^{n})\operatorname{\mathbb{E}}[1/n(S)\mid n(S)>0] follows from the same fact.

For the second term in Eq. (1), from the definition of kk-\discountand conditional unbiasedness (Claim 1), we have

𝔼[F^kDIS(S)n(S)]\displaystyle\operatorname{\mathbb{E}}[\hat{F}_{k\text{DIS}}(S)\mid n(S)] ={0 if n(S)=0F(S) if n(S)>0\displaystyle=
=F(S)Bernoulli(1rn).\displaystyle=F(S)\cdot\text{Bernoulli}(1-r^{n}).

The variance is therefore

Var(𝔼[F^kDIS(S)n(S)])=F(S)2rn(1rn).\operatorname{Var}\big{(}\operatorname{\mathbb{E}}[\hat{F}_{k\text{DIS}}(S)\mid n(S)]\big{)}=F(S)^{2}\cdot r^{n}\cdot(1-r^{n}).

Putting the two terms together yields the result. ∎

Proof of Claim 6.

By Claim 5 we have

limnnVar(F^kDIS,n(S))=limnnG(S)2σ2(S)(1rn)𝔼[1n(S)|n(S)>0]+limnnF(S)2rn(1rn).\lim_{n\to\infty}n\operatorname{Var}(\hat{F}_{k\text{DIS},n}(S))=\lim_{n\to\infty}n\cdot G(S)^{2}\cdot\sigma^{2}(S)\cdot(1-r^{n})\cdot\operatorname{\mathbb{E}}\left[\frac{1}{n(S)}\,\bigg{|}\,n(S)>0\right]+\\ \lim_{n\to\infty}n\cdot F(S)^{2}\cdot r^{n}\cdot(1-r^{n}). (2)

The second limit on the right side is zero, because nrn0nr^{n}\to 0 as nn\to\infty (recall that r<1r<1) and the other factors are bounded. We will show the first limit on the right side is equal to G(S)2σ2(S)/p(S)G(S)^{2}\cdot\sigma^{2}(S)/p(S), which will prove the first part of the result. The asymptotic expansion of Žnidarič [2009] (Corollary 3) states that

(1rn)𝔼[1/n(S)n(S)>0]=1np(S)+𝒪(1(np(S))2).(1-r^{n})\operatorname{\mathbb{E}}[1/n(S)\mid n(S)>0]=\frac{1}{np(S)}+\mathcal{O}\left(\frac{1}{(np(S))^{2}}\right).

Using this expansion in the limit gives:

limnnG(S)2σ2(S)(1rn)\displaystyle\lim_{n\to\infty}n\cdot G(S)^{2}\cdot\sigma^{2}(S)\cdot(1-r^{n})\cdot 𝔼[1n(S)|n(S)>0]\displaystyle\operatorname{\mathbb{E}}\left[\frac{1}{n(S)}\,\bigg{|}\,n(S)>0\right]
=limnG(S)2σ2(S)/p(S)(1+𝒪(1/n))\displaystyle=\lim_{n\to\infty}G(S)^{2}\cdot\sigma^{2}(S)/p(S)\cdot(1+\mathcal{O}(1/n))
=G(S)2σ2(S)/p(S)\displaystyle=G(S)^{2}\cdot\sigma^{2}(S)/p(S)

The variance of \discount\discount with sample size mm is Var(F^DIS,m(S))=G(S)2σ2(S)/m\operatorname{Var}(\hat{F}_{\text{DIS},m}(S))=G(S)^{2}\cdot\sigma^{2}(S)/m. Setting m=np(S)m=\lceil np(S)\rceil and using the second to last line above we have

limnVar(F^kDIS,n(S))Var(F^DIS,np(S)(S))\displaystyle\lim_{n\to\infty}\frac{\operatorname{Var}(\hat{F}_{k\text{DIS},n}(S))}{\operatorname{Var}(\hat{F}_{\text{DIS},\lceil np(S)\rceil}(S))} =limnnVar(F^kDIS,n(S))nVar(F^DIS,np(S)(S))\displaystyle=\lim_{n\to\infty}\frac{n\operatorname{Var}(\hat{F}_{k\text{DIS},n}(S))}{n\operatorname{Var}(\hat{F}_{\text{DIS},\lceil np(S)\rceil}(S))}
=limnG(S)2σ2(S)/p(S)(1+𝒪(1/n))nG(S)2σ2(S)/np(S)\displaystyle=\lim_{n\to\infty}\frac{G(S)^{2}\cdot\sigma^{2}(S)/p(S)\cdot(1+\mathcal{O}(1/n))}{n\cdot G(S)^{2}\cdot\sigma^{2}(S)/\lceil np(S)\rceil}
=1.\displaystyle=1.

A.6 Control Variates

Recall that with control variates the weight is redefined as

wi=(f(si)h(si))/g(si).w_{i}=(f(s_{i})-h(s_{i}))/g(s_{i}).

The expectation of the weight given siSs_{i}\in S is

𝔼[wisiS]=sSg(s)G(S)f(s)h(s)g(s)=F(S)H(S)G(S)\operatorname{\mathbb{E}}[w_{i}\mid s_{i}\in S]=\sum_{s\in S}\frac{g(s)}{G(S)}\frac{f(s)-h(s)}{g(s)}=\frac{F(S)-H(S)}{G(S)}

Therefore

𝔼[w¯cv(S)n(S)>0]=F(S)H(S)G(S)\operatorname{\mathbb{E}}[\bar{w}_{cv}(S)\mid n(S)>0]=\frac{F(S)-H(S)}{G(S)}

Therefore

𝔼[F^kDIScv(S)n(S)>0]=G(S)F(S)H(S)G(S)+H(S)=F(S)\operatorname{\mathbb{E}}[\hat{F}_{k\text{DIS}cv}(S)\mid n(S)>0]=G(S)\cdot\frac{F(S)-H(S)}{G(S)}+H(S)=F(S)

A.7 Confidence Intervals

Proof of Claim 3.

Let w1,w2,w_{1},w_{2},\ldots be an iid sequence of importance weights for samples in SS, i.e., wi=f(si)/g(si)w_{i}=f(s_{i})/g(s_{i}) for sig¯Ss_{i}\sim\bar{g}_{S}. Each weight wiw_{i} has mean F(S)/G(S)F(S)/G(S) and variance σ2(S)\sigma^{2}(S). Let ω¯n=1ni=1nwi\bar{\omega}_{n}=\frac{1}{n}\sum_{i=1}^{n}w_{i}. By the central limit theorem,

n(ω¯nF(S)/G(S))D𝒩(0,σ2(S))\sqrt{n}(\bar{\omega}_{n}-F(S)/G(S))\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}(0,\sigma^{2}(S))

Recall that F^kDIS,n(S)=G(S)w¯n(S)\hat{F}_{k\text{DIS},n}(S)=G(S)\cdot\bar{w}_{n}(S) where w¯n(S)\bar{w}_{n}(S) is the average of the importance weights for samples that land in SS when drawn from all of Ω\Omega (for clarity in the proof we add subscripts for sample size to all relevant quantities). It is easy to see that w¯n(S)\bar{w}_{n}(S) is equal in distribution to ω¯n(S)\bar{\omega}_{n(S)} where n(S)Binomial(n,p(S))n(S)\sim\text{Binomial}(n,p(S)) and n(S)n(S) is independent of the sequence of importance weights — this follows from first choosing the number of samples that land in SS and then choosing their locations conditioned on being in SS. From Theorem 3.5.1 of van der Vaart and Wellner [1996] (with Nn=n(S)N_{n}=n(S) and cn=np(S)c_{n}=np(S)) it then follows that

n(S)(w¯n(S)F(S)/G(S))D𝒩(0,σ2(S))\sqrt{n(S)}\big{(}\bar{w}_{n}(S)-F(S)/G(S)\big{)}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}(0,\sigma^{2}(S))

Rearranging yields

F^kDIS(S)F(S)G(S)/n(S)D𝒩(0,σ2(S))\frac{\hat{F}_{k\text{DIS}}(S)-F(S)}{G(S)/\sqrt{n(S)}}\stackrel{{\scriptstyle D}}{{\longrightarrow}}\mathcal{N}(0,\sigma^{2}(S))

After dividing by σ^n(S)\hat{\sigma}_{n}(S), the result follows from Slutsky’s lemma if σ^n2(S)Pσ2(S)\hat{\sigma}^{2}_{n}(S)\stackrel{{\scriptstyle P}}{{\longrightarrow}}\sigma^{2}(S), which follows from a similar application of Theorem 3.5.1 of van der Vaart and Wellner [1996]. ∎

Appendix B Counting tasks

Refer to caption
Refer to caption
Figure 7: Counting roosting birds in radar images. (a) The US weather radar network has collected data for 30 years from 143+ stations and provides an unprecedented opportunity to study long-term and wide-scale biological phenomenon such as roosts. (b) Counts are collected for each day ss by running the detector using all radar scans for that day to detect and track roost signatures and then mapping to bird counts using the radar “reflectivity” within the tracks. The figure shows two scans for the KDTX station (Detroit, MI) on the same day, along roost detections which appear as expanding rings. By tracking these detections across a day one can estimate the number of birds in each roost. (c) Cumulative bird counts in the complete roosting season by aggregating counts across all tracked roosts and days. (bottom) Examples of true bird counts (blue) and detector counts (red) during a roosting season for station KGRB 2020 and KTYX 2015.
Refer to caption
Figure 8: Counting damaged buildings in satellite images. Building damage assessment from satellite images Kim and Yoon [2018], Deng and Wang [2022] is often used to plan humanitarian response after a natural disaster strikes. (a) We consider the Palu Tsunami from 2018; the data consists of 113 high-resolution satellite images labeled with 31,394 buildings and their damage levels. (b) Counts are collected per tile using before- and after-disaster satellite images. Colors indicate different levels of damage (e.g., red: "destroyed"). (c) Damaged building counts per sub-region.

Appendix C Palu Tsunami regions

Refer to caption Refer to caption
Figure 9: (left) Distance from the area with the most damaged buildings. (right) Regions defined for damaged building counting (See § 4.2).