Scale-Aware Crowd Count Network with Annotation Error Correction

Yi-Kuan Hsieh¹, Jun-Wei Hsieh¹, Xin li², Ming-Ching Chang², Yu-Chee Tseng¹

Abstract

Traditional crowd counting networks suffer from information loss when feature maps are downsized through pooling layers, leading to inaccuracies in counting crowds at a distance. Existing methods often assume correct annotations during training, disregarding the impact of noisy annotations, especially in crowded scenes. Furthermore, the use of a fixed Gaussian kernel fails to account for the varying pixel distribution with respect to the camera distance. To overcome these challenges, we propose a Scale-Aware Crowd Counting Network (SACC-Net) that introduces a “scale-aware” architecture with error-correcting capabilities of noisy annotations. For the first time, we simultaneously model labeling errors (mean) and scale variations (variance) by spatially-varying Gaussian distributions to produce fine-grained heat maps for crowd counting. Furthermore, the proposed adaptive Gaussian kernel variance enables the model to learn dynamically with a low-rank approximation, leading to improved convergence efficiency with comparable accuracy. The performance of SACC-Net is extensively evaluated on four public datasets: UCF-QNRF, UCF CC 50, NWPU, and ShanghaiTech A-B. Experimental results demonstrate that SACC-Net outperforms all state-of-the-art methods, validating its effectiveness in achieving superior crowd counting accuracy.

Introduction

Crowd counting is an increasingly important technique in computer vision with applications in public safety and crowd behavior analysis (Li et al. 2021; Gao et al. 2020). Over the years, many CNN-based crowd counting methods have been developed that predict crowd density maps from a given image (Li, Zhang, and Chen 2018; Xu et al. 2019; Bai et al. 2020; Ma et al. 2019; Xiong et al. 2019; Varior et al. 2019; Jiang et al. 2020; Thanasutives et al. 2021; Zhu et al. 2019). The total number of people in the image is then calculated by summing up the predicted values on the density map. In the past, the image was passed directly through a backbone and its last layer was used to predict the density map. However, most of existing methods did not account for the scale problem in crowd counting properly: people at the far end tend to look smaller than those at the near end. Existing counting methods have difficulty generating fine-grained density maps to accurately count people at the far end of an input image after it passes through the pooling layer.

Refer to caption — Figure 1: Modeling uncertainty for the task of crowd counting. (a) Inaccurate annotations lead to biased mean (red dots deviate from the center of human faces). (b) Different camera distance leads to a distribution of head sizes (or $\beta$ characterizing the change in variance). It is positively skewed.

Moreover, many of these methods require precise annotations from which a density map can be constructed using L2-norm (Li, Zhang, and Chen 2018; Wan and Chan 2019; Cao et al. 2018) or Bayesian Loss (BL) (Ma et al. 2019). Unfortunately, even for human annotators, the presence of annotation errors is inevitable because ground-truth labeling might vary from subject to subject. As illustrated in Fig. 1, accurately pinpointing the center of each individual’s head in an image is nontrivial and can pose technical challenges, particularly for people who appear small or distant. As the crowd size increases, the distance from a person to the camera is not constant: individuals situated at a far distance might only occupy a few pixels in the image, rendering annotation more challenging and unreliable. Therefore, treating all pixels equally in Bayesian loss (BL) is likely to hurt the accuracy of crowd counting. How to handle both scale variations and annotation errors in crowd counting is still an open problem, to the best of our knowledge.

The motivation for this work is to improve the accuracy of crowd counting by addressing not only the scaling truncation problem (caused by the pooling operations), but also the problem of annotation errors across scales. The Feature Pyramid (FP) can capture an object’s visual features from coarse to fine scales and has become the standard component for most State-of-The-Art (SoTA) object counting frameworks (Li, Zhang, and Chen 2018; Xu et al. 2019; Bai et al. 2020; Ma et al. 2019; Xiong et al. 2019; Varior et al. 2019; Jiang et al. 2020; Thanasutives et al. 2021; Zhu et al. 2019). However, the adopted pooling operations aim to scale the feature maps in the FP to $\frac{1}{2}$ , $\frac{1}{4}$ , or $\frac{1}{8}$ of the input size, where the scale truncation causes small objects to disappear dramatically. To tackle this problem, we propose a novel Synthetic Fusion Module (SFM) to scale the feature map to $\frac{1}{2}$ , $\frac{1}{3}$ , $\frac{1}{4}$ , $\frac{1}{6}$ , etc. Then, a smoother scale space can then be obtained for fitting the ground truth whose scale changes continuously. We propose an Intra-block Fusion Module (IFM) to allow all feature layers within the same convolution block to be fused, so that more fine-grained information can be sent to the decoder for effective crowd counting. Finally, most existing crowd counting architectures (Li, Zhang, and Chen 2018; Xu et al. 2019; Bai et al. 2020; Xiong et al. 2019; Varior et al. 2019; Jiang et al. 2020) cannot meet the operating speed requirements for real-time crowd counting. Our architecture can be easily converted to a lightweight version with real-time efficiency and comparable accuracy.

To address the problem of annotation errors, we propose a novel scale-aware loss function that simultaneously considers annotation noise, head-to-head correlation, and adjustment for variances at different scales. In (Wan and Chan 2020), a multivariate Gaussian distribution was used to solve this annotation problem. This model is fixed for all objects of different sizes. In real images, the sizes of human heads vary at different positions. Thus, we argue that annotation correction should be scale-aware and capable of adapting to changes in head size. To model the correlation between pixels at different scales, we derive a multivariate Gaussian distribution with a full covariance matrix of different scales. To speed up computation, we adopt a low-rank approximation method. Finally, our scale-aware loss function is designed to correct human annotation errors, so that our trained model can achieve SoTA performance in crowd counting. Our new architecture, Scale-Aware Crowd Counting Network (SACC-Net) as in Fig. 2 is integrated into VGG-19 and trained by a new loss function with scale-aware annotation error correction that achieves SoTA performance on four popular crowd counting datasets. The main contributions of this paper are summarized as follows:

•

We propose SACC-Net that integrates information across layers and corrects annotation errors across scales to achieve SoTA crowd counting performance.
•

Based on an observation that the distribution of head sizes in an image is generally skewed, we create a new scale-aware density model to handle the counting annotation errors while addressing the scale variations. A new scale-aware loss function is proposed to simultaneously model scale variations and annotation errors, such that a fine-grained heat maps for crowd counting is produced.
•

An SFM is proposed to generate a smoother scale space to deal with the problem of scale truncation.
•

An IFM is proposed to fuse all feature layers within the same convolution block to generate finer-grained information for more effective crowd counting.

Related Works

Scale variations in crowd counting: One critical challenge of crowd counting based on summing the density maps is the scale variation due to various distances between the cameras and the targets. To improve generalizability, (Zhang et al. 2015) proposed a CNN architecture based on a switching strategy to perform an alternative optimization between density estimation and count estimation. In the multi-column CNN (Zhang et al. 2016), each column uses a different combination of convolution kernels to extract multi-scale features. However, (Li, Zhang, and Chen 2018) shows that similar features are often learned in each column of this network; therefore, the model cannot be efficiently trained as the layers become deeper. In (Li, Zhang, and Chen 2018), multi-scale features are obtained using VGG16 and convolutions are adopted with different dilation rates. Instead of using different conv kernel sizes in each layer, a multi-branch strategy is used in (Varior et al. 2019) to choose convolution filters with a fixed size while extracting multi-scale features across layers. To avoid repeatedly computing convolutional features, multi-resolution feature maps are generated by dividing a dense region into sub-regions in (Xiong et al. 2019). In (Liu, Salzmann, and Fua 2019), scale variation is handled by encoding multi-scale contextual information into the regression model. In (Jiang et al. 2020), a density attention network generates various attention masks to focus on a particular scale. A densely connected architecture is used in (Miao et al. 2020) to maintain multi-scale information.

The point-wise or dotted annotation is widely used in most crowd counting datasets to represent each object in the image. Since no size information is included, the subsequent deviation and performance evaluation compared to the bounding-box annotation is profoundly affected. To this end, in (Zhang et al. 2016), the average distance from each head to its three neighbors is calculated, and then the head size is estimated as the Gaussian standard deviation. Synthetic crowd scenes can be generated simultaneously with annotation in (Wang and Breckon 2019). In (Cheng et al. 2022), various locally connected Gaussian kernels are used to replace the original convolution filter.

Loss Function: Traditionally, density estimation-based crowd counting approaches used the pixel-wise Mean Square Error (MSE) loss for training. More recently, alternative loss functions are developed to address the limitations of MSE loss. For example, (Jiang et al. 2019) used a combinatorial loss including the spatial abstraction and spatial correlation terms to reduce the annotation deviation. The Bayesian loss in (Ma et al. 2019) leverages a density contribution probability model to mitigate the impact of deviation, though false positives still cannot be successfully reduced. The DM-count loss in (Wang et al. 2020) measures the similarity between the predicted and ground-truth density maps. In (Wan and Chan 2020), a multivariate Gaussian distribution-based loss function considers annotation noise and correlation, but the design is not scale aware. In practice, annotation pixel shifts or errors may not affect the counting of large objects but can significantly degrade the counting of small objects. We argue that the loss function used to correct such annotation errors should be scale aware.

The Proposed Architecture and Method

Density Map Generation

Traditional methods cast the counting task as a density regression problem (Lempitsky and Zisserman 2010; Sindagi and Patel 2017; Wan and Chan 2020). For a given image $\mathcal{I}$ with $N$ people that we aim to count, let ${\mathrm{\textbf{H}}}_{i}$ denote the true position of the $i^{th}$ person. For any pixel location $x$ in the image $\mathcal{I}$ , we model the crowd density $y$ at $x$ as a Gaussian kernel centered at each annotation point. Let $\beta$ denote the annotation variance of the Gaussian kernel and $\sum_{i=1}^{N}\mathcal{N}(x|\mu,\Sigma)$ be the Probability Density Function (PDF) for a multivariate Gaussian with mean $\mu$ and covariance matrix $\Sigma$ . We calculate the squared Mahalanobis distance as $\left\|x\right\|^{2}_{\Sigma}=X^{T}\Sigma^{-1}X$ , where $X$ is the feature vector of $x$ extracted from a network backbone. The crowd density $y$ at position $x$ is calculated as:

y(x)=\sum_{i=1}^{N}\mathcal{N}(x|{\mathrm{\textbf{H}}}_{i},\beta\mathrm{\textbf{I}})=\sum_{i=1}^{N}\frac{1}{\sqrt{2\pi}\beta}exp(-\frac{||x-{\mathrm{\textbf{H}}}_{i}||_{\beta\mathrm{\textbf{I}}}^{2}}{2}),

(1)

For all annotated head positions $\{{\mathrm{\textbf{H}}}_{i}\}^{N}_{i=1}$ in the image $\mathcal{I}$ , the density map $y$ is estimated by learning a regressor $f(\mathcal{I})$ based on the L2 loss $\mathcal{L}(y,f(\mathcal{I}))$ = $\left\|y-f(\mathcal{I})\right\|^{2}$ or a Bayesian loss (Wan and Chan 2020). The crowd count is calculated as the sum of the map $y$ from all pixels in $\mathcal{I}$ .

Scale-Aware Crowd Counting Network

We next overview the Scale-Aware Crowd Counting Network (SACC-Net) architecture that generates the density map $y(x)$ as in Fig. 2. In most of CNN backbones, the pooling or convolution with stride 2 usually down-samples the image to half and produces feature maps of $\frac{1}{2}$ , $\frac{1}{4}$ , $\frac{1}{8}$ , and so on. We hypothesize that such scale gap is too large, which causes the features fusion within the layers to be uneven. To address this issue, we propose a novel Synthetic Fusion Module (SFM) in Fig. 3 to generate new synthetic layers between the original layers, such that an improved set of density maps can be produced for accurate crowd counting. We further propose an Intra-block Fusion Module (IFM) in Fig. 4 to allow all feature layers within the same convolution block to be fused, such that more fine-grained information can be sent to the decoder for effective crowd counting. At the end of the SACC-Net, we adopt the ASPP (Chen et al. 2017) and CAN (Liu, Salzmann, and Fua 2019) modules to leverage atrous convolutions with different rates to extract multi-scale features for accurate counting.

The Synthetic Fusion Module (SFM) creates various synthetic layers between the original layers to scale the prediction maps to $\frac{1}{2}$ , $\frac{1}{3}$ , $\frac{1}{4}$ , $\frac{1}{6}$ and $\frac{1}{8}$ , as in Fig. 3. This provides a denser scale space samples to fit the ground truth whose scale changes continuously. The synthetic layer generated by SFM can take either two or three inputs depending on its position in the SACC-Net, as shown in the brown blocks in Fig. 2. SFM first performs linearly scaling of the inputs then merge them via an $1\times 1$ convolution. The results are then fused using a $3\times 3$ convolution. SFM thus synthesizes a new feature layer from the two original adjacent layers, which yields a smoother scaling space for crowd counting. Details of the linear down-sampling and up-sampling operations and their time complexities are discussed in the supplementary.

Intra-block Fusion Module (IFM): For most CNN architectures such as VGG, various convolutions are performed sequentially to extract features. Only the feature maps from the last layer of a convolution block are sent to the next module. In contrast, we argue that, not only the last layer but also all other layers within the convolution block can provide fine-grained features to generate an accurate density map. To reflect these ideas, our proposed IFM is different in design in comparison to the structure of the DenseNet (Huang et al. 2017). The convolution block of the DenseNet in Fig. 4(a) uses a fully connected structure to link all layers, which might lead to difficulties in training such as complicated computation in back-propagation, excessive RAM usage during model training, and inefficiency during inference. Fig. 4(b) shows the structure of our IFM, which uses fewer connections over DenseNet to generate the required feature maps. IFM contains three advantages over DenseNet: (1) IFM requires fewer memory usage, as it uses $1\times 1$ convolution to directly obtain the output. (2) IFM can obtain more representative features by aggregating all layers of information. (3) IFM contains fewer parameters and is more efficient over DenseNet. Time complexity of IFM is analyzed in the supplementary.

Light-Weight Architecture: Density-based crowd counting methods can achieve good counting accuracy, but existing methods are inefficient for running in real-time applications. The processing of feature maps in SACC-Net goes in two routes, as shown in the Conv-2-1 block in Fig. 2: one branch goes through a VGG block, and the other goes directly to the convolution block with two simple convolutions and move on (see Fig. 2 in the Supplementary). This separation can balance the computation load of each layer and reduce the memory load. In each convolution block, $e.g.$ , the green block “Conv-n-x” in Fig. 2), only half the number of channels instead of the original convolution on all channels are sent to the next convolution block, $e.g.$ , the block “Conv-(n+1)-x” in Fig. 2. This design can greatly reduce the number of model parameters while maintaining stable accuracy performance. Additional details regarding this lightweight design are provided in the supplementary.

Scale-Aware Annotation Noise

We aim to deal with uncertainty in the manually labeled annotations as shown in Fig. 1(a). Our assumption is that the point-wise head annotations can inevitably come with annotation errors. This annotation errors will cause the crowd density $y$ in Eq. (1) to be incorrectly estimated. We next derive a solution to address this.

Let $\tilde{\mathrm{\textbf{H}}}_{i}$ denote the annotated head position of the $i^{th}$ person with potential annotation error, and $\varepsilon_{i}$ denote its annotation noise, $\tilde{\mathrm{\textbf{H}}}_{i}=\mathrm{\textbf{H}}_{i}+\varepsilon_{i}$ . We assume the annotation noise is independent and identically distributed (i.i.d), $\varepsilon_{i}\overset{i.i.d}{\sim}\mathcal{N}(0,\alpha\mathit{\textbf{I}})$ , where $\alpha$ is a annotation variance parameter. Considering the case of annotation noise, we model the density $\mathbb{D}(x)$ at location $x$ as:

	$\displaystyle\mathbb{D}(x)$	$\displaystyle=\sum_{i=1}^{N}\mathcal{N}(x\|\mathrm{\tilde{\mathrm{\textbf{H}}}_{i}},\beta\mathit{\textbf{I}})=\sum_{i=1}^{N}\mathcal{N}(x\|{\mathrm{\textbf{H}}}_{i}+\varepsilon_{i},\beta\mathit{\textbf{I}})$		(2)
		$\displaystyle=\sum_{i=1}^{N}\mathcal{N}(q_{i}\|\varepsilon_{i},\beta\textbf{I})\cong\sum_{i=1}^{N}\phi_{i}.$		(2)

where $\beta$ is defined in Eq. (1), $q_{i}=x-{\mathrm{\textbf{H}}}_{i}$ denoting the position difference between the $i^{th}$ annotation and $x$ , and $\phi_{i}$ denotes the individual Gaussian kernel for the $i^{th}$ annotation. In the literature, (Wan and Chan 2020) did not distinguish the range of annotation errors among small and large objects. They proposed a fixed-scale model using the NoiseCC loss to rectify the annotation noise. One key problem for all SoTA methods (Lempitsky and Zisserman 2010; Sindagi and Patel 2017; Wan and Chan 2020) regarding Eq. (2) is that a fixed $\beta$ with constant value is used to model $\mathbb{D}(x)$ of the crowd density around each head position. As aforementioned, the head sizes appear in different sizes according to their distances w.r.t. the observing camera. To this end, our design makes $\beta$ scale-aware and adaptive to the size of each head appearing in the image.

Adaptive Gaussian Kernel

As shown in Fig. 1, the distribution of $\beta$ is positively skewed (small heads occupy more). With this observation, this section will propose a scale-adaptive Gussian model for heat map generation and annotation error correction. Assume that there are $S$ scales used to model a head with annotation errors. Then, Eq. (2) can be rewritten as:

\displaystyle\mathbb{D}(x)

\displaystyle=\sum_{i=1}^{N}\sum_{s=1}^{S}w_{s}\mathcal{N}(q_{i}|\varepsilon_{i},\beta_{s}\textbf{I})\cong\sum_{s=1}^{S}w_{s}\sum_{i=1}^{N}\phi_{i}^{s},\vspace{-3mm}

(3)

where the density of a head is modeled with a mixed Gaussian model, and $\sum_{s=1}^{S}w_{s}=1$ . In addition, $\phi_{i}^{s}=\mathcal{N}(q_{i}|\varepsilon_{i},\beta_{s}\textbf{I})$ , $i.e$ ., the Gaussian kernel placed in the $i$ th annotation at the scale $s$ and parameterized with the annotation error $\varepsilon_{i}$ and the variance $\beta_{s}$ . Let $\mathbb{D}_{s}=w_{s}\sum_{i=1}^{N}\phi_{i}^{s}$ . In addition, we denote $\mathcal{I}_{s}$ as the scaled-down version of $\mathcal{I}$ at scale $s$ . For all pixels $x_{j}$ in $\mathcal{I}_{s}$ , a multivariate random variable for the density map $\mathbb{D}(x)$ at the scale $s$ can be constructed as:

\displaystyle\mathbb{D}_{s}=[\mathbb{D}_{s}(x_{1}),\cdots,\mathbb{D}_{s}(x_{j}),\cdots,\mathbb{D}_{s}(x_{J_{s}})],\vspace{-2mm}

(4)

where $J_{s}$ is the number of pixels in $\mathcal{I}_{s}$ .

Scale-aware probability distribution

To calculate $\mathbb{D}_{s}$ in closed form, we approximate it by a Gaussian, $i.e$ ., $\hat{p}(\mathbb{D}_{s})\sim\mathcal{N}(\mathbb{D}_{s}|\mu_{s},\sigma^{2}_{s})$ with the scale-aware mean $\mu_{s}$ and variance $\sigma^{2}_{s}$ . The mean $\mu_{s}$ is calculated as:

	$\displaystyle\mu_{s}$	$\displaystyle=\mathbb{E}[\mathbb{D}_{s}]=\mathbb{E}[w_{s}\sum_{i=1}^{N}\mathcal{N}(q_{i}\|\varepsilon_{i},\beta_{s}\textbf{I})]$		(5)
		$\displaystyle=w_{s}\sum_{i=1}^{N}\mathcal{N}(q_{i}\|0,(\alpha+\beta_{s})\textbf{I})=\sum_{i=1}^{N}\mu_{i}^{s},$		(5)

where $\mu_{i}^{s}=w_{s}\mathcal{N}(q_{i}|0,(\alpha+\beta_{s})\textbf{I})$ and the annotation error $\varepsilon_{i}\sim\mathcal{N}(0|0,\alpha\textbf{I})$ . The variance $\Sigma^{2}_{s}$ is given by:

	$\displaystyle\Sigma^{2}_{s}$	$\displaystyle=\mathrm{var}(\mathbb{D}_{s})=\mathbb{E}[\mathbb{D}_{s}^{2}]-\mathbb{E}[\mathbb{D}_{s}]^{2}$		(6)
		$\displaystyle\cong\sum_{i=1}^{N}[\frac{w_{s}^{2}}{4\pi\beta_{s}}\mathcal{N}(q_{i}\|0,(\beta_{s}/2+\alpha)\textbf{I})-(\mu_{i}^{s})^{2}].$		(6)

Gaussian approximation to scale-aware joint likelihood $\textbf{D}_{s}$

Next, the covariance term $\mathrm{Cov}(\mathbb{D}_{s}{(x_{j})},\mathbb{D}_{s}{(x_{k})})$ between locations $x_{j}$ and $x_{k}$ needs to be calculated. We model it by a multivariate Gaussian approximation of the joint likelihood $\textbf{D}_{s}$ at the scale $s$ . Let $q_{i}{(x_{j})}=x_{j}-\tilde{\mathrm{\textbf{H}}}_{i}$ be the difference between the spatial location of the $i$ -th annotation and the location of the pixel $x_{j}$ . Based on Eq. (3), the density value $\mathbb{D}_{s}{(x_{j})}$ is calculated as:

\mathbb{D}_{s}{(x_{j})}=w_{s}\sum_{i=1}^{N}\mathcal{N}(q_{i}{(x_{j})}|\varepsilon_{i},\beta_{s}\textbf{I})=w_{s}\sum_{i=1}^{N}\phi_{i}^{s}{(x_{j})},

(7)

where $\phi_{i}^{s}{(x_{j})}=\mathcal{N}(q_{i}{(x_{j})}|\varepsilon_{i},\beta_{s}\textbf{I})$ and annotation noise $\varepsilon_{i}$ is the same random variable across all $\phi_{i}^{s}{(x_{j})}$ . Define the Gaussian approximation to $\textbf{D}_{s}$ as $\hat{p}(\textbf{D}_{s})=\mathcal{N}(\textbf{D}_{s}|\mu_{s},\Sigma_{s}),$ where $\mu_{s}$ and $\Sigma_{s}$ are defined in Eqs. (5) and (6). The $j^{th}$ entry in $\mu_{s}$ is $\mathbb{E}[\mathbb{D}_{s}(x_{j})]$ = $\sum_{i=1}^{N}\mu_{i}^{s}{(x_{j})}$ from Eq. (5). The diagonal of the scale-aware covariance matrix is calculated as $\boldsymbol{\Sigma}_{x_{j},x_{j}}^{s}$ = $\mathrm{Var}(\mathbb{D}_{s}{(x_{j})})$ . The covariance term is then:

	$\displaystyle\boldsymbol{\Sigma}_{x_{j},x_{k}}^{s}$	$\displaystyle=\mathrm{Cov}(\mathbb{D}_{s}{(x_{j})},\mathbb{D}_{s}{(x_{k})})$		(8)
		$\displaystyle=\sum_{i=1}^{N}[w_{s}^{2}\Omega_{i}^{s}{(x_{j},x_{k})}-\mu_{i}^{s}{(x_{j})}\mu_{i}^{s}{(x_{k})}].$		(8)

where $\Omega_{i}^{s}{(x_{j},x_{k})}$ = $\mathbb{E}[\phi_{i}^{s}{(x_{j})}\phi_{i}^{s}{(x_{k}})]$ .

Low-rank Approximation using SVD

Since the dimension of $\boldsymbol{\Sigma}_{x_{j},x_{k}}^{s}$ is huge, $i.e$ , $J_{s}\times J_{s}$ , this section will derive its low-rank approximation with its non-zero rows and columns for efficiency improvement. Let $\hat{\boldsymbol{\Sigma}}^{s}$ denote the approximation to $\boldsymbol{\Sigma}^{s}$ using Singular Value Decomposition (SVD) calculated as:

\hat{\boldsymbol{\Sigma}}^{s}\cong\boldsymbol{\Sigma}^{s}=\textbf{U}^{s}\textbf{C}_{L}^{s}\textbf{V}^{T},

(9)

where $\textbf{U}^{s}$ is an $J_{s}\times J_{s}$ orthogonal matrix, $\textbf{C}_{L}^{s}$ is a nonnegative $J_{s}\times J_{s}$ diagonal matrix with diagonal entries sorted from high to low, and $\textbf{V}^{T}$ is a $J_{s}\times J_{s}$ orthogonal matrix. Let $v_{j}^{s}$ = $\boldsymbol{\Sigma}_{x_{j},x_{j}}^{s}$ . To obtain this low-rank approximation, each pixel $x_{j}$ is first ordered by $v_{j}^{s}$ . Then, the top- $M$ pixels whose percentages of variance are larger than 0.8, $i.e$ .

\frac{\sum_{j=1}^{M}v_{j}^{s}}{\sum_{i=1}^{J}v_{j}^{s}}>0.8,

(10)

are selected from $\mathcal{I}_{s}$ for this low-rank approximation. Let the set of indices of the top $M$ pixels be denoted by $L$ , $i.e.$ , $L=\{l_{1},l_{2},\dots,l_{m},\dots,l_{M}\}$ . Then, only the elements in $L$ are selected to approximate $\boldsymbol{\Sigma}^{s}$ . The approximation of a matrix $\boldsymbol{\Sigma}^{s}$ by a rank- $M$ matrix requires a representation of $\boldsymbol{\Sigma}^{s}$ as the sum of several ingredients ordered by their importance. SVD lends itself to this task by transforming $\boldsymbol{\Sigma}^{s}$ into the sum of rank-1 matrices (weighted by the corresponding singular values), namely, $\boldsymbol{\Sigma}^{s}=\textbf{U}^{s}\textbf{C}_{L}^{s}{\textbf{V}^{s}}^{T}$ is equivalent to:

\boldsymbol{\Sigma}^{s}=\sum^{J_{s}}_{i=1}c_{i}^{s}\cdot\textbf{u}^{s}_{i}{\textbf{v}_{i}^{s}}^{T},\vspace{-2mm}

(11)

where the scale $s=1,...,S$ , $c_{i}^{s}$ is the $i$ th singular value and $\textbf{u}^{s}_{i},\textbf{v}_{i}^{T}$ are the corresponding left and right singular vectors. A natural idea is to keep only the top $M$ terms on the right-hand side of Eq. (11). That is, for $\boldsymbol{\Sigma}^{s}$ as in Eq. (11) and a target rank $M$ , the proposed rank- $M$ approximation is:

\hat{\boldsymbol{\Sigma}}^{s}\cong\sum^{M}_{i=1}c_{i}^{s}\cdot\textbf{u}^{s}_{i}{\textbf{v}_{i}^{s}}^{T},\vspace{-2mm}

(12)

where the singular values ( $s_{1}\geq s_{2}\geq\cdots\geq s_{J_{s}}\geq 0$ ) have been sorted, and $\textbf{u}^{s}_{i},\textbf{v}_{i}^{T}$ denote the $i$ th left and right singular vectors. With $\hat{\boldsymbol{\Sigma}}^{s}$ , the rank-M approximate negative log-likelihood function is:

-\mathrm{log}\hat{p}(\textbf{D}_{s})=-\mathrm{log}\mathcal{N}(\textbf{D}_{s}|\mu_{s},\hat{\boldsymbol{\Sigma}}^{s})\propto||\textbf{D}_{s}-\mu_{s}||^{2}_{\hat{\boldsymbol{\Sigma}}^{s}}.

(13)

The time complexities to calculate the right-hand sides of Eq.(12) and Eq.(13) take $O(M^{3})$ and $O(M^{2})$ , in contrast to $O(J_{s}^{3})$ and $O(J_{s}^{2})$ required to calculate the original matrix $\boldsymbol{\Sigma}^{s}$ and the distance $||\textbf{D}_{s}-\mu_{s}||^{2}_{{\boldsymbol{\Sigma}}^{s}}$ , respectively.

Regularization and the Final Loss Term

The Gaussian approximation to $\textbf{D}_{s}$ can be obtained based on Eqs. (12) and (13). To ensure that the predicted density map near each annotation satisfies the density sum to 1, for the $i$ -th annotation point, we define the regularizer $\mathcal{R}_{i}^{s}$ as:

\mathcal{R}_{i}^{s}=|\sum_{j}\mathbb{D}_{s}(x_{j})\frac{\phi_{i}^{s}(x_{j})}{\sum_{i=1}^{N}\phi_{i}^{s}{(x_{j})}}-1|,

(14)

where $\mathbb{D}_{s}(x_{j})$ is the $j^{th}$ term of $\textbf{D}_{s}$ . Let $\bar{\textbf{D}}_{s}=\textbf{D}_{s}-\mu_{s}$ . Then, the final loss function is:

\mathcal{L}=\sum_{s=1}^{S}\bar{\textbf{D}}_{s}^{T}(\hat{\Sigma}^{s})^{-1}\bar{\textbf{D}}_{s}+\sum_{s=1}^{S}\sum_{i=1}^{N}\mathcal{R}_{i}^{s}.

(15)

Experimental Results

We evaluated our crowd counting method and compared it with 14 SoTA methods in four public datasets, UCF-QNRF (Idrees et al. 2018), UCF CC 50 (Idrees et al. 2013), NWPU-Crowd (Wang et al. 2021), and ShanghaiTech Parts A and B (Zhang et al. 2016).

Model Training Parameters

Our method was pre-trained on ImageNet (Deng et al. 2009) with the Adam optimizer. Since the image dimensions in the used datasets are different, patches with a fixed size are cropped at random locations, then randomly flipped (horizontally) with probability 0.5 for data augmentation. The learning rates used in the training process are $1e^{-5}$ , $1e^{-5}$ , $1e^{-5}$ , and $1e^{-4}$ for the UCF-QNRF, UCF CC 50, NWPU, and ShanghaiTech datasets, respectively. To stabilize the training loss change, we use batch sizes 10, 10, 15, and 10, respectively. All parameters used in the training stage are listed in Table 1. Similarly to other SoTA methods (Li, Zhang, and Chen 2018; Xu et al. 2019; Bai et al. 2020; Xiong et al. 2019; Varior et al. 2019; Jiang et al. 2020; Thanasutives et al. 2021; Zhu et al. 2019), the mean average error (MAE) and the mean squared error (MSE) are used to evaluate the performance of our architecture.

Parameter Settings for $\beta_{s}$ and $w_{s}$

As shown in Fig. 1(a), Fig. 5(a), and Fig. 6(a), the number of heads of the crowd decreases according to the head size $h$ . This $P_{head}(h)$ distribution is positive-skewed and can be easily obtained by accumulating from the training data. In Eq. (1), the variance parameter $\beta$ is proportional to the head size. We can set the mean of $h$ as the initial value of $\beta_{1}$ , $\beta_{1}=\sum_{h}hP_{head}(h)$ . In a CNN backbone such as VGG19, the pooling operation will reduce the feature map size by half, thus also reduce the head size in the feature map. Given $\beta_{s}$ , the value of $\beta_{s+1}$ can be obtained in a recursive form: $\beta_{s+1}=\beta_{s}/2$ . Our analysis yields $\beta$ to be approximately 8.3. This subsampling operation will also make small heads to eventually disappear. Then, $w_{s}$ is set to $P_{head}(\beta_{(S+1-s)})$ , where $S$ is the largest scale used to model $\mathbb{D}(x)$ in Eq. (3). We set $S=3$ . After normalization, we have $\sum_{s=1}^{S}w_{s}=1$ .

Table 1: Detailed parameters used for training.

Dataset	learning rate	batch size	crop size
UCF-QRNF	1e-5	12	512×512
UCF CC 50	1e-5	10	512×512
NWPU	1e-5	8	512×512
ShanghaiTech	1e-4	12	512×512

Performance Comparisons w.r.t. Loss Functions

Table 2: Accuracy comparisons among different loss functions with various backbones on UCF-QNRF.

VGG19 CSRNet MCNN MAE MSE MAE MSE MAE MSE L2 98.7 176.1 110.6 190.1 186.4 283.6 BL 88.8 154.8 107.5 184.3 190.6 272.3 NoiseCC 85.8 150.6 96.5 163.3 177.4 259.0 DM-count 85.6 148.3 103.6 180.6 176.1 263.3 Gen-loss 84.3 147.5 92.0 165.7 142.8 227.9 Ours 83.47 140.34 90.83 150.67 134.52 213.71

We compare the proposed loss function against L2, BL (Ma et al. 2019), NoiseCC (Wan and Chan 2020), DM-count (Wang et al. 2020), and the generalized loss (Wan, Liu, and Chan 2021) under different backbones to evaluate the effectiveness. Table 2 shows the performance evaluation results. Clearly, our proposed scale-aware loss function outperforms other SoTA loss functions under various backbones. Since human head sizes are different, the same annotation error causes different effects to affect the accuracy of crowd counting. Although NoiseCC (Wan and Chan 2020) has pointed out that the annotation noise will affect the accuracy of crowd counting, their work did not address the scaling issues. Our scale-aware loss function can properly handle it and outperforms other losses on UCF-QNRF.

Comparisons with SoTA Methods

Table 3: Performance comparisons among different SoTA crowd counting methods.

Methods Venue UCF-QNRF NWPU S. H. Tech-A S. H. Tech-B UCF CC 50 MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE CSRNet CVPR’18 - - 121.3 522.7 68.2 115.0 10.3 16.0 266.1 397.5 CAN CVPR’19 107 183 - - 62.3 100.0 7.8 12.2 212.2 243.7 S-DCNet ICCV’19 104.4 176.1 - - 58.3 95.0 6.7 10.7 204.2 301.3 SANet ECCV’18 - - 190.6 491.4 67.0 104.5 8.4 13.6 258.4 334.9 BL ICCV’19 88.7 154.8 105.4 454.2 62.8 101.8 7.7 12.7 229.3 308.2 SFANet - 100.8 174.5 - - 59.8 99.3 6.9 10.9 - - DM-Count NeurIPS’20 85.6 148.3 88.4 498.0 59.7 95.7 7.4 11.8 211.0 291.5 RPnet CVPR’15 - - - - 61.2 96.9 8.1 11.6 - - AMSNet ECCV’20 101.8 163.2 - - 56.7.2 93.4 6.7 10.2 208.4 297.3 M-SFANet ICPR’21 85.6 151.23 - - 59.69 95.66 6.38 10.22 162.33 276.76 TEDnet CVPR’19 113.0 188.0 - - 64.2 109.1 8.2 12.8 249.4 354.5 P2PNet ICCV’21 85.32 154.5 77.44 362 52.74 85.06 6.25 9.9 172.72 256.18 GauNet CVPR’22 81.6 153.7 - - 54.8 89.1 6.2 9.9 186.3 256.5 MAN CVPR’22 77.3/83.4^∗ 131.5/146^∗ 76.5/76.6^∗ 323.0/465.4^∗ 56.8 90.3 - - - - SACC-Net(BL Loss) - 85.42 145.44 86.72 442.9 55.28 90.37 6.5 10.68 167.48 235.41 SACC-Net(our loss) - 77.12 124.25 75.52 349.73 52.19 76.63 6.16 9.71 150.66 187.89

Symbol ^∗ denotes scores produced by running the original source codes provided by the authors.

To further evaluate the performance of our proposed method, 14 SoTA methods are compared here for performance evaluation; that is, CSRNet (Li, Zhang, and Chen 2018), CAN (Liu, Salzmann, and Fua 2019), S-DCNet (Xiong et al. 2019), SANet (Cao et al. 2018), BL (Ma et al. 2019), SFANet (Zhu et al. 2019), DM-Count (Wang et al. 2020), RPnet (Zhang et al. 2015), AMSNet (Hu et al. 2020) M-SFANet (Thanasutives et al. 2021), TEDnet (Jiang et al. 2019), P2PNet (Song et al. 2021), GauNet (Cheng et al. 2022), and MAN (Lin et al. 2022). Table 3 shows the comparative results among these SoTA methods on four benchmark datasets. Clearly, our method achieves the best MAE on all the above datasets, especially for large-scale datasets such as UCF-QNRF, NWPU-Crowd, and ShanghaiTech Part A. As to the MSE metric, our method outperforms all SoTA methods except MAN (Lin et al. 2022).

Table 4: Ablation study of SACC-Net for running SFM+IFM at different density scales on UCF-QNRF.

SFM+IFM Scale1 Scale2 Scale3 UCF-QNRF MAE MSE ✓ 85.45 145.74 ✓ ✓ 84.07 135.63 ✓ ✓ ✓ 82.42 130.04 ✓ ✓ 83.81 140.19 ✓ ✓ 82.71 130.29 ✓ ✓ ✓ 77.12 124.25

Ablation Studies

To demonstrate the effectiveness of our fusion approach, we conducted an ablation study on how the addition of “fusion” and the number of scales used to improve crowd counting accuracy. Most objects in the UCF-QNRF dataset are smaller than those in other datasets. Thus, UCF-QNRF is adopted here to fairly evaluate the effect of our proposed fusion module. In Table 4, we can see that using our fusion module is significantly better than not using it. For example, our SACC-Net with this module reduces the error rates significantly from 85.42 to 77.12 in MAE and from 145.44 to 124.25 in MSE for the UCF-QNRF dataset.

We also evaluated the effects of the number of scales on improving the accuracy of crowd count. There are five pooling layers created in VGG19 that cause the original image to be scaled down to only 1/32 $\times$ 1/32 ratio. The feature map in the last layer cannot provide enough information to calculate the required covariance matrix. The first layer is too primitive for crowd counting. Since three layers provide optimal performance results, we set $S$ to three in Eq. (3). Table 4 shows the accuracy comparisons between three combinations of three scales (corresponding to layer 2, layer 3, and layer 4). The three-scale scale-aware loss function significantly improves the accuracy of crowd counting on the UCF-QNRF dataset, especially in the MAE metric.

Visualization results of heat maps: Fig. 5 shows the visualization results when different loss functions were used. The ground truth of head counting in (a) is 855. The heat maps generated by the MSE loss and the Bayesian loss (Ma et al. 2019) are visualized in (b) and (c) with the predicted results of 772.69 and 901.30, respectively. Clearly, the MSE loss performs better than the BL method. (d) is the visualization result generated by our scale-aware loss with the predicted value 884.1. Compared to (b) and (c), our loss function can generate a more detailed heat map for heads, since annotation errors are taken into account in crowd modeling. NoiseCC (Wan and Chan 2020) also points out that noise from annotation will affect the accuracy of crowd counting. Fig. 6 shows the visualization results generated by our method wo/ and w/ IFM. The detailed heat map of smaller heads generated in (c) leads to better crowd counting accuracy, which justifies the effectiveness of IFM. The SMF can generate various synthetic layers to construct better density maps for crowd counting. Refer to the supplementary for detailed performance evaluations of our lightweight model.

Conclusions

We presented a scale-aware SACC-Net and a new loss function that addresses the annotation noise w.r.t. scale for improving crowd counting. To deal with the scale truncation problem, The proposed SFM handles the scale truncation problem and generates a smoother scale space such that large objects can be accurately counted. The IFM is developed to fuse all feature layers within the same convolution block to generate finer-grained information for small object counting. The SACC-Net is lightweight, efficient and accurate. We also evaluated the effects of annotation variance $\alpha$ and annotation error variance $\beta$ regarding MAE. The SACC-Net outperforms other SoTA methods on four datasets.

Future Work includes the automatic selection of the parameters $\alpha$ and $\beta_{s}$ from data-driven learning. Also further lightweight improvement can enable the deployment of SCAA-Net to directly operate on drones.

References

Bai et al. (2020) Bai, S.; He, Z.; Qiao, Y.; Hu, H.; Wu, W.; and Yan, J. 2020. Adaptive dilated network with self-correction supervision for counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4594–4603.
Cao et al. (2018) Cao, X.; Wang, Z.; Zhao, Y.; and Su, F. 2018. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European conference on computer vision (ECCV), 734–750.
Chen et al. (2017) Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4): 834–848.
Cheng et al. (2022) Cheng, Z.-Q.; Dai, Q.; Li, H.; Song, J.; Wu, X.; and Hauptmann, A. G. 2022. Rethinking Spatial Invariance of Convolutional Networks for Object Counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19638–19648.
Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
Gao et al. (2020) Gao, G.; Gao, J.; Liu, Q.; Wang, Q.; and Wang, Y. 2020. Cnn-based density estimation and crowd counting: A survey. arXiv preprint arXiv:2003.12783.
Hu et al. (2020) Hu, Y.; Jiang, X.; Liu, X.; Zhang, B.; Han, J.-g.; Cao, X.; and Doermann, D. 2020. Nas-count: Counting-by-density with neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), 748–765.
Huang et al. (2017) Huang, G.; Liu, Z.; Maaten, L. V. D.; and Weinberger, K. Q. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269.
Idrees et al. (2013) Idrees, H.; Saleemi, I.; Seibert, C.; and Shah, M. 2013. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2547–2554.
Idrees et al. (2018) Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; and Shah, M. 2018. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), 532–546.
Jiang et al. (2019) Jiang, X.; Xiao, Z.; Zhang, B.; Zhen, X.; Cao, X.; Doermann, D.; and Shao, L. 2019. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6133–6142.
Jiang et al. (2020) Jiang, X.; Zhang, L.; Xu, M.; Zhang, T.; Lv, P.; Zhou, B.; Yang, X.; and Pang, Y. 2020. Attention scaling for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4706–4715.
Lempitsky and Zisserman (2010) Lempitsky, V.; and Zisserman, A. 2010. Learning to count objects in images. Advances in neural information processing systems, 23: 1324–1332.
Li et al. (2021) Li, B.; Huang, H.; Zhang, A.; Liu, P.; and Liu, C. 2021. Approaches on crowd counting and density estimation: a review. Pattern Analysis and Applications, 24: 853–874.
Li, Zhang, and Chen (2018) Li, Y.; Zhang, X.; and Chen, D. 2018. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1091–1100.
Lin et al. (2022) Lin, H.; Ma, Z.; Ji, R.; Wang, Y.; and Hong, X. 2022. Boosting Crowd Counting via Multifaceted Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19628–19637.
Liu, Salzmann, and Fua (2019) Liu, W.; Salzmann, M.; and Fua, P. 2019. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5099–5108.
Ma et al. (2019) Ma, Z.; Wei, X.; Hong, X.; and Gong, Y. 2019. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6142–6151.
Miao et al. (2020) Miao, Y.; Lin, Z.; Ding, G.; and Han, J. 2020. Shallow feature based dense attention network for crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11765–11772.
Sindagi and Patel (2017) Sindagi, V. A.; and Patel, V. M. 2017. Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Song et al. (2021) Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; and Wu, Y. 2021. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3365–3374.
Thanasutives et al. (2021) Thanasutives, P.; Fukui, K.-i.; Numao, M.; and Kijsirikul, B. 2021. Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting. In 2020 25th International Conference on Pattern Recognition (ICPRsong2021rethinking), 2382–2389. IEEE.
Varior et al. (2019) Varior, R. R.; Shuai, B.; Tighe, J.; and Modolo, D. 2019. Multi-scale attention network for crowd counting. arXiv preprint arXiv:1901.06026.
Wan and Chan (2019) Wan, J.; and Chan, A. 2019. Adaptive density map generation for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1130–1139.
Wan and Chan (2020) Wan, J.; and Chan, A. 2020. Modeling noisy annotations for crowd counting. Advances in Neural Information Processing Systems, 33: 3386–3396.
Wan, Liu, and Chan (2021) Wan, J.; Liu, Z.; and Chan, A. B. 2021. A Generalized Loss Function for Crowd Counting and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1974–1983.
Wang et al. (2020) Wang, B.; Liu, H.; Samaras, D.; and Hoai, M. 2020. Distribution matching for crowd counting. arXiv preprint arXiv:2009.13077.
Wang and Breckon (2019) Wang, Q.; and Breckon, T. P. 2019. Crowd Counting via Segmentation Guided Attention Networks and Curriculum Loss. arXiv preprint arXiv:1911.07990.
Wang et al. (2021) Wang, Q.; Gao, J.; Lin, W.; and Li, X. 2021. NWPU-crowd: A large-scale benchmark for crowd counting. IEEE transactions on Pattern Analysis and Machine Intelligence, 43(6): 2141–2149.
Xiong et al. (2019) Xiong, H.; Lu, H.; Liu, C.; Liu, L.; Cao, Z.; and Shen, C. 2019. From open set to closed set: Counting objects by spatial divide-and-conquer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8362–8371.
Xu et al. (2019) Xu, C.; Liang, D.; Xu, Y.; Bai, S.; Zhan, W.; Bai, X.; and Tomizuka, M. 2019. Autoscale: learning to scale for crowd counting. arXiv preprint arXiv:1912.09632.
Zhang et al. (2015) Zhang, C.; Li, H.; Wang, X.; and Yang, X. 2015. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 833–841.
Zhang et al. (2016) Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; and Ma, Y. 2016. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 589–597.
Zhu et al. (2019) Zhu, L.; Zhao, Z.; Lu, C.; Lin, Y.; Peng, Y.; and Yao, T. 2019. Dual path multi-scale fusion networks with attention for crowd counting. arXiv preprint arXiv:1902.01115.