This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding

Single-Frame based Deep View Synchronization for Unsynchronized Multi-Camera Surveillance

Qi Zhang,  and Antoni B. Chan Qi Zhang is with Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Guangdong Province, China. Antoni B. Chan is with the Department of Computer Science, City University of Hong Kong, Tat Chee 83, Hong Kong SAR, China. E-mail: qi.zhang.opt@gmail.com, abchan@cityu.edu.hk.Manuscript received xxx; revised August xxx.
Abstract

Multi-camera surveillance has been an active research topic for understanding and modeling scenes. Compared to a single camera, multi-cameras provide larger field-of-view and more object cues, and the related applications are multi-view counting, multi-view tracking, 3D pose estimation or 3D reconstruction, etc. It is usually assumed that the cameras are all temporally synchronized when designing models for these multi-camera based tasks. However, this assumption is not always valid, especially for multi-camera systems with network transmission delay and low frame-rates due to limited network bandwidth, resulting in desynchronization of the captured frames across cameras. To handle the issue of unsynchronized multi-cameras, in this paper, we propose a synchronization model that works in conjunction with existing DNN-based multi-view models, thus avoiding the redesign of the whole model. We consider two variants of the model, based on where in the pipeline the synchronization occurs, scene-level synchronization and camera-level synchronization. The view synchronization step and the task-specific view fusion and prediction step are unified in the same framework and trained in an end-to-end fashion. Our view synchronization models are applied to different DNNs-based multi-camera vision tasks under the unsynchronized setting, including multi-view counting and 3D pose estimation, and achieve good performance compared to baselines.

I Introduction

Compared to single cameras, multi-camera networks allow better understanding and modeling of the 3D world through more dense sampling of information in a 3D scene [1]. Multi-camera based vision tasks have been a popular research field, especially deep learning related tasks, such as 3D pose estimation from multiple 2D observations [2, 3], 3D reconstruction [4, 5], multi-view tracking [6, 7, 8], multi-view crowd counting [9], and ReID [10, 11, 12, 13, 14]. Usually, it is assumed that the multi-cameras are temporally synchronized when designing DNNs models, i.e., all cameras capture images at the same time point. However, the synchronization assumption for multi-camera systems may not always be valid in practical applications due to a variety of reasons, such as dropped camera frames due to limited network bandwidth or system resources, network transmission delays, etc. Other examples of situations where camera synchronization is not possible include: 1) using images captured from different camera systems; 2) using images from social media to reconstruct the crowd at an event; 3) performing 3D reconstruction of a dynamic scene using video from a drone. Thus, handling unsynchronized multi-cameras is an important issue in the adoption and practical usage of multi-view computer vision.

Refer to caption
Figure 1: Two variants of the main pipeline for unsynchronized multi-view prediction tasks: (top) scene-level synchronization is performed after the projection on the scene-level feature representations; (bottom) camera-level synchronization is performed on the camera-view feature maps before projection.
Refer to caption
Figure 2: A general multi-view pipeline consists of several stages: camera-view feature extraction, feature projection, multi-view feature fusion to obtain a scene-level representation, and prediction. (a) Scene-level synchronization performs the synchronization after the projection. The unsynchronized projected features from the reference view and other views are concatenated to predict the motion flow, which is then used to warp the other views’ projected features to match those of the reference view. (b) Camera-level synchronization performs the synchronization before the projection. The unsynchronized camera-view features from the reference view and other views are matched together to predict the motion flow, which is used to warp features from other camera views to the reference view.

There are several possible methods to fix the problem of unsynchronized cameras. The first method is using hardware-based solutions to synchronize the capture times, such as improving network bandwidth, or by using a central clock to synchronize capture of all cameras in the multi-camera network. However, this will increase the cost and overhead of the system, and is not possible when there is limited bandwidth. The second method is to capture image sequences from each camera, and then synchronize the images afterwards by determining the frame offset between cameras. The fineness of the synchronization depends on the frame rate of the image sequences. However, this method is not effective when acquiring high frame-rate image sequences is not possible due to limited the bandwidth and storage space, or the frame latency between multi-cameras is random. The final method is to modify the multi-view model to handle unsynchronized images, especially for low-frame-rate multi-camera systems or random frame latency between multi-cameras, such as introducing new assumptions or relaxing the original constraints under the unsynchronized setting. Existing approaches for handling unsynchronized multi-cameras are largely based on optimization frameworks [15, 16], but are not directly applicable to DNNs-based multi-view methods, which have seen recent successes in tracking [6, 7], 3D pose estimation [2], and crowd counting [9, 17].

In this paper, we propose a synchronization model that operates in conjunction with existing DNN-based multi-view models by using single frames from each camera to deal with low-frame-rate unsynchronized multi-camera systems or random frame latency between multi-cameras. Our proposed model first synchronizes other views to a reference view using a differentiable module, and then the synchronized multi-views features are fused and decoded to obtain the task-oriented output. As illustrated in Fig. 1, the synchronization can either occur after the camera-to-scene (2D-to-3D) projection (Fig. 1 top) or before the projection (Fig. 1 bottom). Thus, to fully explore these options, we consider two variants of our model that perform synchronization at different stages in the pipeline (see Fig. 2): 1) scene-level synchronization performs the synchronization after projecting the camera features to their 3D scene representation; 2) camera-level synchronization performs the synchronization between camera views first, and then projects the synchronized 2D feature maps to their 3D representations. In both cases, motion flow between the cameras’ feature maps are estimated and then used to warp the feature maps to align with the reference view (either at the scene-level or the camera-level). With both variants, the view synchronization and the multi-view fusion are unified in the same framework and trained in an end-to-end fashion. In this way, the original DNN-based multi-view model can be adapted to work in the unsynchronized setting by adding the view synchronization module, thus avoiding the need to design a new model. Furthermore, the synchronization module only relies on content-based image matching and camera geometry, and thus is widely applicable to many DNNs-based multi-view tasks, such as crowd counting, tracking, 3D pose estimation, and 3D reconstruction.

In summary, the contributions of this paper are 3-fold:

  • We propose an end-to-end trainable framework to handle the issue of unsynchronized multi-camera images in DNNs-based multi-camera vision tasks. To the best of our knowledge, this is the first study on DNNs-based single-frame synchronization of multi-view cameras.

  • We propose two synchronization modules, scene-level synchronization and camera-view level synchronization, which are based on image-based content matching that is guided by epipolar geometry. The synchronization modules can be applied to many different DNNs-based multi-view tasks.

  • We conduct experiments on multi-view counting and 3D pose estimation from unsynchronized images, demonstrating the efficacy of our approach.

The remainder of this paper is organized as follows. We review related works in Section II. In Section III, we propose our single-frame multi-camera synchronization methods, and in Section IV we present experiments on two applications, multi-view crowd counting and multi-view 3d human pose estimation. Finally, Section V concludes the paper.

II Related Work

In this section, we review DNN-based methods on synchronized multi-view images and unsynchronized multi-view video tasks, as well as traditional multi-view video synchronization methods. We then review DNN-based image matching and flow estimation methods.

II-A DNN-based synchronized multi-camera tasks

Multi-camera surveillance based on DNNs has been an active research area. By utilizing multi-view cues and the strong mapping power of DNNs, many DNNs models have been proposed to solve multi-view surveillance tasks, such as multi-view tracking and detection [18, 6, 7], crowd counting [9], 3D reconstruction [4, 5, 19, 20] and 3D human pose estimation [2, 21, 22, 23, 24]. [4] proposed a deep learning 3D reconstruction framework with differentiable feature projection and unprojection steps. [10] proposed the collaboration ensemble learning for ReID with middle-level sharable two-stream network. [2] proposed volumetric aggregation of feature maps for 3D pose estimation. The DNN pipelines used for these multi-camera tasks can be generally divided into three stages: the single-view feature extraction stage, the multi-view fusion stage to obtain a scene-level representation, and prediction stage. Furthermore, all these DNN-based methods assume that the input multi-views are synchronized, which is not always possible in real multi-camera surveillance systems, or in multi-view data from disparate sources (e.g., crowd sourced images). Therefore, relaxing the synchronization assumption can allow more practical applications of multi-camera vision tasks in real world.

II-B Tasks on unsynchronized multi-camera video

Only a few works have considered computer vision tasks on unsynchronized multi-camera video. [15] posed the estimation of 3D structure observed by multiple unsynchronized video cameras as the problem of dictionary learning. [16] proposed a multi-camera motion segmentation method using unsynchronized videos by combining shape and dynamical information. [25] proposed a method of estimating 3D human pose from multi-view videos captured by unsynchronized and uncalibrated cameras by utilizing the projections of joint as the corresponding points. [26] presented a method for simultaneously estimating camera geometry and time shift from video sequences from multiple unsynchronized cameras using minimal correspondence sets. [27] addressed the problem of aligning unsynchronized camera views by low and/or variable frame rates using the intersections of corresponding object trajectories to match views.

Note that all these methods assume that videos or image sequences are available to perform the synchronization. In contrast, our framework, which is motivated by practical low-fps systems, is solving a harder problem, where only a single image is available from each camera view, i.e., there is no temporal information available. Furthermore, these methods pose frame synchronization as optimization problems that are applicable only to the particular multi-view task, and cannot be directly applied to DNN-based multi-view models. In contrast, we propose a synchronization module that can be broadly applied to many DNN-based multi-camera models, enabling their use with unsynchronized inputs.

II-C Traditional methods for multi-view video synchronization

Traditional synchronization methods usually serve as a preprocessing step for multi-camera surveillance tasks. Except audio-based synchronization like [28], most traditional camera synchronization methods rely on videos or image sequences and hand-crafted features for camera alignment/synchronization [29, 30, 31, 32, 33]. Typical approaches recover the temporal offset by matching features extracted from the videos, e.g., space-time feature trajectories [34, 35, 36], image features [37], low-level temporal signals based on fundamental matrices [38], silhouette motion [39], and relative object motion [40]. The accuracy of feature matching is improved using epipolar geometry [37, 39] and rank constraints [35]. [34] proposed to use the space-time feature trajectories matching instead of feature-points matching to reduce the search space. [29] proposed an iterative procedure to achieve the alignment in space and time with the homography assumption in spatial domain. [37] utilized image feature correspondences and epipolar geometry to find the corresponding frame indices and computes the relative frame rate and offset by fitting a 2D line to the index correspondences. [36] estimated the frame accurate offset by analysing the trajectories and matching their characteristic time patterns. [38] presented a method for online synchronization that relied on the video sequences with known fundamental matrix to compute low level temporal signals for matching. [35] proposed the rank constraint of corresponding points in two views to measure the similarity between trajectories to avoid the noise sensitivity of the fundamental matrix. [39] proposed a RANSAC-based algorithm that computed the epipolar geometry and synchronization of a pair of cameras from the motion of silhouettes in videos. [32] estimated possible synchronization parameters via the Hough transform and refined these parameters using non-linear optimization methods. [33] relied on correlating space-time interest point distribution in time between videos which represented events in video that had high variation in both space and time. [40] synchronized two independently moving cameras via the relative motion between objects and known camera intrinsic.

The main disadvantages for these traditional camera synchronization methods are: 1) Videos and image-sequences are required, which might not be available in practical multi-camera systems with limited network bandwidth and storage; 2) A fixed frame rate of the multi-cameras are usually assumed, which means random frame dropping cannot be handled (except [38]); 3) Feature matching is based on hand-crafted features, which lack representation ability, or known image correspondences, which requires extra manual annotations and may not always be available. Compared with these methods, we consider a more practical and difficult setting: only single-frames and no videos (no temporal information) are available, which means that these traditional video-based methods are not suitable solutions. These traditional methods perform image content matching using hand-crafted features and traditional matching algorithms, while in contrast our method uses DNN-based image matching. Because we also assume that only single-frames are available, our method also requires DNN-based motion estimation to estimate a frame’s features after synchronization. Finally, our synchronization module is end-to-end trainable with existing multi-view DNNs and thus avoids the redesign of the whole DNNs models to handle unsynchronized multi-cameras.

II-D DNN-based image matching and flow estimation

Image matching and optical flow estimation both involve estimating image-to-image correspondences, which is related to frame synchronization of multi-views. We mainly review the DNN-based image matching [41, 42, 43] or optical flow estimation methods [44, 45, 46], which inspire us to solve the unsynchronized multi-camera based problems in a DNN-based fashion. DNN flow [47] proposed an image matching method based on a DNN feature pyramid in a coarse-to-fine optimization manner. FlowNet [48] predicted the optical flow from DNNs with feature concatenation and correlation. SpyNet [49] combined a classical spatial-pyramid formulation with deep learning and estimated large motions in a coarse-to-fine approach by warping one image to the other at each pyramid level by the current flow estimate and computing an update to the flow. [41] addressed image correspondence problem using a convolutional neural network architecture that mimics classic image matching algorithms. PWC-Net [50] uses a feature pyramid and one image feature map is warped to the other at each scale, which is guided by the upsampled optical flow estimated from the previous scale. [51] proposed a single network to jointly learn spatiotemporal correspondence for stereo matching and flow estimation.

Our method is related to the DNN-based image matching and optical flow estimation, but the difference is still significant: 1) Typical image/geometric matching only involves either a camera view angle transformation (e.g., camera relative pose estimation, stereo matching) or a small time change in the same view (optical flow estimation), while both factors appear in our problem, which makes our problem harder; 2) Image/geometric matching is directly supervised by the correspondence of two images, while the multi-view fusion ground-truth in the 3D world is used as supervisory signal in our problem; 3) The 2D-to-3D projection causes ambiguity for multi-view feature fusion, which also causes difficulties for view synchronization.

III Single-Frame DNNs Multi-Camera Synchronization

In this section we propose our single-frame synchronization model for DNN-based multi-view models. The temporal offset between cameras can either be constant latency for each camera (the same offset over time), or random latency (random offsets over time). Similar to most multi-view methods [7, 17, 20, 2], we assume that the cameras are static and the cameras’ intrinsic and extrinsic parameters are known. The main idea of our method is to choose a camera view as the reference view, and then use the view synchronization model to warp the other camera views to be synchronized with the reference view. The synchronization model should be general enough to handle both constant and random latencies between cameras, in order to work under various conditions causing de-synchronization.

DNNs models for the multi-camera surveillance tasks typically consist of 3 stages (see in Fig. 1): Single-view feature extraction, which extracts single-view features of the input camera views. Multi-view feature projection and fusion, where a fixed differentiable projection layer is first adopted to project the single-view features to the 3D coordinate map and then the projected multi-view features are fused together to form the scene-level representation. The projection layer depends the application task, and our framework can generally handle any differentiable projection layer. For example, for multi-view counting [9], the projection maps the 2D camera view to the 3D scene plane at the average person height (assuming all camera pixels fall on the same height plane), while for 3D pose estimation [2], the projection copies features along a view-ray in the 3D grid, assuming an unknown height of each camera-view pixel. Prediction, where the decoder predicts the final result in the 3D coordinate map, such as ground-plane density maps [9] or 3D reconstruction [4].

In Fig. 2, we take multi-view crowd counting [9] as an example to show the pipeline of the proposed single-frame based view synchronization model. In the multi-view fusion model, we denote the input multi-view frames as {Iit0}i=0n1\{I_{i}^{t_{0}}\}_{i=0}^{n-1}, where ii denotes the camera view id and nn is the input camera view number, and superscript t0t_{0} indicates that the frames are all captured at the same time point t0t_{0}, corresponding to the synchronized multi-camera setup. After being fed into the single-view feature extractor FF, the extracted features are denoted as

Fit0=F(Iit0),i{0,1,,n1}.\displaystyle F_{i}^{t_{0}}=F(I_{i}^{t_{0}}),i\in\{0,1,...,n-1\}. (1)

For multi-view counting [9], the projection 𝒫{\cal P} maps the 2D camera view to the 3D scene plane at the average person height. After projection layer 𝒫\cal P, the projected multi-view features are

it0=𝒫(Fit0),i{0,1,,n1}.\displaystyle{\cal F}_{i}^{t_{0}}={\cal P}(F_{i}^{t_{0}}),i\in\{0,1,...,n-1\}. (2)

We use UU to denote the fusion operation (e.g., concatenation, max-pooling) of the projected multi-view features, thus the fused feature is U(0t0,,n1t0)U({\cal F}_{0}^{t_{0}},\ldots,{\cal F}_{n-1}^{t_{0}}). Finally, the decoder DD is applied to obtain the final prediction VpV_{p},

Vp\displaystyle V_{p} =D(U(0t0,,n1t0))\displaystyle=D(U({\cal F}_{0}^{t_{0}},\ldots,{\cal F}_{n-1}^{t_{0}})) (3)
=D(U(𝒫(F0t0),,𝒫(Fn1t0))).\displaystyle=D(U({\cal P}(F_{0}^{t_{0}}),\ldots,{\cal P}(F_{n-1}^{t_{0}}))).

However, when the input multi-cameras frames are not synchronized, denoted as {Iiti}i=0n1\{I_{i}^{t_{i}}\}_{i=0}^{n-1}, the capture time for the ii-th view tit0t_{i}\neq t_{0}. Thus, we need to synchronize the camera views first by utilizing the view synchronization model.

The view synchronization model can be embedded into one of the first two stages, synchronizing the extracted single-view features {Fiti}\{F_{i}^{t_{i}}\} or projected features {iti}\{{\cal F}_{i}^{t_{i}}\}, without the need to redesign a new architecture. Thus, we propose two variants of the synchronization model: 1) scene-level synchronization, where the projected features {iti}\{{\cal F}_{i}^{t_{i}}\} from different camera views are synchronized during multi-camera feature fusion; and 2) camera-level synchronization, where the camera view features {Fiti}\{F_{i}^{t_{i}}\} are synchronized before projection and fusion. We present the details of the two synchronization models next. Note that we first consider the case when both synchronized and unsynchronized multi-view images are available for training (but not available in the testing stage). We then extend this to the case when only unsynchronized training images are available.

III-A Scene-level synchronization

Scene-level synchronization works by synchronizing the multi-camera features after the projection stage in the multi-view pipeline. The pipeline for scene-level synchronization is shown in the Fig. 2 (a).

III-A1 Synchronization module

Without loss in generality, we choose one view (denoted as view 0) as the reference view, and other views are to be synchronized to this reference view. We first assume that synchronized frame pairs are available in the training stage. The frames are I0t0I_{0}^{t_{0}} from reference view 0 captured at reference time t0t_{0}, and Iit0I_{i}^{t_{0}} and IitiI_{i}^{t_{i}} from view ii (i{1,2,..,n1}i\in\{1,2,..,n-1\}) taken at times t0t_{0} and tit_{i}. Note that frames (I0t0,Iit0)(I_{0}^{t_{0}},I_{i}^{t_{0}}) are synchronized, while (I0t0,Iiti)(I_{0}^{t_{0}},I_{i}^{t_{i}}) are not.

The synchronization module consists of the following stages. First, camera frame feature maps (F0t0,Fit0,Fiti)(F_{0}^{t_{0}},F_{i}^{t_{0}},F_{i}^{t_{i}}) (both synchronized and unsynchronized frames) are extracted and projected to the 3D world space, resulting in the projected feature maps (0t0,it0,iti)({\cal F}_{0}^{t_{0}},{\cal F}_{i}^{t_{0}},{\cal F}_{i}^{t_{i}}). Second, synchronization is performed between the reference view 0 and each other view ii. The projected feature map 0t0{\cal F}_{0}^{t_{0}} from the reference view is concatenated with the projected feature map iti{\cal F}_{i}^{t_{i}} from view ii, and then fed into a motion flow estimation network s{\cal M}_{s} to predict the scene-level motion flow wiw_{i} between view ii at time tit_{i} and the reference view at time t0t_{0}:

wi=s(Cat(0t0,iti)),i{1,,n1},\displaystyle w_{i}={{\cal M}_{s}}(\mathrm{Cat}({\cal F}_{0}^{t_{0}},{\cal F}_{i}^{t_{i}})),i\in\{1,...,n-1\}, (4)

where Cat\mathrm{Cat} is the concatenation operation. The iti{\cal F}_{i}^{t_{i}} from view ii is then synchronized with the reference view at time t0t_{0} using a warping transformation 𝒲{\cal W} guided by wiw_{i}, 𝒲(wi,iti){\cal W}(w_{i},{\cal F}_{i}^{t_{i}}),

^it0=𝒲(wi,iti),i{1,,n1},\displaystyle\hat{{\cal F}}_{i}^{t_{0}}={\cal W}(w_{i},{\cal F}_{i}^{t_{i}}),i\in\{1,...,n-1\}, (5)

where ^it0\hat{{\cal F}}_{i}^{t_{0}} are the warped projected features of the ii-th view synchronized to time t0t_{0}. Note that the warping 𝒲\cal W only applies spatial shifting to the feature map iti{\cal F}_{i}^{t_{i}}, i.e., it only changes the feature locations and does not change the feature values. Finally, the reference view features 0t0{\cal F}_{0}^{t_{0}} and estimated warped features of the other views {^it0}\{\hat{{\cal F}}_{i}^{t_{0}}\} are fused and decoded to obtain the final scene-level prediction VpV_{p}:

Vp\displaystyle V_{p} =D(U(0t0,^1t0,,^n1t0))\displaystyle=D(U({\cal F}_{0}^{t_{0}},\hat{{\cal F}}_{1}^{t_{0}},\ldots,\hat{{\cal F}}_{n-1}^{t_{0}})) (6)
=D(U(0t0,𝒲(w1,1t1),,𝒲(wn1,n1tn1))).\displaystyle=D(U({\cal F}_{0}^{t_{0}},{\cal W}(w_{1},{\cal F}_{1}^{t_{1}}),\ldots,{{\cal W}}(w_{n-1},{{\cal F}}_{n-1}^{t_{n-1}}))). (7)

In the testing stage, only unsynchronized frames (I0t0,Iiti)(I_{0}^{t_{0}},I_{i}^{t_{i}}) are available and the forward operations related to frame Iit0I_{i}^{t_{0}} are removed from the network.

III-A2 Training loss

Two losses are used in the training stage. The first loss is a task-specific prediction loss p\ell_{p} between the scene-level prediction VpV_{p} and the ground-truth VgtV_{gt}. For example, for multi-view crowd counting p\ell_{p} is the mean-square error, and Vp,VgtV_{p},V_{gt} are the predicted and ground-truth scene-level density maps. The second loss is on the multi-view feature synchronization in the multi-view fusion stage. Since the synced frame pairs are available during training, the feature warping loss 𝒲\ell_{{\cal W}} encourages the warped features to be similar to the features of the original synced frame of view ii,

𝒲(wi,it0,iti)\displaystyle\ell_{{\cal W}}(w_{i},{\cal F}_{i}^{t_{0}},{\cal F}_{i}^{t_{i}}) =mse(it0,^it0),\displaystyle=\mathrm{mse}({\cal F}_{i}^{t_{0}},\hat{{\cal F}}_{i}^{t_{0}}), (8)
=mse(it0,𝒲(wi,iti)),\displaystyle=\mathrm{mse}({\cal F}_{i}^{t_{0}},{\cal W}(w_{i},{\cal F}_{i}^{t_{i}})),

where mse\mathrm{mse} is the mean-square error loss. Note that the warping 𝒲\cal W only applies spatial shifting, and thus the minimization of the warping loss 𝒲\ell_{\cal W} in (8) will be based on the feature alignment via scene-level motion flow wiw_{i} and not global feature value changes (e.g., color correction). Finally, the training loss combines the task loss and the warping loss summed over all non-reference views,

\displaystyle\ell =p(Vp,Vgt)+γi=1n1𝒲(wi,it0,iti),\displaystyle=\ell_{p}(V_{p},V_{gt})+\gamma\sum_{i=1}^{n-1}\ell_{{\cal W}}(w_{i},{\cal F}_{i}^{t_{0}},{\cal F}_{i}^{t_{i}}), (9)

where γ\gamma is a hyperparameter.

III-B Camera view-level synchronization

Each image pixels’ height in 3D space is unknown, and thus the projection operation of multi-camera DNNs models [9, 17, 2] will either project each pixel to the same assumed height level [9] (causing distortion when the true pixel height is different), or to multiple height levels [17], [2] (duplicating features along the view ray). These projection cause the features to stretch along the view ray in the 3D scene, which makes their synchronization more difficult due to their imprecise (stretched) and ambiguous (duplicated) nature. Therefore, we also consider synchronization between camera view features before the projection. The pipeline for camera-level synchronization is presented in Fig. 2 (b).

III-B1 Synchronization model

The view synchronization model is applied to each view separately. The camera view features (F0t0,Fiti)(F_{0}^{t_{0}},F_{i}^{t_{i}}) from the unsynchronized reference view and view ii are first passed through a matching module (see below) and then fed into the motion flow estimation network c{\cal M}_{c} to predict the camera-view motion flow wiw_{i} for view ii. The warping transformation WW guided by wiw_{i} then warps the camera-view features FitiF_{i}^{t_{i}} from view ii to be synchronized with the reference view at time t0t_{0},

F^it0=W(wi,Fiti),i{1,,n1},\displaystyle\hat{F}_{i}^{t_{0}}=W(w_{i},F_{i}^{t_{i}}),i\in\{1,...,n-1\}, (10)

where F^it0\hat{F}_{i}^{t_{0}} is the warped camera-view features of view ii captured at time tit_{i}, which is synchronized to reference view 0 captured at time t0t_{0}. Finally, the reference and warped camera views are projected

0t0=𝒫(F0t0),^it0=𝒫(F^it0),i{1,,n1},\displaystyle{\cal F}_{0}^{t_{0}}={\cal P}(F_{0}^{t_{0}}),\hat{{\cal F}}_{i}^{t_{0}}={\cal P}(\hat{F}_{i}^{t_{0}}),i\in\{1,...,n-1\}, (11)

and then fused and decoded to obtain the scene-level prediction VpV_{p}:

Vp\displaystyle V_{p} =D(U(0t0,^1t0,,^n1t0))\displaystyle=D(U({\cal F}_{0}^{t_{0}},\hat{{\cal F}}_{1}^{t_{0}},\ldots,\hat{{\cal F}}_{n-1}^{t_{0}})) (12)
=D(U(𝒫(F0t0),𝒫(F^1t0),,𝒫(F^n1t0))).\displaystyle=D(U({\cal P}(F_{0}^{t_{0}}),{\cal P}(\hat{F}_{1}^{t_{0}}),\ldots,{\cal P}(\hat{F}_{n-1}^{t_{0}}))). (13)

In the testing stage, only unsynchronized frames (I0t0,Iiti)(I_{0}^{t_{0}},I_{i}^{t_{i}}) are available and the forward operations related to frame Iit0I_{i}^{t_{0}} are removed from the network.

III-B2 Matching module

We propose 3 methods to match features to predict the view-level motion flow. The first method concatenates the features (F0t0,Fiti)(F_{0}^{t_{0}},F_{i}^{t_{i}}) and then feeds them into the motion flow estimation network c{\cal M}_{c} to predict the motion flow wiw_{i}:

wi=c(Cat(F0t0,Fiti)),i{1,,n1}.\displaystyle w_{i}={\cal M}_{c}(\mathrm{Cat}(F_{0}^{t_{0}},F_{i}^{t_{i}})),i\in\{1,...,n-1\}. (14)

The second method builds a correlation map CiC_{i} between features from each pair of spatial locations in F0t0F_{0}^{t_{0}} and FitiF_{i}^{t_{i}},

Ci((x,y),(x,y))=F0t0(x,y)TFiti(x,y),\displaystyle C_{i}((x,y),(x^{\prime},y^{\prime}))=F_{0}^{t_{0}}(x,y)^{T}F_{i}^{t_{i}}(x^{\prime},y^{\prime}), (15)

which is then fed into the motion flow estimation network c{\cal M}_{c} to predict the motion flow wiw_{i}:

wi=c(Ci),i{1,,n1}.\displaystyle w_{i}={\cal M}_{c}(C_{i}),i\in\{1,...,n-1\}. (16)

The third method incorporates camera geometry information into the correlation map to suppress false matches. If both cameras are synchronized at t0t_{0}, then according the multi-view geometry, each spatial location in view 0 must match a location in view ii on its corresponding epipolar line (Fig. 3a). Thus in the synchronized setting, detected matches that are not on the epipolar line can be rejected as false matches. For our unsynchronized setting, the matched location in view ii remains on the epipolar line only when its corresponding feature/object does not move between times t0t_{0} and tit_{i}. To handle the case where the feature moves, we assume that a matched feature in view ii moves according to a Gaussian motion model with standard deviation σ\sigma (Fig. 3b). With the epipolar line and motion model, we then build a weighting mask, with high weights on locations with high probability of containing the matched features, and vice versa. Specifically, we set the mask Mi((x,y),(x,y))=1M_{i}((x,y),(x^{\prime},y^{\prime}))=1 if (x,y)(x^{\prime},y^{\prime}) is on the epipolar line induced by (x,y)(x,y), and 0 otherwise, and then convolve it with a 2D Gaussian with standard deviation σ\sigma (Fig. 3c). We then apply the weight mask MiM_{i} on the correlation map C~i=MiCi\tilde{C}_{i}=M_{i}\odot C_{i}, which will suppress false matches that are not consistent with the scene and motion model. Thus, the motion flow wiw_{i} is

wi=c(C~i)=c(MiCi),i{1,,n1}.\displaystyle w_{i}={\cal M}_{c}(\tilde{C}_{i})={\cal M}_{c}(M_{i}\odot C_{i}),i\in\{1,...,n-1\}. (17)
Refer to caption
Figure 3: Epipolar-guided weights. (a) In the synchronized setting, given the point (x,y)(x,y) in view 0, the matched point (x,y)(x^{\prime},y^{\prime}) in view ii must be on the epipolar line lxyl_{xy}. (b) In the unsynchronized setting, we assume a Gaussian motion model of the matched feature location from time t0t_{0} to tit_{i}. (c) An epipolar-guided weight mask is use to bias the feature matching towards high-probability regions according to the motion model.

III-B3 Multi-scale architecture

Multi-scale feature extractors are used in multi-camera tasks like crowd counting [9] or to refine the final prediction via multi-scale prediction fusion [51, 50]. Therefore, we next show how to incorporate multi-scale feature extractors with our camera-level synchronization model.111No extra steps are needed to incorporate multi-scale features with scene-level synchronization because the synchronization occurs after the feature projection. Instead of performing the view synchronization in each scale separately, the motion flow estimate of neighbor scales is fused to refine the current scale’s estimate (see Fig. 4). In particular, let there be mm scales in the multi-scale architecture and jj denotes one scale in the scale range {1,2,,m}\{1,2,...,m\}, with mm the largest scale. The multi-scale predicted motion flow are fused as follows.

  • When j=1j=1 (the smallest scale), the correlation map Ci(1)C^{(1)}_{i} of scale 11 is fed into the motion flow estimation net to predict the motion flow wi(1)w^{(1)}_{i} for scale 11.

  • For scales j>1j>1, first the difference between the correlation map Ci(j)C^{(j)}_{i} and the upsampled correlation map of the previous scale up(Ci(j1))\mathrm{up}(C^{(j-1)}_{i}) is fed into the motion flow estimation net to predict the residual of the motion flow between two scales, denoted as w~i(j)\tilde{w}^{(j)}_{i}.

  • The refined motion flow of scale jj is

    wi(j)=up(wi(j1))+w~i(j).\displaystyle w^{(j)}_{i}=\mathrm{up}(w^{(j-1)}_{i})+\tilde{w}^{(j)}_{i}. (18)
Refer to caption
Figure 4: Multi-scale estimation of motion flow.

III-B4 Training loss

Similar to scene-level synchronization, a combination of two losses (scene-level prediction and feature synchronization) is used in the training stage. The scene-level prediction loss is the same as before. The feature synchronization loss encourages the warped camera-view features at each scale to match the features of the original synchronized frame,

W\displaystyle\ell_{W} =mse(Fit0,(j),F^it0,(j))\displaystyle=\mathrm{mse}(F_{i}^{t_{0},(j)},\hat{F}_{i}^{t_{0},(j)}) (19)
=mse(Fit0,(j),W(wi(j),Fiti,(j))).\displaystyle=\mathrm{mse}(F_{i}^{t_{0},(j)},W(w_{i}^{(j)},F_{i}^{t_{i},(j)})). (20)

Similar to scene-level synchronization, the warping function WW only applies spatial shifting, and thus the minimization of W\ell_{W} in (20) will be based on feature alignment rather than feature value changes. Finally, the training loss is the combination of the prediction loss and the synchronization loss summed over all non-reference views and scales,

\displaystyle\ell =p(Vp,Vgt)+γi=1n1j=1mW(wi(j),it0,(j),iti,(j)),\displaystyle=\ell_{p}(V_{p},V_{gt})+\gamma\sum_{i=1}^{n-1}\sum_{j=1}^{m}\ell_{W}(w_{i}^{(j)},{\cal F}_{i}^{t_{0},(j)},{\cal F}_{i}^{t_{i},(j)}), (21)

where γ\gamma is a hyperparameter.

III-C Training with only unsynchronized frames

In the previous models, we assume that both synchronized and unsynchronized multi-camera frames are available during training. For more practical applications, we also consider the case when only unsynchronized multi-view frames are available for training. In this case, for the scene-level synchronization, the warping feature loss 𝒲\ell_{\cal W} is replaced with a similarity loss s\ell_{s} on the projected features, to indirectly encourage synchronization of the projected multi-view features,

s=mean(1cos(0t0,𝒲(wi,iti)),\displaystyle\ell_{s}=\mathrm{mean}(1-\cos({\cal F}_{0}^{t_{0}},{\cal W}(w_{i},{\cal F}_{i}^{t_{i}})), (22)

where “cos\cos” is the cosine similarity between feature maps (along the channel dimension), and “mean\mathrm{mean}” is the mean over all spatial locations. Similarly, for camera-level synchronization, the warping feature loss W\ell_{W} is replaced by the similarity loss of the projected features s\ell_{s}. Note that the similarity loss s\ell_{s} is applied after the projection – thus the warping function only needs to predict the residual motion in the camera view, which is the object motion in time, so as to align the projected features.

IV Experiments

We validate the effectiveness of the proposed view synchronization model on two unsynchronized multi-view tasks: multi-view crowd counting and multi-view 3d human pose estimation.

IV-A Implementation details

The synchronization model consists of two parts: motion estimation network and feature warping layer. The input of the motion estimation network is the unsynchronized multi-view features (the concatenation of the projected features) for scene-level synchronization or the matching result of the 2D camera-view features for camera-level synchronization, and the output is a 2-channel motion flow. The layer setting of the motion estimation network is shown in Table I. The feature warping layer warps the features from other views to align with the reference views, guided by the estimated motion flow. The feature warping layer is based on the image resampler from the Spatial Transformation layer in [52].

The synchronized multi-view model consists of feature extraction module, projection module and multi-view prediction module. For the multi-view counting model [9], Table II shows the model setting of the feature extraction and multi-view prediction module. For the 3D pose estimation model [2], the feature extraction module consists of a ResNet-152 network, a series of transposed convolution layers and a 1 by 1 convolution layer to predict joint heatmaps [53], and the V2V-PoseNet [54] is used for multi-view prediction, which is based on hour-glass network [55].

TABLE I: The layer settings for the motion estimation net in the view synchronization module. The filter dimensions are output channels, input channels and filter size w0×h0w_{0}\!\times\!h_{0}\!.
Layer Filter
conv 1 128×n×5×5128\!\times\!n\!\times\!5\!\times\!5\!
conv 2 128×128×5×5128\!\times\!128\!\times\!5\!\times\!5\!
conv 3 64×128×5×564\!\times\!128\!\times\!5\!\times\!5\!
conv 4 64×64×5×564\!\times\!64\!\times\!5\!\times\!5\!
conv 5 32×64×5×532\!\times\!64\!\times\!5\!\times\!5\!
conv 6 2×32×5×52\!\times\!32\!\times\!5\!\times\!5\!
TABLE II: The model setting of the synchronized multi-view counting model [9], consisting of feature extraction and multi-view prediction. The Filter dimensions are output channels, input channels, and filter size (w×hw\!\times\!h).
Feature extraction
Layer Filter
conv 1 16×1×5×516\!\times\!1\!\times\!5\!\times\!5
conv 2 16×16×5×516\!\times\!16\!\times\!5\!\times\!5
pooling 2×22\!\times\!2\!
conv 3 32×16×5×532\!\times\!16\!\times\!5\!\times\!5
conv 4 32×32×5×532\!\times\!32\!\times\!5\!\times\!5
pooling 2×22\!\times\!2\!
conv 5 64×32×5×564\!\times\!32\!\times\!5\!\times\!5
conv 6 32×64×5×532\!\times\!64\!\times\!5\!\times\!5
conv 7 1×32×5×51\!\times\!32\!\times\!5\!\times\!5
Prediction
Layer Filter
concat -
conv 1 64×n×5×564\!\times\!n\!\times\!5\!\times\!5
conv 2 32×64×5×532\!\times\!64\!\times\!5\!\times\!5
conv 3 1×32×5×51\!\times\!32\!\times\!5\!\times\!5

IV-B Experiment setup

We test four versions of our synchronization model: scene-level synchronization (denoted as SLS), and camera-level synchronization using concatenation, correlation, or correlation with epipolar-guided weights (denoted as CLS-cat, CLS-cor, CLS-epi) for the matching module (Section III-B2). The synchronization models are trained with the multi-view DNNs introduced in each application later.

We consider two training scenarios: 1) both synchronized and unsynchronized training data is available; 2) only unsynchronized training data is available, which is the more difficult setting. For the first training scenario, we compare against two comparison methods: BaseS trains the DNN only on the synchronized data; BaseSU fine-tunes the BaseS model using the unsynchronized training data (using the full training set). For the second training scenario, BaseU trains the DNN directly from the unsynchronized data. Note that traditional synchronization methods [29, 30, 31, 32, 33] are based on videos (temporal information) and assume high-fps cameras with fixed frame rates, which are unavailable in our problem setting. Thus, traditional and video-based synchronization methods are not suitable for comparison.

To test the proposed method, we first create an unsynchonized multi-view dataset from the existing multi-view datasets (the specific datasets are introduced in each application later). In particular, suppose the frame sequence in the reference view is captured at times t0+kΔtt_{0}+k{\Delta}t, where Δt{\Delta}t is the time offset between neighbor frames, k{0,,N1}k\in\{0,\cdots,N-1\} and NN is the number of frames. For view ii, the unsynchronized frames are captured at times t0+kΔt+δi,kt_{0}+k{\Delta}t+{\delta}_{i,k}, where δi,k\delta_{i,k} is the desynchronization time offset between view ii and the reference view. We consider two settings of the desynchronization offset. The first is a constant latency for each view, δi,k=τi{\delta}_{i,k}=\tau_{i}, for some constant value τi\tau_{i}. The second is random latency, where the offset for each frame and view is randomly sampled from a uniform distribution, δi,kU(κi,κi){\delta}_{i,k}\sim U(-\kappa_{i},\kappa_{i}). Finally, since the synchronization is with the reference view, the ground-truth labels for the multi-view task correspond to the times of the reference view, t0+kΔtt_{0}+k{\Delta}t.

TABLE III: Unsynchronized multi-view counting: experiment results for training set with both synchronized and unsynchronized frames. Two desynchronization settings are tested: constant latency and random latency. The evaluation metric is MAE and NAE.
PETS2009 CityStreet
constant random constant random
loss model MAE NAE MAE NAE MAE NAE MAE NAE
p\ell_{p} BaseS 7.21 0.200 4.58 0.139 9.07 0.108 8.86 0.107
BaseSU 4.36 0.137 4.30 0.140 9.02 0.106 8.82 0.108
p,W\ell_{p},\ell_{W} SLS 4.49 0.145 4.91 0.154 8.23 0.102 8.02 0.101
CLS-cat 4.18 0.130 4.85 0.150 8.82 0.111 8.57 0.108
CLS-cor 4.13 0.135 4.03 0.128 8.03 0.099 7.99 0.098
CLS-epi 3.95 0.130 4.09 0.129 8.05 0.100 7.93 0.096
TABLE IV: Unsynchronized multi-view counting: experiment results for training set with only unsynchronized frames, under constant and random latency.
PETS2009 CityStreet
constant random constant random
loss model MAE NAE MAE NAE MAE NAE MAE NAE
p\ell_{p} BaseU 6.18 0.187 6.22 0.192 10.22 0.134 9.35 0.121
p,s\ell_{p},\ell_{s} SLS 5.37 0.178 4.82 0.150 8.50 0.105 8.33 0.100
CLS-cat 6.00 0.186 6.08 0.189 8.48 0.102 9.17 0.110
CLS-cor 4.18 0.136 4.34 0.136 8.02 0.098 7.77 0.093
CLS-epi 4.25 0.135 4.77 0.144 8.04 0.095 7.70 0.094
p\ell_{p} SLS 7.13 0.226 5.30 0.162 8.77 0.107 8.45 0.107
CLS-cat 6.30 0.194 5.98 0.184 8.28 0.098 9.15 0.108
CLS-cor 4.25 0.138 4.49 0.141 8.20 0.099 8.10 0.102
CLS-epi 4.27 0.135 4.53 0.143 8.16 0.097 7.86 0.096
TABLE V: Unsynchronized multi-view counting: experiment results for training set with only unsynchronized frames, under constant and random latency and using ground-truth calculated from unsynchronized multi-view frames.
PETS2009 CityStreet
constant random constant random
loss model MAE NAE MAE NAE MAE NAE MAE NAE
p\ell_{p} BaseU 14.89 0.458 10.95 0.484 10.96 0.146 11.30 0.149
p,s\ell_{p},\ell_{s} SLS 6.80 0.229 6.58 0.283 9.18 0.111 9.49 0.117
CLS-cat 7.41 0.237 6.10 0.237 9.72 0.130 9.69 0.129
CLS-cor 5.91 0.201 5.93 0.240 8.55 0.106 8.31 0.107
CLS-epi 5.72 0.184 4.80 0.187 8.32 0.104 8.05 0.102
p\ell_{p} SLS 7.85 0.274 7.22 0.313 9.31 0.109 8.91 0.108
CLS-cat 7.52 0.240 6.20 0.243 8.48 0.107 9.85 0.121
CLS-cor 6.98 0.244 6.26 0.282 8.03 0.099 8.24 0.107
CLS-epi 6.80 0.229 5.18 0.200 8.23 0.102 8.16 0.103

IV-C Unsynchronized multi-view counting

We first apply our synchronization model to unsynchronized multi-view counting system, whose bandwidth is assumed to be limited and the frame latency between cameras can be fixed or random. Here we adopt the multi-view multi-scale fusion model (MVMS) from [9], which is the state-of-the-art model for multi-view counting DNNs. We embed the synchronization models in the MVMS model to handle the unsynchronized multi-view frames for crowd counting.

IV-C1 Datasets and metric

Two multi-view counting datasets used in [9], PETS2009 [56] and CityStreet [9], are selected and desynchronized for the experiments.

PETS2009 contains 3 views (camera 1, 2 and 3), and the first camera view is chosen as the reference view. The input image resolution (w×hw\!\times\!h) is 384×288384\!\times\!288 and the ground-truth scene-level density map resolution is 152×177152\!\times\!177. There are 825 multi-view frames for training and 514 frames for testing. The frame rate of PETS2009 is 7 fps (Δt=1/7s\Delta t=1/7s). For constant frame latency, τi{5s,5s}\tau_{i}\in\{5s,-5s\} is used for cameras 2 and 3, and κi=5s\kappa_{i}=5s for random latency.

CityStreet proposed in [9] consists of 3 views (camera 1, 3 and 4), and camera 1 is chosen as the reference view. The input image resolution is 676×380676\!\times\!380 and the ground-truth density map resolution is 160×192160\!\times\!192. There are 500 multi-view frames, and the first 300 are used for training and the remaining 200 for testing. The frame rate of CityStreet is 1 fps (Δt=1s\Delta t=1s)222We obtained the higher fps version from the dataset authors.. For constant latency, τi{3s,3s}\tau_{i}\in\{3s,-3s\} for cameras 3 and 4, and κi=3s\kappa_{i}=3s for random latency.

Following [9], the mean absolute error (MAE) and normalized absolute error (NAE) of the predicted counts on the test set are used as the evaluation metric:

MAE=1Ni=1N|c^ici|,\displaystyle\mathrm{MAE}=\tfrac{1}{N}\sum\nolimits_{i=1}^{N}|\hat{c}_{i}-c_{i}|, (23)
NAE=1Ni=1N|c^ici|/ci,\displaystyle\mathrm{NAE}=\tfrac{1}{N}\sum\nolimits_{i=1}^{N}|\hat{c}_{i}-c_{i}|/c_{i}, (24)

where cic_{i} is the ground truth count and c^i\hat{c}_{i} is the predicted count, and NN is the number of testing images.

IV-C2 Results for training with synced and unsynced frames

The experiment results using training with synchronized and unsynchronized frames are shown in Table III. The hyperparameter γ=1\gamma=1 is used for feature warping loss. On both datasets, our camera-level synchronization methods, CLS-cor and CLS-epi, perform better than other methods, including the baselines, demonstrating the efficicacy of our approach. Scene-level synchronization (SLS) performs worse than camera-level synchronization methods (CLS), due to the ambiguity of the projected features from multi-views. Furthermore after projection to the ground-plane, the crowd movement between frames Iit0I^{t_{0}}_{i} and IitiI^{t_{i}}_{i} on the ground-plane is less salient due to the low resolution of the ground-plane feature map. CLS-cat performs worse among the CLS methods because simple concatenation of features cannot capture the image correspondence between different views to estimate the motion flow. Finally, the two baselines (BaseS and BaseSU) perform badly on CityStreet because of the larger scene with larger crowd movement between neighboring frames (due to lower frame rate).

IV-C3 Results for training with only unsynchronized frames

The experiment results by training with only unsynchronized frames (which is a more practical real-world case) are shown in Table IV. Since the synchronized frames are not available, the MVMS model weights are trained from scratch using only unsynchronized data. Our models are trained with the similarity loss s\ell_{s} (with hyperparameter γ=1000\gamma=1000), which encourages alignment of the projected multi-view features. Generally, without the synchronized frames in the training stage, the counting error increases for each method. Nonetheless, the proposed camera-level synchronization models CLS-cor and CLS-epi performs much better than the baseline BaseU. CLS-cor and CLS-epi trained on only unsynchronized data also performs better (on CityStreet) or on par with (on PETS2009) the baseline BaseSU, which uses both synchronized and unsynchronized training data. These two results demonstrate the efficacy of our synchronization model when only unsynchronized training data is available. Finally, the error for almost all synchronization models increases on both datasets when training without the similarity loss (p\ell_{p} in Table IV). This demonstrates the effectiveness of using s\ell_{s} to align the multi-view features in training.

TABLE VI: Ablation study on the multi-scale architecture of the proposed methods for multi-view counting on CityStreet dataset. The top rows show performance when training with synchronized and unsynchronized frames and using feature warping loss W\ell_{W}. The bottom is training only on unsynchronized frames using feature similarity loss s\ell_{s}.
Loss/Training data Method Multi-scale Single-scale
MAE NAE MAE NAE
SLS 8.02 0.101 8.31 0.100
p,W\ell_{p},\ell_{W} / CLS-cat 8.57 0.108 8.77 0.102
sync and unsync CLS-cor 7.99 0.098 8.25 0.099
CLS-epi 7.93 0.096 8.12 0.098
SLS 8.33 0.100 8.95 0.112
p,s\ell_{p},\ell_{s} CLS-cat 9.17 0.110 9.54 0.116
unsync CLS-cor 7.77 0.093 8.62 0.111
CLS-epi 7.70 0.094 8.59 0.110

IV-C4 Results for using ground-truth from unsynchronized multi-view images

In the previous experiments (training with only unsynchronized frames, see Sec. IV-C3), the ground-truth is corresponded (synchronized) to the frames of the reference view. We also perform experiments when the ground-truth scene-level density maps are calculated from the unsynchronized multi-view images. Specifically, we project the same person’s image coordinates of each unsynchronized view to the world plane and the average of the projection results is used as the ground-truth person location on the ground. Then, we use the obtained person location map to generate the scene-level density map.

The results for training with ground-truth from unsynchronized multi-view images and only unsynchronized frames can be seen in Table V. From the table, we can also find that the proposed method CLS-cor / CLS-epi can achieve better performance than other methods and CLS-epi achieves the best performance, and the performance can be further improved by adding similarity loss s\ell_{s}.

TABLE VII: The average feature maps value mean and variance before and after the feature warping of view 2 and 3 of CityStreet dataset.
Method view 2 view 3
before warping 0.686±1.0060.686\pm 1.006 0.777±0.7040.777\pm 0.704
after warping 0.670±0.9760.670\pm 0.976 0.761±0.6920.761\pm 0.692

IV-C5 Ablation study on the multi-scale architecture

We next present an ablation study on the multi-scale architecture for the multi-view counting in Table VI. Generally, the multi-scale architecture performs better than single-scale architecture models, and the proposed CLS-cor/CLS-epi can perform better than SLS or CLS-cat under both single-scale or multi-scale architecture, and under both training paradigms (sync and unsynced, or only unsynced).

TABLE VIII: Comparison of methods on the CityStreet dataset with only unsynchronized frames (both constant and random unsynchronized frames).
Method constant random
Color correction 8.90/0.108 8.64/0.100
SLS 8.50/0.105 8.33/0.100
CLS-cat 8.48/0.102 9.17/0.110
CLS-cor 8.02/0.098 7.77/0.093
CLS-epi 8.04/0.095 7.70/0.094

IV-C6 Ablation study on color correlation

The feature warping module only applies spatial shifting on the features of the unsynced views, i.e., it does not change the values (e.g., color) of the unsynced features (see Eqs. 5 and 10). To demonstrate this, we calculate the average statistics (mean and variance) of the feature maps before and after feature warping of Views 2 and 3 of CityStreet, and present the results in Table VII. The statistics of the feature maps do not change much after performing feature warping, and thus the performance improvement of the feature warping module is not due to color correction (feature value changes).

We further perform an ablation study to show that image color correction by itself cannot solve the frame desychronization problem. On the CityStreet dataset, in the baseline model (MVMS [9]), we add a learnable “color correction” layer, comprising an extra 1×\times1 convolution layer (32 channels) in the branches of the other camera views before the projection and fusion step. The results are denoted as “color correction” in Table VIII. The error for using “color correction” is worse than the proposed SLS, CLS-cor and CLS-epi. The reason is that the desychronization issue comes from the capture time difference between camera views, which is better solved by spatial shifting of features rather than color correction (changing feature values).

IV-C7 Model size and running speed comparison

TABLE IX: The model parameter number and running speed comparison of the baseline methods BaseS/BaseSU/BaseU and the proposed SLC, CLS-cat, CLS-cor and CLS-epi for multi-view counting on CityStreet dataset. The input resolution for the correlation step of the camera-view synchronization module is 160×95160\times 95.
Method Paras. Num FPS
BaseS/BaseSU/BaseU 853.4K 21.9
SLS 3.7M 8.3
CLS-cat 3.7M 8.9
CLS-cor 37.3M 7.2
CLS-epi 37.3M 3.6

We present the model size (number of parameters) and running speed of the baseline methods and the proposed SLS, CLS-cat, CLS-cor and CLS-epi in Table IX. The input resolution for the correlation step of the camera-view synchronization module is 160×95160\times 95. All models are tested on the CityStreet dataset with a Nvidia 1080Ti GPU. The baseline methods (BaseS, BaseSU and BaseU) do not use view synchronization modules, so their model sizes are smaller and running speeds are faster. The proposed CLS-cor and CLS-epi methods have the correlation module, and thus have more parameters than SLS or CLS-cat. CLS-epi is slower than CLS-cor due to the extra multiplication step with the epipolar weights.

Refer to caption
Figure 5: Examples of unsynchronized multi-view crowd counting on PETS2009 (top) and CityStreet (bottom). The left shows the input multi-view frames, and note that the synchronized frames (in dotted box) are not used when training with only unsynchronized frames. The input unsynchronized frames are randomly selected around the synchronized frames. For each dataset, the result of training with synchronized and unsynchronized frames (lpl_{p} and lWl_{W}) is in row 1, the result of training only with unsynchronized frames (lpl_{p}) is shown in row 2, and the result of training with unsynchronized frames and using similarity loss between projected features (lpl_{p} and lsl_{s}) is shown in row 3. Generally, 1) the proposed synchronization methods CLS-epi and CLS-cor predict density maps with better quality compared to other comparison methods; 2) the methods achieves better performance when synchronized frames are available in training; 3) when training only with unsynchronized frames, enforcing the similarity loss lsl_{s} can help improve the performance.

IV-C8 Visualization results

Example results are shown in Fig. 5. Generally, the proposed synchronization methods CLS-epi and CLS-cor can predict better quality density maps, such as in the red box regions in the figure, where comparison methods tend to over-count these regions due to the same person being counted multiple times in unsynchronized frames. Furthermore, we also observe that the predicted density map is with better quality when synchronized frames are available compared to training with only unsynchronized frames. Finally, the prediction results are improved if similarity loss is enforced when training with only unsynchronized frames, such as the methods CLS-epi and CLS-cor on PETS2009.

IV-D Unsynchronized 3D human pose estimation

We next apply our synchronization model to the unsynchronized 3D human pose estimation task. The DNNs model for 3D human pose estimation is adopted from [2], which proposed two learnable triangulation methods for multi-view 3D human pose from multiple 2D views: algebraic triangulation and volumetric aggregation. Here we use volumetric aggregation (softmax aggregation) as the multi-view fusion DNN in the experiments.

TABLE X: Unsynchronized 3D human pose estimation: experiment results with random latency. For ‘CLS-cor’ and ‘CLS-epi’, the consistency loss hyperparameter γ=0.01\gamma=0.01. The evaluation metric is MPJPE and absolute position MPJPE (left/right).
Latency 8/50s8/50s 32/50s32/50s 64/50s64/50s
BaseS 62.8/59.2 78.6/78.2 151.1/151.5
BaseSU 26.5/27.8 49.9/50.1 69.4/69.2
BaseU 37.3/38.9 50.9/50.6 71.0/70.7
CLS-cor(γ\gamma=0) 25.8/26.9 36.5/36.7 56.6/56.9
CLS-cor 25.8/27.0 38.2/38.7 46.8/47.1
CLS-epi 25.7/26.8 37.6/37.8 45.7/45.6
TABLE XI: Detailed performance for unsynchronized 3D human pose estimation with random latency κi=8/50s\kappa_{i}=8/50s. The evaluation metric is MPJPE.
Pose BaseS BaseSU BaseU CLS-cor(γ\gamma=0) CLS-cor CLS-epi
Directions 42.8 29.3 34.3 26.1 25.8 26.1
Discussion 60.7 28.4 38.8 27.3 26.7 27.0
Eating 60.7 26.4 28.8 23.9 24.0 23.4
Greeting 63.8 19.7 32.3 25.3 24.3 25.1
PhoneCall 52.2 25.7 31.0 24.7 24.5 24.4
Posing 49.7 22.0 27.6 24.1 24.0 24.0
Purchases 67.5 24.4 52.5 28.7 27.4 28.8
Sitting 33.2 22.6 36.6 23.8 24.0 24.0
SittingDown 37.4 25.7 66.6 25.9 26.8 27.2
Smoking 42.2 25.7 31.2 24.8 24.3 24.4
TakingPhoto 59.9 24.3 44.2 28.2 27.9 27.2
Waiting 44.3 19.5 35.8 23.2 23.8 24.2
Walking 161.1 31.9 32.1 27.0 30.2 27.8
WalkingDogs 91.5 34.2 54.8 30.1 30.1 29.8
WalkingTogether 126.8 33.9 31.8 25.5 26.8 25.5
Average 62.8 26.5 37.3 25.8 25.8 25.7
TABLE XII: Detailed performance for unsynchronized 3D human pose estimation with random latency κi=32/50s\kappa_{i}=32/50s. The evaluation metric is MPJPE.
Pose BaseS BaseSU BaseU CLS-cor(γ\gamma=0) CLS-cor CLS-epi
Directions 46.2 48.7 65.1 42.9 42.5 43.9
Discussion 75.6 53.6 55.2 38.9 41.0 41.6
Eating 64.5 39.1 40.2 32.5 32.7 30.8
Greeting 71.5 48.5 55.4 35.7 38.1 36.8
PhoneCall 64.5 43.6 43.0 33.9 35.2 35.1
Posing 49.3 42.1 43.3 32.7 33.3 30.8
Purchases 111.5 50.9 48.7 35.9 42.4 40.4
Sitting 55.2 46.0 43.7 33.6 33.8 34.7
SittingDown 108.3 79.3 64.9 36.8 41.8 42.8
Smoking 54.5 44.3 44.0 35.5 35.9 35.7
TakingPhoto 87.9 57.0 58.6 39.3 43.0 41.3
Waiting 64.3 45.6 47.3 35.5 33.7 35.0
Walking 150.6 47.6 48.1 34.2 37.1 34.2
WalkingDogs 123.1 66.2 67.5 44.5 49.2 49.1
WalkingTogether 125.5 50.3 52.7 36.9 38.5 34.9
Average 78.6 49.9 50.9 36.5 38.2 37.6
TABLE XIII: Detailed performance for unsynchronized 3D human pose estimation with random latency κi=64/50s\kappa_{i}=64/50s. The evaluation metric is MPJPE.
Pose BaseS BaseSU BaseU CLS-cor(γ\gamma=0) CLS-cor CLS-epi
Directions 99.2 83.2 76.3 70.3 64.8 66.5
Discussion 144.1 72.0 67.5 57.3 48.2 48.4
Eating 138.2 55.3 63.2 44.7 40.4 37.9
Greeting 181.3 68.0 74.1 54.8 46.9 46.3
PhoneCall 138.1 58.8 61.1 49.2 40.7 40.5
Posing 121.7 53.5 50.2 42.1 36.6 36.3
Purchases 155.7 69.0 62.4 58.1 47.0 50.6
Sitting 74.0 64.2 67.8 55.2 41.1 39.6
SittingDown 103.8 112.3 140.7 89.8 54.6 50.6
Smoking 112.7 58.7 60.3 49.2 41.7 41.7
TakingPhoto 166.8 76.8 79.6 64.5 57.7 53.5
Waiting 120.2 62.7 61.1 51.1 42.0 42.6
Walking 301.1 66.3 69.2 49.7 44.0 41.7
WalkingDogs 219.2 95.9 91.6 77.0 62.7 55.5
WalkingTogether 302.7 67.2 68.9 54.5 43.5 42.5
Average 151.1 69.4 71.0 56.6 46.8 45.7
TABLE XIV: Unsynchronized 3D human pose estimation: CLS-epi experiment results with different hyperparameter γ\gamma. The evaluation metric is MPJPE.
γ\gamma 0.005 0.01 0.02
κi=8/50s\kappa_{i}=8/50s 25.6 25.7 26.0
κi=32/50s\kappa_{i}=32/50s 38.3 37.6 37.9
κi=64/50s\kappa_{i}=64/50s 51.7 45.7 46.8

IV-D1 Datasets and Metrics

We use the Human3.6M [57] dataset, which consists of 3.6 million frames from 4 synchronized 50 Hz digital cameras along with the 3D pose annotations. We follow the preprocessing step333https://github.com/anibali/h36m-fetch. Accessed: Oct. 10, 2019. recommended in [57], and sample one of every 64 frames (Δt=64/50\Delta t=64/50) for the testing set, and sample one of every 4 frames (Δt=4/50\Delta t=4/50) as the training set. The first camera view is always used as the reference view (if the first camera view is missing, the second one is used). We test desynchronization via random frame latency, with κi{8/50,32/50,64/50}\kappa_{i}\in\{8/50,32/50,64/50\} seconds, and only use unsynchronized data for training. Following [2], Mean Per Point Position Error (MPJPE) and absolute position MPJPE are used as the metric for evaluation. In training, the single-view backbone uses the pretrained weights from the original 3D pose estimation model. Baseline methods BaseS, BaseSU and BaseU are compared with our proposed camera-view synchronization models CLS-cor and CLS-epi.

Refer to caption
Figure 6: Examples of unsynchronized 3D pose estimation (Walking Dogs). The first row shows the input unsynchronized multi-view frames and the top labels indicate the unsynchronized frame latency (in seconds). The remaining rows show the ground-truth key-joints and the predicted results. Blue lines are the 2D key-joints projected from 3D poses, and the synchronized frames are used for better visualization. CLS-epi achieves the best performance among all methods, especially the prediction result of arms in view 0.
Refer to caption
Figure 7: Examples of unsynchronized 3D pose estimation (Greeting). Blue lines are the 2D key-joints projected from 3D poses, and the synchronized frames are used for better visualization. CLS-epi achieves the best performance.

IV-D2 Experiment results

The experiments results are presented in Table X. The original 3D pose estimation method (BaseS, BaseSU and BaseU) cannot perform well under the unsynchronized test condition, especially under large latencies (e.g., 64/50s). Our camera-view synchronization methods performs better than the baseline methods, with the performance gap increasing as the latency increases. Using similarity loss s\ell_{s} improves the performance of our models, and adding epipolar-guided weights can suppress false matches and further reduces the error. The detailed performance for each pose type under different frame latency settings is shown in Table XI, Table XII and Table XIII. From the tables, we can find that the proposed methods can perform especially better on the poses with larger movement between unsynchronized frames, e.g., Walking, WalkingDogs and WalkingTogether.

IV-D3 Ablation study on γ\gamma for 3D pose estimation

The ablation study on hyperparameter γ\gamma for the method CLS-epi for 3D pose estimation is presented in Table XIV. In general, γ=0.01\gamma=0.01 achieves better performance than other weights.

IV-D4 Model size and running speed comparison

We present the model sizes and running speed comparisons of our proposed models and the baselines for 3D pose estimation in Table XV. The input resolution for the correlation step of the camera-view synchronization module is 48×4848\times 48. As the original synchronized 3D pose estimation model [2] is already very large, the running speed of the proposed models CLS-cor and CLS-epi is similar to the baseline methods BaseS/BaseSU/BaseU.

TABLE XV: The model parameter number and running speed comparison of the baseline methods BaseS/BaseSU/BaseU and the proposed CLS-cor and CLS-epi for 3D pose estimation on the Human3.6M dataset. The input resolution for the correlation step of the camera-view synchronization module is 48×4848\times 48.
Method Paras. Num FPS
BaseS/BaseSU/BaseU 80.6M 3.7
CLS-cor 86.3M 3.4
CLS-epi 86.3M 3.0

IV-D5 Visualization results

Visualization results of unsynchronized 3D pose estimation are presented in Figs. 6 and 7. In the figures, the first row shows the input unsynchronized multi-view frames, and the top labels indicate the unsynchronized frame latency. Rows 2-8 show the 2D key-joints projected from 3D poses of Ground-truth, BaseS, BaseSU, BaseU, CLS-cor (γ=0\gamma=0), CLS-cor and CLS-epi, respectively, where synchronized frames are displayed for better visualization effect. In Fig. 6, BaseU fails on the unsynchronized input, and CLS-epi achieves the best performance among all methods, especially the prediction of the arms in view 1. In Fig. 7, the CLS-epi also achieves the best performance among all comparison methods.

V Conclusion

In this paper, we focus on the issue of unsynchronized cameras in DNNs-based multi-view computer vision tasks. We propose two view synchronization models based on single frames (not videos) from each view, scene-level synchronization and camera-level synchronization. The two models are trained and evaluated under two training settings (with or without synchronized frame pairs), and a similarity loss of the projected multi-view features is proposed to boost the performance when synchronized training pairs are not available. Furthermore, to show its generality to different conditions of desynchronization, the proposed models are tested with desynchronization based on both constant and random latency. Finally, the proposed models are applied to unsynchronized multi-view counting and unsynchronized 3D human pose estimation, and achieve better performance compared to the baseline methods. Overall, camera-level synchronization model using correlation and epipolar weights (CLS-epi) performs best among the proposed models.

In addition to unsynchronized multi-camera crowd counting and 3D pose estimation, the proposed method can also be applied to other multi-camera vision tasks, such as multi-camera detection [7], multi-camera tracking [18]. In these tasks, multi-cameras may also be unsynchronized due to no synchronization clock or limited network bandwidth. As these DNN models [7, 18] generally follow the 3 stage pipeline (single-view feature extraction, multi-view projection and fusion, and prediction), our proposed synchronization modules can be inserted to adapt them to unsynchronized frames.

In our current model, image content matching is used for view synchronization, while the 2D-to-3D projection for multi-view fusion relies on known camera parameters. The multi-camera surveillance tasks themselves require known calibration for better multi-view fusion. Note that our proposed view synchronization module based on correlation maps (CLS-cor) does not require camera calibrations due to the single-frame basis, and still achieves good performance. When the calibrations are provided, epipolar constraint can be utilized to achieve better results (CLS-epi). In future work, the 2D-3D projection in the original multiview models could be replaced with camera self-calibration modules, which can allow the model to handle unsynchronized and uncalibrated multi-cameras.

VI Acknowledgements

This work was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. [T32-101/15-R] and CityU 11212518).

References

  • [1] H. Aghajan and A. Cavallaro, Multi-camera networks: principles and applications.   Academic press, 2009.
  • [2] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov, “Learnable triangulation of human pose,” in ICCV, 2019.
  • [3] E. Remelli, S. Han, S. Honari, P. Fua, and R. Wang, “Lightweight multi-view 3d pose estimation through camera-disentangled representation,” in CVPR.
  • [4] A. Kar, C. Háne, and J. Malik, “Learning a multi-view stereo machine,” in NIPS, 2017, pp. 365–376.
  • [5] P.-H. Huang, K. Matzen, and et al., “Deepmvs: Learning multi-view stereopsis,” in CVPR, 2018, pp. 2821–2830.
  • [6] P. Baqué, F. Fleuret, and P. Fua, “Deep occlusion reasoning for multi-camera multi-target detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 271–279.
  • [7] T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov, L. Lettry, P. Fua, L. Van Gool, and F. Fleuret, “Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5030–5039.
  • [8] L. Chen, H. Ai, R. Chen, Z. Zhuang, and S. Liu, “Cross-view tracking for multi-human 3d pose estimation at over 100 fps,” in CVPR.
  • [9] Q. Zhang and A. B. Chan, “Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8297–8306.
  • [10] M. Ye, X. Lan, Q. Leng, and J. Shen, “Cross-modality person re-identification via modality-aware collaborative ensemble learning,” IEEE Transactions on Image Processing, vol. 29, pp. 9387–9399, 2020.
  • [11] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person re-identification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3702–3712.
  • [12] M. Ye and P. C. Yuen, “Purifynet: A robust person re-identification model with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 2655–2666, 2020.
  • [13] M. Ye, J. Shen, X. Zhang, P. C. Yuen, and S.-F. Chang, “Augmentation invariant and instance spreading feature for softmax embedding,” IEEE transactions on pattern analysis and machine intelligence, 2020.
  • [14] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang, “Unsupervised domain adaptive re-identification: Theory and practice,” Pattern Recognition, 2018.
  • [15] E. Zheng, D. Ji, E. Dunn, and J.-M. Frahm, “Sparse dynamic 3d reconstruction from unsynchronized videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4435–4443.
  • [16] X. Zhang, B. Ozbay, M. Sznaier, and O. Camps, “Dynamics enhanced multi-camera motion segmentation from unsynchronized videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4668–4676.
  • [17] Q. Zhang and A. B. Chan, “3d crowd counting via multi-view fusion with 3d gaussian kernels,” in AAAI Conference on Artificial Intelligence, 2020.
  • [18] Y. He, J. Han, W. Yu, X. Hong, and Y. Gong, “City-scale multi-camera vehicle tracking by semantic attribute parsing and cross-camera tracklet matching,” in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020.
  • [19] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-r2n2: A unified approach for single and multi-view 3d object reconstruction,” in ECCV.   Springer, 2016, pp. 628–644.
  • [20] H. Xie, H. Yao, X. Sun, S. Zhou, and S. Zhang, “Pix2vox: Context-aware 3d reconstruction from single and multi-view images,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [21] M. Kocabas, S. Karagoz, and E. Akbas, “Self-supervised learning of 3d human pose using multi-view geometry,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1077–1086.
  • [22] C.-H. Chen, A. Tyagi, A. Agrawal, D. Drover, S. Stojanov, and J. M. Rehg, “Unsupervised 3d pose estimation with geometric self-supervision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5714–5724.
  • [23] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh, “Panoptic studio: A massively multiview system for social motion capture,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3334–3342.
  • [24] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Harvesting multiple views for marker-less 3d human pose annotations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6988–6997.
  • [25] K. Takahashi, D. Mikami, M. Isogawa, and H. Kimata, “Human pose as calibration pattern; 3d human pose estimation with multiple unsynchronized and uncalibrated cameras,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1775–1782.
  • [26] C. Albl, Z. Kukelova, A. Fitzgibbon, J. Heller, M. Smid, and T. Pajdla, “On the two-view geometry of unsynchronized cameras,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4847–4856.
  • [27] T. Kuo, S. Sunderrajan, and B. Manjunath, “Camera alignment using trajectory intersections in unsynchronized videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1121–1128.
  • [28] N. Hasler, B. Rosenhahn, T. Thormahlen, M. Wand, and H. P. Seidel, “Markerless motion capture with unsynchronized moving cameras,” in Computer Vision and Pattern Recognition, 2009.
  • [29] C. Dai, Y. Zheng, and L. Xin, “Subframe video synchronization via 3d phase correlation,” in ICIP, 2006.
  • [30] F. Padua, R. Carceroni, G. Santos, and K. Kutulakos, “Linear sequence-to-sequence alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 2, pp. 304–320, 2008.
  • [31] C. Lei and Y.-H. Yang, “Tri-focal tensor-based multiple video synchronization with subframe optimization,” IEEE Transactions on Image Processing, vol. 15, no. 9, pp. 2473–2480, 2006.
  • [32] P. A. Tresadern and I. D. Reid, “Video synchronization from human motion using rank constraints,” Computer Vision and Image Understanding, vol. 113, no. 8, pp. 891–906, 2009.
  • [33] J. Yan and M. Pollefeys, “Video synchronization via space-time interest point distribution,” in Advanced Concepts for Intelligent Vision Systems, vol. 1, 2004, pp. 12–21.
  • [34] Y. Caspi, D. Simakov, and M. Irani, “Feature-based sequence-to-sequence matching,” International Journal of Computer Vision, vol. 68, no. 1, pp. 53–64, 2006.
  • [35] Rao, Gritai, Shah, and Syeda-Mahmood, “View-invariant alignment and matching of video sequences,” in Proceedings Ninth IEEE International Conference on Computer Vision, 2003, pp. 939–945.
  • [36] B. Meyer, T. Stich, M. A. Magnor, and M. Pollefeys, “Subframe temporal alignment of non-stationary cameras.” in BMVC, 2008, pp. 1–10.
  • [37] E. Imre and A. Hilton, “Through-the-lens synchronisation for heterogeneous camera networks.” in BMVC, 2012, pp. 1–11.
  • [38] D. Pundik and Y. Moses, “Video synchronization using temporal signals from epipolar lines,” in ECCV, 2010.
  • [39] S. N. Sinha and M. Pollefeys, “Synchronization and calibration of camera networks from silhouettes,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, 2004.
  • [40] T. Gaspar, P. Oliveira, Favaro, and Paolo, Synchronization of Two Independently Moving Cameras without Feature Correspondences, 2014.
  • [41] I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network architecture for geometric matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6148–6157.
  • [42] S. Phillips and K. Daniilidis, “All graphs lead to rome: Learning geometric and cycle-consistent representations with graph convolutional networks,” arXiv preprint arXiv:1901.02078, 2019.
  • [43] H. Altwaijry, E. Trulls, J. Hays, P. Fua, and S. Belongie, “Learning to match aerial images with deep attentive architectures,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3539–3547.
  • [44] T.-W. Hui, X. Tang, and C. Change Loy, “Liteflownet: A lightweight convolutional neural network for optical flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8981–8989.
  • [45] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462–2470.
  • [46] M. Bai, W. Luo, K. Kundu, and R. Urtasun, “Exploiting semantic information and deep matching for optical flow,” in European Conference on Computer Vision.   Springer, 2016, pp. 154–170.
  • [47] W. Yu, K. Yang, Y. Bai, H. Yao, and Y. Rui, “Dnn flow: Dnn feature pyramid based image matching.” in BMVC, 2014.
  • [48] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2758–2766.
  • [49] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4161–4170.
  • [50] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8934–8943.
  • [51] H.-Y. Lai, Y.-H. Tsai, and W.-C. Chiu, “Bridging stereo matching and optical flow via spatiotemporal correspondence,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1890–1899.
  • [52] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
  • [53] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in ECCV, 2018.
  • [54] G. Moon, J. Y. Chang, and K. M. Lee, “V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map,” pp. 5079–5088, 2018.
  • [55] A. Newell, K. Yang, and D. Jia, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision, 2016, pp. 483 C–499.
  • [56] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,” in IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2009, pp. 1–6.
  • [57] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013.