This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bidirectionally Deformable Motion Modulation For
Video-based Human Pose Transfer

Wing-Yin Yu, Lai-Man Po, Ray C.C. Cheung, Yuzhi Zhao, Yu Xue, Kun Li
Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
{wingyinyu8, yzzhao2, yuxue22, kunli25}-c@my.cityu.edu.hk {eelmpo, r.cheung}cityu.edu.hk
Abstract

Video-based human pose transfer is a video-to-video generation task that animates a plain source human image based on a series of target human poses. Considering the difficulties in transferring highly structural patterns on the garments and discontinuous poses, existing methods often generate unsatisfactory results such as distorted textures and flickering artifacts. To address these issues, we propose a novel Deformable Motion Modulation (DMM) that utilizes geometric kernel offset with adaptive weight modulation to simultaneously perform feature alignment and style transfer. Different from normal style modulation used in style transfer, the proposed modulation mechanism adaptively reconstructs smoothed frames from style codes according to the object shape through an irregular receptive field of view. To enhance the spatio-temporal consistency, we leverage bidirectional propagation to extract the hidden motion information from a warped image sequence generated by noisy poses. The proposed feature propagation significantly enhances the motion prediction ability by forward and backward propagation. Both quantitative and qualitative experimental results demonstrate superiority over the state-of-the-arts in terms of image fidelity and visual continuity. The source code is publicly available at github.com/rocketappslab/bdmm.

Refer to caption
Figure 1: Examples of a synthesized video clip based on some noisy poses. Existing methods such as GFLA  [35] and Impersonator++  [25] fail to generate realistic videos due to problems of spatio-temporally discontinuous poses and highly structural texture misalignment while our method can generate highly plausible texture with seamless transition between consecutive frames. Please zoom in for more details.

1 Introduction

The video-based human pose transfer is a task to animate a plain source image according to a series of desired postures. It is challenging due to problems of spatio-temporally discontinuous poses and highly structural texture misalignment as depicted in Figure 1. In this paper, we aim to tackle these problems with an end-to-end generative model to maximize the value of applications in various domains including person re-identification  [49], fashion recommendation  [13, 20], and virtual try-on  [8, 33, 42].

Existing works focus on three categories to solve the spatial misalignment problem, including prior generation  [28, 45, 7, 27, 46], attention module  [34, 52, 39], and flow warping  [35, 48]. There are many side effects in these methods such as spatially misaligned content, blurry visual quality and unreliable flow prediction. Some methods  [24, 22, 25] proposed to obtain the spatial transformation flow by computing the vertex matching in 3D neural rendering process. The main advantage is to preserve more texture details of the source image. However, the generative networks struggle to render new content for occluded regions since flows in such regions are not accurate.

To obtain animated sequences with smooth human gesture movements, the temporal coherence is the main determinant. Different from most of the generative tasks such as inpainting or super-resolution, the conditional inputs of the sequence in this task are noisy. It is because the existing third-party human pose extractors  [2, 21, 10] fail to extract accurate pose labels in the video frames. It increases the difficulty to predict the temporal correspondence for generating a smooth sequence of frames, especially the highly structural patterns on garments and occluded regions. In general, previous works  [24, 35, 25, 35] mainly use recurrent neural networks to solve this problem by taking the previously generated result as the input of current time step. However, the perceptual quality is still unsatisfactory due to limited receptive field of view along time space. We observe that solely relying on unidirectionally hidden states in recurrent units to interpolate the missing content is insufficient. It motivates us to utilize all the frames within the mini batch to stabilize the temporal statistics in the generated sequence.

To alleviate the aforementioned problems, we propose a novel modulation mechanism – Deformable Motion Modulation (DMM) incorporated with bidirectional recurrent feature propagation to perform spatio-temporal affine transformation and style transfer simultaneously. It is designed with three major components, including motion offset, motion mask and the modulated style weight. To strengthen the temporal consistency, the motion offset and mask are responsible for estimating the local geometric transformation based on the features of two spatially misaligned adjacent frames, in which the feature branches come from both forward propagation branch and backward propagation branch. The bidirectional feature propagation encapsulates the temporal information of the entire sequence so that a long-range temporal correspondence of a sequence from the forward flow to the backward flow can be captured at current time. By maintaining more semantic details from the source image to process the coarsely aligned features, the style weights are modulated by the style codes extracted from the source image. The corresponding affine transformation is enhanced with the augmented spatial-temporal sampling offset. It can produce a dynamic receptive field of view to track semantics so that it can synthesize a sequence of plausible and smooth video frames. The main contributions of this work can be summarized as follows:

  • We propose a novel Deformable Motion Modulation that utilizes geometric kernel offset with adaptive weight modulation to perform spatio-temporal affine transformation and style transfer simultaneously;

  • We design a bidirectionally recurrent feature propagation on coarsely warped images to generate target images on top of noisy poses so that a long-range temporal correspondence of the sequence can be captured at current time;

  • We demonstrate the superiority of our method in both quantitative and qualitative experimental results with a significant enhancement in perceptual quality in terms of visual fidelity and temporal consistency.

2 Related Work

Human Pose Transfer. Recent research in image-based human pose transfer can be categorized as prior-based, attention-based, and flow-based. Initial methods  [28, 45] proposed a prior-based generative model to combine the generated results with residual priors. In addition to residual maps, some solutions  [7, 27] proposed to pre-generate the target parsing maps in order to enhance the semantic correspondence. Yu et al.  [46] also introduced an edge prior to reconstruct the fragile high frequency on the characteristics of garments. Although these priors are tailor-made to reconstruct details of the source image, inaccurately generated priors limit the ability to synthesize new content, especially when encountering large occlusion variations. Some attention-based methods proposed to compute dense correspondences in feature space via activated pose attention  [52] and spatial attention  [34, 39]. Despite the fact that these kinds of attentional operations can achieve better scores in some quantitative evaluation metrics such as FID, the qualitative visualizations show a blurry effect on the generated images due to insufficient texture and shape guidance. In view of this problem, flow-based methods  [22, 35, 48] warped the features of the source image by estimating the pose correspondence. Notwithstanding that they can preserve the characteristics of the source image, unreliable optical flow prediction is a bottleneck for these methods to transfer complex texture patterns.

Apart from spatial transformation, the video-based human pose transfer has an additional challenge on maintaining temporal consistency. Current approaches  [24, 35, 25] employed unidirectional forward propagation in recurrent networks to extract the hidden temporal information. However, it is insufficient to produce a spatio-temporally smooth sequence due to the problem of noisy pose that cannot be detected at certain time steps. To address this issue, Ren et al.  [35] used a convolution network to preprocess the 2D skeletons by transferring knowledge of 3D pose estimation in advance. Due to the domain gap between different datasets, reducing the number of key points in the heatmap limits the ability of flow prediction. Without training an extra network to perform noisy pose recovery, our method is still able to generate temporally coherent videos transferred from source images.

Video-to-video Generation. With the success of conditional Generative Adversarial Networks  [9, 31](cGANs), video-to-video models convert semantic input videos to photorealistic videos. Wang et al.  [43] introduced a sequential generative model to extract feature correlations from adjacent frames. Due to weak spatial transformation ability, it failed to produce plausible images. Siarohin et al.  [37, 36] suggested to simulate the motion directly from the driving images by using zeroth-order and first-order Taylor series expansions to estimate the transformation flow. However, it sacrificed the controllability of generating images on arbitrary poses because of domain gaps.

Deformable Convolutional Networks. Due to the shortcoming of geometric transformations in Convolutional Neural Networks (CNNs)  [19], Deformable Convolutional Networks (DCNs)  [5, 51] suggested to learn the kernel offsets by augmenting the spatial sampling locations. The deformable alignment regressed by flow-guided features demonstrated effective spatial transformation capabilities in several generative tasks, including image inpainting  [23] and image super-resolution  [40, 4]. Inspired by these works, we have the motivation to enhance the style transfer ability and the temporal coherence by modulating affine transformations from the source image.

Refer to caption
Figure 2: Illustration of the proposed Deformable Motion Modulation (DMM) module. The motion offset and motion mask are parametrized by the output of coarsely warped features fi1f_{i-1} in forward branch or bi+1b_{i+1} (skipped for simplicity) in backward branch, the output results generated from previous layer x~il1{\widetilde{x}}_{i}^{l-1} at time ii, and the affine transformation based on IsI_{s}.
Refer to caption
Figure 3: Overview of the proposed model. We use a bidirectional propagation mechanism to manipulate coarsely spatial-aligned sequence rendered by vertex matching. The pose is encoded to capture structural guidance by a self-recurrent convolution unit by a Structural Encoder. The generator decoder progressively synthesizes target images by fusing features from forward and backward propagation branches via the proposed Deformable Motion Modualtion (DMM) block and the source style code extracted by a Style Encoder.

3 Methodology

To begin with, we define some notations used in this paper. Given a source person image IsI_{s}, the corresponding source pose PsP_{s}, and a sequence of spatially arbitrary target pose P(1:M)P(1:M), where M is the total numbers of frames in a sequence. The goal of video-based human pose transfer is to animate the IsI_{s} according to P(1:M)P(1:M) with desired movements including free-form view angles, postures, or body shapes, etc. The proposed end-to-end and recurrent generative model 𝒢\mathcal{G} can be formulated as I^1:M=𝒢(Is,Ps,P1:M){\hat{I}}_{1:M}=\ \mathcal{G}\left(I_{s},\ P_{s},\ P_{1:M}\right).

3.1 Deformable Motion Modulation (DMM)

The major challenge of video-based pose transfer is to maintain the spatio-temporally misaligned characteristics of IsI_{s} while synthesizing unseen content according to the target poses. In this subsection, we introduce a new modulation mechanism – Deformable Motion Modulation (DMM) to synthesize continuous frame sequences by modulating the affine transform of IsI_{s} with an augmentation of spatio-temporal sampling locations. It aims to estimate local geometric transformations on an initially aligned feature space so that it can enhance the smoothness of the propagated features in forward and backward branches. We design the proposed DMM with three components, namely motion offset, motion mask and style weight, inspired by the success of Deformable Convolution Network (DCN)  [5, 51] and StyleGANv2  [14, 15]. As depicted in Figure 2, we parametrize them as the output of coarsely warped features fi1f_{i-1} in the forward branch or bi+1b_{i+1} in the backward branch, the output results generated from previous layer x~il1{\widetilde{x}}_{i}^{l-1} at time ii, and the source style code from IsI_{s}. We firstly initialize the standard convolution as

f~i(p)=k=1Kwkfi(p)+bias,{\widetilde{f}}_{i}\ \left(p\right)=\sum_{k=1}^{K}{w_{k}\cdot f_{i}\ \left(p\right)}+bias\ , (1)

where KK is a set of the sampling location of a kernel, y(p)y\left(p\right) is the convoluted result of input xx at position pp with the sampled weight wkw_{k}. To equip convolution with modulation and irregular receptive field of view, we formulate our proposed DMM as

f~i(p)=k=1Kwk′′mii1(p)fi(p+pk+oii1(p)),{\widetilde{f}}_{i}\ \left(p\right)=\sum_{k=1}^{K}{\ w_{k}^{\prime\prime}\cdot\ m_{i\rightarrow i-1}(p)\cdot f_{i}\ \left(p+p_{k}+o_{i\rightarrow i-1}(p)\right)}, (2)

where pkp_{k} is the pre-defined kernel offset depending on KK, oii12Ko_{i\rightarrow i-1}\in\mathbb{R}^{2K} and mii1Km_{i\rightarrow i-1}\in\mathbb{R}^{K} are both learnable shift offsets and a non-negative modulation scalar for a kernel at pp location regressed by the geometric relationship between the propagated features fif_{i} or bib_{i} and the previous generation layer x~il1{\widetilde{x}}_{i}^{l-1}, wk′′w_{k}^{\prime\prime} is the stylized weights modulated by the incoming statistics of style code extracted from IsI_{s}. More specifically, wk′′w_{k}^{\prime\prime} is responsible for manipulating the style transfer accompanied with the motion mask mii1m_{i\rightarrow i-1} so that a long-range spatio-temporal correspondence of the sequence can be captured at the current time. This goal can be achieved by computing the weights with demodulation  [15], which is expressed as

wjhk=Ajwjhk,w_{jhk}^{\prime}=A_{j}\cdot\ w_{jhk}, (3)
wjhk′′=wjhk/jkwjhk2+ϵ,w_{jhk}^{\prime\prime}=\nicefrac{{w_{jhk}^{\prime}}}{{\sqrt{\sum_{jk}{{w_{jhk}^{\prime}}^{2}+\epsilon}}}}, (4)

where wjhkw_{jhk} represents the weights of jj-th input feature and hh-th output feature map on kk-th sampling kernel location, i.e., wkwjhkw_{k}\subset w_{jhk}, AjA_{j} is the jj-th scalar from the source style vector, wjhkw_{jhk}^{\prime} is computed for estimating the affine transformation based on the statistics of incoming style code, ϵ\epsilon is a small number to prevent computation from numerical error. The demodulation can well preserve the semantic details of the source image while it is able to interpolate unseen content by considering the forward and backward propagation features. The augmented spatio-temporal sampling offsets can also produce dynamic receptive fields of view to track the semantics of interest so that it can synthesize a sequence of good-looking and smooth video frames.

3.2 Bidirectional Recurrent Propagation

It has been a challenge to produce stable and smooth videos simply by relying on current pose to generate the target person image due to discontinuous noisy poses extracted by some third-party human skeleton extractors  [2, 21, 10]. We introduce a simple bidirectional propagation mechanism to interpolate the probability of missing structural guidance from both forward and backward propagation.

Mesh Flow. We define the transformation flow as FisH×W×2F_{i\rightarrow s}\in\mathbb{R}^{H\times W\times 2} between PsP_{s} and PiP_{i}, where HH and WW are height and width of the generated image resolution, PsP_{s} and PiP_{i} are the source image and target pose at time ii. Following previous work  [22, 25], we apply SPIN  [18] as the 3D human pose and shape estimator to predict parametric representations by inferencing RGB images into the implicit differentiable model SMPL  [26]. The SMPL representation consists of three major elements, including a weak perspective camera vector C3C\in\mathbb{R}^{3}, a pose vector θ72\theta\in\mathbb{R}^{72} and a shape vector β10\beta\in\mathbb{R}^{10}. It parametrizes a triangulated mesh to produce the explicit pose representation by computing the corresponding SMPL(θ,β)6890×3SMPL\left(\theta,\beta\right)\in\mathbb{R}^{6890\times 3}. By utilizing Neural Mesh Renderer (NMR)  [16] and the SMPLSMPL, we can obtain the corresponding visible vertices of triangulated faces V13776×3×2V\in\mathbb{R}^{13776\times 3\times 2} between the source and target mesh, and the weight index map of source mesh WH×W×3W\in\mathbb{R}^{H\times W\times 3}. Therefore, we can compute the FisF_{i\rightarrow s} by matching the correspondence of source WW and VV. The detailed computation is demonstrated in the Supplementary - Mesh Flow Computation.

Bidirectional propagation. Once we obtain the transformation flow FisF_{i\rightarrow s}, we perform feature propagation to extract the latent temporal information in a recurrent manner. We leverage a bidirectional propagation mechanism to manipulate the coarsely spatial-aligned sequence before feeding it into the generator. As shown in Figure 3, the pre-warped frames are formulated as

xiN:i+N=warp(Fis(Ps,P1:M),Is),x_{i-N:i+N}=warp\left(F_{i\rightarrow s}\left(P_{s},P_{1:M}\right),I_{s}\right), (5)

where N=M/2N=M/2. We use a shared 2D CNN encoder to independently extract features of xiN:i+Nx_{i-N:i+N} in both forward branch \mathcal{F} and backward branch \mathcal{B}, respectively. With the recurrent propagation, the extracted features at time ii are encapsulated with the spatio-temporal information across the entire input sequence in the feature space. The temporal forward features and backward features computed at time ii are represented as

fi=conv((xi)fi1),f_{i}=conv\left(\mathcal{F}\left(x_{i}\right)\circledcirc f_{i-1}\right), (6)
bi=conv((xi)bi+1),b_{i}=conv\left(\mathcal{B}\left(x_{i}\right)\circledcirc b_{i+1}\right), (7)

where (xi)\mathcal{F}\left(x_{i}\right) indicates the feature maps of the forward encoder \mathcal{F}, (xi)\mathcal{B}\left(x_{i}\right) is also used for backward encoder \mathcal{B}, and \circledcirc denotes concatenation operator. With the recurrent features from forward and backward propagations, the model can expand the field of view across the whole input sequence so that a more robust spatio-temporal consistency is captured during the generation process. Moreover, the outliers of input noisy pose at time ii can also be interpolated by the warped features from xiN:i1x_{i-N:i-1} to xi+1:i+Nx_{i+1:i+N}. With the assistance of Equation  2, the probability of estimating generative result can be formulated as

q(xi|Is)=iNiq(fi|fi1)+ii+Nq(bi|bi+1).q\left(x_{i}|I_{s}\right)=\prod_{i-N}^{i}q\left(f_{i}|f_{i-1}\right)+\prod_{i}^{i+N}q\left(b_{i}|b_{i+1}\right). (8)

The combinations of q(xi|Is)q\left(x_{i}|I_{s}\right) can dramatically provide positive gain to the network in synthesizing new content by feature interpolation.

3.3 Objective Loss Function

Following similar training strategies in current pose transfer frameworks  [35, 25], the final objective loss function in our model is composed of six terms including a spatial adversarial loss adv\mathcal{L}_{adv}, a spatio-temporal adversarial loss temp\mathcal{L}_{temp}, an appearance loss l1\mathcal{L}_{l1}, a perceptual loss per\mathcal{L}_{per} a style loss gram\mathcal{L}_{gram}, and a contextual loss cx\mathcal{L}_{cx} as follows:

full=λadvadv+λtemptemp+λl1l1+λperper+λgramgram+λcxcx,\begin{split}\mathcal{L}_{full}=\lambda_{adv}\mathcal{L}_{adv}+\lambda_{temp}\mathcal{L}_{temp}+{\lambda_{l1}\mathcal{L}}_{l1}\\ +{\lambda_{per}\mathcal{L}}_{per}+{\lambda_{gram}\mathcal{L}}_{gram}+{\lambda_{cx}\mathcal{L}}_{cx}\ ,\end{split} (9)

where λadv\lambda_{adv}, λtemp\lambda_{temp}, λ1\lambda_{1}, λper\lambda_{per}, λgram\lambda_{gram}, and λcx\lambda_{cx} are the hyperparameters to optimize the convergence of the network.

Spatial adversarial loss. We utilize the traditional generative adversarial loss  [9, 29] adv\mathcal{L}_{adv} to mimic the distribution of the training set with a convolutional discriminator DsD_{s}. It is formulated as:

adv=𝔼[log(Ds(Is,Ii))+log(1Ds(Is,I^i))],\mathcal{L}_{adv}=\mathbb{E}\left[\log{\left(D_{s}\left(I_{s},\ I_{i}\right)\right)}+\log{\left(1-D_{s}\left(I_{s},\ {\hat{I}}_{i}\right)\right)}\right], (10)

where (Is,Ii)𝕀real(I_{s},I_{i})\in\mathbb{I}_{real}, I^i𝕀fake{\hat{I}}_{i}\in\mathbb{I}_{fake} , and i1Mi\in 1\ldots M indicate samples from the distribution of real person image, generated person image, the numbers of an input patch.

Temporal adversarial loss. Similar to adv\mathcal{L}_{adv}, the temporal adversarial loss temp\mathcal{L}_{temp} optimizes the temporal consistency in time and feature channels of a mini patch with a 3D CNN discriminator DtD_{t}.

Appearance loss. To enforce discriminatively pixel-level supervision, we employ a pixel-wise L1 loss to provide guidance on synthesizing photo-realistic appearance compared to the ground-truth image.

Perceptual loss. To minimize the distance in feature-level space, we apply a standard perceptual loss  [12]. It computes the L1 difference of a selected layer =Conv1_2\ell=Conv1\_2 from a VGG-19  [38] model θ()\theta_{\ell}\left(\cdot\right) pre-trained in ImageNet  [6]. It is defined as

per=CHWθ(I^i)θ(Ii)|1,\displaystyle\mathcal{L}_{per}=\sum_{C_{\ell}H_{\ell}W_{\ell}}\|{\theta_{\ell}({\hat{I}}_{i})-\theta_{\ell}(I_{i})|}\|_{1}, (11)

where CC_{\ell} is the number of channels, HH_{\ell} and WW_{\ell} are the height and width of the feature maps in a particular layer \ell respectively.

Style loss. Similar to the perceptual loss to minimize the L1 distance in feature-level space, we further calculate the Gram matrix of some activated feature maps at the selected layers to maximize the similarities.

gram=CHWGram(θ(I^i))Gram(θ(Ii)|)1,\displaystyle\mathcal{L}_{gram}=\sum_{C_{\ell}H_{\ell}W_{\ell}}\|{Gram(\theta_{\ell}({\hat{I}}_{i}))-Gram(\theta_{\ell}(I_{i})|)}\|_{1}, (12)

where the used layers are the same as in perceptual loss.

Contextual loss. To maximize the similarities between two non-aligned images in context space, we utilize the contextual loss  [30] to allow spatial alignment according to contextual correspondence during the deformation process.

cx=CHWlog[CX(δ(I^i),δ(It))],\displaystyle\mathcal{L}_{cx}=-\sum_{C_{\ell}H_{\ell}W_{\ell}}log\left[CX\left(\delta_{\ell}\left({\hat{I}}_{i}\right),\delta_{\ell}\left(I_{t}\right)\right)\right], (13)

where =relu{3_2,4_2}\ell=relu\left\{3\_2,4\_2\right\} layers from a pre-trained VGG-19 model θ()\theta\left(\cdot\right), the CX()CX\left(\cdot\right) function is the similarity measurement defined in  [30].

Models FashionVideo iPER
SSIM\uparrow PSNR\uparrow l1\downarrow FID\downarrow LPIPS\downarrow FVD-Train128f\downarrow FVD-Test128f\downarrow SSIM\uparrow PSNR\uparrow L1\downarrow FID\downarrow LPIPS\downarrow FVD-Train128f\downarrow FVD-Test128f\downarrow
GFLA [35] 0.892 21.309 0.0459 16.308 0.0922 195.205±\pm3.036 256.430±\pm7.459 0.797 20.898 0.085 25.075 0.149 684.101±\pm11.215 796.112±\pm37.071
Impersonator++ [25] 0.873 21.434 0.0502 22.363 0.0761 197.668±\pm2.309 175.663±\pm5.857 0.755 18.689 0.103 33.629 0.173 714.519±\pm13.813 742.394±\pm30.208
DPTN [48] 0.907 23.996 0.0335 15.342 0.0603 215.078±\pm2.252 206.345±\pm6.522 0.742 17.997 0.110 34.204 0.209 1003.598±\pm14.715 1143.603±\pm33.631
NTED [34] 0.890 22.025 0.0425 14.263 0.0728 278.854±\pm3.505 324.128±\pm7.753 0.771 19.320 0.091 20.164 0.162 784.509±\pm12.908 916.489±\pm46.471
Ours 0.918 24.071 0.0302 14.083 0.0478 168.275±\pm2.564 148.253±\pm6.781 0.803 21.797 0.0724 22.291 0.120 500.226±\pm11.670 536.084±\pm29.200
Table 1: Quantitative comparisons with some state-of-the-art methods on the FashionVideo and iPER benchmarks. The best scores are highlighted in bold format.
Refer to caption
Figure 4: Qualitative comparisons of pose transfer with some state-of-the-art methods on DanceFashion and iPER benchmarks. Please zoom in for more details.

4 Experiments and Results

4.1 Implementations

Dataset. We conducted experiments on two publicly available high-resolution video datasets for video-based human pose transfer, including FashionVideo  [47] and iPER  [24]. Both are collected from a human-centric manner with diverse garments, poses, viewpoints, and occlusion scenarios. The FashionVideo consists of 600 videos with around 350 frames per video. It is partitioned into 500 videos for training and 100 videos for testing. It is collected from a static camera and a clean white background. The iPER dataset contains 206 videos with roughly 1100 frames each. There are 164 videos for training and 42 videos for testing purposes. Different focal lengths and genders are included to capture various poses and views in some indoor or natural backgrounds.

Evaluation metrics. To evaluate structural similarity, the SSIM  [44] index is used to achieve this goal by applying covariance and mean. The PSNR computes the power of maximum value and its mean squared error. The L1 distance represents the pixel-wise fidelity. We also employ two supervised perceptual metrics including Fréchet Inception Distance (FID)  [11] and Learned Perceptual Image Patch Similarity (LPIPS)  [50]. The FID is used to measure the distribution disparity between the generated images and the training images by computing the perceptual distances. The LPIPS is targeted on evaluating the Wasserstein-2 distance between the distributions of the generated samples and real samples. To measure the temporal coherence, we utilize Fréchet Video Distance (FVD)  [41] to extract features on time and feature space by a pre-trained I3D  [3] network. It considers a distribution over the entire video, thereby avoiding the drawbacks of frame-level metrics. The term “FVD-Train128f” denotes the protocol of computing the FVD on randomly selected consecutive 128128 frames for a sequence on training set and generated images with 5050 iterations, likewise for “FVD-Test128f” on testing set.

Training strategy. We implement the proposed method with the public framework PyTorch. We adopt the Adam  [17] optimizer with momentum β1=0.5\beta_{1}=0.5 and β2=0.999\beta_{2}=0.999 to train our model for 50,00050,000 iterations in total. The learning rate is set to 104{10}^{-4}. To keep the original aspect ratio of the images, we resize the video frames to 256×256256\times 256 by thumbnail approach. The negative slope of LeakyReLU  [32] is set to 0.20.2. The weighting hyperparameters λadv\lambda_{adv}, temp\mathcal{L}_{temp}, λ1\lambda_{1}, λper\lambda_{per}, gram\mathcal{L}_{gram}, and λcx\lambda_{cx} are set to 55, 55, 22, 500500, 0.50.5, and 0.10.1. All models are trained and tested on a server with four NVIDIA GeForce RTX 2080 Ti GPUs with 11GB memory for each.

Refer to caption
Figure 5: Qualitative comparisons with the state-of-the-art methods on some transferred results conditioned on some noisy poses. Noted that the input poses are evenly sampled from a random video clip. Please zoom in for more details.

4.2 Comparison with SOTAs

To demonstrate the superiority of the proposed model, we compare our model with several state-of-the-art approaches including GFLA  [35], Impersonator++  [25], DPTN  [48], and NTED  [34].

Quantitative Comparison. As shown in Table 1, our model achieves the best results on most evaluation metrics. The large margin of enhancement on FVD score indicates the best performance of our method in terms of spatio-temporal consistency. It represents the merits of the proposed bidirectionally deformable motion modulation in modulating long-range motion sequences with minimum discontinuity. Our model can achieve the best results on those image-based perceptual measuring metrics in the challenging FashionVideo dataset. For some images with natural backgrounds like those in iPER dataset, our model is also able to get highly competitive performance. It quantifies that our model has a better style transfer and video synthesis ability against current methods.

Models SSIM\uparrow PSNR\uparrow L1\downarrow FID\downarrow LPIPS\downarrow FVD-Train128f\downarrow FVD-Test128f\downarrow
Deformable Motion Modulation
w/o DMM 0.892 21.802 0.0435 16.636 0.0935 200.363±\pm2.992 267.476±\pm7.093
w/o DCN 0.916 23.849 0.0317 14.358 0.0484 187.390±\pm2.617 167.529±\pm7.566
w/o Style weight 0.914 23.483 0.0330 14.497 0.0513 176.107±\pm2.745 161.549±\pm7.245
w/o Feature concat 0.911 23.529 0.0338 14.933 0.0523 199.483±\pm2.224 172.651±\pm8.439
w/o Motion mask 0.912 23.492 0.0343 14.554 0.0519 191.984±\pm2.336 176.882±\pm7.004
Bidirectional Propagation
w/o Forward propagation 0.914 23.715 0.0327 15.794 0.0510 208.354±\pm2.833 188.869±\pm9.678
w/o Backward propagation 0.908 23.179 0.0349 14.345 0.0538 171.649±\pm2.289 156.469±\pm6.671
w/o Recurrent structural flow 0.910 23.440 0.0337 15.951 0.0527 202.555±\pm2.712 199.854±\pm7.192
Ours 0.918 24.071 0.0302 14.083 0.0478 168.275±\pm2.564 148.253±\pm6.781
Table 2: Quantitative comparisons of ablation study on the FashionVideo benchmark. The best scores are highlighted in bold format.

Qualitative Comparison. Apart from quantitative comparison, we also conduct a comprehensive qualitative measurement to compare the perceptually visual quality with the state-of-the-arts. We illustrate some generated results with various poses in Figure 4. We demonstrate a wide variety of viewpoints including front view, left side of body, right side of body, and back view on the Fashion dataset (row 1 – row 4). These results can highlight the superiority of our method from transferring person facial characteristics and complex texture on the garments in different points of view. To evaluate the synthesis quality in natural background, we present some generated results on uncommon gestures in iPER dataset (row 5 – row 8). It shows that our method can confidently handle arbitrary poses, shapes and backgrounds with minimum generated artifacts compared with others. It is benefited from the irregular field of view constructed by the deformable motion offset so that the multi-scale features can be effectively activated.

As a video-based solution, our method can generate temporally coherent sequences conditioned on some noisy poses without pre-processing, as shown in Figure 5. In general, the majority of structural guidance is hampered due to statistical outliers, especially in some occluded scenarios. It leads to an uncompleted shape and artifacts on the generated images, even though recurrent neural networks are applied in  [24, 35, 25]. With the proposed bidirectional modulation mechanism, our method can synthesize smooth sequences with high-fidelity transferring effects.

4.3 Ablation Study

Refer to caption
Figure 6: Qualitative analysis of ablation study. The red arrow indicates the major difference among the variants. Please zoom in for more details.

Deformable Motion Modulation. The proposed DMM is used to synthesize continuous frame sequences by modulating the 1D style code of source image with an augmentation of spatio-temporal sampling locations. We use a sum operation to fuse the features extracted from bidirectional propagation. Compared to the model w/o DMM, our model achieves superior results for all evaluation metrics in Table 2. The lack of style modulation mechanism leads to failure in style transferring results, in spite of the simple appearance style from the source image, as shown in Figure 6 (a). Moreover, based on the result of the model w/o DCN, we observe that there is a positive gain in synthesizing new content if the receptive field of view during convolution is expanded. It can achieve higher FID and LPIPS scores for image-based perception. The enhancement on FVD demonstrates the importance of capturing temporal information from adjacent frames. Furthermore, we compare the results with the model w/o Style Weight. The modulated style weight is an important component to perform affine transformation decomposed from style codes to structural poses. As depicted in Figure 6 (b), the generated images are not with style consistence due to lack of a generalization on style transfer. It verifies that our proposed DMM can provide benefits to the fusion of style statistics so that it can minimize the distribution between real-world images and the synthesized.

Forward / Backward Propagation. The proposed bidirectional propagation mechanism is used to interpolate the probability of missing structural guidance from both forward and backward propagation flows in order to enhance the temporal consistency. The results of evaluation metrics in Table 2 report that they both have an effective contribution in generating realistic images and maintaining temporal coherence between adjacent frames. In addition, the qualitative results in Figure 6 (c-d) demonstrate that the forward and backward propagation can preserve more details on structural shape and appearance details. Both comparisons on different measurements verify the efficacy of the proposed bidirectional propagation flow.

4.4 Visualization of DMM

The proposed DMM uses geometric kernel offset to transform regular receptive field of view to some irregular shapes  [5, 51]. To investigate the effectiveness of the proposed deformable motion modulation, we illustrate some visualizations on the DMM module in feature space.

Refer to caption
Figure 7: Demonstration on the region of interest for DMM. We highlight the activated regions of the estimated motion offsets and motion mask for both forward and backward propagation. Please zoom in for more details.
Refer to caption
Figure 8: Demonstration on the motion offset applied on the activated units for DMM. The green points and red points represents the activation units for the corresponding augmented sampling locations. Please zoom in for more details.

Region of Interest. The region of interest for DMM is to highlight the global area with effective motion offsets and motion mask. As demonstrated in Figure 7, we plot the kernel offsets as a kind of optical flow by following  [1] so that we can observe the activated regions of interest in each propagation branch. The visualizations for the motion mask also highlight the activated magnitude along with the motion offsets. It is reasonable that the motion offsets and masks are not aligned for both forward and backward branches because they are designed to capture the temporal information in two different sequences. Based on the global shape on the offset regions and masks in both forward and backward propagation, we can clearly point out the human body shape with a predictable movement. The regions with more semantic information are with higher density. The activated regions provide geometric guidance for the network to modulate the style code extracted from the source image.

Activated Unit. The success of the proposed DMM relies on the augmentation of spatio-temporal sampling locations. We visualize the behavior of the deformable filters in Figure 8. The activation units are highlighted with green points and red points for the corresponding augmented sampling locations. It is obvious that the proposed semantics on the sampling locations are dependent on the activated units. It is certified that the proposed DMM module can produce a dynamic receptive field of view to keep track of interested semantics so that it can synthesize a sequence of high-quality and smooth video frames.

5 Conclusion

In this paper, we present a novel end-to-end framework for video-based human pose transfer. The proposed Deformable Motion Modulation (DMM) employs geometric kernel offsets with adaptive weight modulation to perform spatio-temporal alignment and style transfer concurrently. The bidirectional propagation is employed to strengthen the temporal coherence. Comprehensive experimental results show that our method can effectively deal with the problems of spatial misalignment for complex structural patterns and noisy poses. Our framework has an excellent synthesis ability in human pose video generation and has great research potential for industrial development.

References

  • [1] Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. A database and evaluation methodology for optical flow. International journal of computer vision, 92(1):1–31, 2011.
  • [2] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  • [3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  • [4] Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5972–5981, 2022.
  • [5] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
  • [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [7] Haoye Dong, Xiaodan Liang, Ke Gong, Hanjiang Lai, Jia Zhu, and Jian Yin. Soft-gated warping-gan for pose-guided person image synthesis. Advances in neural information processing systems, 31, 2018.
  • [8] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9026–9035, 2019.
  • [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • [10] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018.
  • [11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
  • [13] Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian McAuley. Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE international conference on data mining (ICDM), pages 207–216. IEEE, 2017.
  • [14] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • [15] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • [16] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3907–3916, 2018.
  • [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [18] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2252–2261, 2019.
  • [19] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  • [20] Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, and Houqiang Li. Comparative deep learning of hybrid representations for image recommendations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2545–2553, 2016.
  • [21] Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10863–10872, 2019.
  • [22] Yining Li, Chen Huang, and Chen Change Loy. Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3693–3702, 2019.
  • [23] Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. Towards an end-to-end framework for flow-guided video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17562–17571, 2022.
  • [24] Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5904–5913, 2019.
  • [25] Wen Liu, Zhixin Piao, Zhi Tu, Wenhan Luo, Lin Ma, and Shenghua Gao. Liquid warping gan with attention: A unified framework for human image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [26] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
  • [27] Zhengyao Lv, Xiaoming Li, Xin Li, Fu Li, Tianwei Lin, Dongliang He, and Wangmeng Zuo. Learning semantic person image generation by region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10806–10815, 2021.
  • [28] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. Advances in neural information processing systems, 30, 2017.
  • [29] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017.
  • [30] Roey Mechrez, Itamar Talmi, and Lihi Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In Proceedings of the European conference on computer vision (ECCV), pages 768–783, 2018.
  • [31] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [32] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Icml, 2010.
  • [33] Assaf Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, and Sharon Alpert. Image based virtual try-on network from unpaired data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5184–5193, 2020.
  • [34] Yurui Ren, Xiaoqing Fan, Ge Li, Shan Liu, and Thomas H Li. Neural texture extraction and distribution for controllable person image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13535–13544, 2022.
  • [35] Yurui Ren, Ge Li, Shan Liu, and Thomas H Li. Deep spatial transformation for pose-guided person image generation and animation. IEEE Transactions on Image Processing, 29:8622–8635, 2020.
  • [36] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2377–2386, 2019.
  • [37] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in Neural Information Processing Systems, 32, 2019.
  • [38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [39] Hao Tang, Song Bai, Li Zhang, Philip HS Torr, and Nicu Sebe. Xinggan for person image generation. In European Conference on Computer Vision, pages 717–734. Springer, 2020.
  • [40] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3360–3369, 2020.
  • [41] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  • [42] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European conference on computer vision (ECCV), pages 589–604, 2018.
  • [43] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. arXiv preprint arXiv:1910.12713, 2019.
  • [44] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [45] Lingbo Yang, Pan Wang, Xinfeng Zhang, Shanshe Wang, Zhanning Gao, Peiran Ren, Xuansong Xie, Siwei Ma, and Wen Gao. Region-adaptive texture enhancement for detailed person image synthesis. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020.
  • [46] Wing-Yin Yu, Lai-Man Po, Yuzhi Zhao, Jingjing Xiong, and Kin-Wai Lau. Spatial content alignment for pose transfer. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
  • [47] Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139, 2019.
  • [48] Pengze Zhang, Lingxiao Yang, Jian-Huang Lai, and Xiaohua Xie. Exploring dual-task correlation for pose guided person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022.
  • [49] Quan Zhang, Jianhuang Lai, Zhanxiang Feng, and Xiaohua Xie. Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification. IEEE Transactions on Image Processing, 31:352–365, 2021.
  • [50] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • [51] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9308–9316, 2019.
  • [52] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2347–2356, 2019.