Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions

Zipeng Ye yezp17@mails.tsinghua.edu.cn Tsinghua UniversityBeijingChina43017-6221 , Zhiyao Sun sunzy21@mails.tsinghua.edu.cn Tsinghua UniversityBeijingChina43017-6221 , Yu-Hui Wen wenyh1616@tsinghua.edu.cn Tsinghua UniversityBeijingChina43017-6221 , Yanan Sun sunyn20@mails.tsinghua.edu.cn Tsinghua UniversityBeijingChina43017-6221 , Tian Lv lvt18@mails.tsinghua.edu.cn Tsinghua UniversityBeijingChina43017-6221 , Ran Yi ranyi@sjtu.edu.cn Shanghai Jiao Tong UniversityShanghaiChina43017-6221 and Yong-Jin Liu liuyongjin@tsinghua.edu.cn Tsinghua UniversityBeijingChina43017-6221

Abstract.

Recently, talking-face video generation has received considerable attention. So far most methods generate results with neutral expressions or expressions that are implicitly determined by neural networks in an uncontrollable way. In this paper, we propose a method to generate talking-face videos with continuously controllable expressions in real-time. Our method is based on an important observation: In contrast to facial geometry of moderate resolution, most expression information lies in textures. Then we make use of neural textures to generate high-quality talking face videos and design a novel neural network that can generate neural textures for image frames (which we called dynamic neural textures) based on the input expression and continuous intensity expression coding (CIEC). Our method uses 3DMM as a 3D model to sample the dynamic neural texture. The 3DMM does not cover the teeth area, so we propose a teeth submodule to complete the details in teeth. Results and an ablation study show the effectiveness of our method in generating high-quality talking-face videos with continuously controllable expressions. We also set up four baseline methods by combining existing representative methods and compare them with our method. Experimental results including a user study show that our method has the best performance.

talking-face, controllable expressions, dynamic neural textures

^†^†conference: ..; cs.CV; 2022^†^†ccs: Information systems Multimedia content creation^†^†ccs: Computing methodologies Appearance and texture representations^†^†ccs: Computing methodologies Neural networks^†^†ccs: Computing methodologies Image-based rendering

1. Introduction

Recently, face image editing with controllable characteristics, such as identity, expression, age, etc, has attracted considerable attention (e.g., (Karras et al., 2019; Richardson et al., 2021; Shen et al., 2020)). Editing these characteristics in videos is even more challenging due to the constraint of inter-frame continuity, and very few works exist. For example, a facial expression editing method is proposed in (Ma and Deng, 2019) which transforms the expressions in talking-face videos into two types: happy and sad. In this method, a lip shape correction and a smoothing post-processing are required to retain lip synchronization with the source audio and temporal smoothness. In this paper, we address the challenging problem of generating talking-face videos with continuously controllable expressions, given an audio and a sequence of expressions (in different classes) and intensities; see Fig. 1 for an illustration. Different from (Ma and Deng, 2019), we consider talking-face video generation with diverse styles of expressions whose intensities are continuously controllable, while simultaneously maintaining lip synchronization.

Refer to caption — Figure 1. We propose dynamic neural textures to generate talking-face videos with continuously controllable expressions. The input of continuous expression information (class label and intensity) can be specified by user or automatically extracted from input audio (Tao et al., 2018). Our method switches different expressions at zero intensity, i.e., neutral expression.

Many methods have been proposed for synthesizing talking-face videos (e.g., (Thies et al., 2020; Wen et al., 2020; Prajwal et al., 2020)). For example, a class of audio-driven talking-face video generation methods (Wen et al., 2020; Thies et al., 2020; Yi et al., 2020; Wu et al., 2021; Ji et al., 2021) connect visual and auditory modalities by using intermediate 3D face models to achieve 3D consistency and temporal stability. So far most methods output talking-face videos with expressions whose types and intensities cannot be explicitly controlled by users (Wen et al., 2020; Thies et al., 2020; Yi et al., 2020; Wu et al., 2021). An exceptional work is (Ji et al., 2021) which can generate talking-face videos with controllable expressions by controlling the intermediate 3D morphable model (3DMM) (Blanz and Vetter, 1999). Given an input audio, all the above mentioned methods learn to predict the expression parameters to drive the 3D face animation and then use neural networks to generate photo-realistic videos, where the expression parameters are related to lip motions and expressions of the 3D face. Thus, unsynchronized lip motions as well as inaccurate expressions exist in most of the existing methods.

In this paper, we propose a method called dynamic neural textures, for the purpose of generating talking-face videos with continuously controllable expressions and lip synchronization. Here, continuously controllable expressions mean that the intensity level of different types of expressions can be continuously changed. Our method is based on an important observation: given a 3D face geometry of moderate resolution¹¹1Although 3D geometry of very high resolution can contain more expression information, it is costly to obtain. (e.g., 3DMM), we find that textures but not geometry carry sufficient information for describing expressions. Meanwhile, low-frequency vertex colors cannot provide enough expression information. Therefore we design to use texture maps which contain fine details in high-resolution images to control expressions.

Texture images can be obtained from photo-realistic rendering techniques, including traditional physically based rendering and recently proposed neural rendering (Tewari et al., 2020). However, the traditional rendering pipeline requires high-quality 3D models. Our work considers neural rendering because it is able to synthesize photo-realistic images by using a geometric representation of moderate resolution. In particular, we pay attention to neural textures (Thies et al., 2019), which are learnable high-dimensional feature maps, instead of simple RGB values, to encode high-level appearance information. However, neural textures use fixed textures which cannot represent different expressions and usually lead to unnatural results (Olszewski et al., 2017; Nagano et al., 2018). In this paper, we propose a dynamic neural texture method, which generates different textures for different expressions.

Our proposed dynamic neural textures are (1) inferred from expression input and (2) independent from geometry, so that we can decouple lip motion (represented in geometry) from expressions (represented in textures). The significant difference from static neural textures is that dynamic neural textures depend on input expressions. Thus, dynamic neural textures are expressive to synthesize expressions with continuous intensities. We propose an audio submodule to generate 3D face animation (represented by 3DMM parameters) from the input audio, where the 3D face is used to sample the dynamic neural textures from texture space into screen space to generate photo-realistic frames. One challenge in our method is how to decouple lip motions from expressions. Our idea is using the dynamic neural textures to represent and control expressions, and using the 3D face geometry to represent and control lip motions. To achieve this decoupling, the 3D face geometry should keep neutral²²2Note that neural textures only need coarse geometry.; in other words, the 3D face does not contain expression information. Accordingly, we propose a decoupling network to transfer 3D faces with different expressions to a neutral face.

We also pays attention to the texture quality in the teeth area, which is essential for high-quality talking-face videos. Our method represents face geometry by the 3DMM model. However, 3DMM cannot provide information in the teeth part. To address this problem, we propose a teeth submodule to complete the missing texture information in the teeth area. The teeth submodule focuses on the teeth area of sampled neural textures by an affine transformation and uses a CNN to infer the missing information. Our results show that using the teeth submodule can improve the quality of the teeth area.

In summary, the main contributions of our work include:

•

We propose novel dynamic neural textures to generate talking-face videos with continuously controllable expressions in real time, without using training data with continuous intensity values.
•

We propose a decoupling network to transfer 3D faces with expressions to neutral faces.
•

We propose a teeth submodule to complete the missing information in the teeth area for achieving fine talking-face details and realistic textures.

2. Related Works

2.1. Audio-Driven Talking-Face Video Generation

Audio-driven talking-face video generation is a typical task with multi-modal input, which uses an audio to drive a specified face (represented by either a face photo or video) and generate a new talking-face video. Deep neural network models have been developed for this task. (Song et al., 2019) proposes a conditional recurrent generation network that takes an audio and an image as input. (Chung et al., 2017) uses two CNNs to extract features from audio and photo respectively. (Zhou et al., 2019) uses an auto-encoder to disentangle subject-related information and speech-related information. (Prajwal et al., 2020) uses a pre-trained lip-sync discriminator to correct lip-sync errors, which improves lip synchronization of generated results. Other methods use 2D facial landmarks (Chen et al., 2019; Zhou et al., 2020) or 3D face models (Wen et al., 2020; Thies et al., 2020; Yi et al., 2020; Wu et al., 2021; Ji et al., 2021) to bridge the gap between audio and visual domains, which achieve better stability and inter-frame continuity. Compared with 2D facial landmarks, the 3D face model is better to handle the large changes of head pose and provide dense guidance.

2.2. Expression Editing

Expression editing aims to modify the human expressions in photos or videos while keeping their identities and other characteristics unchanged, which is a special kind of image translation task. Recently, general-purpose image translation methods (Choi et al., 2018; Richardson et al., 2021) have been successfully applied in this area. However, these methods cannot completely decouple expressions from other attributes, such as lip motion and pose, which are important in talking-face videos, and thus cannot generate high-quality results. Some methods (Ma and Deng, 2019; Geng et al., 2019; Wu and Lu, 2020; Ding et al., 2018) are specially designed for expression editing. (Ding et al., 2018) designs an expression controller module to encode expressions as real-valued intensity vectors and other methods take only discrete intensity levels of expressions into consideration. However, (Ding et al., 2018) cannot be used to edit talking-face videos, because they cannot maintain lip shape synchronization with speech. In recent years, several methods (Karras et al., 2017; Cudeiro et al., 2019) have been proposed for generating 3D face animations with controllable expressions. Although 3D face animations can be finely controlled with high quality, they are still far from photo-realistic and are easily distinguishable from real videos.

2.3. Neural Rendering

Neural rendering (Tewari et al., 2020; Thies et al., 2019; Mildenhall et al., 2020) is a novel rendering technology utilizing neural networks. In contrast to traditional rendering that uses empirical models or physically-based models, neural rendering takes full advantage of neural networks and can achieve more realistic results. One important class of neural rendering technologies are neural textures (Thies et al., 2019), which apply learned textures to 3D meshes to represent a scene. Neural textures have been introduced into generating high-quality talking-face videos (Thies et al., 2019, 2020). However, existing methods only use static textures which cannot be used to model time-varying expressions. Inspired by dynamic textures (Olszewski et al., 2017; Nagano et al., 2018) which are per-frame textures, we overcome this limitation and propose dynamic neural textures, which are independent of geometry for decoupling lip motion and expressions.

3. Our Method

3.1. Overview

Our work is based on an texture-containing-expression-information observation, which is explained in Sec. 3.2. The pipeline of the proposed dynamic neural texture method is illustrated in Fig. 2, which contains 4 submodules: dynamic neural texture submodule (Sec. 3.3), audio submodule (Sec. 3.4), teeth submodule (Sec. 3.6) and neural rendering submodule (Sec. 3.7). The input to our system is a driving audio, a sequence of expressions and a sequence of background frames (or only one background frame). Our system outputs a talking-face video with continuously controllable expressions, in which each video frame is generated from a background frame, a driving audio segment and a continuous intensity expression coding (CIEC) vector. We use two submodules — an audio submodule and a dynamic neural texture submodule — to decouple lip motions from expressions. The former generates 3D face animation (represented by 3DMM parameters) from input audio, and the latter obtains dynamic neural textures from CIEC, which is used to fuse features of the input expressions. In Sec. 3.5, we present the details of how to use generated 3D faces to sample dynamic neural textures. Finally, we blend the facial area into the background (Sec. 3.7), by using a CNN to extract features from background frames and using a U-Net with residual blocks from the background features and the complete facial features to simultaneously generate a color facial image and an attention mask.

Table 1. Cross entropy (CE) losses of five inputs. The numbers in brackets are ranks. Rd means rendered and vc means vertex color.

Groups	Inputs	CE of level $\downarrow$	CE of type $\downarrow$
with textures	Frames	0.0338 (1)	0.0019 (3)
	Rd with textures	0.0373 (2)	0.0016 (2)
	Textures	0.0383 (3)	0.0012 (1)
w/o textures	Rd w/o color	0.2048 (5)	0.0326 (4)
w/o textures	Rd with vc	0.1697 (4)	0.0438 (5)

3.2. Where Are Expressions?

We use a 3DMM (with 35,709 vertices in our experiment), which has been widely used in the talking-face video generation (Wen et al., 2020; Yi et al., 2020), to represent the face geometry and texture in an image. The expression is a characteristic of face information and thus can be captured in geometry and texture. To explore which component captures most of the expression information, we design an experiment. We train a classification model with the same architecture (ResNet- $50$ (He et al., 2016)) for five different input settings (shown in Fig. 3), i.e., video frames (Frames), images rendered without color (Rd w/o color), images rendered with vertex color (Rd with vc), images rendered with texture map (Rd with textures), and texture maps (Textures). The geometry and vertex color are reconstructed from a frame by WM3DR (Zhang et al., 2021), and the textures are sampled from the frame. In this experiment, we use all frontal-view videos of an actor from the MEAD dataset (Wang et al., 2020), which contain expressions of eight types and three intensity levels. We divide them into the training and testing sets randomly by $10$ -fold cross-validation. For each input setting, two classification models are trained for expression types and intensity levels, respectively.

The five inputs (as shown Fig. 3) can be divided into two groups by whether it includes high frequency textures: 1) the group with textures (i.e., Frames, Rd with textures and Textures); 2) the group without textures (i.e., Rd w/o color and Rd with vc). The cross entropy loss values for classifying expression types and intensity levels are shown in Table 1. The results show that the loss values of the group without textures are significantly higher than the other group. It indicates that given finite non-high resolution geometry (face geometry represented by 3DMM), most expression information lies in the textures. Therefore, textures are essential for describing expressions in talking-face video generation. Neural rendering using neural textures has been proposed for generating high-quality talking-face videos from a coarse geometry representation (Thies et al., 2019, 2020). However, they use static textures which are not suitable for generating talking-face videos with time-varying expressions of different types and intensities. In computer graphics, dynamic textures have been proposed for dynamic 3D avatar (Nagano et al., 2018), which uses different textures for different expressions. Inspired by the success of dynamic textures, we propose a novel method called dynamic neural textures, to generate photo-realistic talking-face videos with continuously controllable expressions.

3.3. Dynamic Neural Textures

Neural textures (Thies et al., 2019, 2020) are a set of learnable feature maps in the texture space, which are used in a neural rendering step. Intuitively, the sampled neural textures could be regarded as feature maps of the screen space and the neural rendering is a neural network to transform the feature maps into a photo-realistic frame. The set of feature maps and the neural network for neural rendering are trained in an end-to-end manner.

Existing neural textures use static texture information during inference, which we refer to as static neural textures. Static neural textures are able to generate talking-face videos by (1) controlling the geometry of the 3D face model, whose parameters are inferred from an input audio sequence, and (2) using the 3D face model to sample the neural textures from texture space into screen space by a UV map. Different from the static neural textures which are fixed during inference, our proposed dynamic neural textures are variable and depend on the input expression. Therefore, dynamic neural textures are more expressive for performing different expression types and intensities.

Interpretation from the perspective of set approximation. Static neural textures generate talking-face videos with a fixed expression. Therefore, using different static neural textures can represent different expressions and dynamic neural textures can be regarded as an approximation for a set of static neural textures. Inspired by dynamic convolution kernels (Ye et al., 2022), we can understand dynamic neural textures in the following way. Denote by $\mathcal{E}$ the space of all expressions. For a fixed expression $t\in\mathcal{E}$ , we can learn a static neural texture to represent it. For different expressions sampled in $\mathcal{E}$ , we learn different static neural textures to represent them, which is a finite set of static neural textures. On the other hand, our method infers the dynamic neural textures from different expressions, and these textures play the same role of the set of static neural textures. However, the set of different static neural textures can only represent discretely sampled expressions in $\mathcal{E}$ . As a comparison, dynamic neural textures can represent expressions in a continuous space and then provide a better tool to continuously control expressions.

Continuous Intensity Expression Coding. For controlling expressions in talking-face videos continuously, we propose continuous intensity expression coding (CIEC) which is a continuous version of the one-hot encoding. To describe $C$ types of expressions (except the neutral expression), we define the CIEC as a $C$ dimensional vector. Similar to one-hot encoding, the CIEC allows at most one non-zero element. Each dimension characterizes an expression type and the zero vector represents the neutral expression. The value of the non-zero element represents the intensity of the corresponding expression. In our experiment, the intensity is normalized to $[0,1]$ , where $0$ indicates the neutral expression and $1$ indicates the highest intensity of that expression. In the MEAD dataset (Wang et al., 2020), there are three levels (level 1, 2 and 3) for each expression type. We map level 1 to $0.33$ , level 2 to $0.67$ and level 3 to $1$ .

We infer dynamic neural textures from the input CIEC vector (as shown in the bottom row of Fig. 2). Since texture space is well aligned and each position in texture space has a fixed (or similar) semantic meaning, e.g., a pixel in the mouth region always corresponds to the mouth, blending a set of neural textures bases (which are learnable parameters) is a direct and effective way to obtain dynamic neural textures. We propose a transcoding network (implemented as a fully connected network) to infer a vector of weights from CIEC, and use these weights to linearly combine the neural textures bases. We implement the dynamic neural textures bases on a linear layer, which follows the transcoding network. Hence we can implement the transcoding network and the dynamic neural textures bases together as a fully connected network. Due to the continuity of the fully connected network, we obtain continuous dynamic neural textures and then continuously control the intensity levels of expressions.

Dynamic neural textures are used in the submodule of neural rendering. In particular, we use the audio submodule to generate 3D face animation (represented by 3DMM parameters) from the input audio and then use the 3D face geometry with parameterization (as UV coordinates) to sample the dynamic neural textures from texture space into screen space to generate photo-realistic frames.

3.4. Audio Submodule

The audio submodule generates 3D face animation (represented by 3DMM parameters) from the input audio, where the 3D face is used to sample the dynamic neural textures from texture space into screen space (in which photo-realistic frames are generate). We use the following three steps to achieve the task (as shown in the middle row of Fig. 2): (1) we use Wav2Lip (Prajwal et al., 2020) to generate talking-face videos from the input audio and the input frames, (2) we use WM3DR (Zhang et al., 2021) to reconstruct 3D faces from the generated talking-face videos and then obtain a 3D face animation, and (3) we use a decoupling network to transfer the 3D faces with expressions to a neutral face. In these steps, the challenge is how to decouple lip motions from expressions. Our solution is using the dynamic neural textures to represent and control expressions, and using the 3D face geometry to represent and control lip motions. In this way, the 3D face geometry should not provide expression information, which means the 3D face should be neutral. Accordingly, we use the decoupling network to transfer 3D faces with different expressions to a neutral face, which is used later in the dynamic neural texture submodule.

Decoupling Network. We train a decoupling network to transfer 3D faces with different expressions to neutral faces and meanwhile keep other attributes, including lip motion and identity. The task is projecting a 3D face into the subspace of neutral faces through the direction that keeps other attributes. This task is similar to expression editing of 3D faces and it only includes one target direction. The 3D faces are represented by 3DMM parameters and the parameters of our 3DMM are a 257-dimensional vector, which contains five components $\{\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\delta},\boldsymbol{\gamma},\mathbf{p}\}\in\mathbb{R}^{257}$ , where $\boldsymbol{\alpha}\in\mathbb{R}^{80}$ is the identity component, $\boldsymbol{\beta}\in\mathbb{R}^{64}$ is the expression component, $\delta\in\mathbb{R}^{80}$ is the texture component, $\boldsymbol{\gamma}\in\mathbb{R}^{27}$ is the illumination component, and $\mathbf{p}\in\mathbb{R}^{6}$ is the pose component. In the decoupling network, we only consider the expression component and fix the other components. The decoupling network can be regarded as a mapping function $f_{d}:\mathbb{R}^{64}\to\mathbb{R}^{64}$ . We implement the decoupling network as a fully connected network due to its good capacity of fitting vector mappings.

3.5. Texture Mapping

In our pipeline, we use the 3D face geometry with parameterization (as UV coordinates) to sample the dynamic neural textures from texture space into screen space by rasterization. Since the faces represented by 3DMM parameters have the same topology, we have a natural parameterization of the mean face of our 3DMM and it can be used for all faces. We use the deferred rendering to implement the texture mapping: (1) we rasterize the UV coordinates to obtain a UV map in screen space, and (2) use the UV map to sample the dynamic neural textures from texture space into screen space.

3.6. Teeth Submodule

3DMM does not contain teeth information and then the sampled neural textures have no information about teeth. Since teeth are important in a high-quality talking-face video, we propose a teeth submodule to complete the missing texture information in the teeth area. As shown in Fig. 4, we use affine transformation to focus on the teeth area and use a CNN to complete teeth features. For aligning the teeth area, the affine transformation translates the center of the teeth area to the center of the image. We also use the inverse transformation to integrate the teeth features with the sampled neural textures. We apply a CNN to complete the teeth features, which is a U-Net with residual blocks. Since the teeth are related to the expression, we also input the CIEC to the CNN. We rearrange the CIEC by a fully connected network and concatenate the rearranged CIEC with the image. The output teeth features are then transformed inversely and concatenated with the sampled neural textures.

3.7. Neural Rendering Submodule

The neural rendering submodule in our pipeline (as shown in the top row of Fig. 2) uses the complete facial features (including dynamic neural textures and teeth features) to generate photo-realistic talking-face video frames. Since the 3DMM does not contain hair and background, we make use of hair and background information in the input frame. We mask out the facial area and teeth area of the input frame to obtain a background frame, where the facial area and teeth area are the projection of the 3DMM reconstructed from the frame. We then use a CNN to extract background features from the masked background frames, concatenate them with the complete facial features and input them into a U-Net with residual blocks for blending. In this step, the output contains a colored facial image and an attention mask. We use the attention mask to blend the colored facial image with the background frame.

Blending. We use the attention mask to blend the colored facial image with the background frame. The attention mask $\alpha_{i}$ is a grayscale image and the facial image $F_{i}$ is a color image, where $i$ indicates the $i$ -th frame. Denote by $B_{i}$ the $i$ -th background frame, the $i$ -th synthetic frame (final output) $I_{i}^{\prime}$ is calculated as:

(1)

I_{i}^{\prime}=B_{i}\otimes(1-\alpha_{i})+F_{i}\otimes\alpha_{i},

where $\otimes$ is pixel-wise multiplication.

4. Training Details

Multiple network structures are used in our pipeline (shown in Fig. 2). The networks of Wav2Lip (Prajwal et al., 2020) and WM3DR (Zhang et al., 2021) are the pre-trained models. We pre-train the decoupling network in the audio submodule as a preprocess and then all networks in the audio submodule are pre-trained models. We end-to-end train all the other networks in our pipeline, including networks of dynamic neural textures submodule, teeth submodule and neural rendering submodule.

4.1. Pre-training the Decoupling Network

We present the decoupling network in Sec. 3.4. This network projects a 3D face to the subspace of neutral faces and meanwhile keeps the lip motion. It is a mapping $f_{d}:\mathbb{R}^{64}\to\mathbb{R}^{64}$ and implemented by a fully connected network. We reconstruct 3D faces from MEAD dataset (Wang et al., 2020) which contains talking-face videos with different expressions (including neutral expressions), and use them to train the decoupling network. We use three loss terms to train the decoupling network, i.e. adversarial loss, neutral loss and landmarks loss.

Inspired by GAN (Goodfellow et al., 2014), we use a discriminator $D$ to learn the subspace of neutral faces and ensure the output of the decoupling network is a neutral face. The adversarial loss term is:

(2)

\begin{array}[]{l}L_{adv}(f_{d},D)=\mathbb{E}_{F_{1}\sim\mathcal{F}_{n},F_{2}\sim\mathcal{F}}(\log D(F_{1})+\log(1-D(f_{d}(F_{2})))),\end{array}

where $\mathcal{F}$ is the space of 3D faces and $\mathcal{F}_{n}$ is the subspace of neutral faces.

The decoupling network does not change the neutral face. If the input is a neutral face, the output should be the same as the input. Then we design the neutral loss term as:

(3)

\begin{array}[]{l}L_{neutral}(f_{d})=\mathbb{E}_{F\sim\mathcal{F}_{n}}(\lVert f_{d}(F)-F\rVert_{1}).\end{array}

Furthermore, we use mouth landmarks in input and output to constrain that the lip motions are unchanged. Accordingly, we design the landmarks loss as:

(4)

\begin{array}[]{l}L_{landmarks}(f_{d})=\mathbb{E}_{F\sim\mathcal{F}}(\lVert LM(f_{d}(F))-LM(F)\rVert_{2}),\end{array}

where $LM$ is the mouth landmarks of 3D faces.

The overall loss function is in the following form:

(5)

\begin{array}[]{l}L_{total}(f_{d},D)\\ =L_{adv}(f_{d},D)+\lambda_{1}L_{neutral}(f_{d})+\lambda_{2}L_{landmarks}(f_{d}),\end{array}

where $\lambda_{1}$ and $\lambda_{2}$ are the weights for balancing the multiple objectives. For all experiments, we set $\lambda_{1}=1$ and $\lambda_{2}=50000$ .

4.2. Training Our Model End-to-end

We train our model end-to-end in a supervised way. Our model generates talking-face videos frame-by-frame. The input to the model $g(B,F,E)$ contains a background frame $B$ , a 3D face $F$ and a CIEC vector $E$ , where the background frame is obtained from the ground-truth frame (i.e., the frame from the MEAD dataset, which are not generated from the audio by Wav2lip), and the 3D face is obtained from the ground-truth frame by 3D face reconstruction (Zhang et al., 2021) and the transferring in the decoupling network. In the training process, we skip the Wav2lip in the audio submodule and use the ground-truth frame.

Our model generates a photo-realistic frame, which should be the same as the ground-truth frame. To model this constraint, we use a perceptual loss based on a pre-trained VGG- $19$ network (Simonyan and Zisserman, 2014) $\mathcal{V}$ :

(6)

\begin{array}[]{l}L_{vgg}(g,I_{gt},E_{gt})=\lVert\mathcal{V}(g(B_{gt},f_{d}(F_{gt}),E_{gt}))-\mathcal{V}(I_{gt})\rVert_{1},\end{array}

where $I_{gt}$ is the ground truth frame, $F_{gt}$ is the reconstructed face from $I_{gt}$ , $f_{d}$ is the pre-trained decoupling network, $B_{gt}$ is the background of $I_{gt}$ , and $E_{gt}$ is the expression type and intensity label of $I_{gt}$ .

5. Experiments

5.1. Implementation Details

We implemented our method and baseline methods with PyTorch (Paszke et al., 2017) and Pytorch3D (Ravi et al., 2020). We trained and tested the model on a server with a NVIDIA Tesla A100 GPU. We use the Adam solver with $\beta_{1}=0.9,\beta_{2}=0.999$ and learning rate of $1e^{-5}$ to optimize our model.

Training Set. We train a model for a specific person by using a set of videos with different expression types and intensity levels of this person. To satisfy the need, we use MEAD dataset (Wang et al., 2020), which has eight expression types (neutral, angry, contempt, disgusted, fear, happy, sad, surprised) and three intensity levels (level 1, 2, 3) for each actor. To train a model for a person in MEAD, we use all frontal-view video clips of the person, about 600-700 clips and 30-40 minutes in total. The length of each clip ranges from 1 to 7 seconds. Each video clip has an expression type label and an intensity level label.

Running Time. For each batch with 8 frames, our method takes 245 ms in average, i.e., $31$ ms per frame, which is over $30$ fps. Therefore, our method can generate talking-face videos with continuously controllable expressions in real-time.

5.2. Study on Expression Intensity Levels

We generate results by fixing an input audio segment and using different expression types and intensity levels, as shown in Fig. 5. We use all 8 expression types in the training set and 6 sampled intensity levels, i.e. 0.5, 1, 1.5, 2, 2.5, 3, where level 0.5, 1.5, 2.5 do not appear in the training set. It is observed that all the results are consistent with specified expression types and intensity levels. More results with continuously varied expressions are presented in the demo video. These results demonstrate that our method can continuously control the expressions of generated results.

Table 2. Several metrics for expressions and lip motions of different methods, i.e. methods in ablation study, baseline methods, EVP (Ji et al., 2021) and our method, which shows that our method achieves a good balance between various criteria.

Methods	PSNR $\uparrow$	SSIM $\uparrow$	CE of intensity levels $\downarrow$	CE of expression types $\downarrow$	LMD $\downarrow$	CSS $\uparrow$
Wav2Lip (Prajwal et al., 2020) + StarGAN (Choi et al., 2018)	27.24	0.73	0.26	0.81	0.86	5.65
Wav2Lip (Prajwal et al., 2020) + ExprGAN (Ding et al., 2018)	28.93	0.79	0.48	0.023	0.44	4.38
EVP (Ji et al., 2021)	29.53	0.71	0.27	0.36	0.49	4.16
Ours w/o DNT	29.64	0.89	0.034	0.23	0.45	6.07
Ours w/o DeNet	29.99	0.87	0.054	0.013	0.44	4.19
Ours w/o TS	29.71	0.90	0.034	0.27	0.43	5.79
Ours	30.39	0.91	0.029	0.18	0.39	5.81

Perceptual Study. We design a perceptual study to validate if the expressions in the generated results are consistent with the input expression labels. We ask participants to sort a set of videos with the same expression type according to intensity levels. We randomly sample $m=4$ intensities in $[0,1]$ for each expression type and use the same audio to generate $m$ videos. We let participants drag and sort them according to intensity levels from weak to strong. We design a perceptual score of intensity based on the collected orders to measure the perceptual expression intensity of a video. Based on the collected rankings, we define a perceptual intensity score, to represent the perceptual intensity level: Denote the sampled $m$ intensities from low to high as $x=\{x_{1},\dots,x_{m}\}$ . The order matrix $M_{i}$ for the order of the $i$ -th participant is defined as a $m\times m$ matrix where each row and column has only one element with value $1$ and the remaining elements with value $0$ . The element at $(i,j)$ with value $1$ means that the participant thought the video with $x_{i}$ intensity ranks $j$ in order of intensity level. The order matrix is an identity matrix if the sorted order by the participant is corresponding to the order of the sampled intensities. The average order matrix $M$ of all $k$ participants is calculated as $M=\sum_{i=1}^{k}M_{i}/k$ . The perceptual intensity scores of $x$ is $s=Mx$ . The score will be equal to the sampled intensity if $M$ is an identity matrix, i.e. the sorted order by all participants are corresponding to the order of the sampled intensities. $16$ participants are recruited to attend the perceptual study and each of them sorts $21$ groups of videos of $3$ identities and $7$ types of expressions, each group contains $m=4$ generated videos with $m$ different input intensity labels. The perceptual intensity scores of videos are shown in Fig. 6, where the scores are close to the input CIEC, demonstrating that the expression intensities in the generated results are consistent with the input expression intensity labels.

5.3. Metrics for Evaluation

We use the following metrics to evaluate the generation results of our methods and baseline methods. The video clips in MEAD dataset (Wang et al., 2020) are used to train and test the methods. We use the early $80\%$ duration of each video clip as the training set and the late $20\%$ duration of each video clip as the testing set to calculate the metrics.

Metrics for Video Quality. We use Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Metrics (SSIM) to evaluate the quality of the generated videos.

Metrics for Expressions. We use two classification losses for expressions (i.e. loss of intensity level and loss of expression type) to evaluate the expressions of the generated videos. The two losses are cross entropy obtained by the ResNet-50 (as shown in Sec. 3.2) for classifying intensity levels and types of expressions.

Metrics for Lip Synchronization. We use landmarks distance (LMD) (Chen et al., 2018) and confidence score of synchronization (CSS) (Chung and Zisserman, 2016) to measure the lip synchronization of the generated videos, where the averaged CSS of the ground truth in the test set is 6.71.

5.4. Ablation Study

To validate the effectiveness of the dynamic neural textures, the decoupling network and the teeth submodule, we compare the results generated by our method with and without dynamic neural textures (DNT), decoupling network (DN) or teeth submodule (TS). The method without dynamic neural textures (ours w/o DNT) uses static neural textures (instead of DNT) and uses 3D face geometry to control the expressions, which skip the decoupling network because the decoupling network and the dynamic neural textures should work together. The method without decoupling network (ours w/o DeNet) skip the decoupling network, which uses both geometry and textures to control the expressions. The method without teeth submodule (ours w/o TS) directly skip the teeth submodule and input sampled neural textures to the neural rendering submodule. We compare our full method with them. The qualitative comparison is shown in Fig. 7 and the supplementary demo video. The quantitative comparison is summarized in Table 2. The qualitative and quantitative comparisons show the dynamic neural textures and the decoupling network (i.e. using 3D faces and textures to represent lip motions and expressions respectively) are essential for generating talking-face videos with continuously controllable expressions. Moreover, the teeth submodule is helpful for generating talking-face videos with high-quality mouth region.

5.5. Comparison with Existing Methods

Baseline Methods. We propose two baseline methods by combining two steps. The first step generates talking-face videos with a neutral expression and the second step transfers the expression of each frame from neutral to the target expression. We select Wav2Lip (Prajwal et al., 2020) for the first step, which is a state-of-the-art method for generating audio-driven talking-face video. We select both StarGAN (Choi et al., 2018) and ExprGAN (Ding et al., 2018) as the second step, where StarGAN is a state-of-the-art method for multi-domain image translation and ExprGAN is specially designed for expression editing. Note that StarGAN uses discrete intensity levels and ExprGAN uses continuous intensity levels.

EVP (Ji et al., 2021) is a state of the art for emotional talking face generation. We also compare it with our method.

The quantitative comparison are summarized in Table 2. The qualitative comparison is shown in Fig. 7 and the demo video in the supplementary material. The results in the demo video show that baseline methods generate low quality results and cannot completely decouple expressions from lip motions and head poses. The results of EVP (Ji et al., 2021) show it generates results with inaccurate lip synchronization and it cannot decouple expressions from lip motions due to it uses 3D faces to represent both lip motions and expressions. By using 3D faces and DNT, our method can effectively decouple expressions from lip motions, and well control the head pose through a 3D face.

6. Conclusion

In this paper, we propose dynamic neural textures to generate talking-face videos with continuously controllable expressions in real-time. For decoupling lip motions from expressions, we propose a decoupling network. We also propose a teeth submodule to complete the missing information in the teeth area for a 3D face model with a hole in the mouth. Quantitative and qualitative results show our method can generate high quality results with continuously controllable expressions.

References

(1)
Blanz and Vetter (1999) Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 187–194.
Chen et al. (2018) Lele Chen, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip Movements Generation at a Glance. In Proceedings of the European Conference on Computer Vision (ECCV). 538–553.
Chen et al. (2019) Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7832–7841.
Choi et al. (2018) Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8789–8797.
Chung et al. (2017) Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that?. In British Machine Vision Conference (BMVC).
Chung and Zisserman (2016) Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251–263.
Cudeiro et al. (2019) Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. 2019. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10101–10111.
Ding et al. (2018) Hui Ding, Kumar Sricharan, and Rama Chellappa. 2018. Exprgan: Facial expression editing with controllable expression intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Geng et al. (2019) Zhenglin Geng, Chen Cao, and Sergey Tulyakov. 2019. 3d guided fine-grained face manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9821–9830.
Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS). 2672–2680.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
Ji et al. (2021) Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. 2021. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14080–14089.
Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410.
Ma and Deng (2019) Luming Ma and Zhigang Deng. 2019. Real-Time Facial Expression Transformation for Monocular RGB Video. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 470–481.
Mildenhall et al. (2020) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2020. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision. Springer, 405–421.
Nagano et al. (2018) Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: real-time avatars using dynamic textures. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1–12.
Olszewski et al. (2017) Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang, Shunsuke Saito, Pushmeet Kohli, and Hao Li. 2017. Realistic dynamic facial textures from a single image using gans. In Proceedings of the IEEE International Conference on Computer Vision. 5429–5438.
Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques.
Prajwal et al. (2020) K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C. V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In The 28th ACM International Conference on Multimedia (MM). 484–492.
Ravi et al. (2020) Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. 2020. Accelerating 3D Deep Learning with PyTorch3D. arXiv:2007.08501 (2020).
Richardson et al. (2021) Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2021. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2287–2296.
Shen et al. (2020) Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. 2020. Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence (2020).
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Song et al. (2019) Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, and Hairong Qi. 2019. Talking Face Generation by Conditional Recurrent Adversarial Network. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 919–925.
Tao et al. (2018) Fei Tao, Gang Liu, and Qingen Zhao. 2018. An ensemble framework of voice-based emotion recognition system for films and TV programs. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6209–6213.
Tewari et al. (2020) Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, et al. 2020. State of the art on neural rendering. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 701–727.
Thies et al. (2020) Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision. Springer, 716–731.
Thies et al. (2019) Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–12.
Wang et al. (2020) Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision. Springer, 700–717.
Wen et al. (2020) Xin Wen, Miao Wang, Christian Richardt, Ze-Yin Chen, and Shi-Min Hu. 2020. Photorealistic Audio-driven Video Portraits. IEEE Transactions on Visualization and Computer Graphics 26, 12 (2020), 3457–3466.
Wu et al. (2021) Haozhe Wu, Jia Jia, Haoyu Wang, Yishun Dou, Chao Duan, and Qingshan Deng. 2021. Imitating arbitrary talking style for realistic audio-driven talking face synthesis. In Proceedings of the 29th ACM International Conference on Multimedia. 1478–1486.
Wu and Lu (2020) Rongliang Wu and Shijian Lu. 2020. Leed: Label-free expression editing via disentanglement. In European Conference on Computer Vision. Springer, 781–798.
Ye et al. (2022) Zipeng Ye, Mengfei Xia, Ran Yi, Juyong Zhang, Yu-Kun Lai, Xuwei Huang, Guoxin Zhang, and Yong-jin Liu. 2022. Audio-Driven Talking Face Video Generation with Dynamic Convolution Kernels. IEEE Transactions on Multimedia (2022), 1–1. https://doi.org/10.1109/TMM.2022.3142387
Yi et al. (2020) Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. 2020. Audio-driven Talking Face Video Generation with Natural Head Pose. arXiv preprint arXiv:2002.10137 (2020).
Zhang et al. (2021) Jialiang Zhang, Lixiang Lin, Jianke Zhu, and Steven CH Hoi. 2021. Weakly-Supervised Multi-Face 3D Reconstruction. arXiv preprint arXiv:2101.02000 (2021).
Zhou et al. (2019) Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 9299–9306.
Zhou et al. (2020) Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeltTalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–15.