Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis from Monocular Image

Yu Deng Baoyuan Wang Heung-Yeung Shum
Xiaobing.AI

Abstract

A key challenge for novel view synthesis of monocular portrait images is 3D consistency under continuous pose variations. Most existing methods rely on 2D generative models which often leads to obvious 3D inconsistency artifacts. We present a 3D-consistent novel view synthesis approach for monocular portrait images based on a recent proposed 3D-aware GAN, namely Generative Radiance Manifolds (GRAM) [14], which has shown strong 3D consistency at multiview image generation of virtual subjects via the radiance manifolds representation. However, simply learning an encoder to map a real image into the latent space of GRAM can only reconstruct coarse radiance manifolds without faithful fine details, while improving the reconstruction fidelity via instance-specific optimization is time-consuming. We introduce a novel detail manifolds reconstructor to learn 3D-consistent fine details on the radiance manifolds from monocular images, and combine them with the coarse radiance manifolds for high-fidelity reconstruction. The 3D priors derived from the coarse radiance manifolds are used to regulate the learned details to ensure reasonable synthesized results at novel views. Trained on in-the-wild 2D images, our method achieves high-fidelity and 3D-consistent portrait synthesis largely outperforming the prior art. Project page: https://yudeng.github.io/GRAMInverter

{strip}

Figure 1: Novel view synthesis results by our method. It can generate novel views of a portrait image with high-fidelity and strong 3D-consistency via single forward pass (e.g., see the bangs and wrinkles). Best viewed with zoom-in.

1 Introduction

Synthesizing photorealistic portrait images of a person from an arbitrary viewpoint is an important task that can benefit diverse downstream applications such as virtual avatar creation and immersive online communication. Thanks to the thriving of 2D Generative Adversarial Networks (GANs) [19, 28, 29], people can now generate high-quality portraits at desired views given only monocular images as input, via a simple invert-then-edit strategy by conducting GAN inversion [1, 63, 49] and latent space editing [52, 22, 13, 61]. However, existing 2D GAN-based methods still have deficiencies when applied to applications that require more strict 3D consistency (e.g. VR&AR). Due to the non-physical rendering process of the 2D CNN-based generators, their synthesized images under pose changes usually bear certain kinds of multiview inconsistency, such as geometry distortions [3, 5] and texture sticking or flickering [27, 66]. These artifacts may not be significant enough when inspecting each static image but can be easily captured by human eyes under continuous image variations.

Recently, there are an emerging group of 3D-aware GANs [50, 11, 20, 10, 14, 51] targeting at image generation with 3D pose disentanglement. By incorporating Neural Radiance Field (NeRF) [36] and its variants into the adversarial learning process of GANs, they can produce realistic images with strong 3D-consistency across different views, given only a set of monocular images as training data. As a result, 3D-aware GANs have shown greater potential than 2D GANs for pose manipulations of portraits. However, even though 3D-aware GANs are capable of generating 3D-consistent portraits of virtual subjects, leveraging them for real image pose editing is still a challenging task. To obtain faithful reconstructions of real images, most existing methods [11, 14, 10, 58, 57, 31, 79] turn to a time-consuming and instance-specific optimization to invert the given images into the latent space of a pre-trained 3D-aware GAN, which is hard to scale-up. And simply enforcing an encoder-based 3D-aware GAN inversion [8, 57] often fails to preserve fine details in the original image.

In this paper, we propose a novel approach GRAMInverter, for high-fidelity and 3D-consistent novel view synthesis of monocular portraits via single forward pass. Our method is built upon the recent GRAM [14] that can synthesize high-quality virtual images with strong 3D consistency via the radiance manifolds representation [14]. Nevertheless, GRAM suffers from the same lack-of-fidelity issue when combined with a general encoder-based GAN inversion approach [63]. The main reason is that the obtained semantically-meaningful low-dimensional latent code cannot well record detail information of the input, as also indicated by some recent 2D GAN inversion methods [63, 66].

To tackle this problem, our motivation is to further learn 3D-space high-frequency details and combine them with the coarse radiance manifolds obtained from the general encoder-based inversion of GRAM, to achieve faithful reconstruction and 3D-consistent view synthesis. A straightforward way to achieve this is to extract a high-resolution 3D voxel from the input image and combine it with the coarse radiance manifolds. However, this is prohibited by modern GPUs due to the high memory cost of the 3D voxel. To tackle this problem, we turn to learn a high resolution detail manifolds, taking the advantage of the radiance manifolds representation of GRAM, instead of learning the memory-consuming 3D voxel. We introduce a novel detail manifolds reconstructor to extract detail manifolds from the input images. It leverages manifold super-resolution [72] to predict high-resolution detail manifolds from a low resolution feature voxel. This can be effectively achieved by a set of memory-efficient 2D convolution blocks. The obtained high resolution detail manifolds can still maintain strict 3D consistency due to lying in the 3D space. We also propose dedicated losses to regulate the detail manifolds via 3D priors derived from the coarse radiance manifolds, to ensure reasonable novel view results.

Another contribution of our method is an improvement upon the memory and time-consuming GRAM, without which it is difficult to be integrated into our GAN inversion framework. We replace the original MLP-based radiance generator [14] in GRAM with a StyleGAN2 [29]-based tri-plane generator proposed by [10]. The efficient GRAM requires only $1/4$ memory cost with $7\times$ speed up, without sacrificing the image generation quality and 3D consistency.

We train our method on FFHQ dataset [28] and conduct multiple experiments to demonstrate its advantages on pose control of portrait images. Once trained, GRAMInverter takes a monocular image as input and predicts its radiance manifolds representation for novel view synthesis at $3$ FPS on a single GPU. The generated novel views well preserve fine details in the original image with strong 3D consistency, outperforming prior art by a large margin. We believe our method takes a solid step towards efficient 3D-aware content creation for real applications.

2 Related Work

3D-aware generative model.

Learned with monocular 2D images, 3D-aware GANs [39, 50, 40, 11, 74, 14, 10, 20, 43, 72, 56, 51, 18] achieve an explicit disentanglement of camera pose by introducing underlying 3D representations. Earlier works [39, 59, 53] utilize voxel or mesh as the intermediate representation. Later works [50, 11, 74, 43, 10, 14, 81] leverage NeRF [36] and its variants [42, 65, 76, 14] to achieve more strict 3D consistency. Among them, methods that directly render their 3D representations for image synthesis achieve the best multiview consistency [11, 14, 56, 51, 18]. We propose a novel approach for high-quality pose editing of given portraits based on GRAM [14], which is a recent 3D-aware GAN with state-of-the-art multiview consistency.

Refer to caption — Figure 2: Overview of the GRAMInverter. An input portrait image goes through two stages to obtain the final radiance manifolds for novel view synthesis. The first general inversion stage maps the input image to the latent space of a pre-trained efficient GRAM to obtain coarse radiance manifolds. The second detail-specific stage then extracts detail feature manifolds from the input image and combines them with the coarse results for high-fidelity image synthesis. See the text for more details.

GAN inversion.

GAN inversion aims to map a given real image into the latent space of a pre-trained generator for image reconstruction and manipulation. Numerous methods [85, 6, 1, 84, 29, 48, 63, 4, 49, 66, 75] try to find a latent code which can faithfully reconstruct the given image meanwhile falls inside a semantically meaningful latent space that supports reasonable editing. They either adopt optimization-based approach [85, 1, 2, 29], introduce an extra image encoder [48, 63, 4], or utilize a hybrid version of the former two [6, 84]. Nevertheless, recent studies [86, 63, 49, 66] reveal that it is difficult to achieve high-fidelity reconstruction and artifacts-free editing at once given only low-bitrate latent code as representation. As a result, several methods propose to further fine-tune the pre-trained generator [49] or allow more detailed features from the input images to leak into the generator during inversion [66, 75].

While the above methods target at the inversion of 2D GANs, inverting a given image with 3D-aware GANs shares a similar spirit. An advantage of 3D-aware GAN inversion compare to its 2D counterpart is a natural disentanglement of the 3D pose. Once inverted, novel view synthesis can be easily achieved without further latent space exploration [52, 22]. The majority of existing 3D-aware GAN inversion methods [11, 14, 10, 31, 58, 57, 68, 62] leverage optimization-based or a hybrid approach for faithful reconstruction, which are time-consuming and hard to scale-up. A recent method [8] explores GAN inversion with a single forward of an encoder, yet it struggles to preserve fine image details. Our proposed method is also an encoder-based inversion approach which yields high-quality reconstruction and novel view synthesis thanks to our novel design.

Pose editing of monocular portraits.

Editing the camera pose of a monocular portrait for novel view synthesis is a longstanding task and has witnessed the emergence of diverse methods. Some of them [9, 88, 87, 73, 70, 44] achieve pose editing by first conducting 3D reconstruction and then rendering the obtained mesh at novel views. Due to the imperfect reconstruction results, they often have difficulties handling non-face regions and unseen parts at the input view. Others [69, 54, 55, 41, 78, 67, 82, 47, 16, 17] generate novel views in a face-reenactment paradigm, where warping flows are often learned from video data to transform a source image to a target viewpoint. These methods may encounter geometry distortions at novel views due to the lack of an explicit 3D constraint. More recently, plenty of works [52, 3, 61, 13, 34, 25, 8, 58, 57] have studied pose editing of an image by inverting it into a prior model such as GAN. With the strong prior bearing in a pre-trained generator, synthesizing novel views of a given portrait can be achieved without any 3D or video data during training. Among them, methods using 3D-aware GANs [14, 10, 8, 58, 57] shows better 3D consistency under pose variations. Our method is also based on 3D-aware GAN and largely improves the efficiency, reconstruction quality, as well as 3D consistency.

3 Approach

Given a monocular portrait image $\hat{I}$ , we aim to synthesize its novel views at some arbitrary camera viewpoints by leveraging the prior knowledge of a pre-trained 3D-aware GAN, as shown in Fig. 2. To guarantee high-quality and 3D-consistent novel view synthesis, we adopt GRAM [14] as our underlying image generator and design an efficient version of it that requires much less computation and memory cost so as to incorporate it into our whole framework (Sec. 3.1). With the efficient GRAM, we first utilize a general encoder-based GAN inversion to reconstruct the coarse radiance manifolds from the input image (Sec. 3.2). We then introduce a detail-specific reconstruction stage to learn high-resolution detail manifolds that cannot be well captured by the coarse result, via our proposed novel detail manifolds reconstructor (Sec. 3.3). Multiple losses are enforced to regulate the predicted detail manifolds to ensure reasonable synthesized results at novel views, by leveraging the 3D priors derived from the coarse radiance manifolds (Sec. 3.4). We describe each part in details below.

3.1 Efficient Generative Radiance Manifolds

We start with a brief review of the original GRAM proposed in [14]. The core of GRAM is its underlying radiance manifolds representation, which regulates radiance field learning on a set of surface manifolds in the 3D space instead of predicting it in the whole volumetric space as done by [36]. The surface manifolds are defined as a set of iso-surfaces $\{\mathcal{S}_{i}\}$ in a 3D scalar field represented by a light-weight MLP called the manifold predictor $\mathcal{M}$ :

\begin{split}\vspace{-2pt}\mathcal{M}:\bm{x}\in\mathbb{R}^{3}\rightarrow s\in\mathbb{R},~{}\mathcal{S}_{i}=\{\bm{x}|\mathcal{M}(\bm{x})=l_{i}\},\end{split}\vspace{-2pt}

(1)

where $\{l_{i}\}$ are $N$ predefined scalar levels. During image generation, only intersections $\{\bm{x}_{i}\}$ between a viewing ray $\bm{r}$ and the surface manifolds will be sent into an MLP-based radiance generator $\Phi$ for radiance prediction:

\Phi:(\bm{z},\bm{x}_{i})\in\mathbb{R}^{d_{z}}\times\mathbb{R}^{3}\rightarrow(\bm{c},\alpha)\in\mathbb{R}^{4},

(2)

where $\bm{z}$ is a latent code determining the radiance, $\bm{c}$ is the color, and $\alpha$ is the occupancy. The final color of each ray can be computed via manifold rendering [14, 83]:

\begin{split}\vspace{-3pt}C({\bm{r}})=\sum_{i=1}^{N}\prod_{j<i}(1-\alpha({\bm{x}}_{j}))\alpha({\bm{x}}_{i})\bm{c}({\bm{x}}_{i}).\vspace{-3pt}\end{split}

(3)

The high momery cost of GRAM lies in its MLP-based radiance generator $\Phi$ , which requires millions of forward steps to generate a single image. Inspired by the recent EG3D [10], we substitute the original radiance generator with a tri-plane generator [10] based on StyleGAN2 structure [29]. Its efficient coarse-to-fine structure helps to reduce memory and computation costs by a large margin. Given the new radiance generator, the color and occupancy for points on the surface manifolds can be obtained by first generating tri-plane features by a 2D CNN $\Psi$ , and then conducting tri-plane sampling and sending the sampled features into a small MLP-based decoder $m$ as done in [10]. Note that although we take the tri-plane generator from EG3D to improve efficiency, we do not use its 2D super-resolution module but keep strictly to the radiance manifolds representation. This helps us to maintain the strong 3D consistency brought by the manifold rendering. In addition, we calculate ray-manifold intersections at $1/4$ resolution of the final image to further speed up our image generation process (see Sec. A.3 for details).

The efficient GRAM serves as a strong prior for generating realistic multiview images of virtual subjects. By combining it with our two-stage manifolds reconstruction method, we achieve high-quality novel view synthesis of real portraits, as described in the following sections.

3.2 General Inversion Stage

Given a pre-trained efficient GRAM following a typical 3D-aware GAN training paradigm [14], we first introduce an image inverter $E_{w}$ that maps a given image to the latent space of the efficient GRAM, as shown in Fig. 2. Inspired by previous StyleGAN-based inversion methods [1, 63, 66], we invert the given image into a latent code $\bm{w}+=[\bm{w}_{1},\bm{w}_{2},...,\bm{w}_{L}]$ in $\mathcal{W}+$ space [1] of the tri-plane generator $\Psi$ for a proper trade-off between inversion fidelity and pose editing quality, where $L$ is the number of layers in $\Psi$ ’s synthesis sub-network. We leverage the e4e encoder [63] as the backbone of $E_{w}$ . Given $\bm{w}+$ , we can obtain a coarse radiance manifolds $\Phi(\bm{w}+,\{\mathcal{S}_{i}\})=m\circ\Psi(\bm{w}+,\{\mathcal{S}_{i}\})$ via Eq. (1) and (2), and further obtain a coarse inversion image $I_{w}$ by rendering the radiance manifolds at input viewpoint $\hat{\bm{\theta}}$ via Eq. (3), where $\hat{\bm{\theta}}$ can be obtained by off-the-shelf 3D face reconstruction method [15].

We fixed the pre-trained efficient GRAM and learn the image inverter $E_{w}$ following the training process of [63], except that we replaced the adversarial loss in [63] with a naive L2 loss between predicted $\bm{w}+$ and the average latent code of the $\mathcal{W}+$ space. To further improve the reconstruction fidelity, we fixed the trained $E_{w}$ and finetuned the efficient GRAM via the pivot tuning strategy [49] using all training images. Details for the above training processes can be found in Sec. A.4.

After training, the general inversion stage can already synthesize reasonable multiview images of the input, yet it cannot faithfully preserve the fine details, making the inverted result looks less like the original image (see Fig. 6). Therefore, we introduce a detail-specific stage for faithful detail reconstruction, as described below.

3.3 Detail-Specific Reconstruction Stage

The detail-specific stage aims to extract fine details from the input image that cannot be well described by the coarse radiance manifolds to improve the reconstruction fidelity. The intuition is to learn high-resolution details in 3D space so that their combination with the coarse radiance manifolds still remains strong 3D consistency under pose variations. To achieve this goal, we design a detail manifolds reconstructor consisting of two modules: a detail encoder $E_{detail}$ that extracts low-resolution feature voxel from the input image, and a super-resolution module $\mathcal{U}$ to predict high-resolution detail manifolds from the low-resolution voxel.

Specifically, $E_{detail}$ takes an image $\hat{I}$ as input¹¹1In practice, we find that concatenating $\hat{I}$ with an extra difference map $\Delta=\hat{I}-I_{w}$ as input yields better reconstruction quality, where $I_{w}$ is the inversion image obtained by the general inversion stage. and predicts a camera space feature voxel as shown in Fig. 3:

E_{detail}:\hat{I}\in\mathbb{R}^{H\times W\times 3}\rightarrow V\in\mathbb{R}^{H_{lr}\times W_{lr}\times D_{lr}\times d_{V}},

(4)

where $d_{V}$ is the feature dimension. We implement $E_{detail}$ as a 3D U-Net with skip connections to extract both global geometry structures as well as local fine textures from the input image. We refer the readers to Sec. A.2 for detailed network structure.

The feature voxel is defined in camera space instead of world space (i.e. space where tri-plane features of efficient GRAM are defined) as it is easier for $E_{detail}$ to extract image-aligned features than to learn transformed world space features (see Sec. 4.3). Given the feature voxel $V$ , we can obtain the corresponding feature $f^{lr}\in\mathbb{R}^{d_{V}}$ for a point $\bm{x}\in\mathbb{R}^{3}$ in the world space via:

\vspace{-1pt}f^{lr}={\rm grid\_sample}(V,{\rm world2cam}(\bm{x})),\vspace{-1pt}

(5)

where ${\rm grid\_sample}$ is a tri-linear interpolation function, and ${\rm world2cam}$ is a transformation function between the world space and the camera space.

Nevertheless, since $V$ is a low-resolution voxel, directly combining it with the feature manifolds obtained from the general inversion stage leads to a blurry inversion result, while predicting a high-resolution voxel (e.g. $256^{3}$ ) instead causes unaffordable memory cost. Inspired by [72], we take the advantage of our radiance manifolds representation to obtain high-resolution detail manifolds from the low-resolution voxel via manifold super-resolution. Specifically, we first obtain low-resolution detail manifolds $f^{lr}(\{\mathcal{S}_{i}\})$ by querying features from $V$ via Eq. (5) for low-resolution points grid on the surface manifolds $\{\mathcal{S}_{i}\}$ . We then flatten each manifold to a low-resolution feature map $F^{lr}_{i}$ and send it to the super-resolution module $\mathcal{U}$ to obtain a high-resolution feature map $F^{hr}_{i}=\mathcal{U}(F^{lr}_{i})$ , where $\mathcal{U}$ is a simple 2D CNN of $4$ convolution blocks and $2$ bilinear upsampling blocks. Finally, we obtain the high-resolution detail manifolds $f^{hr}(\{\mathcal{S}_{i}\})$ by re-projecting each flattened feature map $F^{hr}_{i}$ to the surface manifolds. Since we conduct super-resolution for 3D space surface manifolds, 3D consistency across different views can be naturally maintained during this process. Note that although the manifold super-resolution strategy is proposed in [72], it does not leverage it in a reconstruction scenario but to generate random fine details. By contrast, we utilize it for faithful reconstruction of high-frequency details in the original image.

Given $f^{hr}(\{\mathcal{S}_{i}\})$ from the detail-specific stage, we add it to the coarse feature manifolds $\Psi(\bm{w}+,\{\mathcal{S}_{i}\})$ from the general inversion stage, and send each feature point on the manifolds to the MLP-based decoder $m$ to obtain the final radiance manifolds, as shown in Fig. 2. The final inversion image $I$ can then be obtained similarly via manifold rendering at input view $\hat{\bm{\theta}}$ . Novel views can also be easily generated given an arbitrary camera pose $\bm{\theta}$ during rendering.

3.4 Detail Manifolds Learning

We fix the image inverter $E_{w}$ and the efficient GRAM from the general inversion stage, and learn the detail manifolds reconstructor with the following losses.

Image reconstruction loss.

A multi-level reconstruction loss is applied between the final inversion image $I$ and the input image $\hat{I}$ :

$\mathcal{L}_{r}=||I-\hat{I}||^{2}+{\rm LPIPS}(I,\hat{I})+(1-\langle f_{id}(I),f_{id}(\hat{I})\rangle)$ ,

(6)

where ${\rm LPIPS}(\cdot,\cdot)$ is the perceptual loss defined by [80], and $f_{id}$ is a pre-trained face recognition network [12].

Novel view regularization.

The reconstruction loss guarantees a faithful inversion result at the input viewpoint, yet artifacts can still occur when rendering the radiance manifolds at other views. We, therefore, design a regularization term to ensure reasonable novel view synthesis results:

\mathcal{L}_{nv}={\rm LPIPS}(M\odot I(\bm{\theta}),M\odot I_{w}(\bm{\theta})),\vspace{-2pt}

(7)

where $I(\bm{\theta})$ and $I_{w}(\bm{\theta})$ are final and coarse inversion image rendered at novel view $\bm{\theta}$ respectively, $\odot$ is element-wise multiplication, and $M$ is a binary mask:

M(u,v)=\mathbb{I}(-\bm{r}(\hat{\bm{\theta}})\cdot{N_{w}(\bm{\theta})(u,v)}<\tau).\vspace{-2pt}

(8)

Here $(u,v)$ is the image space coordinate, $\mathbb{I}$ is the indicator function, $\bm{r}(\hat{\bm{\theta}})$ is the camera lookat direction of the input image $\hat{I}$ , $N_{w}(\bm{\theta})$ is the surface normal map of $I_{w}(\bm{\theta})$ , and $\tau$ is a scalar threshold. The intuition behind this regularization is that the details for regions unobserved in the input image should stay close to the coarse inversion result at novel views, as the coarse inversion image is more reasonable at new views due to leveraging the priors from the pre-trained GRAM. The normal map $N_{w}(\bm{\theta})$ in Eq. (8) can be effectively calculated via the following equation:

N_{w}(\bm{\theta})(u,v)=-\frac{1}{\eta}\sum_{i=1}^{N}T(\alpha({\bm{x}}_{i}))\alpha({\bm{x}}_{i})\frac{\partial\alpha({\bm{x}}_{i})}{\partial{\bm{x}}_{i}},\vspace{-3pt}

(9)

where ${\bm{x}}_{i}$ are intersections along the ray that $(u,v)$ corresponds to, $T(\alpha({\bm{x}}_{i}))=\prod_{j<i}(1-\alpha({\bm{x}}_{j}))$ is the accumulated transparency, and $\eta$ is a normalizing scalar. The partial gradient $\partial\alpha({\bm{x}}_{i})/\partial{\bm{x}}_{i}$ can be easily computed via backpropagating the MLP-based decoder $m$ . Visualizations of the normal map $N_{w}(\bm{\theta})$ and the binary mask $M$ are in Fig. 4.

Depth regularization.

We further enforce a depth regularization to the HR detail manifolds $f^{hr}(\{\mathcal{S}_{i}\})$ to ensure that details are predicted near the geometry surface:

\mathcal{L}_{depth}=\left\{\begin{aligned} &\lambda\|f^{hr}(\bm{x}_{i})\|^{2}&|z(\bm{x}_{i})-z_{surf}|>\epsilon\\ &0&|z(\bm{x}_{i})-z_{surf}|\leq\epsilon\end{aligned}\right.,

(10)

where $\bm{x}_{i}$ are intersections along the viewing rays at input viewpoint $\hat{\bm{\theta}}$ , $z(\bm{x}_{i})$ is the depth of $\bm{x}_{i}$ , $z_{surf}=\sum_{i=1}^{N}T(\alpha({\bm{x}}_{i}))\alpha({\bm{x}}_{i})z(\bm{x}_{i})$ is the depth of the approximated surface, and $\epsilon$ is a threshold. This regularization ensures correct parallax for the learned details (see Sec. 4.3).

4 Experiments

Implementation details.

We train our method on the FFHQ [28] dataset at $256\times 256$ resolution and test it on the CelebA-HQ [26] dataset. All images are pre-processed following the procedure in [14]. The camera pose of input images is estimated by the face reconstruction method of [15]. We train our models on $4$ NVIDIA Tesla V100 GPUs with $32$ GB memory. The whole training process takes around $6$ days, where training the efficient GRAM takes $2$ days, training the image inverter and finetuning the efficient GRAM takes $2$ days and $1$ day respectively, and training the detail manifolds reconstructor takes $1$ day. See Sec. A for more details.

4.1 Novel View Synthesis Results

Figure 1 shows the novel view synthesis results of our method given different portrait images. Our method well preserves fine details (e.g. hair bangs, wrinkles, moles) of the input images and produces their 3D consistent novel views. The whole inversion and novel view synthesis process runs at $3$ FPS on a V100 GPU without specialized acceleration, which largely improves the efficiency upon previous optimization-based 3D-aware GAN inversions. With the manifold caching technique in [14], we can further increase the free view rendering speed to 180FPS. More results are in Sec. B.

4.2 Comparison with Prior Arts

Comparison with GRAM.

We first compare our efficient version of GRAM with the original one [14]. We measure the image generation quality by the Fréchet Inception Distances (FID) [23] between $20$ K randomly generated images and $20$ K sampled real images. The 3D consistency is measured by the reconstruction quality of NeuS [65] (i.e. PSNR_mv and SSIM_mv) on multiview images of $50$ generated instances following [72]. As shown in Tab. 1, our efficient GRAM largely reduces the memory cost and increases the inference speed upon the original one without sacrificing image generation quality or 3D consistency, by introducing the StyleGAN2-based radiance generator and the efficient intersection calculation strategy. This improvement enables our following GRAMinverter method, otherwise, it is difficult, if not impossible, to leverage the memory-consuming GRAM for encoder-based GAN inversion. We also list the performance of the state-of-the-art EG3D [10] as a reference. Although EG3D has better image quality, it sacrifices the 3D consistency which we argue is a key factor for 3D-aware generation.

Table 1: Comparison between efficient GRAM and GRAM [14]. *: Inference on a Tesla V100 GPU with a batchsize of

1

Methods	Memory* $\downarrow$	FPS* $\uparrow$	FID $\downarrow$	PSNR_mv $\uparrow$	SSIM_mv $\uparrow$
EG3D	2.8G	20	6.02	34.0	0.928
GRAM	12G	2	15.0	38.0	0.966
Ours	3.3G	14	14.2	37.6	0.969

Table 2: Quantitative comparison with existing portrait editing methods. See the text for details.

	Inversion fidelity					Novel view quality		3D consistency
Methods	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	ID $\uparrow$	FID $\downarrow$	ID_nv $\uparrow$	FID_nv $\downarrow$	PSNR_mv $\uparrow$	SSIM_mv $\uparrow$
PIRenderer [47]	–	–	–	–	–	0.476	42.64	36.72	0.958
Face-vid2vid [67]	–	–	–	–	–	0.416	41.76	36.10	0.942
e4e [63] + InterfaceGAN [52]	19.23	0.451	0.213	0.706	35.92	0.489	38.04	34.29	0.909
HFGI [66] + InterfaceGAN [52]	22.30	0.579	0.135	0.827	26.41	0.516	45.23	33.99	0.917
HFGI [66] + StyleHEAT [77]	22.30	0.579	0.135	0.827	26.41	0.457	58.33	35.93	0.951
pix2NeRF [8]	16.95	0.394	0.452	0.466	108.3	0.378	115.6	50.66	0.997
IDE-3D (encoder) [57]	16.73	0.382	0.290	0.393	51.51	0.324	47.56	37.57	0.950
Ours	21.51	0.650	0.127	0.936	28.17	0.635	36.02	39.53	0.974

Table 3: Ablation study of our proposed framework.

Methods	PSNR $\uparrow$	LPIPS $\downarrow$	ID $\uparrow$	FID_nv $\downarrow$
General	17.68	0.265	0.648	44.43
General - pretrain	17.45	0.280	0.472	52.12
General + finetune	18.00	0.254	0.678	43.46
Detail (Ours)	21.51	0.127	0.936	36.02
Detail - world2cam	19.96	0.171	0.908	38.74
Detail - superres	21.06	0.165	0.926	38.45
Detail - $\mathcal{L}_{nv}$	22.32	0.106	0.949	36.49
Detail - mask	19.08	0.211	0.840	43.25
Detail - $\mathcal{L}_{depth}$	23.68	0.094	0.960	35.60

Comparison with pose editing methods.

We compare with existing methods that can achieve 3D pose editing of a given portrait via single forward pass, including 2D GAN inversion-based methods: e4e [63]+InterFaceGAN [52], HFGI [66]+InterFaceGAN, and HFGI+StyleHEAT [77]; 3D-aware GAN inversion-based methods: pix2NeRF [8] and IDE-3D [57]; and face reenactment methods: PIRenderer [47] and Face-vid2vid [67].

We first evaluate the inversion fidelity among GAN inversion-based methods (Tab. 2). We report PSNR, SSIM, LPIPS, identity similarity (i.e. ID) measured by cosine distance of face recognition features [64], and FID. All metrics are calculated between the first $1$ K images of CelebA-HQ and their corresponding inverted results. Since different methods may generate images of different resolutions and alignments, we pre-process all results following [14] and resize them to $256\times 256$ for a fair comparison. As shown, our method significantly outperforms other 3D-aware GAN inversion methods across all metrics. We also exceed the StyleGAN2-based inversion method e4e and achieve comparable results with the state-of-the-art method HFGI.

We further compare our method with other approaches on pose editing of portrait images. We generate novel views (see Sec. A.7 for details) of the $1$ K test images using different methods and evaluate their identity similarity and FID to the original input in Tab. 2. Higher ID_nv and lower FID_nv indicate that a method can better keep the identity and image quality while changing the camera pose. Our method yields the best result among all competitors. We also surpass PIRenderer and Face-vid2vid which require video data for training, while ours is merely trained on monocular in-the-wild images. Figure 5 shows a visual comparison.

Finally, we measure the 3D consistency of all methods during continuous variation of the camera viewpoint. Following [72], for each method, we generate $30$ images under different views for $50$ test instances in CelebA-HQ, and measure the multiview reconstruction quality of NeuS on them (i.e. PSNR_mv and SSIM_mv). In theory, better 3D consistency across different views would reduce the learning difficulty of NeuS, thus leading to higher PSNR and SSIM. Table 2 shows that our method has the second best 3D consistency among all methods, while the best one (i.e. pix2NeRF) generates over-smooth images of low quality as shown in Fig. 5 and indicated by the high FID score in Tab. 2. Our method outperforms IDE-3D in that it utilizes a 2D super-resolution module in its 3D-aware GAN which lowers the 3D consistency to some extent. Nevertheless, all 3D-aware GAN-based methods yield better 3D consistency compared to other 2D methods, indicating the importance of 3D-aware GAN for pose editing of images. A further comparison with the full pipeline of IDE-3D which includes an extra optimization step is in the Sec. B.2.

Figure 5 further shows the visual comparison of 3D consistency, where we draw the stacked texture image of a fixed horizontal line segment during continuous camera movement following [72]. Methods with strong 3D consistency will result in texture images with smoothly tilted strips, while methods with low 3D consistency produce twisted textures (i.e. geometry distortions and texture flickering issues) or vertical lines (i.e. texture sticking issues). Our method clearly produces a more reasonable texture image compared to the others. See Sec. B.2 for more results.

4.3 Ablation Study

We conduct an ablation study to validate the efficacy of our proposed framework and report the results in Tab. 3 and Fig. 6. All metrics are calculated similarly as in Sec. 4.2.

Inversion stage.

Table 3 shows the performance of different stages. General stands for the general inversion stage without finetuned generator. General - pretrain denotes learning the efficient GRAM with $E_{w}$ together instead of pre-training it via the 3D-aware GAN framework. General + finetune denotes finetuning the pre-trained efficient GRAM as described in Sec. 3.2, and Detail denotes our final approach with detail-specific reconstruction. As shown, the general stage alone cannot produce faithful reconstruction result, whether the efficient GRAM is further finetuned or not. By contrast, introducing the detail-specific reconstruction significantly improves the inversion fidelity upon the previous stage without sacrificing novel view quality. Learning the efficient GRAM together with the encoder without pre-training leads to a significant performance drop, indicating the importance of leveraging the prior knowledge of a pre-trained 3D-aware GAN.

Network architecture.

We ablate the architecture of the detail manifolds reconstructor. As shown in Tab. 3 and Fig. 6, learning the low-resolution detail voxel in world space instead of in camera space (- world2cam) harms the reconstruction fidelity. And removing the super-resolution module for high-resolution manifold prediction (- superres) leads to blurry inversion results.

Regularization.

We further validate our proposed regularization for detail manifolds learning. As shown in Tab. 3 and Fig. 6, removing the novel view regularization (- $\mathcal{L}_{nv}$ ) causes obvious artifacts at new views and leads to the increase of FID_nv, though it improves the reconstruction quality at the input viewpoint. Simply enforcing $\mathcal{L}_{nv}$ without the normal-aware mask (- mask) damages fine texture preservation at visible regions. Finally, although learning without the depth regularization (- $\mathcal{L}_{depth}$ ) results in better metrics, we find that it cannot well preserve certain fine details at novel views due to incorrect parallax brought by the depth error (e.g. mole in Fig. 6). We conjecture that such dynamic artifact can hardly be captured by the current feature extractor [60] for FID computation.

5 Conclusions

We presented GRAMinverter, a novel approach for high-fidelity and 3D-consistent portrait synthesis from monocular images via single forward pass. The core idea is to learn a detail manifolds reconstructor to predict 3D-consistent fine details on the radiance manifolds from a input image, and combine them with the coarse radiance manifolds obtained via an encoder-based inversion of the pre-trained GRAM. Extensive experiments have demonstrated our superior results over previous works. We believe our method paves a new way for efficient 3D-aware portrait creation.

Limitations and future works.

Our GRAMinverter has several limitations. Based on the radiance manifold representation, it produces layered artifacts at large viewing angles. It cannot well handle occlusions of hands and other accessories. Its performance is also affected by the training data and may produce inferior results for out-of-distribution input. Besides, it does not support editing of attributes beyond camera viewpoints as done in previous 2D GAN inversions. Better 3D representations and inversion strategies should be further explored to alleviate these problems.

References

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? In IEEE/CVF International Conference on Computer Vision, pages 4432–4441, 2019.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8296–8305, 2020.
[3] Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics, 40(3):1–21, 2021.
[4] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle: A residual-based stylegan encoder via iterative refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6711–6720, 2021.
[5] Yuval Alaluf, Or Patashnik, Zongze Wu, Asif Zamir, Eli Shechtman, Dani Lischinski, and Daniel Cohen-Or. Third time’s the charm? image and video editing with stylegan3. arXiv preprint arXiv:2201.13433, 2022.
[6] David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4502–4511, 2019.
[7] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In IEEE International Conference on Computer Vision, pages 1021–1030, 2017.
[8] Shengqu Cai, Anton Obukhov, Dengxin Dai, and Luc Van Gool. Pix2nerf: Unsupervised conditional p-gan for single image to neural radiance fields translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3981–3990, 2022.
[9] Menglei Chai, Linjie Luo, Kalyan Sunkavalli, Nathan Carr, Sunil Hadap, and Kun Zhou. High-quality hair modeling from a single portrait photo. ACM Transactions on Graphics (TOG), 34(6):1–10, 2015.
[10] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[11] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5799–5809, 2021.
[12] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
[13] Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen, and Xin Tong. Disentangled and controllable face image generation via 3D imitative-contrastive learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5154–5163, 2020.
[14] Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware image generation. In IEEE Computer Vision and Pattern Recognition, 2022.
[15] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[16] Michail Christos Doukas, Stefanos Zafeiriou, and Viktoriia Sharmanska. Headgan: One-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14398–14407, 2021.
[17] Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Megaportraits: One-shot megapixel neural head avatars. arXiv preprint arXiv:2207.07621, 2022.
[18] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. In Advances In Neural Information Processing Systems, 2022.
[19] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
[20] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
[21] Yuan-Chen Guo, Di Kang, Linchao Bao, Yu He, and Song-Hai Zhang. Nerfren: Neural radiance fields with reflections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18409–18418, 2022.
[22] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. Advances in Neural Information Processing Systems, 33:9841–9850, 2020.
[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
[24] Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, and Ziwei Liu. Eva3d: Compositional 3d human generation from 2d image collections. arXiv preprint arXiv:2210.04888, 2022.
[25] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374–20384, 2022.
[26] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[27] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
[28] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
[29] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[31] Connor Z Lin, David B Lindell, Eric R Chan, and Gordon Wetzstein. 3d gan inversion for controllable portrait image animation. arXiv preprint arXiv:2203.13441, 2022.
[32] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[33] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
[34] BR Mallikarjun, Ayush Tewari, Abdallah Dib, Tim Weyrich, Bernd Bickel, Hans Peter Seidel, Hanspeter Pfister, Wojciech Matusik, Louis Chevallier, Mohamed A Elgharib, et al. Photoapp: Photorealistic appearance editing of head portraits. ACM Transactions on Graphics, 40(4), 2021.
[35] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, pages 3481–3490, 2018.
[36] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pages 405–421. Springer, 2020.
[37] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989, 2022.
[38] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
[39] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan: Unsupervised learning of 3D representations from natural images. In IEEE/CVF International Conference on Computer Vision, pages 7588–7597, 2019.
[40] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
[41] Yuval Nirkin, Yosi Keller, and Tal Hassner. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7184–7193, 2019.
[42] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In IEEE/CVF International Conference on Computer Vision, 2021.
[43] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. Stylesdf: High-resolution 3d-consistent image and geometry generation. In IEEE/CVF International Conference on Computer Vision, 2022.
[44] Xingang Pan, Bo Dai, Ziwei Liu, Chen Change Loy, and Ping Luo. Do 2d gans know 3d shape? unsupervised 3d shape reconstruction from 2d image gans. arXiv preprint arXiv:2011.00844, 2020.
[45] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
[46] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3D face model for pose and illumination invariant face recognition. In IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 296–301, 2009.
[47] Yurui Ren, Ge Li, Yuanqi Chen, Thomas H Li, and Shan Liu. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13759–13768, 2021.
[48] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
[49] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. arXiv preprint arXiv:2106.05744, 2021.
[50] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. In Advances in Neural Information Processing Systems, 2020.
[51] Katja Schwarz, Axel Sauer, Michael Niemeyer, Yiyi Liao, and Andreas Geiger. Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids. arXiv preprint arXiv:2206.07695, 2022.
[52] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In IEEE/CVF conference on computer vision and pattern recognition, pages 9243–9252, 2020.
[53] Yichun Shi, Divyansh Aggarwal, and Anil K Jain. Lifting 2d stylegan for 3d-aware face generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6258–6266, 2021.
[54] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2377–2386, 2019.
[55] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in Neural Information Processing Systems, 32, 2019.
[56] Ivan Skorokhodov, Sergey Tulyakov, Yiqun Wang, and Peter Wonka. Epigraf: Rethinking training of 3d gans. arXiv preprint arXiv:2206.10535, 2022.
[57] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. Ide-3d: Interactive disentangled editing for high-resolution 3d-aware portrait synthesis. arXiv preprint arXiv:2205.15517, 2022.
[58] Jingxiang Sun, Xuan Wang, Yong Zhang, Xiaoyu Li, Qi Zhang, Yebin Liu, and Jue Wang. Fenerf: Face editing in neural radiance fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7672–7682, 2022.
[59] Attila Szabó, Givi Meishvili, and Paolo Favaro. Unsupervised generative 3d shape learning from natural images. arXiv preprint arXiv:1910.00287, 2019.
[60] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[61] Ayush Tewari, Mohamed Elgharib, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer, and Christian Theobalt. Pie: Portrait image embedding for semantic control. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
[62] Ayush Tewari, Xingang Pan, Ohad Fried, Maneesh Agrawala, Christian Theobalt, et al. Disentangled3d: Learning a 3d generative model with disentangled geometry and appearance from monocular images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1516–1525, 2022.
[63] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021.
[64] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018.
[65] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
[66] Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. High-fidelity gan inversion for image attribute editing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11379–11388, 2022.
[67] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021.
[68] Ziyu Wang, Yu Deng, Jiaolong Yang, Jingyi Yu, and Xin Tong. Generative deformable radiance fields for disentangled image synthesis of topology-varying objects. arXiv preprint arXiv:2209.04183, 2022.
[69] Olivia Wiles, A Koepke, and Andrew Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV), pages 670–686, 2018.
[70] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–10, 2020.
[71] Yue Wu, Yu Deng, Jiaolong Yang, Fangyun Wei, Qifeng Chen, and Xin Tong. Anifacegan: Animatable 3d-aware face image generation for video avatars. arXiv preprint arXiv:2210.06465, 2022.
[72] Jianfeng Xiang, Jiaolong Yang, Yu Deng, and Xin Tong. Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. arXiv preprint arXiv:2206.07255, 2022.
[73] Sicheng Xu, Jiaolong Yang, Dong Chen, Fang Wen, Yu Deng, Yunde Jia, and Xin Tong. Deep 3d portrait from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7710–7720, 2020.
[74] Xudong Xu, Xingang Pan, Dahua Lin, and Bo Dai. Generative occupancy fields for 3d surface-aware image synthesis. Advances in Neural Information Processing Systems, 34, 2021.
[75] Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hellier. Feature-style encoder for style-based gan inversion. arXiv e-prints, pages arXiv–2202, 2022.
[76] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
[77] Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution editable talking face generation via pretrained stylegan. arXiv preprint arXiv:2203.04036, 2022.
[78] Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, and Victor Lempitsky. Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, pages 524–540. Springer, 2020.
[79] Jichao Zhang, Aliaksandr Siarohin, Yahui Liu, Hao Tang, Nicu Sebe, and Wei Wang. Training and tuning generative neural radiance fields for attribute-conditional 3d-aware face generation. arXiv preprint arXiv:2208.12550, 2022.
[80] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
[81] Xiaoming Zhao, Fangchang Ma, David Güera, Zhile Ren, Alexander G Schwing, and Alex Colburn. Generative multiplane images: Making a 2d gan 3d-aware. arXiv preprint arXiv:2207.10642, 2022.
[82] Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186, 2021.
[83] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics, 37(4):1–12, 2018.
[84] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. In European conference on computer vision, pages 592–608. Springer, 2020.
[85] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European conference on computer vision, pages 597–613. Springer, 2016.
[86] Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and Peter Wonka. Improved stylegan embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020.
[87] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 146–155, 2016.
[88] Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z Li. High-fidelity pose and expression normalization for face recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 787–796, 2015.

{strip}

Supplementary Material

Appendix A More Implementation Details

A.1 Data Preparation

We align all images in FFHQ [28] and CelebA-HQ [26] using the detected facial landmarks following [14]. Specifically, we first use an off-the-shelf landmark detector [7] to extract 5 facial landmarks for each image. Then, we resize and crop the images by solving a least square problem between the detected landmarks and canonical 3D landmarks from the average shape of a 3D face model [46]. Camera poses of the images are extracted using a 3D face reconstruction model [15].

A.2 Network Structure

The structure of the detail manifolds reconstructor is shown in Fig. V. It consists of two sub-networks. A detail encoder $E_{detail}$ and a super-resolution module $\mathcal{U}$ .

Detail encoder $E_{detail}$ .

The detail encoder receives the concatenation of the input image $\hat{I}$ and the difference map $\hat{I}-I_{w}$ , and predicts a low-resolution feature voxel $V$ (see Fig. V (a)). It consists of several 2D downsampling blocks, followed by a 2D convolution to project the low-resolution 2D feature map to 3D voxel. A 3D U-Net structure with skip connections is then applied, followed by several 3D resblocks to obtain the final low-resolution feature voxel $V$ .

Super-resolution module $\mathcal{U}$ .

The super-resolution module takes the low-resolution feature map $F_{i}^{lr}$ derived from each low-resolution feature manifold as input, and produces a high-resolution feature map $F_{i}^{hr}$ which will be later projected back to the surface manifolds (see Fig. V (b)). It consists of two upsampling blocks, and each block contains two 2D convolutions.

A.3 Intersection Calculation Details

The efficient GRAM requires to calculate ray-manifold intersections for manifold rendering following [14]. To accelerate the efficiency of this process, we calculate ray-manifold intersections at $1/4$ resolution of the final image, as depicted in Fig. I.

Specifically, we first generate viewing rays at a resolution of $64\times 64$ , and calculate their intersections with each surface manifold produced by the manifold predictor $\mathcal{M}$ following [14]. Then, we upsample the obtained low-resolution intersection grid on each manifold via bilinear interpolation to obtain dense intersections at the final resolution (i.e. $256\times$ 256). In this way, only the low-resolution intersections obtained in the first step require forwarding the manifold predictor, which largely reduces the computation cost compare to directly calculating intersections at the final resolution. Since the learned surface manifolds for human faces have small curvature and are nearly planar at local regions (see illustration in [14]), the intersections obtained via the bilinear upsampling are close to the ground truth and have a minor influence on the final synthesis results.

A.4 More Training Details

Pretraining efficient GRAM.

We follow [14] to train the efficient GRAM on FFHQ dataset at a resolution of $256\times 256$ . During training, we randomly sample latent code $\bm{z}$ from the normal distribution and camera pose $\bm{\theta}$ from the estimated distribution of the training data and send them to the efficient GRAM to generate corresponding images. The manifold predictor $\mathcal{M}$ is initialized following [14]. The tri-plane generator $\Psi$ and the MLP-based decoder $m$ are initialized following [10]. The synthesized images, together with randomly sampled real images from the training data, are sent into an extra discriminator [29] for loss computation. We adopt the non-saturating GAN loss with R1 regularization [35] to learn the efficient GRAM and the discriminator. We also enforce the pose regularization in [14] to ensure that the learned geometries are reasonable.

We use the Adam optimizer [30] with $\beta_{1}=0$ and $\beta_{2}=0.99$ . The learning rates are set to $2.5e-3$ for the tri-plane generator and the MLP-based decoder, $2e-5$ for the manifold predictor, and $1e-3$ for the discriminator. The loss weights for the R1 regularization and the pose regularization are set to $10$ and $30$ , respectively. We trained the efficient GRAM for $150$ K iterations with a batchsize of $32$ . Training took 2 days on $4$ NVIDIA Tesla V100 GPUs with $32$ GB memory.

General inversion stage.

During this stage, we fix the pre-trained efficient GRAM and learn the image inverter $E_{w}$ . The image inverter is initialized following [63]. We adopt the multi-level reconstruction loss $\mathcal{L}_{r}$ in Eq. (6) for faithful image reconstruction, and the minimal variation loss $\mathcal{L}_{d-reg}$ proposed in [63] to ensure that each $\bm{w}_{i},i=1,...,L$ in the predicted $\bm{w}^{+}$ latent code are close to each other. Besides, we apply a regularization on the predicted latent code $\bm{w}^{+}$ to ensure that it falls in a semantically meaningful latent space:

\mathcal{L}_{\bm{w}+}=||\bm{w}^{+}-\bar{\bm{w}}^{+}||^{2},

(I)

where $\bar{\bm{w}}^{+}$ is the average latent code of the $\mathcal{W}+$ space computed using $10$ K randomly sampled $\bm{z}$ .

We use the Adam optimizer with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The initial learning rate for the image inverter is $3e-4$ , and decreases to $6e-5$ after $100$ K iterations. The balancing weights for the three terms in $\mathcal{L}_{r}$ are set to $1e-2$ , $1$ , and $4e-2$ , respectively. The weights for $\mathcal{L}_{d-reg}$ and $\mathcal{L}_{\bm{w}+}$ are $1e-3$ and $1e-4$ , respectively. The network is trained for $150$ K iterations with a batchsize of $32$ , which took $2$ days on $4$ NVIDIA Tesla V100 GPUs.

General inversion stage - finetuning.

After the image inverter is learned, we further finetune the efficient GRAM for better image reconstruction. We only finetune the tri-plane generator $\Psi$ and the MLP-based decoder $m$ , and leave the manifold predictor $\mathcal{M}$ unchanged. The two networks are learned following [49]. Specifically, we adopt the multi-level reconstruction loss $\mathcal{L}_{r}$ in Eq. (6). In addition, we leverage the locality regularization $\mathcal{L}_{R}$ proposed in [49] to ensure that images synthesized by the finetuned efficient GRAM stay close to those of the original one at randomly sampled locations in the latent space. Different from [49], we finetune the efficient GRAM on the whole training set instead of using only a single image.

During training, the Adam optimizer is also applied with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The learning rate for the efficient GRAM is $1e-3$ . The balancing weights for $\mathcal{L}_{r}$ are similar to the above stage, and the weight for $\mathcal{L}_{R}$ is set to $0.5$ . We use a batchsize of $16$ and finetune the efficient GRAM for $100$ K iterations. The whole process took $1$ day on $4$ NVIDIA Tesla V100 GPUs.

Detail-specific reconstruction stage.

Finally, we fix the image inverter as well as the efficient GRAM learned from the previous stages, and learn the detail manifolds reconstructor via the losses proposed in Sec. 3.4. We set the balancing weights for $\mathcal{L}_{r}$ following the above stages. The loss weights for the novel view regularization $\mathcal{L}_{nv}$ and the depth regularization $\mathcal{L}_{depth}$ are set to $4$ and $2e-4$ , respectively. We use the Adam optimizer with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ , and set the learning rate for the detail manifolds reconstructor to $3e-4$ . We use a batchsize of $8$ and train the whole pipeline for $60$ K iterations. It took $1$ day on $4$ NVIDIA Tesla V100 GPUs.

A.5 Baseline Implementation Details

PIRenderer.

PIRenderer [47] is a face-reenactment method learned on video data. It leverages a 3D Morphable Model (3DMM) [46, 15] as guidance and learns 2D warping flow to drive a source image with target motions. It supports intuitive control of a given image by directly modifying the input 3DMM parameters to the network. We use the officially released code and model trained on VoxCeleb [38] dataset²²2https://github.com/RenYurui/PIRender in our experiments, and achieve pose editing of an image by modifying the input 3D pose parameters.

Face-vid2vid.

Face-vid2vid [67] is also a face-reenactment method learned on video data. It extracts 3D keypoints of an image and derives 3D warping flows from them to transfer the 3D features of a source image to a target position. By using a single frame as both the source and the target, and applying 3D rotation to the extracted 3D keypoints of the target, it can also achieve intuitive control over the 3D pose of a given portrait image. Since the official code and model are unavailable, we use a re-implementation of it trained on VoxCeleb dataset³³3https://github.com/zhanglonghao1992/One-Shot_Free-View_Neural_Talking_Head_Synthesis for our experiments.

e4e.

e4e [63] is an encoder-based StyleGAN2 [29] inversion method. Its encoder adopts a feature-pyramid structure [32] and predicts StyleGAN2’s $\mathcal{W}+$ space vector for a given image. By editing the predicted latent code towards certain direction, and sending the modified code into the pre-trained StyleGAN2, it can achieve pose control of the given image. We use the official released code and model trained on FFHQ⁴⁴4https://github.com/omertov/encoder4editing to carry out our experiments.

HFGI.

HFGI [66] is also an encoder-based StyleGAN2 inversion method. It builds upon the e4e method and extracts extra feature maps from a given image as substitutions to the original feature maps within StyleGAN2. Therefore, it achieves more faithful inversion results compare to e4e. We use its officially released code and model trained on FFHQ⁵⁵5https://github.com/Tengfei-Wang/HFGI in our experiments.

InterFaceGAN.

InterFaceGAN [52] is a latent space editing method for StyleGAN [28] and StyleGAN2. It learns the binary classification boundaries of multiple image attributes for latent vectors in StyleGAN’s $\mathcal{W}+$ space. By modifying the latent code along the direction perpendicular to an interface, it can change the corresponding attribute of a synthesized image. It can also be combined with GAN inversion methods like e4e and HFGI for real image editing. Since the officially released model only contains interfaces for StyleGAN, we use the model provided by [66] for StyleGAN2-based pose editing.

StyleHEAT.

StyleHEAT [77] is also a latent space editing method for StyleGAN2 which targets at talking head synthesis. Different from InterFaceGAN, it modifies the latent feature maps within the StyleGAN2 instead of the $\mathcal{W}+$ space latent vector. It learns 2D warping flows for the feature maps via the help of video data as well as the guidance of 3DMM, similarly as done by PIRenderer. It also supports direct 3D pose editing of a given image by modifying the 3D pose parameters for generating the warping flow. We use the officially released code and model trained on VoxCeleb⁶⁶6https://github.com/FeiiYin/StyleHEAT/ in our experiments.

pix2NeRF.

pix2NeRF [8] is an encoder-based 3D-aware GAN inversion method based on pi-GAN [11]. It simultaneously learns an image encoder and a 3D-aware image generator to reconstruct NeRF [36] representation from a given image for novel view synthesis. We adopt its official released code⁷⁷7https://github.com/primecai/Pix2NeRF in our experiments. Since the official model is trained on CelebA [33] dataset that overlaps with our test set, we re-train it on the FFHQ dataset for a fair comparison.

IDE-3D.

IDE-3D [57] is a 3D-aware GAN aiming for 3D-consistent portrait synthesis with interactive control. Its image generator is based on the tri-plane generator and the 2D super-resolution module proposed in [10]. It also achieves disentangled editing of real images by introducing a hybrid GAN inversion scheme, where it first learns an image encoder to map a given image into the latent space of the pre-trained generator, and then leverages instance-specific optimization [49] to further improve the reconstruction fidelity. We use its official model trained on FFHQ⁸⁸8https://github.com/MrTornado24/IDE-3D in our experiments. Moreover, we only use the inversion results from its encoder instead of those from the further optimization step for a fair comparison. A comparison with its full pipeline including the optimization step is demonstrated in Tab. IV and Fig. II.

A.6 Visualization Details

Visualization results in this paper are rendered with yaw angles ranging from $-0.4$ rad to $0.4$ rad. The pitch angles are identical to those of the input images, which are estimated via the face reconstruction method of [15]. The roll angles are set to zero.

A.7 Novel View Experiment Details

We describe more details about the novel view synthesis comparison proposed in Sec. 4.2. Specifically, we generate novel views of the first $1$ K test images in the CelebA-HQ dataset using different methods to calculate the metrics (i.e. ID_nv and FID_nv in Tab. 2). We randomly sample the yaw angle within a range of $[-0.5,-0.4]\cup[0.4,0.5]$ , and set the pitch and roll identical to those of the original input. To ensure that the novel view images have a large pose difference with the input, we multiply the sampled yaw angle by $-1$ if its absolute difference with that of the input is smaller than $0.3$ . For all methods, we use the same $1$ K sampled yaw angles to generate novel view images for a fair comparison.

Table IV: Comparison with the full pipeline of IDE-3D which contains an extra optimization step.

Methods	PSNR $\uparrow$	LPIPS $\downarrow$	ID_nv $\uparrow$	PSNR_mv $\uparrow$	Time(s) $\downarrow$
IDE-3D (full)	24.43	0.092	0.507	37.10	100
Ours	21.57	0.123	0.645	39.53	0.3

Appendix B More Results

B.1 Novel View Synthesis Results

Figure VI, VII and VIII shows more novel view synthesis results by our method on CelebA-HQ test data. Figure IX further shows novel view synthesis results on in-the-wild images. Our method can generate realistic novel views with high fidelity and strong 3D consistency for diverse subjects. Please see project page for animations.

B.2 Comparisons with the Prior Art

Figure X and XI show more comparisons between our method and the previous methods. Our method can well preserve fine details in the original images and produces their novel views with more strict 3D consistency compared to the others. Please see the project page for animations.

We further compare with the full inversion pipeline of IDE-3D, which adopts the EG3D structure and leverages optimization-based inversion (i.e. encoder-based initialization + pivot tuning [49]). The results and a visual example are shown in Tab. IV and Fig. II. The PSNR, LPIPS, and ID_nv are calculated on the first $100$ instances in the CelebA-HQ, and the PSNR_mv on $50$ instances. Our method performs slightly worse than the state-of-the-art optimization-based method on image reconstruction quality, but shows better novel view results and 3D consistency, and has dramatically faster inference speed.

B.3 More Applications

Dolly zoom effect.

Since our method is based on GRAM [14] that leverages the radiance manifolds representation, we can explicitly move the camera towards or away from a subject, and adjust the camera fov accordingly to ensure that the size of a portrait in the synthesized image stays a constant. In this way, we can generate a sequence of images under different levels of camera distortions, which is known as the dolly zoom effect⁹⁹9https://en.wikipedia.org/wiki/Dolly_zoom. It can hardly be achieved by 2D-GAN based face editing methods without explicit camera modeling. Examples of this effect generated by our method are shown in Fig. XII. Animations can be found in the project page.

3D-consistent editing.

Our method can also be applied to 3D-consistent interactive portrait editing thanks to its ability to preserve fine image details. Specifically, given a real portrait image, we can draw some arbitrary patterns on it and send the edited result to our GRAMinverter for reconstruction and novel view synthesis. As shown in Fig. XIII, our method can well preserve the drawn patterns on the input images and generate their 3D-consistent novel views bearing these patterns. Corresponding animations are in the project page.

Appendix C Limitations and Future Works

We thoroughly discuss the limitations of our method and possible future solutions to improve it.

Our method adopts the radiance manifolds representation. Although it helps us to synthesize novel views with strong 3D consistency, it can produce layered artifacts at large viewing angles as shown in Fig. III. This artifact could be alleviated to some extent by using more profile images as training data. In addition, it could also be reduced by leveraging alternative 3D representations, such as some recently proposed efficient NeRF representations [37, 56, 24]. However, it is still unclear how to effectively incorporate these representations for high-quality and efficient novel view synthesis of monocular portraits.

We also observed ghosting artifacts in some cases where the background contains the appearance of ears. The major cause is that the background plane and the foreground subject share the same tri-plane generator so they might have similar appearance patterns in some regions. Some floating points can also be observed around the silhouette, which are mainly due to the wrong parallax provided by inaccurate coarse depth (geometry) estimated from the general inversion stage. These problems can be alleviated by only rendering the foreground subject as shown in Fig. III, or using an extra image generator to synthesize the background. We show more comparisons with or without the background in Fig. IV. Clearly, removing the background largely reduces the layered artifacts and the floating points.

Besides, our method cannot well handle occlusions and tends to interpret them as textures clinging to the face as shown in Fig. III. One possible solution is to leverage an extra face segmentation network to mask out the occluded regions and let the model only focus on reconstructing the portrait region. Our method can also produce inferior results for out-of-distribution input with large poses and abnormal lighting. The synthesized images may also have a global color shift compared to the input in certain cases. We believe these problems can be mitigated by training on larger-scale datasets with carefully tuned loss weights. In addition, our method cannot well handle complex lighting effect when varying the camera pose, such as specular reflectance. More dedicated 3D representations [21] are required to tackle this problem.

Finally, our method does not support editing of attributes like expression, due to the learned details being aligned with the original input image. This problem can be tackled by introducing a distortion-aware detail reconstructor similarly as done by some recent 2D GAN inversion methods [66], or leveraging a 3D representation that handles dynamic changes [45, 71]. We leave these explorations as future works.

Appendix D Ethics Consideration

The goal of this paper is efficient large-scale virtual avatar creation. It does not intend to create misleading or deceptive content. However, it could still be potentially misused for impersonating humans. In particular, the 3D-consistent synthesized portraits might be used to fool the 3D face recognition system that relies on multiview consistency. We condemn any behavior to create such harmful content. Currently, the synthesized portraits by our method contain certain visual artifacts that can be identified by humans and some deepfake detection methods. We encourage to apply this method for learning more advanced forgery detection approaches to avoid potential misusage.