This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Geometric-Photometric Joint Alignment for Facial Mesh Registration

Xizhi Wang11footnotemark: 1 yvonnewxz826@163.com Yaxiong Wang22footnotemark: 2 wangyx15@stu.xjtu.edu.cn Mengjian Li33footnotemark: 3 limengjian@zhejianglab.org
Abstract

This paper presents a Geometric-Photometric Joint Alignment (GPJA) method, which aligns discrete human expressions at pixel-level accuracy by combining geometric and photometric information. Common practices for registering human heads typically involve aligning landmarks with facial template meshes using geometry processing approaches, but often overlook dense pixel-level photometric consistency. This oversight leads to inconsistent texture parametrization across different expressions, hindering the creation of topologically consistent head meshes widely used in movies and games. GPJA overcomes this limitation by leveraging differentiable rendering to align vertices with target expressions, achieving joint alignment in both geometry and photometric appearances automatically, without requiring semantic annotation or pre-aligned meshes for training. It features a holistic rendering alignment mechanism and a multiscale regularized optimization for robust convergence on large deformation. The method utilizes derivatives at vertex positions for supervision and employs a gradient-based algorithm which guarantees smoothness and avoids topological artifacts during the geometry evolution. Experimental results demonstrate faithful alignment under various expressions, surpassing the conventional non-rigid ICP-based methods and the state-of-the-art deep learning based method. In practical, our method generates meshes of the same subject across diverse expressions, all with the same texture parametrization. This consistency benefits face animation, re-parametrization, and other batch operations for face modeling and applications with enhanced efficiency.

keywords:
Geometry registration , Facial performance capture , Face modeling
journal: Computers & Graphics
\affiliation

[zjlab] organization=Zhejiang Lab,addressline=Kechuang Avenue, Yuhang District, city=Hangzhou, postcode=311121, state=Zhejiang, country=China \affiliation[hefei] organization=Hefei University of Technology,addressline=Rd.Tunxi No.193, city=Hefei, postcode=230009, state=Anhui, country=China

1 Introduction

Refer to caption
Figure 1: Given multiview images and (a) the textured template mesh, we propose a novel method GPJA based on differentiable rendering to achieve geometric and photometric alignment jointly for facial meshes. (b) The aligned meshes are rendered with the shared texture map as the template in (a). (c) The zoomed renderings (yellow boxes) of eyes and mouths demonstrate photometric alignment with the reference images (black boxes).

Nowadays, professional studios in industry and academia commonly use synchronized multiview stereo setups for facial scanning [1, 2, 3], ensuring high-fidelity results in controlled settings. These setups aim to generate topology-consistent meshes for different subjects with various facial expressions. Typically, conventional pipelines [4] involve constructing raw scans from multiview images, followed by manual processes like marker point tracking, clean-up, or key-framing [5], which is labor-intensive and time-consuming, limiting its application in film, gaming, AR/VR industry. To fulfill automatic registration, geometry-based methods have been widely employed [6, 7, 8, 9, 10]. However, these methods primarily focus on geometric alignment but failing to ensure photometric consistency at dense pixel-level. To remedy this issue, this paper aims to achieve a joint alignment in terms of geometry and photometric appearances. To this end, two challenges need to be addressed.

The first challenge is to establish a proper deformation field to guide the alignment process, especially for the challenging areas such as mouths and eyes. Previous attempts have been made to construct correspondences by landmarks or optical flow to aid the photometric alignment [4, 6]. However, the offset vectors obtained through these methods often introduce errors. Moreover, since they are extracted from 2D images, these vectors are insufficient for guiding deformation of 3D geometry.  [11] employed implicit volumetric representation to combine shape and appearance recovery for realistic rendering, which lacks explicit geometry constraints. In response to the challenge, we propose a differentiable rendering [12, 13] based registration framework to generate topology-consistent facial meshes from multiview images. In particular, our approach includes a Holistic Rendering Alignment (HRA) which incorporates constraints from color, depth and surface normals, facilitating alignment through automatic differentiation without explicit correspondence computation.

The second challenge for facial mesh registration is generating faithful output meshes while preserving the topological structure. Aligning discrete facial expressions involves large-step geometry deformation, which is susceptible to topological artifacts. Previous works addressed this with specialized constraints [8, 10], but at the cost of a complex pipeline. Other approaches [14, 15, 7, 16] utilize parametric models to fit the facial images for mesh creation. Although these methods are highly efficient, they struggle with diverse identities and complex expressions due to their limited expressive capabilities. To overcome this, we resort to a multiscale regularized optimization that combines a modified gradient descent algorithm [17], with coarse-to-fine remeshing scheme. Starting with the coarsest template, the mesh is tessellated periodically while updating the vertices with a robust regularized geometry optimization for the constraints collected from HRA. Our multiscale regularized approach is simple to implement and ensures smoothness and robust convergence, without the need for extensive training data.

We validate our method through experiments on seven subjects from the FaceScape [18] dataset, covering diverse facial expressions, as shown in Fig. 1. The joint alignment is examined by both geometric and image metrics, demonstrating the effectiveness of our approach. The aligned meshes produced by our method are of high quality, free from topological errors, and accurately warped even in challenging regions like mouths and eyes.

Our contributions are summarized as following:

  • 1.

    A simple and novel method named GPJA achieving joint alignment in geometry and photometric appearances at dense pixel-level for facial meshes.

  • 2.

    A holistic rendering alignment mechanism based on differentiable rendering that effectively generates the deformation field for joint alignment, without any semantic annotation.

  • 3.

    A multiscale regularized optimization strategy that ensures robust convergence and produces high-quality, topologically consistent aligned meshes.

2 Related Work

Our research focuses on the registration of facial meshes for topology-consistent geometry on discrete expressions. This section provides a literature review relevant to our study.

Geometry Processing Methods. Non-rigid registration is a well established technique in geometry processing for warping a template mesh to raw scans [19, 20, 21, 22]. The Iterative Closest Point (ICP) algorithm is a commonly used framework [6, 20, 23] for registration. With a template based on 3D Morphable Models (3DMMs) of strong geometric priors [6, 24, 7, 8, 9, 18, 25, 10] as initialization, ICP minimizes the error between landmarks on the template and the scans, resulting in a rough alignment. Then, a fine-tuning stage involves searching for valid correspondences in the spatial neighborhood, and warping the template leveraging data fidelity and smoothness terms. Previous works have explored regularization terms [26, 27] and correspondences [28, 29] in ICP-based algorithms. In addition to registration, there are also face tracking methods [30, 31] that rely on coefficient regression of rigged blendshapes, which are primarily used for expression transfer rather than high-fidelity reconstruction. Overall, these methods are limited in achieving pixel-level photometric consistency, where our method is successfully capable of.

To achieve photometric alignment, industry-standard methods often involve re-topologizing frame-by-frame using the professional software like Wrap4D [32], which requires significant time and resources [33]. For automatic pipelines, researchers have explored incorporating optical flow [34, 35, 36] into the fine-tuning stage of facial mesh registration. However, optical flow alone is inadequate for handling significant differences between the source template and target scans, as well as occlusion changes around the eyes and mouth [37]. It is even worse for discrete expression scenarios due to the lack of temporal coherence and large deformation. In contrast, our method effectively handles significant deformation, visibility changes and asymmetry.

Another challenge in facial mesh registration is maintaining smooth contours in facial features [38], which is difficult due to occlusions and color changes. Previous approaches have used user-guided methods [33] or contour extraction [37] to address these challenges, but they either involve lengthy workflows or specific treatments, which hinder efficient topology-consistent mesh creation. On the other hand, our method achieves pixel-level alignment without any semantic annotation.

Deep Learning Based Methods. Recent advancements in deep learning approaches for generating topology-consistent meshes include ToFu [39], which predicts probabilistic distributions of vertices on the face mesh to reconstruct registered face geometry. TEMPEH [40] enhances ToFu with a transformer-based architecture, while NPHM [41] models head geometry using a neural field representation that parametrizes shape and expressions in disentangled latent spaces. Although these methods prioritize geometric alignment, they do not guarantee rigorous photometric consistency. ReFA [42] introduces a recurrent network operating in the texture space for predicting positions and texture maps. Although these deep learning methods represent progress in facial mesh registration, they require a substantial amount of registered data processed with classical ICP-based algorithms. Besides, out-of-domain expressions may be limited by insufficient training data. In our work, no additional training data is needed except for a semi-automatic created template mesh with textures.

In addition to mesh-based representations, implicit volumetric representations have gained popularity in reconstruction. Several studies have extended NeRF [43] and 3D Gaussian Splatting [44] for dynamic face reconstruction [45, 46, 47, 48, 49]. The pipeline typically begins with explicit parametric models, followed by estimating a deformation field represented by multi-layer perceptrons. Finally, a volumetric renderer is used to generate densities and colors. However, volumetric rendering-based methods lack supervision for aligning mesh-represented geometry and generally do not produce production-ready geometries despite decent rendering results [42].

3 Preliminaries

Before discussing the details of our methodology, we first introduce differentiable rendering as background information to provide context for our approach.

Given a 3D scene containing mesh-based geometries, lights, materials, textures, cameras etc., a renderer synthesizing a 2D image P of each screen pixel (x,y) can be formulated as:

𝑷(x,y)=F(𝒙|Θ),\boldsymbol{P}(x,y)=F(\boldsymbol{x}|\Theta), (1)

where the function F()F(\cdot) represents the rendering process, encompassing various computations such as shading, interpolation, projection, and anti-aliasing. The output P of this function can be RGB colors, normals, depths, or label images. In our scenario, the parameter to be optimized is the positions of mesh vertices denoted by 𝒙n×3\boldsymbol{x}\in\mathbb{R}^{n\times 3}. Θ\Theta symbolizes a set of scene parameters known in advance, including camera poses, lighting, texture color, and other relevant factors.

Differentiable rendering augments renderers by providing additional derivatives with respect to certain scene parameters, which is a valuable tool for inverse problems [13, 50]. The objective of inverse rendering is to recover some specific scene parameters through gradient-based optimization on a scalar loss function \mathcal{L} which is usually defined as the sum of pixel-wise differences between the rendered images 𝑷j\boldsymbol{P}_{j} and the reference image 𝑷jrw×h×c\boldsymbol{P}_{j}^{\textit{{r}}}\in\mathbb{R}^{w\times h\times c} at the j-th camera pose across vv views. In this work, we adopt the L1L_{1}-norm for loss functions.

(𝒙Θ,𝑷r)=jv|F(𝒙Θj)𝑷jr|.\mathcal{L}\left(\boldsymbol{x}\mid\Theta,\boldsymbol{P}^{\textit{{r}}}\right)=\sum_{j}^{v}\left|F(\boldsymbol{x}\mid\Theta_{j})-\boldsymbol{P}_{j}^{\textit{{r}}}\right|. (2)

In our joint alignment registration setting, the derivative 𝒙\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}} obtained through differentiable rendering plays a key role, as it guides the template mesh to fit the target expressions. This process avoids the explicit computation of correspondences, such as optical flow, facial landmarks, or other semantic information.

4 Joint Alignment of Facial Meshes

Refer to caption
Figure 2: Illustration of our proposed GPJA. With the provided textured template, scans and camera poses, the constraints from HRA \mathcal{L} back-propagates derivatives at each iteration to guide the warping of the template. The regularized optimization is built on a multiscale scheme with periodic tessellation, and the vertices are updated with a robust modified gradient descent algorithm.

Our work addresses the challenge of registering the same subject across diverse facial expressions. Fig. 2 depicts the pipeline of our method. Starting with a textured template mesh and raw scans, each iteration computes deformation from HRA and optimizes the vertices through multiscale regularized optimization. As a result, the outputs of GPJA share a joint alignment across all expressions within the subject.

4.1 Holistic Rendering Alignment

Motivation. While geometric alignment has been well studied, previous photometric alignment approaches mainly consider sparse facial landmarks, resulting in two main limitations: coarse alignment lacking fine details and unreliability under occlusions and extreme expressions, as shown in Fig. 3(a). Other photometric algorithms [35, 4] attempt to learn deformation from optical flow obtained in the 2D images, but are prone to false correspondences as shown in Fig. 3(b) ,and suffer from ambiguities when elevating to the 3D space.

Refer to caption
(a) Common errors with landmarks
Refer to caption
(b) Common errors with optical flow
Figure 3: Errors are indicated with arrows: (a) Landmark [51] errors under asymmetry, occlusion, and extreme expressions, and (b) Optical flow [52] inaccuracies due to significant visibility changes.

Instead of making remedies for landmarks or optical flows, we leverage differentiable rendering, which automatically computes derivatives 𝒙\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}} to guide the deformation of the template mesh 𝑻{\boldsymbol{T}} into both geometric and photometric alignment. This approach eliminates correspondence errors from facial landmarks [6] or optical flow offset vectors [35]. Furthermore, differentiable rendering operates directly in 3D space, effectively handling occlusions in the eye and mouth regions.

With the above consideration, we propose a Holistic Rendering Alignment (HRA) mechanism, incorporating multiple cues with the aid of differentiable rendering. HRA collects constraints from three different aspects, i.e., color, depth, and surface normals:

HRA=color+depth+normal.\mathcal{L}_{HRA}=\mathcal{L}_{color}+\mathcal{L}_{depth}+\mathcal{L}_{normal}. (3)

Color Constraint. color\mathcal{L}_{color} seeks to impose constraints for photometric alignment by comparing the rendered image with the observed multiview color images. In practice, we found that the deformation around the inner lip is sometimes affected by occlusion changes, resulting in inaccurate lip contours. To address this issue, we deliberately exclude the interior mouth from the color constraint by a masking strategy.

As illustrated in Fig. 4, a binary mask image 𝑩{0,1}t×t×3\boldsymbol{B}\in\left\{0,1\right\}^{t\times t\times 3} is manually created in accordance with the color texture 𝑪𝑻t×t×3\boldsymbol{C}_{\boldsymbol{T}}\in\mathcal{\mathbb{R}}^{t\times t\times 3} of the given template 𝑻{\boldsymbol{T}}. The mask image labels the interior mouth region, which corresponds to the mouth socket of 𝑻\boldsymbol{T}.

Refer to caption
Figure 4: Illustration of the masking strategy. Using the same parametrization as (a) the texture map, the interior mouth is manually masked out, resulting in (b) a binary mask image. By the shading operation FS(𝒙|𝑷j,𝑩)F_{S}(\boldsymbol{x}|\boldsymbol{P}_{j},\boldsymbol{B}), (c) the mouth socket is mask out from the color constraint.

The color constraint is defined as the summation of the element-wise absolute difference between the rendered image and the reference image, weighted by the mask.

color(𝒙)=jv|(FS(𝒙𝚷j,𝑺,𝑪𝑻)𝑷jr)FS(𝒙𝚷j,𝑩)|,\begin{split}\mathcal{L}_{color}\left(\boldsymbol{x}\right)=\sum_{j}^{v}|\left(F_{S}\left(\boldsymbol{x}\mid\boldsymbol{\Pi}_{j},\boldsymbol{S},\boldsymbol{C}_{\boldsymbol{T}}\right)-\boldsymbol{P}_{j}^{\textit{{r}}}\right)\odot F_{S}\left(\boldsymbol{x}\mid\boldsymbol{\Pi}_{j},\boldsymbol{B}\right)|,\end{split} (4)

where the shading function FS:n×3w×h×3F_{S}:\mathbb{R}^{n\times 3}\rightarrow\mathbb{R}^{w\times h\times 3} is defined as rendering diffuse objects with the lighting estimated from the capture setup in spherical harmonics forms 𝑺9×3\boldsymbol{S}\in\mathcal{\mathbb{R}}^{9\times 3} [53], 𝚷j𝕃(3)\boldsymbol{\Pi}_{j}\in\mathbb{PL}\left(3\right) remarks projective matrix of the jj-th camera, and \odot denotes element-wise multiplication. By synthesizing a binary image using FS(𝒙|𝚷j,𝑩)F_{S}(\boldsymbol{x}|\boldsymbol{\Pi}_{j},\boldsymbol{B}), where the interior mouth is assigned zeros and the rest with ones, Equation 4 can mask out the interior mouth for color constraint for faithful lip contours. Experimental results confirm the effectiveness of this mask operation.

Depth Constraint. depth\mathcal{L}_{depth} pursues to achieve geometric alignment. The color images observed in real-world scenes exhibit inconsistencies due to variations in shading and imaging formulation across views and expressions. Therefore, the derivatives of color constraint inevitably introduce bias and noise, making disturbances for accurate alignment. By including the depth term in HRA, we provide strong supervision for preserving geometric fidelity during registration, ensuring robust outputs.

The depth constraint measures the depth disparity between the deformed template and the target scan:

depth(𝒙)=jv|(FD(𝒙𝚷j)FD(𝒙~𝚷j))|,\mathcal{L}_{depth}\left(\boldsymbol{x}\right)=\sum_{j}^{v}|\left(F_{D}\left(\boldsymbol{x}\mid\boldsymbol{\Pi}_{j}\right)-F_{D}\left(\boldsymbol{\tilde{x}}\mid\boldsymbol{\Pi}_{j}\right)\right)|, (5)

where FD:n×3w×hF_{D}:\mathbb{R}^{n\times 3}\rightarrow\mathbb{R}^{w\times h} represents the depth rendering operation[54] based on the perspective projection function, and 𝒙~m×3\boldsymbol{\tilde{x}}\in\mathbb{R}^{m\times 3} denotes vertex coordinates of the scan with mm vertices.

Normal Constraint. normal\mathcal{L}_{normal} assists with fidelity preservation. While the color and depth terms establish guidance for overall alignment, discrepancies in color tones and texture-less areas can cause artifacts on aligned meshes. The normal constraint compensates for these issues, leading to improved alignment and sharper details with fewer vertices.

In specific, the normal constraint penalties the disparity of surface normals between the deformed template and the target scan.

normal(𝒙)=jv|(FN(𝒙𝚷j)FN(𝒙~𝚷j))|,\mathcal{L}_{normal}\left(\boldsymbol{x}\right)=\sum_{j}^{v}|\left(F_{N}\left(\boldsymbol{x}\mid\boldsymbol{\Pi}_{j}\right)-F_{N}\left(\boldsymbol{\tilde{x}}\mid\boldsymbol{\Pi}_{j}\right)\right)|, (6)

where FN:n×3w×h×3F_{N}:\mathbb{R}^{n\times 3}\rightarrow\mathbb{R}^{w\times h\times 3} represents the process of computing and projecting surface normals, as implemented in deferred shading [55].

HRA benefits from differentiable rendering to obtain derivatives, replacing the explicitly computed correspondences from previous approaches [28, 56]. By incorporating multiple cues, HRA ensures joint alignment of geometry and photometric appearances.

4.2 Multiscale Regularized Optimization

The HRA mechanism steers deformation towards joint alignment, yet its derivative vectors are noisy because of shading and imaging variations. Since the derivatives lack regularization, applying them directly as update steps to each vertex could lead to topological errors [17, 57]. A common way to tackle this is adding a regularization term [28, 56], like the ARAP energy [58] or the Laplacian differential representation [19]. However, these solutions introduce problems in tuning the regularization weight for outputs with both smooth and non-smooth regions [17], and implementing a robust solution scheme for non-linear optimization [59]. To overcome these challenges, we propose a multiscale regularized optimization for generating high-quality aligned meshes.

Vertex Optimization. We follow the work of [17] to update vertices iteratively. It suggests that the second-order optimization like Newton’s method is better for smoothing geometry, and the computationally expensive Hessian matrix can be replaced by re-parametrization of 𝒙\boldsymbol{x} with the introduced variables 𝝁\boldsymbol{\mu} to ensure the smoothness of recovered 𝒙\boldsymbol{x}:

𝒙=(𝑰+λ𝑳)1𝝁,\boldsymbol{x}=\left(\boldsymbol{I}+\lambda\boldsymbol{L}\right)^{-1}\boldsymbol{\mu}, (7)
𝝁𝝁η𝒙𝝁𝒙,\boldsymbol{\mu}\leftarrow\boldsymbol{\mu}-\eta\frac{\partial\boldsymbol{x}}{\partial\boldsymbol{\mu}}\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}}, (8)

where η>0\eta>0 means the learning rate, 𝑰𝕀n×n\boldsymbol{I}\in\mathbb{I}^{n\times n} denotes identity matrix, and λ>0\lambda>0 is the regularization weight. 𝑳n×n\boldsymbol{L}\in\mathbb{R}^{n\times n} is a discrete Laplacian operator defined on a mesh =(𝒳,)\mathcal{M}=\left(\mathcal{X},\mathcal{E}\right) with nn vertices 𝒳\mathcal{X} and mm edges \mathcal{E}:

𝑳ij={wij,if(i,j)(i,k)wikifi=j0,otherwise,\boldsymbol{L}_{ij}=\left\{\begin{matrix}-w_{ij},&\textrm{if}\quad(i,j)\in\mathcal{E}\\ \sum_{(i,k)\in\mathcal{E}}w_{ik}&\mathrm{if}\quad i=j\\ 0,&\mathrm{otherwise},\end{matrix}\right.\\ (9)

where wijw_{ij} is the cotangent weight described in [60].

Combining Equations 7 and 8, the updated formula for vertices at each iteration is:

𝒙𝒙η(𝑰+λ𝑳)2HRA𝒙.\boldsymbol{x}\leftarrow\boldsymbol{x}-\eta\left(\boldsymbol{I}+\lambda\boldsymbol{L}\right)^{-2}\frac{\partial\mathcal{L}_{HRA}}{\partial\boldsymbol{x}}. (10)

Multiscale Learning. The coarse-to-fine multiscale learning scheme, based on the tessellation technique [61], periodically decreases the average edge length of the triangle mesh while retaining the shape and the texture parametrization. The multiscale scheme allows for parameter adjustment at each scale to capture fine details without distorting the topology. In particular, the Laplacian matrix 𝑳\boldsymbol{L} is updated for each tessellation step as the topology changes. The template mesh 𝑻\boldsymbol{T}, initially at the coarsest level, undergoes more rigid deformations with a higher regularization parameter λ\lambda to fit the overall target expressions. As the mesh is tessellated to finer scales, λ\lambda is decreased to capture more details.

The multiscale regularized optimization offers several advantages: (1) it produces high-quality meshes effectively with significantly reduced distortion and self-intersection artifacts; (2) it converges robustly without additional training data or priors other than the textured template. The iterative optimization process can be observed in the supplementary videos.

Refer to caption
Figure 5: The pipeline of the textured template mesh creation. (a) A genus-0 mesh in the shape of a bust that approximately overlaps the scan undergoes GPJA as Fig. 2, producing (b) a densely tessellated mesh accompanied with the reconstructed texture map 𝑪𝑻\boldsymbol{C}_{\boldsymbol{T}}, which is then decimated to construct (c) a coarse template 𝑻\boldsymbol{T}, while still preserving the same texture map.

4.3 Textured Template Mesh Creation

To adapt to the proposed pipeline, we construct a reliable textured template mesh which is used throughout the joint alignment for different expressions per subject. As illustrated in Fig. 5, we deliberately choose the mouth-open expression from each subject to reconstruct it into the template mesh. This is because the mouth socket is crucial for accommodating flexible movements during deformation. Initially, we manually sculpt a coarse genus-0 mesh with pre-designed texture parametrization, resembling a bust sculpture. This genus-0 surface is appropriate for representing the head geometry due to similar topological structures without loss of generality .

The genus-0 bust mesh as initialization is then processed through the GPJA pipeline. At this stage, only the depth constraint in HRA is applied. After the depth guiding reconstruction, a tessellated mesh is established. The geometry is then fixed, and the texture color map 𝑪𝑻\boldsymbol{C}_{\boldsymbol{T}} is updated through the color constraint via gradient descent, shown in Fig. 5(b). Finally, the tessellated mesh is decimated using seam-aware simplification [62] while preserving the texture parametrization, resulting in the creation of the textured template mesh 𝑻\boldsymbol{T} as the startup mesh in Fig. 2.

Although the template is subject-specific, all expressions share a common texture space, which enables batch re-parametrization. This offers a practical way for topology-consistent meshes of large datasets.

5 Experiments and Analysis

Experiment Setup. As an early exploration of semantic annotation-free photometric alignment, many existing public datasets (LYHM [25], and NPHM [41], etc.) that primarily consist of 3D scans or registered meshes rather than original images are unsuitable for GPJA. We also found FaMoS [40] inappropriate due to its sparse down-sampled RGB views and subjects with noticeable facial markers. Following our investigation, the FaceScape dataset emerged as the most fitting benchmark with high-resolution images from dense viewpoints and uniform lighting conditions for discrete facial expressions.

In order to thoroughly assess GPJA’s capability, seven subjects, including four publishable ones (Subject 122, 212, 340 and 344), are chosen, and we deliberately selected 10 highly different expressions for each subject. The selection process eliminates redundant and similar expressions. Additionally, we prioritize expressions with significant deformation and occlusion compared to the neutral expression, where landmark detection is less accurate as shown in Fig. 3. This process ensures that the chosen expressions cover a range of challenging scenarios and effectively evaluate the algorithm’s performance. Six to eight images, covering frontal and side views, are used as references.

GPJA is implemented with Nvdiffrast [12] as the differentiable renderer and LargeSteps [17] as the optimizer. The rendering uses the Lambertian reflection model under uniform lighting, which is a reasonable simplification of FaceScape’s acquisition environment. The multiscale optimization involves remeshing 4 times, increasing the vertex count from 16K to 250K. The learning rate η\eta is set to one-tenth of the face length. For the first two levels, the regularization parameter λ\lambda is set to 200 and 120, while for the remaining levels, it is set to 80 and 50. Gradients of only one constraint are back-propagated per iteration, with three constraints being iterated in rotation. Convergence is achieved within 1500 iterations per expression, taking approximately 15 minutes on a single NVIDIA RTX 3080 graphics card.

Metrics. As GPJA achieves joint alignment, we evaluate the method’s effectiveness from both geometric and photometric perspectives. Geometric alignment is assessed using raw face scans as ground truth, with L1-Chamfer distance and normal consistency reported. The distance is computed via the point-mesh approach to reduce the influence of vertex counts. Moreover, we calculate the F-Score with thresholds of 0.5mm and 1.0mm for a statistical error analysis. For photometric alignment, multiview reference images are used as ground truth, while rendered images are generated from aligned meshes with the common template’s texture map at the same camera poses. Standard image metrics such as PSNR, SSIM, and LPIPS are employed for evaluation.

Table 1: Geometric evaluations on 10 expressions per subject in millimeter. C.D. stands for L1-Chamfer distance, N.C. represents normal consistency, while F@0.5 and F@1.0 are F-Scores at thresholds of 0.5mm and 1.0mm.
Subject 7 Subject 32 Subject 122 Subject 212 Subject 340 Subject 344 Subject 350
C.D.\downarrow N.C.\uparrow C.D.\downarrow N.C.\uparrow C.D.\downarrow N.C.\uparrow C.D.\downarrow N.C.\uparrow C.D.\downarrow N.C.\uparrow C.D.\downarrow N.C.\uparrow C.D.\downarrow N.C.\uparrow
NICP 0.477 0.910 0.355 0.942 0.404 0.943 0.614 0.915 0.383 0.944 0.340 0.936 0.428 0.912
NPHM 0.347 0.958 0.377 0.970 0.325 0.968 0.429 0.949 0.285 0.973 0.338 0.957 0.489 0.937
GPJA 0.229 0.981 0.147 0.992 0.264 0.984 0.180 0.980 0.215 0.981 0.224 0.975 0.239 0.970
F@0.5\uparrow F@1.0\uparrow F@0.5\uparrow F@1.0\uparrow F@0.5\uparrow F@1.0\uparrow F@0.5\uparrow F@1.0\uparrow F@0.5\uparrow F@1.0\uparrow F@0.5\uparrow F@1.0\uparrow F@0.5\uparrow F@1.0\uparrow
NICP 0.708 0.852 0.800 0.918 0.738 0.901 0.549 0.799 0.743 0.913 0.792 0.922 0.746 0.879
NPHM 0.814 0.923 0.778 0.925 0.825 0.945 0.731 0.899 0.858 0.956 0.814 0.938 0.682 0.872
GPJA 0.909 0.960 0.957 0.975 0.880 0.959 0.921 0.963 0.912 0.961 0.912 0.960 0.891 0.957
Refer to caption
Figure 6: Visualization on the geometric errors. GPJA outputs meshes with higher vertex counts than NPHM and FaceScape TU. To mitigate the effect of vertex counts, we use the point(of GT)-to-mesh distance. The error distribution show that GPJA achieves lower geometric errors compared to NPHM and FaceScape TU.

Result Analysis on Geometric Alignment. We compare our GPJA against two registration methods. The first is FaceScape topologically uniform(TU) meshes obtained through NICP [18] which is a variant of the standard ICP registration method in 3D face domain [57]. The second is the state-of-the-art deep learning method NPHM [41], which is trained on a dataset comprising 87 subjects and 23 different facial expressions.

Quantitative and qualitative comparisons of geometric alignment are presented in Table 1 and Fig. 6. Table 1 shows that GPJA outperforms NPHM and NICP across all subjects, achieving the lowest L1-Chamfer distance and highest normal consistency, indicating superior geometric alignment. GPJA also consistently achieves higher F-Scores at both thresholds, with F-Scores@0.5mm particularly emphasizing its statistical superiority.

As depicted in Fig. 6, the error visualization exhibits the distribution of errors across the face, demonstrating that our method achieves higher fidelity, even in challenging regions such as the lips and eyes. The primary limitation of NPHM is its lack of fidelity in capturing details. This deficiency is particularly evident in errors concentrated around the mouth of subject 212 and wrinkles of subject 344. Fig. 6 also reveals two main drawbacks in FaceScape TU meshes. First, subject 122’s lip-funneler expression, where the eyes should be closed, incorrectly shows the opposite. This is a common issue resulting from inaccurate landmarks in the ICP-based method. Secondly, the face rim areas (foreheads and cheeks) in TU meshes are observed with higher geometric errors due to excessive conformation to the template shape. Additionally, both NPHM and TU meshes incorrectly stitch non-facial areas like the skull, ears, and neck using templates. In contrast, GPJA aligns these regions more accurately, showcasing its robustness in capturing full heads.

Result Analysis on Photometric Alignment. Previous registration methods neglect photometric evaluations due to a lack of consideration for joint alignment, hence we provide qualitative evaluation for comparisons.

Refer to caption
Figure 7: Evidences of photometric misalignment on FaceScape TU meshes.
Refer to caption
Figure 8: Evidences of photometric misalignment in NPHM. The lip contours on the NPHM mesh do not align with the GT images.

There are two key evidences illustrating the lack of photometric alignment in FaceScape, as shown in Fig. 7: (a) The texture maps for the same subject exhibit discrepancies across different expressions, signifying the absence of a unified parametrization; (b) Deviations are observed between the contours of lips on the texture maps and those on the meshes, revealing misalignment between the geometry and texture. Similar issues are also present in NPHM meshes, although they are not accompanied by texture coordinates. Fig. 8 reveals that NPHM generates false lips that ought to be invisible, which implies the method’s insufficient ability to align based solely on geometric features.

Table 2: Photometric evaluations on 10 expressions per subject. The image metrics are comparable to the photo-realistic results by the NeRF-style reconstruction NeP [63].
PSNR↑ SSIM↑ LPIPS↓
Subject 7 24.75 0.7810 0.06868
Subject 32 20.79 0.6621 0.07516
Subject 122 23.56 0.7443 0.07081
Subject 212 25.51 0.7699 0.09368
Subject 340 24.10 0.7378 0.06802
Subject 344 23.08 0.7569 0.06791
Subject 350 25.38 0.7320 0.07857
Average 23.88 0.7406 0.07469
Refer to caption
Figure 9: Visualization on photometric errors of GPJA. The expressions of each subject are rendered using the shared color texture shown in the first row, and L1L_{1} distance is computed by overlapping the rendered images with ground truth.

In contrast, GPJA ensures photometric consistency by applying the shared color texture to match reference images of different expressions. A comparison between the ground truth and our rendered images in Fig. 9 convincingly shows the photometric alignment achieved by our registration method. The last row of Fig. 9 illustrates that the discrepancy between the ground truth and the rendered images is primarily attributed to variations in skin tone across facial expressions. Notably, as depicted in Fig. 10, the rendered images exhibit pixel-level consistency in characteristic areas such as the mouth, eyes, and mole features across various facial expressions, despite occlusion changes. Moreover, Table 2 provides the image metrics for our experiment. The quantitative results are superior than PSNR of 23.61, SSIM of 0.6460, and LPIPS of 0.09677 achieved by the NeRF-style pipeline NeP [63] (tested on the first 100 subjects of FaceScape), which produces photo-realistic reconstruction through per-frame color generation without alignment.

Refer to caption
Figure 10: Close-up views of GT and rendered images. Checkerboard patterns are used to demonstrate the photometric consistency of moles and freckles, as indicated by arrows. Zooming in on this figure is recommended.
Refer to caption
Figure 11: Ablations with and without the normal constraint. The normal constraint in HRA significantly enhanced details.
Refer to caption
Figure 12: Ablations with and without the mask operation. Masking out the interior mouth avoids disturbance around the contour.
Table 3: Ablation studies demonstrating the effects of each constraint from HRA on four publishable subjects. The geometric error used is the point-to-mesh distance. The geometric performance degrades when any of the three constraints is removed, validating the contributions of each constraint in HRA.
Geometric error↓ PSNR↑ SSIM↑ LPIPS↓
GPJA 0.167 24.06 0.7523 0.07511
Ablation
w/o normal\mathcal{L}_{normal} 0.377 24.49 0.7498 0.07578
w/o depth\mathcal{L}_{depth} 0.455 23.67 0.7535 0.07544
w/o color\mathcal{L}_{color} 0.447 22.04 0.7364 0.07820

Ablations. The HRA mechanism plays a crucial role in correctly warping the template into joint alignment. To validate HRA’s effectiveness, we conduct two ablation experiments to confirm its benefits.

We first validate the contribution of each constraints from HRA mechanism, which is supported by Table 3. In the case where all constraints are utilized, the geometric error is minimized. As a visualized example, Fig. 11 reveals that the normal term significantly contributes to the sharpness of details and alleviates incorrect bumps on texture-less regions like cheeks. However, Table 3 show close image metrics in comparison. We speculate the reason is that with the color constraint guiding the deformation, the geometric distortion can not manifest itself in color renderings.

In the second ablation, we study the effectiveness of the masking strategy for color constraint. We synthesize the label image for the mouth socket using FS(𝒙|𝚷j,𝑩)F_{S}(\boldsymbol{x}|\boldsymbol{\Pi}_{j},\boldsymbol{B}), and overlap it with the ground truth to examine the contours around the inner mouth. Fig. 12 illustrates that when the inner mouth is not masked out, the vertices around it are disturbed and distorted. Hence, intentionally masking out the inner mouth facilitates correct tracing of the mouth contours.

Applications. The core strength of GPJA lies in its ability to achieve joint alignment with a shared texture space across expressions. This consistency enables efficient batch processing for a variety of applications, such as animation and re-parametrization, as demonstrated below.

The meshes generated by GPJA for the same subject share a common parametrization, though the vertex sampling varies. We perform remeshing on these meshes via a common triangulation on the texture space. The remeshed results can then be linearly interpolated for face animation as presented in Fig. 13 and the supplementary video.

When an artist-edited, UV-unwrapped facial mesh is available, as illustrated in Fig. 14(a), the texture embeddings of input meshes yield a continuous map φbφa1\varphi_{b}\varphi_{a}^{-1} from the texture coordinates of the GPJA-generated meshes to those of the artist-edited mesh. Thereafter, the interpolation operator is applied directly to propagate the artist-edited parametrization across all expressions shown in Fig. 14(b).

Refer to caption
Figure 13: Face animation via vertex interpolation.
Refer to caption
(a)
Refer to caption
(b)
Figure 14: (a) Computation of maps on the texture spaces. (b) Re-parametrization results.

6 Conclusion

We propose an innovative geometric-photometric joint alignment approach for facial mesh registration through the utilization of differentiable rendering techniques, demonstrating robust performance under various facial expressions with visibility changes. Unlike previous methods, our semantic annotation-free approach does not require marker point tracking or pre-aligned meshes for training. It is fully automatic and can be executed on a consumer GPU. Experiments show that our method achieve high geometric accuracy, surpassing conventional ICP-based techniques and the state-of-the-art method NPHM. We also validate the photometric alignment by comparing rendered images with captured multiview images, demonstrating pixel-level alignment in key facial areas, including the eyes, mouth, nostrils, and even freckles.

Our method currently adopts a relatively simple lighting and reflection model, which limits its performance under varying lighting conditions and skin highlights. Additionally, features such as teeth and the tongue may occasionally lead to inaccuracies in aligning the mouth contours. Future improvements will focus on optimizing efficiency, enhancing the rendering function to simulate more complex effects, and implementing cross-subject alignment based on genus-0 properties.

References