This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Volumetric Blendshapes: Computationally Efficient Physics-Based Facial Blendshapes

Nicolas Wagner    Ulrich Schwanecke    and Mario Botsch
Abstract

Computationally weak systems and demanding graphical applications are still mostly dependent on linear blendshape models for facial animations. At this, artifacts such as self-intersections, loss of volume, or missing soft tissue elasticity are often avoided by using comprehensively designed blendshape rigs. However, hundreds of blendshapes have to be manually created or scanned in for high-quality animations, which is very costly to scale to many characters. Non-linear physics-based animation systems provide an alternative approach which avoid most artifacts by construction. Nonetheless, they are cumbersome to implement and require immense computational effort at runtime. We propose neural volumetric blendshapes, a realtime approach on consumer-grade CPUs that combines the advantages of physics-based simulations while the handling is as effortless and fast as that of linear blendshapes. To this end, we present a neural network that efficiently approximates volumetric simulations and generalizes across human identities as well as facial expressions. Furthermore, it only requires a single neutral face mesh as input in the minimal setting. Along with the design of the network, we introduce a pipeline for the challenging creation of anatomically and physically plausible training data. Part of the pipeline is a novel layered head model that densely positions the biomechanical anatomy within a skin surface while avoiding intersections. The fidelity of all parts of the data generation pipeline as well as the accuracy and efficiency of the network are evaluated in this work. Upon publication, the trained models and associated code will be released.


Refer to caption
a)                             b)
Figure 1: a) Exemplary results of our Neural Volumetric Blendshapes (brown) compared to linear blendshapes (blue). Among others, volume preservation, the ability to create more detailed wrinkles, and avoided self-intersections result in more realistic and anatomically plausible facial animations. b) An example fit of the novel layered head model. The model encapsulates the skin, the skull, and the muscles with wraps for which we present a data-driven fitting algorithm. The space between the wraps can be canonically tetrahedralized.

1 Introduction

At present, research in the field of head avatars and facial animation is mainly concerned with obtaining photorealistic results through neural networks [13, 24, 4] which can be operated on computationally rich systems, require comprehensive per-person training data, and time-consuming individualization. What currently falls short, however, is the inclusion of less capable hardware setups and efficient training pipelines that avoid extensive data collection. For this, various adaptions of linear blendshape models paired with deformation transfer [54, 9] and example-based facial rigging [38] are still the usual means in production. Although linear facial models have been intensively researched and improved over the past decades, there are still known shortcomings like physically implausible distortions, loss of volume, anatomically impossible expressions, missing volumetric elasticity, or self-intersections. Physics-based simulations have been proposed that overcome most artifacts of linear blendshapes [27, 26, 18, 16], but they are usually laborious to handle and computationally expensive. Hybrid models that try to combine the best of both worlds are either not sophisticated enough in the quality of the simulated physical properties [6] or still too inefficient to be used on slower devices [26].

An auspicious approach in the latter category is physics-based volumetric blendshapes [26]. These can be animated with physical plausibility, anatomical constraints can be taken into account, self-intersections can be prevented, and the control is identical to linear blendshapes. Although the level of detail is slightly sacrificed in comparison to other physics simulations [18, 27], volumetric blendshapes can still only be used at low frame rates. We improve on this approach with neural volumetric blendshapes that approximate the involved calculations of physical and anatomical constraints with an efficient and lightweight neural network. Thereby, realtime inference of physics-based non-linear blendshapes on consumer-grade hardware becomes possible and only a slight computational overhead is necessary compared to linear blendshapes. The principal challenge we solve in this work is the creation of training data for neural volumetric blendshapes that reflects the previously discussed animation advantages and facilitates a generalization across different human identities. Thus, in contrast to other recent work that tries to approximate physics-based facial simulations by neural networks [53, 55, 16], we avoid time-consuming individualizations for a straightforward deployment. In addition, our facial animation method does not require sequences of optical scans or manually crafted facial animations. Instead, the network trained in this work induces an anatomical plausible deformation transfer such that our system is instantly applicable to neutral head surfaces or on top of arbitrary linear blendshape rigs.

The aforementioned data generation pipeline is unique in that, to the best of our knowledge, there are no other comprehensive datasets that relate a broad range of head shapes in diverse facial expressions to the underlying anatomical and physical characteristics. To curate such a dataset for the first time, we bring together multiple data sources such as CT data to reflect the anatomy of heads, 3D reconstructions of images in the wild to collect diverse head shapes, or facial expressions in the form of recorded blendshape weights from dyadic conversational situations. The result is a dataset of heads with neutral and non-neutral expressions represented by a novel standardized layered head model, relating skin surface displacements with the underlying biomechanical volumetric deformations and transformations of muscles, skull bones, and soft tissue.

The key novelties and contributions we present in this paper can be summarized as follows:

  • A novel layered head model (LHM) representing the skin surface as well as the entire biomechanical anatomy.

  • A data-driven procedure for fitting the LHM to neutral skin surfaces.

  • An inverse physics-based simulation which fits the LHM to skin surfaces of non-neutral facial expressions.

  • A novel neural network design that approximates physics-based simulations to efficiently implement neural volumetric blendshapes.

  • A pipeline for creating training data of the neural network that among other parts includes the LHM and the associated fitting procedures.

2 Related Work

2.1 Personalized Anatomical Models

Algorithms that create personalized anatomical models can essentially be distinguished according to two paradigms: heuristic-based and data-driven. Considering heuristic-based approaches, Anatomy Transfer [2] applies a space warp to a template anatomical structure to fit a target skin surface. The skull and other bones are only deformed by an affine transformation. A similar idea is proposed by Gilles et al. [23]. While they also implement a statistical validation of bone shapes, the statistics are collected from artificially deformed bones. In [26, 30], an inverse physics simulation was used to reconstruct anatomical structures from multiple 3D expression scans. Saito et al. [45] simulate the growth of soft tissue, muscles, and bones. A musculoskeletal biomechanical model is fitted from sparse measurements in [48] but not qualitatively evaluated.

There are only a few data-driven approaches because combined data sets of surface scans and MRI, CT or DXA images are hard to obtain for various reasons (e.g. data privacy or unnecessary radiation exposure). The recent work OSSO [32] predicts full body skeletons from 2000 DXA images that do not carry precise 3D information. Further, bones are positioned within a body by predicting only three anchor points per bone group and not avoiding intersections between skin and skull. A model that prevents skin-skull intersections and also considers muscles is based on fitting encapsulating wraps instead of the anatomy itself [34]. However, no accurate algorithm based on medical imaging but a BMI (Body mass index) regressor [41] is used to position the wraps. A much more accurate, pure face model, was developed by Achenbach et al. [1]. Here, CT scans are combined with optical scans by a multilinear model (MLM) which can map from skulls to faces and vice versa. As before, no self-intersections are prevented and only bones are fitted. Building on the data from [1] and following the idea of a layered body model [34], we create a statistical layered head model including musculature that avoids self-intersections.

2.2 Physics-Based Facial Animation

A variety of techniques for animating faces have been developed in the past [25, 12, 56, 43]. Data-driven models [36, 26, 37], which have recently been significantly improved by deep learning [57, 21, 13, 20, 51, 4], are certainly dominant. Due to their simplicity and speed, linear blendshapes [36] are still most commonly used in demanding applications and whenever no computationally rich hardware is available. Physics-based models have been developed for a long time [50] and avoid artifacts like implausible contortions and self-intersections, but due to their complexity and computational effort, they are rarely used. Hybrid approaches add surface-based physics to linear blendshapes for more detailed facial expressions [6, 7, 16]. However, by construction they can not model volumetric effects.

The pioneering work of Sifakis et al. [50] is the first fully physics-based facial animation. The simulation is conducted on a personalized tetrahedron mesh, which can only be of a limited resolution due to a necessary dense optimization problem. With Phace [27], this problem was overcome by an improved physics simulation. An art-directed muscle model [17, 5, 18] additionally represents muscles as B-splines and allows control of expressions via trajectories of spline control points. A solely inverse model for determining the physical properties of faces was proposed in [29]. Neural soft-tissue dynamics [46, 14] extend the SMPL (Skinned Multi-Person Linear Model) proposed in [40] with secondary motion. Recently, [53, 55, 16] adapted neural soft-body dynamics to learn the physical properties of a particular person. However, these approaches must be retrained for new objects and are slow in inference. With volumetric blendshapes [26], a hybrid approach has been presented that combines the structure of linear blendshapes with physical and anatomical plausibility. We extend this work to neural volumetric blendshapes to make physical plausibility realtime capable while maintaining the control structure of standard linear blendshapes.

3 Method

The foundation of our neural volumetric blendshapes approach (Sections 3.2 and 3.3) is a novel layered head representation (Section 2). Starting from there, we design a physics-based facial animation model (Section 3.4) and distill it into a defining dataset (Section 3.5). With this dataset, a neural network that makes the animation model realtime capable is trained.

3.1 Layered Head Model

Refer to caption
Figure 2: All components of the layered head model template 𝒯\mathcal{T}. Skin S𝒯S_{\mathcal{T}}, skin wrap S^𝒯\hat{S}_{\mathcal{T}}, muscles M𝒯M_{\mathcal{T}}, muscles wrap M^𝒯\hat{M}_{\mathcal{T}}, skull B𝒯B_{\mathcal{T}}, and the skull wrap B^𝒯\hat{B}_{\mathcal{T}}. Each wrap is an abstract and simplified representation of a more complex structure.

We represent a head =ρ(𝒯)\mathcal{H}=\rho_{\mathcal{H}}\!\left(\mathcal{T}\right) with neutral expression through a component-wise transformation (see Section 3.5 for details) of a layered head model (LHM) template

𝒯=(S𝒯,B𝒯,M𝒯,S^𝒯,B^𝒯,M^𝒯),\mathcal{T}=\!\left(S_{\mathcal{T}},B_{\mathcal{T}},M_{\mathcal{T}},\hat{S}_{\mathcal{T}},\hat{B}_{\mathcal{T}},\hat{M}_{\mathcal{T}}\right), (1)

that consists of six triangle meshes. S𝒯S_{\mathcal{T}} describes the skin surface including the eyes, the mouth cavity, and the tongue, B𝒯B_{\mathcal{T}} the surface of all skull bones and teeth, M𝒯M_{\mathcal{T}} the surface of all muscles and the cartilages of the ears and nose. S^𝒯\hat{S}_{\mathcal{T}} is the skin layer, i.e. a closed wrap enveloping S𝒯S_{\mathcal{T}}, B^𝒯\hat{B}_{\mathcal{T}} the skull layer that envelopes B𝒯B_{\mathcal{T}}, and M^𝒯\hat{M}_{\mathcal{T}} the muscle layer that envelopes M𝒯M_{\mathcal{T}}. Other anatomical structures are omitted for simplicity. The template structures S𝒯,B𝒯S_{\mathcal{T}},B_{\mathcal{T}} and M𝒯M_{\mathcal{T}} were designed by an experienced digital artist. The skin, skull, and muscle layers S^𝒯,B^𝒯\hat{S}_{\mathcal{T}},\hat{B}_{\mathcal{T}}, and M^𝒯\hat{M}_{\mathcal{T}} have the same triangulation and were generated by shrink-wrapping a sphere as close as possible to the corresponding surfaces without intersections. The complete template is shown in Figure 2.

The skull and muscle layers are further massaged such that the quads of all prisms that can be spanned between corresponding faces of the skin, muscle, and skull layer are as rectangular as possible while preserving the original geometries. To that end, we determine for the skull layer

argminX\displaystyle\arg\min_{X} wrectErect(X,S^𝒯)+wdistEdist(X,B^𝒯),\displaystyle w_{\mathrm{rect}}{E}_{\mathrm{rect}}\!\left(X,\hat{S}_{\mathcal{T}}\right)+w_{\mathrm{dist}}E_{\mathrm{dist}}\!\left(X,\hat{B}_{\mathcal{T}}\right), (2)

within the efficient projective dynamics [11] optimization framework, initializing with X=B^𝒯X=\hat{B}_{\mathcal{T}}. Here, EdE_{d} is the two-sided Hausdorff distance to the non-massaged shape and

Er(X,S^𝒯)=\displaystyle E_{r}\!\left(X,\hat{S}_{\mathcal{T}}\right)= (3)
(xi0,xi1)X(90(xi0xi1s^i1))2+(90(xi1xi0s^i0))2\displaystyle\sum_{(x_{i}^{0},x_{i}^{1})\in X}\left(90^{\circ}-\angle\!\left(x_{i}^{0}x_{i}^{1}\hat{s}_{i}^{1}\right)\right)^{2}+\left(90^{\circ}-\angle\!\left(x_{i}^{1}x_{i}^{0}\hat{s}_{i}^{0}\right)\right)^{2}
+(90(s^i0s^i1xi1))2+(90(s^i1s^i0xi0))2,\displaystyle\quad\quad\quad+\left(90^{\circ}-\angle\!\left(\hat{s}_{i}^{0}\hat{s}_{i}^{1}x_{i}^{1}\right)\right)^{2}+\left(90^{\circ}-\angle\!\left(\hat{s}_{i}^{1}\hat{s}_{i}^{0}x_{i}^{0}\right)\right)^{2},

induces the rectangular prism shapes. After the optimization we set B^𝒯=X\hat{B}_{\mathcal{T}}=X. The same optimization is run for the muscle layer.

Mesh S𝒯S_{\mathcal{T}} B𝒯B_{\mathcal{T}} M𝒯M_{\mathcal{T}} S^𝒯\hat{S}_{\mathcal{T}}
#Vertices 21875 14572 16388 7826
#Faces / #Tets 42738 28856 32370 15648
Mesh B^𝒯\hat{B}_{\mathcal{T}} M^𝒯\hat{M}_{\mathcal{T}} 𝕊𝒯\mathbb{S}_{\mathcal{T}} 𝕄𝒯\mathbb{M}_{\mathcal{T}}
#Vertices 7826 7826 49852
#Faces / #Tets 15648 15648 123429 73681
Table 1: Description of the cardinality of each template LHM 𝒯\mathcal{T} component. By subdividing the wrap meshes or the layer prisms, the resolution of the template tetrahedron meshes can easily be adjusted.

The wrapping layers of the LHM allow for at least two significant advantages. On the one hand, they provide a simplified representation of the skin surface, the musculature, and the skull, which we exploit in Section 3.5 for determining the LHM template transformation ρ\rho_{\mathcal{H}}. On the other hand, they can be used for topologically and semantically consistent tetrahedralization of the head volume by splitting the prisms between the layers canonically into tetrahedrons.

More precisely, the construction of the LHM also defines a soft tissue tetrahedron mesh 𝕊𝒯\mathbb{S}_{\mathcal{T}} (i.e. between the skin and the muscle layer) and a muscle tissue tetrahedron mesh 𝕄𝒯\mathbb{M}_{\mathcal{T}} (i.e. between the muscle and the skull layer). The massage of the muscle and skull layers ensures nicely shaped tets with minimal shearing. Further, 𝕊𝒯\mathbb{S}_{\mathcal{T}} can be fine-tuned by removing all vertices and connected tets outside of S𝒯S_{\mathcal{T}}, adding the vertices of S𝒯S_{\mathcal{T}}, and connecting them via Delaunay tetrahedralization. The complexities of all template components are given in Table 1.

3.2 Linear & Volumetric Blendshapes

Building on the LHM representation, we can now introduce neural volumetric blendshapes. For this, the classical concept of linear blendshapes is reviewed first. Thereupon, we derive volumetric blendshapes and the involved physics-based simulations that are in general not real time capable. Finally, we introduce neural volumetric blendshapes (Section 3.3) as an efficient and fast alternative.

For a specific head =(S,B,M,S^,B^,M^)\mathcal{H}=\!\left(S_{\mathcal{H}},B_{\mathcal{H}},M_{\mathcal{H}},\hat{S}_{\mathcal{H}},\hat{B}_{\mathcal{H}},\hat{M}_{\mathcal{H}}\right), a linear blendshape model consists of nn surface blendshapes

{Si}i=1n\!\left\{{S}_{\mathcal{H}}^{i}\right\}_{i=1}^{n} (4)

and determines an unknown facial expression SexpS_{\mathcal{H}}^{\mathrm{exp}} as the linear interpolation

Sexp=iwexpiSi.S_{\mathcal{H}}^{\mathrm{exp}}=\sum_{i}w_{\mathrm{exp}}^{i}S_{\mathcal{H}}^{i}. (5)

The wexpiw_{\mathrm{exp}}^{i} are the blending weights and determine the share of each blendshape in the expression.

The corresponding set of volumetric blendshapes can be defined as

{Vi}i=1n={(𝕊i,𝕄i,Bi)}i=1n,\left\{V^{i}_{\mathcal{H}}\right\}_{i=1}^{n}=\left\{\!\left(\nabla\mathbb{S}_{\mathcal{H}}^{i},\nabla\mathbb{M}_{\mathcal{H}}^{i},B_{\mathcal{H}}^{i}\right)\right\}_{i=1}^{n}, (6)

where 𝕊i\nabla\mathbb{S}_{\mathcal{H}}^{i}, 𝕄i\nabla\mathbb{M}_{\mathcal{H}}^{i} describe soft and muscle tissue deformations as vectors of 3×33\times 3 deformation gradients. BiB_{\mathcal{H}}^{i} describes the skull after rigid motion of jaw and cranium. An anatomically plausible inverse physics model ϕ\phi^{\dagger} (see Section 3.4 for details) relates the linear rig to the volumetric rig as

Vi=ϕ(Si,𝕊,𝕄,B).V^{i}_{\mathcal{H}}=\phi^{\dagger}\left(S_{\mathcal{H}}^{i},\mathbb{S}_{\mathcal{H}},\mathbb{M}_{\mathcal{H}},B_{\mathcal{H}}\right). (7)

In words, ϕ\phi^{\dagger} determines the biomechanical volumetric deformations of \mathcal{H} that cause the skin surface of a facial expression. Vice versa, a forward physics model ϕ\phi acts as a left inverse to ϕ\phi^{\dagger} and maps volumetric deformations back to surface deformations as

S¯i=ϕ(Vi).\bar{S}_{\mathcal{H}}^{i}=\phi\!\left(V^{i}_{\mathcal{H}}\right). (8)

In reality, Si{S}_{\mathcal{H}}^{i} is often not an anatomically legal facial expression with respect to ϕ\phi^{\dagger}. As a consequence, ϕ\phi^{\dagger} is built such that S¯i\bar{S}_{\mathcal{H}}^{i} is an anatomically reachable expression which, in Euclidean sense, is close but not necessarily equal to Si{S}_{\mathcal{H}}^{i}.

The major advantage of a volumetric blendshape model is that unknown facial expressions calculated as

S¯exp=ϕ(iwexpiVi)\bar{S}_{\mathcal{H}}^{\mathrm{exp}}=\phi\!\left(\bigoplus_{i}w_{\mathrm{exp}}^{i}\otimes V^{i}_{\mathcal{H}}\right) (9)

can be shaped anatomically more plausible and thus more realistic, provided a suitable choice of interpolation (and extrapolation) functions \bigoplus and \otimes. At this, the essential requirement to \bigoplus and \otimes is the biologically necessary volume preservation of the deformation gradients. The common approach would be to to separately interpolate the stretch and the rotation components of the deformation gradients [26, 49] and the positions of the skull bones. Considering the stretching components, volume-preserving interpolation methods [3, 28] have been proposed. Considering the rotation components, quaternion-interpolation satisfies the volume-preservation by construction. However, to the best of our knowledge, it is yet to be discussed how both components should be extrapolated for facial animations if the blending weights do not form a convex combination. In this work, we therefore use a novel hybrid approach that calculates facial expressions as

S¯exp=ϕ(ϕ(iwexpiS¯i,𝕊,𝕄,B)).\bar{S}_{\mathcal{H}}^{\mathrm{exp}}=\phi\left(\phi^{\dagger}\!\left(\sum_{i}w_{\mathrm{exp}}^{i}\bar{S}_{\mathcal{H}}^{i},\mathbb{S}_{\mathcal{H}},\mathbb{M}_{\mathcal{H}},B_{\mathcal{H}}\right)\right). (10)

This way, inter- and extrapolation capabilities of linear surface blendshapes can be used while maintaining the advantages of anatomical plausibility through ϕ\phi^{\dagger}.

3.3 Neural Volumetric Blendshapes

Refer to caption
Figure 3: An overview of the neural volumetric blendshape model and how novel expressions are determined. For the fully connected layers (FC), the input size, the output size, as well as the activation function are stated. Please note that ff is designed as an adapter such that the part left of the dotted line can be amended.

Regardless of the construction of \bigoplus and \otimes, the calculation of facial expressions is in general not realtime-capable for high-resolution volumetric blendshapes and sophisticated physics models ϕ\phi and ϕ\phi^{\dagger}. We therefore present the neural volumetric blendshape model that like the linear blendshape model (4) consists of a set of (not necessarily anatomically plausible) expression surfaces {Si}i=1n\{{S}_{\mathcal{H}}^{i}\}_{i=1}^{n}. However, unknown anatomically plausible expressions are formed by a neural network ff that is trained such that

S¯expiwexpiSif(iwexpiSiS,S).\bar{S}_{\mathcal{H}}^{\mathrm{exp}}-\sum_{i}w_{\mathrm{exp}}^{i}S_{\mathcal{H}}^{i}\approx f\!\left(\sum_{i}w_{\mathrm{exp}}^{i}S_{\mathcal{H}}^{i}-S_{\mathcal{H}},S_{\mathcal{H}}\right). (11)

The structure of Equation (11) enables us to implement ff efficiently as fully connected networks. More precisely, both inputs to ff, the neutral surface SS_{\mathcal{H}} and the vector of the differences to the linear interpolation result Sexp=iwexpiSiS_{\mathcal{H}}^{\mathrm{exp}}=\sum_{i}w_{\mathrm{exp}}^{i}S_{\mathcal{H}}^{i}, are tokenized with the help of respective encoders. Subsequently, the tokens are processed by a decoder that outputs the vertex-wise deformations from SexpS_{\mathcal{H}}^{\mathrm{exp}} to the anatomically plausible expression S¯exp\bar{S}_{\mathcal{H}}^{\mathrm{exp}}.

The inputs and outputs of ff are justified as discussed next. Considering Equation (10) which is approximated by ff, it seems reasonable to use the same inputs and only learn the evaluation of ϕ(ϕ()\phi(\phi^{\dagger}(\cdot). However, the accompanying tetrahedron meshes are significantly higher-dimensional than the corresponding surface meshes and ff would be slow downed severely in the inference speed. We therefore expect ff to implicitly learn the linking between linear and volumetric blendshapes as shown in Equation (7) as well as the fitting of the LHM. Since we demonstrate in Section 3.5 how the LHM can be fitted given only the neutral surface and also the linking does not need any further inputs, the neutral surface carries sufficient information to omit the tetrahedron meshes. For the second input, the difference vector, the blending weights or the linear interpolation result could alternatively be inserted. Inputting only the weights, however, would considerably limit the flexibility of ff because it does not allow the underlying surface blendshapes to be changed after training. On the other hand, inputting the linear interpolation result exhibits disadvantages at training time. The target deltas are mostly Gaussian distributed and can therefore be learned more easily by a neural network [39]. In the same spirit, the output of ff is chosen to be the differences to the anatomically plausible expression.

We evaluated alternatives to fully connected networks such as set transformers [35], convolutional networks on geometry images, graph neural networks [47], or implicit architectures [42], but all have exhibited substantially slower inference speeds while reaching a similar accuracy.

Our design of ff is not only fast, i.e., runs only 8 ms per frame on a consumer grade Intel i5 12600K, but it is also straightforward to deploy. A single neutral surface of a head is sufficient on which deformation transfer [9] can be applied. Building on this, an anatomically plausible animation can be performed with ff as shown in Figure 3. Further, no more volumetric information has to be processed in complex simulation frameworks. Thereby, and because only simple fully connected layers are used, the mesh is portable and easy to use on many different (computationally weak) devices.

Next, the construction of ϕ\phi and ϕ\phi^{\dagger} is described and subsequently a pipeline for creating a dataset to learn ff is structured.

Refer to caption
a)                          b)
Figure 4: a) The training scheme of the skin to skull layer distances regressor DD which is used in the layered head model fitting. DD is trained on the SKULLS dataset [22] that relates optical skin scans to MRI measurements. b) Procedural overview of the layered head model fitting algorithm. Starting from only a neutral head scan, the five other components are positioned. Blue strokes indicate inputs, green strokes outputs.

3.4 Physics-Based Simulations

We realize the anatomically plausible inverse physics ϕ\phi^{\dagger} and the left inverse ϕ\phi as projective dynamics energies EϕE_{\phi^{\dagger}} and EϕE_{\phi}, respectively. Starting with ϕ\phi^{\dagger}, separate terms for soft tissue, muscle tissue, the skin, the skull, and auxiliary components are applied. Considering the soft tissue 𝕊\mathbb{S}_{\mathcal{H}}, we closely follow the model of [27] and impose

E𝕊=wvolt𝕊Evol(t)+wstrt𝕊Estr(t),E_{{\mathbb{S}_{\mathcal{H}}}}=w_{\mathrm{vol}}\sum_{t\in\mathbb{S}_{\mathcal{H}}}E_{\mathrm{vol}}(t)+w_{\mathrm{str}}\sum_{t\in\mathbb{S}_{\mathcal{H}}}E_{\mathrm{str}}(t), (12)

which for each tetrahedron tt penalizes changes of volume

Evol(t)=(det(F(t))1)2\displaystyle E_{\mathrm{vol}}(t)=\!\left(\det(\text{F}(t))-1\right)^{2} (13)

and strain

Estr(t)=minRSO(3)F(t)RF2.\displaystyle E_{\mathrm{str}}(t)=\min_{R\in SO(3)}\left\|\text{F}(t)-R\right\|^{2}_{F}. (14)

F(t)\text{F}(t) denotes the deformation gradient of a tetrahedron tt, RSO(3)R\in SO(3) the optimal rotation, and F2\left\|\cdot\right\|^{2}_{F} the Frobenius norm.

To reflect the biological structure of the skin, we additionally formulate a dedicated strain energy

ES=tSEstr(t)E_{{S_{\mathcal{H}}}}=\sum_{t\in{S}_{\mathcal{H}}}E_{\mathrm{str}}(t) (15)

on each triangle tt of the skin.

For the muscle tets 𝕄\mathbb{M}_{\mathcal{H}}, we follow the arguments of [29] that capturing fiber directions for tetrahedralized muscles is in general too restrictive. Hence, only a volume-preservation term

E𝕄=wvolt𝕄Evol(t)E_{{\mathbb{M}_{\mathcal{H}}}}=w_{\mathrm{vol}}\sum_{t\in\mathbb{M}_{\mathcal{H}}}E_{\mathrm{vol}}(t) (16)

is applied. The weight wvolw_{\mathrm{vol}} is set sufficiently high such that the volume-preservation is almost a hard constraint.

The skull is not tetrahedralized as it is assumed to be non-deformable even though it is rigidly movable. The non-deformability of the skull is represented by

EB=tBEstr(t)+xBEcurv(x,B),E_{B_{\mathcal{H}}}\!=\sum_{t\in B_{\mathcal{H}}}\!E_{\mathrm{str}}\!\left(t\right)+\sum_{x\in B_{\mathcal{H}}}\!E_{\mathrm{curv}}\!\left(x,B_{\mathcal{H}}\right), (17)

i.e. a strain EstrE_{str} on the triangles and mean curvature regularization

Ecurv(x,B)=AxΔxRΔbx2E_{\mathrm{curv}}\!\left(x,B_{\mathcal{H}}\right)=A_{x}\left\|{\Delta x-R\Delta b_{x}}\right\|^{2} (18)

on the vertices of the skull. The matrix RSO(3)R\in SO(3) denotes the optimal rotation keeping the vertex Laplacian Δx\Delta x as close as possible to its initial value Δbx\Delta b_{x}. The vertex Laplacian is discretized using the cotangent weights and the Voronoi areas AxA_{x} [10]. We do not model the non-deformability as a rigidity constrain due to the significantly higher computational burden. Like for the volume constraints, the non-deformability of the skull is weighted to be an almost hard constraint.

To connect the muscle tets as well as the eyes to the skull, connecting tets are introduced. For the muscle tets, each skull vertex connects to the closest three vertices in 𝕄\mathbb{M}_{\mathcal{H}} to form a connecting tet. For the eyes, connecting tets are formed by connecting each eye vertex to the three closest vertices in BB_{\mathcal{H}}. On these connecting tets, the energy EconE_{con} with the same constraints as in Equation (12) is imposed as almost hard constraints. By this design, the jaw and the cranium are moved independent from each other though muscle activations but the eyes remain rigid and move only with the cranium.

Finally, the energy

Einv=xSEtar(x,Sexp)E_{\mathrm{inv}}=\sum_{x\in S_{\mathcal{H}}}E_{\mathrm{tar}}\!\left(x,{S}^{\mathrm{exp}}_{\mathcal{H}}\right) (19)

of soft Dirichlet constraints

Etar(x,Sexp)=xsx2,\displaystyle E_{\mathrm{tar}}\!\left(x,{S}^{\mathrm{exp}}_{\mathcal{H}}\right)=\left\|{x-{s_{x}}}\right\|^{2}, (20)

is added, attracting each vertex xx of the skin surface SS_{\mathcal{H}} to the corresponding vertex sxs_{x} from the target expression Sexp{S}^{\mathrm{exp}}_{\mathcal{H}}. As previously mentioned, the expression Sexp{S}^{\mathrm{exp}}_{\mathcal{H}} might be anatomically implausible. As a countermeasure, we can impose a maximum strain by balancing wstrw_{\mathrm{str}} with wtarw_{\mathrm{tar}}. Thus, together with the almost hard constraints and by the construction of projective dynamics, ϕ\phi^{\dagger} always results in a plausible expression S¯exp\bar{S}^{\mathrm{exp}}_{\mathcal{H}} close to the target. The weighted sum of the aforementioned energies gives the total energy

Eϕ=\displaystyle E_{\phi^{\dagger}}= w𝕊E𝕊+w𝕄E𝕄+wBEB\displaystyle\>\>w_{\mathrm{{\mathbb{S}_{\mathcal{H}}}}}E_{{\mathbb{S}_{\mathcal{H}}}}+w_{\mathrm{{\mathbb{M}_{\mathcal{H}}}}}E_{{\mathbb{M}_{\mathcal{H}}}}+w_{\mathrm{{B_{\mathcal{H}}}}}E_{{B_{\mathcal{H}}}} (21)
+wSES+wconEcon+winvEinv\displaystyle+w_{\mathrm{{{S}_{\mathcal{H}}}}}E_{{{S}_{\mathcal{H}}}}+w_{\mathrm{con}}E_{con}+w_{\mathrm{inv}}E_{inv}

of the backward model ϕ\phi^{\dagger}.

The forward model ϕ\phi is considerably simpler in structure and is realized as the energy

Eϕ=\displaystyle E_{\phi}= t𝕊Edg(t,𝕊exp)+t𝕄Edg(t,𝕄exp)\displaystyle{}\sum_{t\in\mathbb{S}_{\mathcal{H}}}E_{\mathrm{dg}}\!\left(t,\nabla\mathbb{S}_{\mathcal{H}}^{\mathrm{exp}}\right)+\sum_{t\in\mathbb{M}_{\mathcal{H}}}E_{\mathrm{dg}}\!\left(t,\nabla\mathbb{M}_{\mathcal{H}}^{\mathrm{exp}}\right) (22)
+wtarxBEtar(x,Bexp)+wconEcon\displaystyle+w_{\mathrm{tar}}\sum_{x\in B_{\mathcal{H}}}E_{\mathrm{tar}}\!\left(x,{B}^{\mathrm{exp}}_{\mathcal{H}}\right)+w_{\mathrm{con}}E_{\mathrm{con}}

that aims to achieve given deformation gradients (𝕊exp,𝕄exp)\!\left(\nabla\mathbb{S}_{\mathcal{H}}^{\mathrm{exp}},\nabla\mathbb{M}_{\mathcal{H}}^{\mathrm{exp}}\right) while matching the positions of the targeted skull Bexp{B}^{\mathrm{exp}}_{\mathcal{H}}. Here,

Edg(t,𝕋)=minRSO(3)F(t)RTtF2\displaystyle E_{\mathrm{dg}}(t,\nabla\mathbb{T})=\min_{R\in SO(3)}\left\|{\text{F}(t)-RT_{t}}\right\|^{2}_{F} (23)

is similar to Equation (14) and attracts the deformation of each tetrahedron tt to a corresponding target deformation gradient Tt𝕋T_{t}\in\nabla\mathbb{T}.

For both models we resolve self-intersections between colliding lips or teeth in a subsequent projective dynamics update. In the second update, colliding vertices are resolved as in [33]. The distinctive feature here is that no collision-gaps can occur after dissolving the self-intersections.

3.5 Generation of Training Data

To approximate Equation (10) with ff, a defining training dataset is required in first place. By the construction of ff, this training dataset must consist of instances that relate diverse facial expressions created through linear blendshapes to the corresponding anatomically plausible surfaces. This dataset must also cover a variety of distinct head shapes to train ff to be as generally applicable as ϕ\phi and ϕ\phi^{\dagger}.

In the following, we describe a pipeline for creating instances of such dataset, which can be roughly divided into two high-level steps. First, in order to evaluate Equation (10) on a reasonable domain, it is necessary to model a head \mathcal{H} from an extensive head-model and determine the corresponding neutral soft and muscle tissue tetrahedron meshes 𝕊\mathbb{S}_{\mathcal{H}} and 𝕄\mathbb{M}_{\mathcal{H}}. Second, SS_{\mathcal{H}} has to be deformed to an expression Sexp{S}_{\mathcal{H}}^{\mathrm{exp}} and mapped to the anatomically plausible S¯exp\bar{S}_{\mathcal{H}}^{\mathrm{exp}}. A more detailed algorithmic description is given in Algorithm 1.

Sampling Head Shapes We start the first part of the pipeline by randomly drawing a neutral skin surface SS_{\mathcal{H}} from DECA [20], one of the most comprehensive high-resolution face models currently available. More specifically, we randomly draw an image from the Flickr-Faces-HQ [31] dataset and let DECA determine the corresponding neutral head shape. Further, a precomputed mapping is applied to adapt the DECA topology to our template.

Algorithm 1 Data Generation
Head Sampling and LHM Fitting
1. a) Draw a random image II_{\mathcal{H}} of a head \mathcal{H} from the
FlickerHQ dataset.
1. b) Calculate skin surface S=𝐷𝐸𝐶𝐴(I)S_{\mathcal{H}}=\mathit{DECA}\!\left(I_{\mathcal{H}}\right) with neutral expression parameters.
1. c) Find LHM transformation ρ\rho_{\mathcal{H}} from SS_{\mathcal{H}}, build up tetrahedron meshes 𝕊\mathbb{S}_{\mathcal{H}} and 𝕄\mathbb{M}_{\mathcal{H}}.
Expression Sampling and Simulation
2. a) Create ARKit blendshapes {Si}i=152\{S_{\mathcal{H}}^{i}\}_{i=1}^{52} from SS_{\mathcal{H}} with deformation transfer.
2. b) Sample weights wexp={wexpi}i=152w_{\mathrm{exp}}=\{w^{i}_{\mathrm{exp}}\}_{i=1}^{52} from dyadic recordings and calculate Sexp=i=152wexpiSiS_{\mathcal{H}}^{\mathrm{exp}}=\sum_{i=1}^{52}w^{i}_{\mathrm{exp}}S_{\mathcal{H}}^{i}.
1. c) Get S¯exp\bar{S}_{\mathcal{H}}^{\mathrm{exp}} from ϕ(ϕ(Sexp,𝕊,𝕄,B))\phi\!\left(\phi^{\dagger}\!\left(S_{\mathcal{H}}^{\mathrm{exp}},\mathbb{S}_{\mathcal{H}},\mathbb{M}_{\mathcal{H}},B_{\mathcal{H}}\right)\right).

Fitting the LHM Next, the template LHM 𝒯\mathcal{T} is aligned with the skin surface SS_{\mathcal{H}} by finding ρ\rho_{\mathcal{H}} that maps each of the template components individually. For this, we rely on a hybrid approach that is largely data-driven but also based on heuristics that ensure anatomic plausibility and avoids intersections.

As the first of the remaining five template meshes, we fit the skin layer by setting

S^=rbfS𝒯S(S^𝒯).\hat{S}_{\mathcal{H}}=\text{rbf}_{S_{\mathcal{T}}\rightarrow S_{\mathcal{H}}}(\hat{S}_{\mathcal{T}}). (24)

The RBF function is a space warp based on triharmonic radial basis functions [8] that is calculated to displace from the template skin surface S𝒯S_{\mathcal{T}} to the target SS_{\mathcal{H}} and is then applied to the template skin layer. By construction, the skin layer will be warped semantically consistent and stick close to the targeted skin surface.

Next, we fit the skull layer B^\hat{B}_{\mathcal{H}} by invoking a linear regressor DD that predicts the distances from the vertices of S^\hat{S}_{\mathcal{H}} to the corresponding vertices of B^\hat{B}_{\mathcal{H}} and subsequently minimizing with projective dynamics

argminX\displaystyle\arg\min_{X} wrectErect(X,S^𝒯)\displaystyle\hskip 2.84526ptw_{\mathrm{rect}}{E}_{\mathrm{rect}}\!\left(X,\hat{S}_{\mathcal{T}}\right) (25)
+wdist2Edist2(X,S^,D(S^))\displaystyle+w_{\mathrm{dist_{2}}}E_{\mathrm{dist}_{2}}\!\left(X,\hat{S}_{\mathcal{H}},D\!\left(\hat{S}_{\mathcal{H}}\right)\right)
+wcurvEcurv(X,B^𝒯).\displaystyle+w_{\mathrm{curv}}E_{\mathrm{curv}}\!\left(X,\hat{B}_{\mathcal{T}}\right).

Here,

Edist2(X,S^,D(S^))=xX(xsx2dx)2E_{\mathrm{dist_{2}}}\!\left(X,\hat{S}_{\mathcal{H}},D\!\left(\hat{S}_{\mathcal{H}}\right)\right)=\sum_{x\in X}\!\left(\sqrt{\left\|{x-{s_{x}}}\right\|^{2}}-d_{x}\right)^{2} (26)

ensures that for each vertex xXx\in X the predicted distances dxD(S^)d_{x}\in D(\hat{S}_{\mathcal{H}}) is adhered to. Apart from Edist2E_{\mathrm{dist_{2}}}, the same regularizing terms as in Equation (LABEL:eq::rect) and 18 are used. The optimization is initialized with X=S^D(S)n(S^)X=\hat{S}_{\mathcal{H}}-D(S_{\mathcal{H}})\cdot n(\hat{S}_{\mathcal{H}})  where n(S^)n(\hat{S}_{\mathcal{H}}) are area-weighted vertex normals. DD is trained on the dataset of [22] (SKULLS) that relates MRI skull measurements to skin surface scans. To ease the learning task, we learn the regressor between PCAs of the skin layers and the skin-to-skull-layer distances. Predicted lengths are set to a minimum value if they fall below a threshold, thus, avoiding skin-skull intersections and numerical issues in downstream physics-based simulations. In Figure 4a, the linear regressor training is visualized.

The muscle layer ^\hat{\mathcal{M}}_{\mathcal{H}} is fitted by positioning its vertices at the same absolute distances between the corresponding skin and skull layer vertices as in the template, and only passing on a small relative amount wrelw_{\mathrm{rel}} of the distance changes compared to the template. This approach assumes that the muscle mass in the facial area is only moderately affected by body weight and skull size.

The skull mesh is placed by setting

B=rbfB^𝒯B^(B𝒯).B_{\mathcal{H}}=\text{rbf}_{\hat{B}_{\mathcal{T}}\rightarrow\hat{B}_{\mathcal{H}}}\!\left(B_{\mathcal{T}}\right). (27)

The properties of the RBF space warp ensure that the skull mesh remains within the skull layer if the layer is of sufficient resolution. The muscle mesh could be placed in a similar fashion but is not needed in our pipeline any further.

Finally, the soft and muscle tissue tetrahedron meshes 𝕊\mathbb{S}_{\mathcal{H}} and 𝕄\mathbb{M}_{\mathcal{H}} can be constructed as described in Section 2. On average, the complete fitting pipeline takes only about 500ms for one instance on an Intel i5 12600K processor. Figure 4b) visualizes the overall fitting process.

Sampling Expressions In the second part of the pipeline, the actual training instance (Sexp,S¯exp)\!\left({S}_{\mathcal{H}}^{\mathrm{exp}},\bar{S}_{\mathcal{H}}^{\mathrm{exp}}\right) is created. Beginning with deformation transfer [9] to transfer ARKit111https://developer.apple.com/ surface-based blendshapes to SS_{\mathcal{H}}, we create expression SexpS_{\mathcal{H}}^{\mathrm{exp}} by linearly blending the blendshapes with weights that are obtained from 8 around 10 minutes long dyadic conversations recorded with a custom iOS app.

Training Instance Finally, S¯exp=ϕ(ϕ(Sexp,𝕊,𝕄,B))\bar{S}_{\mathcal{H}}^{\mathrm{exp}}=\phi\!\left(\phi^{\dagger}\!\left(S_{\mathcal{H}}^{\mathrm{exp}},\mathbb{S}_{\mathcal{H}},\mathbb{M}_{\mathcal{H}},B_{\mathcal{H}}\right)\right) can be computed. Computing one training instance takes approximately 40 seconds on a AMD Threadripper Pro 3995wx.

4 Experiments

Before demonstrating the accuracy and efficiency (Section 4.3) of neural volumetric blendshapes, we first evaluate the fitting precision of the LHM (Section 4.1) as well as the quality of the proposed physical models (Section 4.2).

4.1 LHM Fitting

Refer to caption
Figure 5: The per-vertex mean L2-error of our LHM fitting algorithm in a leave-one-out validation on SKULLS [22]. Larger errors only appear in the back part of the face and have only a minor impact on facial animations.

The fitting of the LHM is mainly composed of the data-driven positioning of the skull layer and the subsequent heuristic fitting of the muscle layer. We evaluate the crucial fitting of the skull layer with the SKULLS [22] dataset. Since this dataset consists of 43 instances only, a leave-one-out validation is performed in which the vertex-wise L2 errors are measured. The results are compared to the multilinear model as originally used for SKULLS in [1].

Refer to caption
Figure 6: Exemplary fits of the LHM components skull wrap, muscle wrap, and skull. The skulls are fitted with a dense position regressor. At this, collisions between the skulls and skins are avoided which increases the numerical stability of downstream physics-based simulations.

Both models cannot achieve a medical-grade positioning with errors between approximately 2 mm2\text{\,}\mathrm{mm} and 4 mm4\text{\,}\mathrm{mm}. The MLM achieves a higher precision with a mean error of 1.98 mm1.98\text{\,}\mathrm{mm} than our approach that dispositions the skull by 3.83 mm3.83\text{\,}\mathrm{mm} on average. However, the MLM cannot prevent collisions that might crash physics-based simulations. Also, our fitting algorithm produces large errors only in regions that are of less importance for facial simulations as can be seen in Figure 5. The errors are predominately distributed in the back area of the skull, since here the rectangular constraints of our fitting procedure can presumably no longer be aligned well to the skin layer. The following section demonstrates in downstream physics-based simulations that the prediction quality in the frontal face region is sufficiently adequate for detailed facial animations. Figure 6 displays fitting examples.

4.2 Physics-Based Simulation

Refer to caption
Figure 7: Exemplary results of the proposed physics-based facial animation models. In the top row, muscle deformation gradients have been extracted with ϕ\phi^{\dagger} and reapplied using ϕ\phi to the corresponding neutral head. In the bottom row, the deformation gradients are reapplied to a different identity.

w𝕊w_{\mathrm{{\mathbb{S}_{\mathcal{H}}}}} wSw_{\mathrm{S_{\mathcal{H}}}} wstrw_{\mathrm{str}} wvolw_{\mathrm{vol}} wconw_{\mathrm{con}} winvw_{\mathrm{inv}}
11 1010 10210^{2} 10310^{3} 10310^{3} 1010
w𝕄w_{\mathrm{{\mathbb{M}_{\mathcal{H}}}}} wBw_{\mathrm{{B_{\mathcal{H}}}}} wcurvw_{\mathrm{curv}} wrectw_{\mathrm{rect}} wdistw_{\mathrm{dist}} wdist2w_{\mathrm{dist_{2}}}
11 1010 0.10.1 1.01.0 10.010.0 10.010.0
Table 2: The weights used to implement ϕ,ϕ\phi,\phi^{\dagger}, and the LHM fitting.
Refer to caption
Figure 8: Exemplary test results of our neural volumetric blendshapes for a variety of different head shapes. Besides resolving self-collisions, the physical and anatomical constraints anchored in our approach lead to more realistic and immersive expressions while being computationally efficient.

To investigate the quality of ϕ\phi^{\dagger} and ϕ\phi we perform an experiment on nearly 300 high-resolution optical facial expression scans from the proprietary 3DScanstore222https://www.3dscanstore.com dataset. First, we fit the LHM to each of the scans using ϕ\phi^{\dagger} and store only the resulting muscle deformation gradients and bone movements. Then, in turn, we apply these to the neutral scans of the corresponding individuals through ϕ\phi and simulate the soft tissue as ϕ\phi^{\dagger} does. Only if our physic models and the previous LHM fitting operate with reasonable precision, do we expect a low L2-loss between the original expression and the simulated expression as well as visually appealing results.

Across all expression scans, we observe an average L2 reconstruction error of only 1.6 mm. The reproduction quality of the physical models is also reflected in the visual results of Figure 7. Further, the accuracy of the LHM fitting investigated in the previous section is again underlined. Additionally, we inspected expression retargeting by applying the extracted muscle and bone transformations to other identities. Again, visual appealing examples can be found Figure 7. The weights used to realize ϕ,ϕ\phi,\phi^{\dagger}, and the LHM fitting in this and the following experiments are stated in Table 2.

4.3 Neural Volumetric Blendshapes

Refer to caption
Figure 9: Exemplary results of our collision resolution. Collisions are not only avoided but also simulated in a physically plausible manner.

To train and evaluate ff, we assemble a dataset of 50000 training and test instances by using the pipeline from Section 3.5. At this, we sample 10000 different head shapes and compute 5 different expressions per head. For training, the Adam optimizer performs 200k update steps with a learning rate of 0.0001. The learning rate is linearly decreased to 0.00005 over the course of training and a batch size of 128 is used. In total, the training specifications result in an approximate runtime of 8 hours on an NVIDIA A6000. The comparatively short training time can straightforwardly be explained by the less noisy training data as usually encountered for image-based deep learning models and the efficient network design. The identities are used 90% for training and 10% for testing.

First, we evaluate the efficiency of ff. The initial runtime of ϕ\phi and ϕ\phi^{\dagger} in Equation (10) with 6 global projective dynamics updates as in [26] is about 950 ms950\text{\,}\mathrm{ms} on a workstation AMD threadripper pro 3995wx with 128 cores (implemented with ShapeOp [19]). Our implementation of ff only takes 8ms on a consumer-grade Intel i5 12600K (implemented with PyTorch333https://pytorch.org). Moreover, a forward pass on an NVIDIA RTX3090 is calculated in less than one millisecond. Thus, neural volumetric blendshapes are suitable for realtime applications on weaker hardware setups and also offer advantages when many facial animations are to be executed in parallel on a GPU.

The time advantage alone is not useful if ff is not an adequate approximation. We evaluate the approximation quality by measuring the mean L2 error on the test dataset. More precisely, we randomly generate 5 training-test splits of the generated dataset and report the average of the test losses after training on each training set. We achieve an average test error of only 0.13 mm0.13\text{\,}\mathrm{mm}, meaning that we can successfully achieve the approximation target of Equation (11) and ff generalizes well across head shapes and facial expressions. Another open question is the temporal consistency of ff which is inherent in physics-based simulations. To this end, we invite the reader to watch the attached video in the supplementary material. Finally, visual results of neural volumetric blendshapes as shown in Figure 8 exhibit the same improvements towards more realistic facial expressions as much slower physics-based simulations before. These include the resolution of self-intersections, the simulation of wrinkles, the adherence to anatomical boundaries, and the realism-enhancing constraining of volume preservation. Close-up images of resolved self-intersections can be seen in Figure 9. The only other approach that to the best of our knowledge can animate detailed faces at this speed, requires only a neutral surface as input, and allows for semantically consistent control, are linear blendshapes paired with a variant of deformation transfer [9, 39, 15]. As can be seen in Figure 8, too, neural volumetric blendshapes produce far more immersive facial animations.

5 Conclusion

In this work, we presented neural volumetric blendshapes, an efficient realization of physics-based facial animations even on consumer-grade hardware. Our approach yields more realistic results than the commonly used linear blendshapes, since artifacts such as volume-loss, self-intersections, or anatomical flaws are avoided and effects like volumetric elasticity and wrinkles are added. It is also convenient to use, since existing surface-based blendshape rigs can be extended into anatomically and physically plausible animations with almost no integration overhead while keeping the actuation as before.

We aim to improve neural volumetric blendshapes in at least two directions. On the one hand, with an even more accurate anatomical model that represents e.g. trachea and esophagus more precisely. On the other hand, recent results [44] show that contact deformations can also be efficiently learned. Since people touch their faces dozens of times [52] a day, adding contact-handling for more realistic gestures may improve the immersion significantly.

References

  • [1] Jascha Achenbach et al. “A multilinear model for bidirectional craniofacial reconstruction” In Proceedings of the Eurographics Workshop on Visual Computing for Biology and Medicine, 2018, pp. 67–76
  • [2] Dicko Ali-Hamadi et al. “Anatomy transfer” In ACM transactions on graphics (TOG) 32.6 ACM New York, NY, USA, 2013, pp. 1–8
  • [3] Vincent Arsigny, Pierre Fillard, Xavier Pennec and Nicholas Ayache “Geometric means in a novel vector space structure on symmetric positive-definite matrices” In SIAM journal on matrix analysis and applications 29.1 SIAM, 2007, pp. 328–347
  • [4] ShahRukh Athar et al. “RigNeRF: Fully Controllable Neural 3D Portraits” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20364–20373
  • [5] Michael Bao, Matthew Cong, Stéphane Grabli and Ronald Fedkiw “High-quality face capture using anatomical muscles” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10802–10811
  • [6] Vincent Barrielle, Nicolas Stoiber and Cédric Cagniart “Blendforces: A dynamic framework for facial animation” In Computer Graphics Forum 35.2, 2016, pp. 341–352 Wiley Online Library
  • [7] Bernd Bickel et al. “Pose-Space Animation and Transfer of Facial Details.” In Symposium on Computer Animation, 2008, pp. 57–66
  • [8] Mario Botsch and Leif Kobbelt “Real-time shape editing using radial basis functions” In Computer graphics forum 24.3, 2005, pp. 611–621 Blackwell Publishing, Inc Oxford, UKBoston, USA
  • [9] Mario Botsch, Robert Sumner, Mark Pauly and Markus Gross “Deformation transfer for detail-preserving surface editing” In Vision, Modeling & Visualization, 2006, pp. 357–364 Citeseer
  • [10] Mario Botsch et al. “Polygon mesh processing” CRC press, 2010
  • [11] Sofien Bouaziz et al. “Projective dynamics: Fusing constraint projections for fast simulation” In ACM transactions on graphics (TOG) 33.4 ACM New York, NY, USA, 2014, pp. 1–11
  • [12] Derek Bradley, Wolfgang Heidrich, Tiberiu Popa and Alla Sheffer “High resolution passive facial performance capture” In ACM SIGGRAPH 2010 papers, 2010, pp. 1–10
  • [13] Chen Cao et al. “Authentic volumetric avatars from a phone scan” In ACM Transactions on Graphics (TOG) 41.4 ACM New York, NY, USA, 2022, pp. 1–19
  • [14] Dan Casas and Miguel A Otaduy “Learning nonlinear soft-tissue dynamics for interactive avatars” In Proceedings of the ACM on Computer Graphics and Interactive Techniques 1.1 ACM New York, NY, USA, 2018, pp. 1–15
  • [15] Prashanth Chandran, Loı̈c Ciccone, Markus Gross and Derek Bradley “Local anatomically-constrained facial performance retargeting” In ACM Transactions on Graphics (TOG) 41.4 ACM New York, NY, USA, 2022, pp. 1–14
  • [16] Byungkuk Choi et al. “Animatomy: an Animator-centric, Anatomically Inspired System for 3D Facial Modeling, Animation and Transfer” In SIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9
  • [17] Matthew Cong and Ronald Fedkiw “Muscle-based facial retargeting with anatomical constraints” In ACM SIGGRAPH 2019 Talks, 2019, pp. 1–2
  • [18] Matthew Deying Cong “Art-directed muscle simulation for high-end facial animation” Stanford University, 2016
  • [19] Mario Deuss et al. “ShapeOp—a robust and extensible geometric modelling paradigm” In Modelling Behaviour Springer, 2015, pp. 505–515
  • [20] Yao Feng, Haiwen Feng, Michael J Black and Timo Bolkart “Learning an animatable detailed 3D face model from in-the-wild images” In ACM Transactions on Graphics (ToG) 40.4 ACM New York, NY, USA, 2021, pp. 1–13
  • [21] Stephan J Garbin et al. “VolTeMorph: Realtime, Controllable and Generalisable Animation of Volumetric Representations” In arXiv preprint arXiv:2208.00949, 2022
  • [22] Thomas Gietzen et al. “A method for automatic forensic facial reconstruction based on dense statistics of soft tissue thickness” In PloS one 14.1 Public Library of Science San Francisco, CA USA, 2019, pp. e0210257
  • [23] Benjamin Gilles, Lionel Reveret and Dinesh K Pai “Creating and animating subject-specific anatomical models” In Computer Graphics Forum 29.8, 2010, pp. 2340–2351 Wiley Online Library
  • [24] Philip-William Grassal et al. “Neural head avatars from monocular RGB videos” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18653–18664
  • [25] Alexandru Eugen Ichim, Sofien Bouaziz and Mark Pauly “Dynamic 3D avatar creation from hand-held video input” In ACM Transactions on Graphics (ToG) 34.4 ACM New York, NY, USA, 2015, pp. 1–14
  • [26] Alexandru Eugen Ichim, Ladislav Kavan, Merlin Nimier-David and Mark Pauly “Building and animating user-specific volumetric face rigs.” In Symposium on Computer Animation, 2016, pp. 107–117
  • [27] Alexandru-Eugen Ichim, Petr Kadleček, Ladislav Kavan and Mark Pauly “Phace: Physics-based face modeling and animation” In ACM Transactions on Graphics (TOG) 36.4 ACM New York, NY, USA, 2017, pp. 1–14
  • [28] Sungkyu Jung, Armin Schwartzman and David Groisser “Scaling-rotation distance and interpolation of symmetric positive-definite matrices” In SIAM Journal on Matrix Analysis and Applications 36.3 SIAM, 2015, pp. 1180–1201
  • [29] Petr Kadleček and Ladislav Kavan “Building accurate physics-based face models from data” In Proceedings of the ACM on Computer Graphics and Interactive Techniques 2.2 ACM New York, NY, USA, 2019, pp. 1–16
  • [30] Petr Kadleček et al. “Reconstructing personalized anatomical models for physics-based body animation” In ACM Transactions on Graphics (TOG) 35.6 ACM New York, NY, USA, 2016, pp. 1–13
  • [31] Tero Karras, Samuli Laine and Timo Aila “A style-based generator architecture for generative adversarial networks” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410
  • [32] Marilyn Keller, Silvia Zuffi, Michael J Black and Sergi Pujades “OSSO: Obtaining Skeletal Shape from Outside” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20492–20501
  • [33] Martin Komaritzan and Mario Botsch “Projective skinning” In Proceedings of the ACM on Computer Graphics and Interactive Techniques 1.1 ACM New York, NY, USA, 2018, pp. 1–19
  • [34] Martin Komaritzan, Stephan Wenninger and Mario Botsch “Inside Humans: Creating a Simple Layered Anatomical Model from Human Surface Scans” In Frontiers in Virtual Reality 2 Frontiers Media SA, 2021, pp. 694244
  • [35] Juho Lee et al. “Set transformer: A framework for attention-based permutation-invariant neural networks” In International conference on machine learning, 2019, pp. 3744–3753 PMLR
  • [36] John P Lewis et al. “Practice and theory of blendshape facial models.” In Eurographics (State of the Art Reports) 1.8, 2014, pp. 2
  • [37] John P Lewis, Jonathan Mooser, Zhigang Deng and Ulrich Neumann “Reducing blendshape interference by selected motion attenuation” In Proceedings of the 2005 symposium on Interactive 3D graphics and games, 2005, pp. 25–29
  • [38] Hao Li, Thibaut Weise and Mark Pauly “Example-based facial rigging” In Acm transactions on graphics (tog) 29.4 ACM New York, NY, USA, 2010, pp. 1–6
  • [39] Jiaman Li et al. “Dynamic facial asset and rig generation from a single scan.” In ACM Trans. Graph. 39.6, 2020, pp. 215–1
  • [40] Matthew Loper et al. “SMPL: A Skinned Multi-Person Linear Model” In ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34.6 ACM, 2015, pp. 248:1–248:16
  • [41] Nadia Maalin et al. “Beyond BMI for self-estimates of body size and shape: A new method for developing stimuli correctly calibrated for body composition” In Behavior Research Methods 53.3 Springer, 2021, pp. 1308–1321
  • [42] Ben Mildenhall et al. “Nerf: Representing scenes as neural radiance fields for view synthesis” In Communications of the ACM 65.1 ACM New York, NY, USA, 2021, pp. 99–106
  • [43] Frederic I Parke “Control parameterization for facial animation” In Computer Animation’91, 1991, pp. 3–14 Springer
  • [44] Cristian Romero, Dan Casas, Maurizio M Chiaramonte and Miguel A Otaduy “Contact-centric deformation learning” In ACM Transactions on Graphics (TOG) 41.4 ACM New York, NY, USA, 2022, pp. 1–11
  • [45] Shunsuke Saito, Zi-Ye Zhou and Ladislav Kavan “Computational bodybuilding: Anatomically-based modeling of human bodies” In ACM Transactions on Graphics (TOG) 34.4 ACM New York, NY, USA, 2015, pp. 1–12
  • [46] Igor Santesteban, Elena Garces, Miguel A Otaduy and Dan Casas “SoftSMPL: Data-driven Modeling of Nonlinear Soft-tissue Dynamics for Parametric Humans” In Computer Graphics Forum 39.2, 2020, pp. 65–75 Wiley Online Library
  • [47] Franco Scarselli et al. “The graph neural network model” In IEEE transactions on neural networks 20.1 IEEE, 2008, pp. 61–80
  • [48] Robert Schleicher et al. “BASH: Biomechanical Animated Skinned Human for Visualization of Kinematics and Muscle Activity.” In VISIGRAPP (1: GRAPP), 2021, pp. 25–36
  • [49] Ken Shoemake and Tom Duff “Matrix animation and polar decomposition” In Graphics Interface 92, 1992, pp. 258–264
  • [50] Eftychios Sifakis, Igor Neverov and Ronald Fedkiw “Automatic determination of facial muscle activations from sparse motion capture marker data” In ACM SIGGRAPH 2005 Papers, 2005, pp. 417–425
  • [51] Steven L Song, Weiqi Shi and Michael Reed “Accurate face rig approximation with deep differential subspace reconstruction” In ACM Transactions on Graphics (TOG) 39.4 ACM New York, NY, USA, 2020, pp. 34–1
  • [52] Jente L Spille, Martin Grunwald, Sven Martin and Stephanie M Mueller “Stop touching your face! A systematic review of triggers, characteristics, regulatory functions and neuro-physiology of facial self touch” In Neuroscience & Biobehavioral Reviews 128 Elsevier, 2021, pp. 102–116
  • [53] Sangeetha Grama Srinivasan et al. “Learning active quasistatic physics-based models from data” In ACM Transactions on Graphics (TOG) 40.4 ACM New York, NY, USA, 2021, pp. 1–14
  • [54] Robert W Sumner and Jovan Popović “Deformation transfer for triangle meshes” In ACM Transactions on graphics (TOG) 23.3 ACM New York, NY, USA, 2004, pp. 399–405
  • [55] Lingchen Yang et al. “Implicit neural representation for physics-driven actuated soft bodies” In ACM Transactions on Graphics (TOG) 41.4 ACM New York, NY, USA, 2022, pp. 1–10
  • [56] Li Zhang, Noah Snavely, Brian Curless and Steven M Seitz “Spacetime faces: High-resolution capture for˜ modeling and animation” In Data-Driven 3D Facial Animation Springer, 2008, pp. 248–276
  • [57] Yufeng Zheng et al. “Im avatar: Implicit morphable head avatars from videos” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13545–13555