VASA-Rig: Audio-Driven 3D Facial Animation with 'Live' Mood Dynamics in Virtual Reality

Shanghai Jiao Tong University & Microsoft Research Asia
IEEE Transactions on Visualization and Computer Graphics (TVCG), 2025

Results Showcase

Uncontrolled Generation

Gaze Control

Emotion Control

Abstract

Visualization of emotional expressions placeholder
Visualization of emotional expressions.
Facial expressions with gaze control placeholder
Facial expressions with gaze control.
MetaHumans with identical expressions placeholder
MetaHumans with identical expressions.

Audio-driven 3D facial animation is crucial for enhancing the metaverse's realism, immersion, and interactivity. While most existing methods focus on generating highly realistic and lively 2D talking head videos by leveraging extensive 2D video datasets these approaches work in pixel space and are not easily adaptable to 3D environments. We present VASA-Rig, which has achieved a significant advancement in the realism of lip-audio synchronization, facial dynamics, and head movements. In particular, we introduce a novel rig parameter-based emotional talking face dataset and propose the Latents2Rig model, which facilitates the transformation of 2D facial animations into 3D. Unlike mesh-based models, VASA-Rig outputs rig parameters, instantiated in this paper as 174 Metahuman rig parameters, making it more suitable for integration into industry-standard pipelines. Extensive experimental results demonstrate that our approach significantly outperforms existing state-of-the-art methods in terms of both realism and accuracy.

Key Contributions

  • VASA-Rig Framework: Audio-driven 3D facial animation with vivid expression and natural gaze/head dynamics.
  • New Emotional Dataset: Over 910,000 paired frames of 2D video and rig parameters for MetaHuman-centric training.
  • Latents2Rig Model: A mapping network that adapts 2D facial animation latents to 3D rig control space.
  • Industry Integration: Direct 174-parameter rig output that is lightweight and production-friendly.

How It Works: The VASA-Rig Pipeline

Overview of VASA-Rig placeholder

Our framework has three stages:

  • Audio2Latents maps audio (with gaze and head-distance controls) to motion latents;
  • Latents2Rig converts per-frame latents to 174-dimensional MetaHuman rig parameters;
  • Rig2Animation applies those parameters in Unreal Engine 5 for real-time, audio-synchronized, emotionally rich 3D animation.
  • Large-Scale Emotional Dataset

    To address the lack of public 3D talking-face data, we built a large-scale emotional dataset with over 4 hours (276+ minutes) of footage and around 910,000 frames of paired 2D video and rig parameters.

    • Actor Capture: Professional artists mapped performances to MetaHuman rigs for high emotional fidelity.
    • Livelink Capture: Live Link Face + iPhone capture with ARKit blendshapes for scalable collection.
    • Diverse Render: Multiple characters (e.g., Ada, Emanuel, Emory, Tori) applied for rendering captured data for cross-identity generalization.

    * For data privacy reasons, we are only showing 2D videos rendered using Metahuman.

    Quantitative Evaluation

    We evaluate VASA-Rig against state-of-the-art methods using SyncNet metrics (LSE-D and LSE-C) for lip-synchronization and FRD (Upper-Face Rig Deviation) for facial expression diversity. Our method achieves the lowest LSE-D and highest LSE-C, demonstrating superior lip-sync accuracy across both HDTF and RAVDESS datasets.

    Method HDTF Dataset RAVDESS Dataset
    LSE-D LSE-C FRD LSE-D LSE-C FRD
    EmoTalk 13.931 0.4545 0.0047 11.184 0.5200 0.0051
    EmoFace 14.531 0.4505 0.0229 10.976 0.5028 -0.0222
    Audio2Face 14.428 0.4269 0.0352 11.194 0.5259 0.0275
    VASA-Rig (Ours) 13.344 0.5191 -0.0912 10.477 0.5722 -0.0691

    The table below illustrates the Standard Deviation (SD) of rig parameters across different facial regions. Compared to baseline methods which often produce "muted" or overly smoothed upper-face motions, VASA-Rig significantly enhances the expressive richness in the eyes, eyebrows, and nose areas, leading to more "live" and natural animations.

    Method HDTF (Standard Deviation) RAVDESS (Standard Deviation)
    Eyes Eyebrows Nose Eyes Eyebrows Nose
    EmoTalk 0.0345 0.0865 0.0089 0.0280 0.0649 0.0098
    EmoFace 0.0234 0.0201 0.0088 0.0475 0.1193 0.0378
    Audio2Face 0.0071 0.0145 0.0017 0.0080 0.0128 0.0030
    VASA-Rig (Ours) 0.1174 0.2451 0.1037 0.0986 0.1681 0.0801

    BibTeX

    @article{pan2025vasa,
      title={Vasa-rig: Audio-driven 3d facial animation with 'live' mood dynamics in virtual reality},
      author={Pan, Ye and Liu, Chang and Xu, Sicheng and Tan, Shuai and Yang, Jiaolong},
      journal={IEEE Transactions on Visualization and Computer Graphics},
      year={2025},
      publisher={IEEE}
    }