SARAH: Spatially Aware Real-time Agentic Humans

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the lack of real-time spatial awareness of user position in existing virtual human motion generation, which hinders natural orientation and interaction. The paper proposes the first end-to-end causal model that jointly leverages user location and conversational audio to generate full-body motions—including speech-driven gestures and spatial orientation—synchronously. Key innovations include a fully causal streaming inference architecture, a disentangled gaze control mechanism enabling adjustable eye contact intensity during inference, and a motion generation framework based on a causal Transformer variational autoencoder combined with trajectory-audio conditional flow matching. Evaluated on the Embody3D dataset, the system achieves state-of-the-art quality at over 300 FPS—three times faster than non-causal baselines—and has been successfully deployed in real-time VR applications.

Technology Category

Application Category

📝 Abstract

As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.

Problem

Research questions and friction points this paper is trying to address.

spatial awareness

embodied agents

conversational motion

real-time interaction

gaze behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatially-aware motion

causal transformer

flow matching