X-Streamer: Unified Human World Modeling with Audiovisual Interaction

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of real-time, open-ended audiovisual interaction driven by a single static portrait. Methodologically, it introduces an end-to-end multimodal human-world modeling framework based on a Thinker-Actor dual-Transformer architecture that jointly models understanding and generation across text, speech, and video modalities. The approach integrates pretrained large language–speech models with block-wise autoregressive diffusion models, incorporating cross-modal attention, temporally aligned positional encoding, and a global identity referencing mechanism to ensure fine-grained cross-modal alignment and long-horizon conversational stability. Experiments demonstrate real-time inference on just two A100 GPUs, enabling hour-long coherent, high-fidelity digital-human audiovisual interaction. To our knowledge, this is the first unified architecture achieving infinite closed-loop, tri-modal (text–speech–video) interaction.

Technology Category

Application Category

📝 Abstract
We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.
Problem

Research questions and friction points this paper is trying to address.

Building digital human agents for infinite multimodal interactions
Enabling real-time video calls from static portraits using streaming inputs
Unifying multimodal understanding and generation with dual-transformer architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-transformer architecture unifies multimodal understanding and generation
Chunk-wise autoregressive diffusion model produces synchronized multimodal responses
Inter- and intra-chunk attentions ensure long-horizon stability and alignment