Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

This work addresses the challenge of modeling full-duplex interactions in audio-driven virtual humans, where existing methods struggle to jointly capture speaking and listening behaviors due to rigid frame-wise alignment or global attention mechanisms that degrade lip-sync accuracy. To overcome this, we propose the first full-duplex virtual human generation model that accepts dual audio streams—representing both speaking and listening—and introduces a multi-head Gaussian kernel based on behavioral timescale differences as a progressive temporal inductive bias, enabling effective long-range conversational context awareness. We also construct VoxHear, the first Talking-Listening dataset with disentangled speech and background audio tracks. Our approach achieves state-of-the-art performance by preserving precise lip synchronization while significantly enhancing contextual understanding, interaction naturalness, and responsiveness.

Technology Category

Application Category

📝 Abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The project page is available at https://warmcongee.github.io/beyond-monologue/ .

Problem

Research questions and friction points this paper is trying to address.

audio-driven avatar

full-duplex interaction

talking-listening behavior

temporal alignment

conversational dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex interaction

audio-driven avatar

temporal scale discrepancy