HumanOmni-Speaker: Identifying Who said What and When

πŸ“… 2026-03-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of accurately determining β€œwho said what and when” in multi-person conversations, a task often compromised by visual biases and insufficient cross-modal alignment in existing methods. To overcome this, the authors propose the HumanOmni-Speaker model coupled with the VR-SDR task paradigm, which enables end-to-end spatiotemporal speaker grounding through natural language queries while rigorously avoiding visual shortcuts. A key innovation is the novel Visual Delta Encoder, which efficiently compresses inter-frame motion residuals from 25 fps video into just six tokens per frame, effectively capturing subtle lip movements and speaker trajectories. The approach further integrates high-frame-rate sampling, uncropped lip reading, and spatial localization. Experiments demonstrate state-of-the-art performance across multiple speaker-centric tasks, significantly advancing multimodal coordination and spatiotemporal localization accuracy.

Technology Category

Application Category

πŸ“ Abstract
While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.
Problem

Research questions and friction points this paper is trying to address.

speaker diarization
multimodal alignment
conversational dynamics
visual bias
temporal identity binding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Delta Encoder
Speaker Diarization
Multimodal Alignment
Viseme Modeling
Token Compression
πŸ”Ž Similar Papers
No similar papers found.
D
Detao Bai
Tongyi Lab Alibaba Group
S
Shimin Yao
Tongyi Lab Alibaba Group
Weixuan Chen
Weixuan Chen
PhD student at Zhejiang University
Semantic CommunicationSecure Communication
X
Xihan Wei
Tongyi Lab Alibaba Group
Z
Zhiheng Ma
Shenzhen University of Advanced Technology