HumanOmni-Speaker: Identifying Who said What and When

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenge of accurately determining “who said what and when” in multi-person conversations, a task often compromised by visual biases and insufficient cross-modal alignment in existing methods. To overcome this, the authors propose the HumanOmni-Speaker model coupled with the VR-SDR task paradigm, which enables end-to-end spatiotemporal speaker grounding through natural language queries while rigorously avoiding visual shortcuts. A key innovation is the novel Visual Delta Encoder, which efficiently compresses inter-frame motion residuals from 25 fps video into just six tokens per frame, effectively capturing subtle lip movements and speaker trajectories. The approach further integrates high-frame-rate sampling, uncropped lip reading, and spatial localization. Experiments demonstrate state-of-the-art performance across multiple speaker-centric tasks, significantly advancing multimodal coordination and spatiotemporal localization accuracy.

Technology Category

Application Category

📝 Abstract

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.

Problem

Research questions and friction points this paper is trying to address.

speaker diarization

multimodal alignment

conversational dynamics

visual bias

temporal identity binding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Delta Encoder

Speaker Diarization

Multimodal Alignment