π€ AI Summary
This work addresses the challenge of accurately determining βwho said what and whenβ in multi-person conversations, a task often compromised by visual biases and insufficient cross-modal alignment in existing methods. To overcome this, the authors propose the HumanOmni-Speaker model coupled with the VR-SDR task paradigm, which enables end-to-end spatiotemporal speaker grounding through natural language queries while rigorously avoiding visual shortcuts. A key innovation is the novel Visual Delta Encoder, which efficiently compresses inter-frame motion residuals from 25 fps video into just six tokens per frame, effectively capturing subtle lip movements and speaker trajectories. The approach further integrates high-frame-rate sampling, uncropped lip reading, and spatial localization. Experiments demonstrate state-of-the-art performance across multiple speaker-centric tasks, significantly advancing multimodal coordination and spatiotemporal localization accuracy.
π Abstract
While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.