🤖 AI Summary
Current multimodal large language models (MLLMs) predominantly focus on joint image-text modeling, overlooking the critical role of speech in real-time interaction and heavily relying on cascaded ASR/TTS modules—resulting in high latency and poor streaming performance. To address this, we propose the first unified MLLM enabling end-to-end, real-time visual–speech interaction, abandoning conventional modular speech processing. Our method introduces a novel multi-stage progressive training framework for joint representation learning over images, videos, and raw speech waveforms; incorporates cross-modal tokenization and a shared LLM decoder; and achieves, for the first time in a single model, ASR-/TTS-free streaming speech–vision dialogue. On multiple visual–speech benchmarks, our model achieves state-of-the-art performance, reduces end-to-end response latency by 67%, and attains measured interactive latency under 300 ms.
📝 Abstract
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.