🤖 AI Summary
This work addresses the fundamental limitation of multimodal large language models (MLLMs) in jointly understanding and generating audio-visual (JAV) content—particularly under stringent temporal synchronization constraints. To this end, we propose the first end-to-end unified JAV model. Methodologically, we introduce SyncFusion, a spatiotemporal fusion module, and a synchronization-aware learnable query mechanism to close the understanding-generation loop; adopt an encoder-LLM-decoder architecture; integrate a pre-trained JAV-DiT generator; and perform three-stage progressive instruction tuning on JavisInst-Omni—a large-scale, GPT-4o-annotated dataset comprising over 200K audio-video dialogues. Experimental results demonstrate substantial improvements over state-of-the-art MLLMs across diverse JAV understanding and generation benchmarks, with particularly pronounced gains on complex temporally synchronized tasks.
📝 Abstract
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.