🤖 AI Summary
Existing open-source multimodal large language models (MLLMs) exhibit significant deficiencies in joint visual-auditory-textual understanding and reasoning, achieving only ~50% instruction-following accuracy on trilingual multimodal tasks.
Method: We introduce OmniBench—the first benchmark for trilingual multimodal collaborative reasoning—and formalize the omni-language model (OLM), a unified architecture capable of jointly processing visual, auditory, and textual (V-A-T) inputs. We construct OmniBench via expert human annotation across diverse trilingual multimodal tasks and curate OmniInstruct, a large-scale instruction-tuning dataset comprising 96K samples. Our methodology integrates cross-modal alignment modeling, trilingual multimodal instruction tuning, and a human-in-the-loop evaluation framework.
Contribution/Results: Experiments reveal severe generalization limitations of current open-source OLMs on trilingual multimodal tasks; OmniInstruct substantially improves their reasoning performance. This work establishes a novel evaluation paradigm, provides high-quality resources, and outlines a technical pathway for advancing trilingual multimodal foundation models.
📝 Abstract
Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities. Our evaluation reveals that: i) open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts; and ii) most baseline models perform poorly (around 50% accuracy) even with textual alternatives to image/audio inputs. To address these limitations, we develop OmniInstruct, an 96K-sample instruction tuning dataset for training OLMs. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance. Codes and data could be found at our repo (https://github.com/multimodal-art-projection/OmniBench).