🤖 AI Summary
This work addresses the limited fine-grained cross-modal coreference alignment capability of existing Omni-LLMs, which hinders reliable multimodal collaborative reasoning. It formally establishes cross-modal coreference alignment as a central challenge and introduces CrossOmni, a benchmark comprising nine tasks grounded in human reasoning rationales. The authors propose a coreference-aware reasoning paradigm that integrates training-free in-context learning with a trained SFT+GRPO framework, augmented by multimodal coreference grounding and re-identification techniques. Extensive experiments across 13 prominent Omni-LLMs demonstrate that the proposed approach substantially improves cross-modal coreference performance and effectively generalizes to downstream collaborative reasoning tasks.
📝 Abstract
Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.