OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of video-audio joint understanding and cross-modal reasoning in unlabeled scenarios by proposing the OmniJigsaw framework. The method leverages a temporal shuffling-based self-supervised proxy task to reconstruct permuted audio-visual segments and introduces three modality collaboration strategies—joint integration, sample-level selection, and a novel segment-level masking—to effectively mitigate the "dual-modality shortcut" problem during joint training. Additionally, a two-stage coarse-to-fine data filtering mechanism is designed to accommodate large-scale unlabeled multimodal data. Extensive experiments across 15 benchmarks demonstrate substantial improvements in video, audio, and cross-modal reasoning performance, validating the scalability and effectiveness of the proposed approach.

Technology Category

Application Category

📝 Abstract

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

Problem

Research questions and friction points this paper is trying to address.

omni-modal reasoning

video-audio understanding

self-supervised learning

cross-modal integration

temporal reordering

Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniJigsaw

modality-orchestrated reordering

cross-modal integration