Context Unrolling in Omni Models

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the challenge of unified modeling of heterogeneous multimodal data by proposing Omni, a novel architecture that enables native end-to-end joint training across text, images, video, 3D geometry, and implicit representations for the first time. The model introduces a context-unfolding mechanism that explicitly reasons about and aggregates complementary cross-modal information to approximate a shared multimodal knowledge manifold. By leveraging a unified architecture and representation alignment, Omni enhances cross-modal consistency and generation fidelity. Experimental results demonstrate that Omni achieves state-of-the-art performance on multimodal understanding and generation benchmarks, supporting coherent joint generation of diverse modalities within a unified context and exhibiting superior cross-modal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

Problem

Research questions and friction points this paper is trying to address.

multimodal integration

heterogeneous modalities

shared knowledge manifold

context unrolling

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context Unrolling

Unified Multimodal Model

Multimodal Reasoning