Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fragility of general-purpose multimodal models in physical understanding, stemming from their difficulty in learning essential physical properties from visually ambiguous and data-sparse web-scale corpora. To this end, the authors propose OmniFysics, a compact multimodal model integrating perception and generation capabilities across images, audio, video, and text. Key innovations include the development of physics-aware data engines—FysicsAny and FysicsOmniCap—that leverage hierarchical prototype retrieval and audiovisual consistency filtering to synthesize high-quality training data; an intent router that dynamically activates generation modules as needed; and a training strategy combining staged alignment, instruction tuning, and explicit physical law constraints. Experiments demonstrate that OmniFysics achieves state-of-the-art performance on standard multimodal benchmarks and significantly outperforms existing methods on physical reasoning tasks.

Technology Category

Application Category

📝 Abstract
Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
Problem

Research questions and friction points this paper is trying to address.

physical understanding
omni-modal models
visually ambiguous
sparsely represented
physical attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

omni-modal architecture
physical data engine
physics-grounded instruction
latent-space flow matching
cross-modal physical cues
🔎 Similar Papers
No similar papers found.