FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing multimodal benchmarks suffer from incomplete modality coverage, unidirectional (text-centric) interaction, and inadequate modeling of cross-modal dependencies and complementarity. To address these limitations, this work introduces the first fully multimodal benchmark supporting arbitrary bidirectional input–output combinations across image, video, audio, and text modalities. It comprehensively evaluates understanding, generation, and reasoning capabilities across 16 task categories, with 3,268 high-quality, cross-source samples. We propose an any-to-any evaluation paradigm and a Cross-Modal Complementarity Screening (CMCS) strategy to systematically construct speech-interaction and fusion-dependent reasoning data. Leveraging a multi-source aggregation and structured annotation framework, the benchmark enables standardized assessment of multimodal large language models (MLLMs), specialized models, unified generative models, and fully multimodal language models. Evaluation across 30+ state-of-the-art models reveals critical capability gaps, establishes strong baselines, and provides a unified evaluation standard to advance next-generation multimodal architectures.

Technology Category

Application Category

📝 Abstract

Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.

Problem

Research questions and friction points this paper is trying to address.

Existing multimodal benchmarks lack comprehensive modality coverage and interaction capabilities.

Current systems have limited cross-modal reasoning and interdependent modality integration.

There is no unified framework for evaluating any-to-any understanding, generation, and reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified full-modality benchmark for any-to-any evaluation

Cross-Modal Complementarity Screening for omni-modal data construction

Comprehensive evaluation of over 30 state-of-the-art baselines

🔎 Similar Papers

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models