π€ AI Summary
Existing multimodal benchmarks suffer from incomplete modality coverage, unidirectional (text-centric) interaction, and inadequate modeling of cross-modal dependencies and complementarity. To address these limitations, this work introduces the first fully multimodal benchmark supporting arbitrary bidirectional inputβoutput combinations across image, video, audio, and text modalities. It comprehensively evaluates understanding, generation, and reasoning capabilities across 16 task categories, with 3,268 high-quality, cross-source samples. We propose an any-to-any evaluation paradigm and a Cross-Modal Complementarity Screening (CMCS) strategy to systematically construct speech-interaction and fusion-dependent reasoning data. Leveraging a multi-source aggregation and structured annotation framework, the benchmark enables standardized assessment of multimodal large language models (MLLMs), specialized models, unified generative models, and fully multimodal language models. Evaluation across 30+ state-of-the-art models reveals critical capability gaps, establishes strong baselines, and provides a unified evaluation standard to advance next-generation multimodal architectures.
π Abstract
Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.