A unified multimodal understanding and generation model for cross-disciplinary scientific research

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of integrating heterogeneous, high-dimensional data inherent in interdisciplinary scientific problems, where existing AI models are often confined to single modalities and struggle to unify understanding and generation across diverse scientific sources. To this end, we propose FuXi-Uni, the first general-purpose framework that natively unifies multimodal scientific data understanding and generation within a shared architecture. By aligning scientific tokens with natural language and employing a dedicated scientific decoder, FuXi-Uni constructs a shared latent space that preserves both cross-disciplinary generality and domain-specific performance. The framework achieves state-of-the-art results in Earth system modeling, including 10-day global weather forecasting, tropical cyclone track and intensity prediction, and super-resolution downscaling, while also outperforming leading multimodal large language models on biomedical visual question answering benchmarks.

Technology Category

Application Category

📝 Abstract

Scientific discovery increasingly relies on integrating heterogeneous, high-dimensional data across disciplines nowadays. While AI models have achieved notable success across various scientific domains, they typically remain domain-specific or lack the capability of simultaneously understanding and generating multimodal scientific data, particularly for high-dimensional data. Yet, many pressing global challenges and scientific problems are inherently cross-disciplinary and require coordinated progress across multiple fields. Here, we present FuXi-Uni, a native unified multimodal model for scientific understanding and high-fidelity generation across scientific domains within a single architecture. Specifically, FuXi-Uni aligns cross-disciplinary scientific tokens within natural language tokens and employs science decoder to reconstruct scientific tokens, thereby supporting both natural language conversation and scientific numerical prediction. Empirically, we validate FuXi-Uni in Earth science and Biomedicine. In Earth system modeling, the model supports global weather forecasting, tropical cyclone (TC) forecast editing, and spatial downscaling driven by only language instructions. FuXi-Uni generates 10-day global forecasts at 0.25{\deg} resolution that outperform the SOTA physical forecasting system. It shows superior performance for both TC track and intensity prediction relative to the SOTA physical model, and generates high-resolution regional weather fields that surpass standard interpolation baselines. Regarding biomedicine, FuXi-Uni outperforms leading multimodal large language models on multiple biomedical visual question answering benchmarks. By unifying heterogeneous scientific modalities within a native shared latent space while maintaining strong domain-specific performance, FuXi-Uni provides a step forward more general-purpose, multimodal scientific models.

Problem

Research questions and friction points this paper is trying to address.

multimodal scientific data

cross-disciplinary integration

high-dimensional data

scientific understanding and generation

heterogeneous data

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multimodal model

scientific understanding and generation

cross-disciplinary AI