TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing unified multimodal models (UMMs) lack a fair evaluation framework due to disparities in architecture, training paradigms, and implementation details. This work proposes the first unified codebase built on PyTorch that supports multimodal understanding, generation, and editing tasks, offering compatibility with diverse backbone architectures, model scales, and datasets. The project introduces standardized evaluation protocols and a consistent interface, enabling—for the first time—fair and reproducible comparisons across heterogeneous UMMs. Furthermore, it integrates a multidimensional benchmarking suite assessing perceptual quality, reasoning, compositionality, and instruction-following capabilities, along with a streamlined post-training pipeline. This comprehensive infrastructure lays a foundational framework for developing more powerful and generalizable unified multimodal systems.

Technology Category

Application Category

📝 Abstract

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

Problem

Research questions and friction points this paper is trying to address.

unified multimodal models

evaluation framework

model heterogeneity

multimodal understanding

standardized benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Model

Standardized Evaluation

Multimodal Understanding