UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language large models (VLLMs) have achieved notable progress in either understanding or generation tasks, yet lack a unified multimodal dataset capable of jointly eliciting both capabilities. To address this gap, we propose UnifiedVisual—a novel framework accompanied by the high-quality dataset UnifiedVisual-240K—marking the first systematic integration of bidirectional cross-modal tasks, including visual question answering and text-to-image generation, thereby overcoming the unidirectional modeling limitations of conventional datasets. Our approach employs fine-grained image-text alignment, fusion of heterogeneous multi-source data, and diverse input-output designs to support complex cross-modal reasoning. Extensive experiments demonstrate that models trained on UnifiedVisual-240K achieve statistically significant improvements over baselines on joint understanding-and-generation metrics and exhibit substantial performance gains across multiple benchmark tasks. This work establishes a foundational data resource and methodological paradigm for developing unified VLLMs.

Technology Category

Application Category

📝 Abstract
Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation, powering applications such as visual question answering and text-guided image synthesis. However, progress in unified VLLMs remains constrained by the lack of datasets that fully exploit the synergistic potential between these two core abilities. Existing datasets typically address understanding and generation in isolation, thereby limiting the performance of unified VLLMs. To bridge this critical gap, we introduce a novel dataset construction framework, UnifiedVisual, and present UnifiedVisual-240K, a high-quality dataset meticulously designed to facilitate mutual enhancement between multimodal understanding and generation. UnifiedVisual-240K seamlessly integrates diverse visual and textual inputs and outputs, enabling comprehensive cross-modal reasoning and precise text-to-image alignment. Our dataset encompasses a wide spectrum of tasks and data sources, ensuring rich diversity and addressing key shortcomings of prior resources. Extensive experiments demonstrate that models trained on UnifiedVisual-240K consistently achieve strong performance across a wide range of tasks. Notably, these models exhibit significant mutual reinforcement between multimodal understanding and generation, further validating the effectiveness of our framework and dataset. We believe UnifiedVisual represents a new growth point for advancing unified VLLMs and unlocking their full potential. Our code and datasets is available at https://github.com/fnlp-vision/UnifiedVisual.
Problem

Research questions and friction points this paper is trying to address.

Lack of unified datasets for multimodal understanding and generation
Existing datasets isolate vision-language understanding from generation tasks
Limited synergistic potential between multimodal abilities in current VLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for unified vision-language datasets
Integrates multimodal understanding and generation
Enables cross-modal reasoning and alignment
🔎 Similar Papers
No similar papers found.
P
Pengyu Wang
Fudan University
Shaojun Zhou
Shaojun Zhou
Fudan University
Chenkun Tan
Chenkun Tan
Fudan University
Xinghao Wang
Xinghao Wang
Fudan University
Natural Language ProcessingLarge Language Models
W
Wei Huang
Fudan University
Z
Zhen Ye
Fudan University
Zhaowei Li
Zhaowei Li
Moonshot AI
Computer VisionNatural Language Processing
B
Botian Jiang
Fudan University
D
Dong Zhang
Fudan University
X
Xipeng Qiu
Fudan University