Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified multimodal models face three key bottlenecks: high computational cost in autoregressive approaches; narrow task coverage and low generation quality in two-stage methods; and reliance on manually specified metadata (e.g., task type, resolution), lacking automation. This paper proposes the first lightweight, general-purpose, fully automated multimodal understanding and generation framework. Built upon a two-stage paradigm, it combines a pretrained foundation model with lightweight alignment fine-tuning, integrating modules for task identification, metadata extraction, and intent parsing to enable end-to-end adaptive inference. Trained on only 500K samples and 50 GPU-hours, it significantly outperforms existing low-cost methods across text, image, and video understanding and generation tasks. It achieves high task identification accuracy, automatic parameter adaptation, and markedly improved generation fidelity—overcoming the long-standing trade-off among task generalization, generation faithfulness, and system-level intelligence.

Technology Category

Application Category

📝 Abstract
Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Develops a low-cost framework for unified multimodal understanding and generation tasks.
Automates task identification and parameter extraction without manual configuration.
Achieves high performance across diverse tasks with minimal training resources.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage scheme preserving pre-trained model capabilities
Automatic parsing of user intentions and task meta-information
Low-cost training with 500k samples and 50 GPU hours
🔎 Similar Papers
No similar papers found.