UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of multimodal large language models (MLLMs) in perception-level image understanding—e.g., aesthetics, quality, structure, and texture—by proposing UniPercept, the first unified perceptual understanding framework. Methodologically, it formally defines perception-level understanding tasks; constructs UniPercept-Bench, a large-scale benchmark spanning four core perceptual dimensions; and introduces a joint training paradigm integrating domain-adaptive pretraining with task-aligned reinforcement learning, incorporating hierarchical modeling, dual VQA/VR interfaces, and a plug-and-play reward architecture. Contributions include: (1) substantial performance gains over state-of-the-art MLLMs on perception-centric tasks; (2) seamless integration as a fine-grained perceptual reward module for text-to-image generation models; and (3) establishment of a new benchmark and strong baseline for multimodal perceptual understanding.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.
Problem

Research questions and friction points this paper is trying to address.

Enhances MLLMs' perception of image aesthetics, quality, structure, and texture.
Establishes a benchmark and dataset for perceptual-level image understanding evaluation.
Develops a unified model for visual rating and question answering tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for perceptual-level image understanding
Domain-Adaptive Pre-Training and Task-Aligned RL training
Plug-and-play reward model for text-to-image generation
🔎 Similar Papers
No similar papers found.
S
Shuo Cao
University of Science and Technology of China
J
Jiayang Li
Peking University
X
Xiaohui Li
Shanghai Jiao Tong University
Yuandong Pu
Yuandong Pu
SJTU,Shanghai AI Laboratory
Computer Vision
Kaiwen Zhu
Kaiwen Zhu
Shanghai Jiao Tong University
Multi-Modal GenerationComputer Vision
Y
Yuanting Gao
Tsinghua University
Siqi Luo
Siqi Luo
Shanghai Jiao Tong university
AIGCComputer VisionImage EditingAI4Science
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
Q
Qi Qin
Shanghai Jiao Tong University
Y
Yu Zhou
Sun Yat-sen University
X
Xiangyu Chen
Tele-AI
W
Wenlong Zhang
Shanghai AI Laboratory
B
Bin Fu
Shanghai AI Laboratory
Y
Yu Qiao
Shanghai AI Laboratory
Y
Yihao Liu
Shanghai AI Laboratory