One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of misaligned inference strategies and optimization objectives in unified image quality assessment (IQA) and image aesthetic assessment (IAA). To this end, the authors propose the TATAR framework, which leverages a shared vision-language backbone and introduces a task-aware dual-track reasoning mechanism: distinct fast and slow pathways tailor the inference processes for IQA and IAA, respectively. An asymmetric reward function is designed by integrating Gaussian scoring with Thurstone pairwise ranking. TATAR is the first approach to systematically resolve both inference and optimization mismatches in unified assessment, employing a two-stage training strategy of supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO). The method significantly outperforms existing unified models across eight benchmarks, demonstrating strong performance in both in-domain and cross-domain settings while enhancing training stability for aesthetic evaluation.

Technology Category

Application Category

📝 Abstract
Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task's nature. TATAR combines three components: fast--slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.
Problem

Research questions and friction points this paper is trying to address.

Image Quality Assessment
Image Aesthetic Assessment
Task-Agnostic Reasoning
Reasoning Mismatch
Optimization Mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

task-conditioned reasoning
asymmetric rewards
unified image assessment
multimodal LLM
perceptual-aesthetic alignment
🔎 Similar Papers
No similar papers found.
W
Wen Yin
University of Electronic Science and Technology of China; Jiigan Technology
C
Cencen Liu
University of Electronic Science and Technology of China; Jiigan Technology
D
Dingrui Liu
Jiigan Technology
B
Bing Su
Jiigan Technology
Yuan-Fang Li
Yuan-Fang Li
Oracle | Monash University
Large language modelKnowledge graphsnatural language processing
Tao He
Tao He
UESTC
Image RetrievalComputer Vision