🤖 AI Summary
This work addresses the challenge of misaligned inference strategies and optimization objectives in unified image quality assessment (IQA) and image aesthetic assessment (IAA). To this end, the authors propose the TATAR framework, which leverages a shared vision-language backbone and introduces a task-aware dual-track reasoning mechanism: distinct fast and slow pathways tailor the inference processes for IQA and IAA, respectively. An asymmetric reward function is designed by integrating Gaussian scoring with Thurstone pairwise ranking. TATAR is the first approach to systematically resolve both inference and optimization mismatches in unified assessment, employing a two-stage training strategy of supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO). The method significantly outperforms existing unified models across eight benchmarks, demonstrating strong performance in both in-domain and cross-domain settings while enhancing training stability for aesthetic evaluation.
📝 Abstract
Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task's nature. TATAR combines three components: fast--slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.