QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing visual quality assessment methods, which rely on costly human annotations and are susceptible to dataset bias, thereby hindering fine-grained and interpretable quality understanding. The authors propose a training-free retrieval-augmented generation (RAG) framework that, for the first time, integrates four dynamically constructed complementary knowledge sources—visual metadata, subject localization, and global and local quality descriptions—into the RAG pipeline. By combining relevance-aware retrieval with structured query decomposition, the framework guides large language models to perform evidence-based visual quality reasoning. This approach overcomes the constraints of static corpora and, without task-specific fine-tuning, significantly outperforms both general-purpose and fine-tuned large models in visual quality understanding and comparison tasks, demonstrating strong zero-shot evaluation capabilities.

Technology Category

Application Category

📝 Abstract
Visual quality assessment (VQA) is increasingly shifting from scalar score prediction toward interpretable quality understanding -- a paradigm that demands \textit{fine-grained spatiotemporal perception} and \textit{auxiliary contextual information}. Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction datasets, which involve labor-intensive annotation and are prone to dataset-specific biases. To address these challenges, we propose \textbf{QualiRAG}, a \textit{training-free} \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration \textbf{(RAG)} framework that systematically leverages the latent perceptual knowledge of large multimodal models (LMMs) for visual quality perception. Unlike conventional RAG that retrieves from static corpora, QualiRAG dynamically generates auxiliary knowledge by decomposing questions into structured requests and constructing four complementary knowledge sources: \textit{visual metadata}, \textit{subject localization}, \textit{global quality summaries}, and \textit{local quality descriptions}, followed by relevance-aware retrieval for evidence-grounded reasoning. Extensive experiments show that QualiRAG achieves substantial improvements over open-source general-purpose LMMs and VQA-finetuned LMMs on visual quality understanding tasks, and delivers competitive performance on visual quality comparison tasks, demonstrating robust quality assessment capabilities without any task-specific training. The code will be publicly available at https://github.com/clh124/QualiRAG.
Problem

Research questions and friction points this paper is trying to address.

Visual Quality Assessment
Interpretable Quality Understanding
Fine-grained Spatiotemporal Perception
Dataset Bias
Auxiliary Contextual Information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
training-free
visual quality understanding
multimodal reasoning
dynamic knowledge generation
🔎 Similar Papers
No similar papers found.