AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
High-quality visual question answering with grounding (VQA-G) relies on costly and labor-intensive human annotations, limiting scalability; existing automated approaches suffer from model hallucination and fragile verification mechanisms. This work proposes AutoVQA-G, a self-improving agent framework that leverages chain-of-thought reasoning to drive a fine-grained visual consistency verification module and integrates a memory-augmented prompt optimization mechanism. This enables the agent to iteratively learn from failure cases and continuously refine its prompt generation. The resulting closed-loop pipeline substantially enhances the visual grounding accuracy and semantic consistency of automatically generated data. The constructed VQA-G dataset outperforms current state-of-the-art multimodal large language models and provides a high-fidelity resource for training and evaluating vision-language models.

Technology Category

Application Category

📝 Abstract
Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G
Problem

Research questions and friction points this paper is trying to address.

Visual Question Answering
Grounding Annotation
Model Hallucination
Data Fidelity
Automated Annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Improving Agentic Framework
Visual Question Answering with Grounding
Chain-of-Thought Reasoning
Prompt Optimization
Automated Dataset Annotation
🔎 Similar Papers