🤖 AI Summary
High-quality visual question answering with grounding (VQA-G) relies on costly and labor-intensive human annotations, limiting scalability; existing automated approaches suffer from model hallucination and fragile verification mechanisms. This work proposes AutoVQA-G, a self-improving agent framework that leverages chain-of-thought reasoning to drive a fine-grained visual consistency verification module and integrates a memory-augmented prompt optimization mechanism. This enables the agent to iteratively learn from failure cases and continuously refine its prompt generation. The resulting closed-loop pipeline substantially enhances the visual grounding accuracy and semantic consistency of automatically generated data. The constructed VQA-G dataset outperforms current state-of-the-art multimodal large language models and provides a high-fidelity resource for training and evaluating vision-language models.
📝 Abstract
Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G