ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing tool-augmented VQA methods exhibit limited generalization in realistic multimodal scenarios—particularly those requiring multi-step implicit reasoning. To address this, we introduce ToolVQA, a benchmark comprising 23K instances spanning 10 multimodal tool types across 7 domains, with an average reasoning depth of 2.78 steps, emphasizing complex real-world tasks that demand tight visual-tool coordination. We propose ToolEngine, a novel generation pipeline integrating depth-first search with dynamic in-context example matching to emulate human-like tool-use reasoning. Fine-tuning a 7B-parameter LLM on ToolVQA yields substantial improvements over GPT-3.5-turbo on both the ToolVQA test set and multiple out-of-distribution benchmarks, demonstrating superior generalization and robust, grounded tool-calling capability in multimodal settings.

Technology Category

Application Category

📝 Abstract

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhancing tool-augmented VQA for real-world multimodal settings

Addressing gaps in multi-step reasoning with diverse tools

Improving generalizability of LFMs in tool-use scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates external tools with LFMs

Uses DFS for human-like reasoning simulation

Features diverse multimodal real-world tasks

🔎 Similar Papers

No similar papers found.