DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation&Instruct-Masking Tuning

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual reasoning (VR) faces critical bottlenecks including LLMs’ lack of tool awareness, scarcity of high-quality training data, error accumulation across tool invocations, and poor robustness of noisy fine-tuning procedures. To address these, this paper proposes a tool-aware VR paradigm. Our method introduces (1) discrepancy-aware workflow generation, which explicitly models behavioral deviations of tools to enhance robustness against execution errors; and (2) instruction-masked supervised fine-tuning, enabling precise action cloning and low-noise parameter updates. The framework integrates workflow evaluation, feasibility filtering, and a multi-stage tool-cooperative reasoning mechanism. Evaluated on multiple mainstream VR benchmarks, our approach achieves state-of-the-art performance, demonstrating significant improvements in cross-task generalization and stable inference under noisy or erroneous workflows.

Technology Category

Application Category

📝 Abstract
Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.
Problem

Research questions and friction points this paper is trying to address.

Enhancing tool awareness in visual reasoning with LLMs
Addressing performance bottlenecks from imperfect VR tools
Improving workflow generation and fine-tuning for VR tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrepancy-aware workflow generation for training
Instruct-Masking fine-tuning for effective actions
Tool-aware visual reasoning with LLMs
🔎 Similar Papers
No similar papers found.