DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation&Instruct-Masking Tuning

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Visual reasoning (VR) faces critical bottlenecks including LLMs’ lack of tool awareness, scarcity of high-quality training data, error accumulation across tool invocations, and poor robustness of noisy fine-tuning procedures. To address these, this paper proposes a tool-aware VR paradigm. Our method introduces (1) discrepancy-aware workflow generation, which explicitly models behavioral deviations of tools to enhance robustness against execution errors; and (2) instruction-masked supervised fine-tuning, enabling precise action cloning and low-noise parameter updates. The framework integrates workflow evaluation, feasibility filtering, and a multi-stage tool-cooperative reasoning mechanism. Evaluated on multiple mainstream VR benchmarks, our approach achieves state-of-the-art performance, demonstrating significant improvements in cross-task generalization and stable inference under noisy or erroneous workflows.

Technology Category

Application Category

📝 Abstract

Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.

Problem

Research questions and friction points this paper is trying to address.

Enhancing tool awareness in visual reasoning with LLMs

Addressing performance bottlenecks from imperfect VR tools

Improving workflow generation and fine-tuning for VR tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrepancy-aware workflow generation for training

Instruct-Masking fine-tuning for effective actions

Tool-aware visual reasoning with LLMs

🔎 Similar Papers

Addressing and Visualizing Misalignments in Human Task-Solving Trajectories