๐ค AI Summary
This work addresses the lack of fine-grained provenance tracing in existing multimodal tool-using agents, which renders their reasoning unverifiable due to insufficient linkage between claims in generated answers and the underlying tool observations. To resolve this, the authors propose TRACER, a framework that simultaneously generates responses and constructs sentence-level structured provenance records, explicitly annotating for each statement the corresponding tool invocation round, evidence units, and semantic relationsโsuch as quotation, compression, or inference. TRACER introduces a multi-dimensional verification mechanism to ensure provenance reliability and, for the first time, enables verifiable generation with traceable provenance in multimodal tool use. The framework encodes provenance information as traceable constraints and localized rewards within a reinforcement learning paradigm and introduces TRACE-Bench, a new evaluation benchmark. Experiments show that TRACER achieves 78.23% answer accuracy and 95.72% summary accuracy on this benchmark, outperforming the strongest closed-source baseline by 23.80 percentage points while reducing tool calls by 30%.
๐ Abstract
Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.