Sample, Predict, then Proceed: Self-Verification Sampling for Tool Use of LLMs

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based tool-use frameworks struggle with reliable function calling in stateful environments, as current test-time computation strategies rely on iterative environment interaction—rendering them impractical in real-world deployments. To address this, we propose DyMo (Dynamic Modeling) and SVS (Self-Validating Sampling), a synergistic framework enabling zero-shot, zero-trial reliable tool planning without runtime environment feedback. DyMo introduces lightweight internal state modeling to predict the outcomes of candidate actions; SVS performs real-time confidence estimation and proactive refusal over function-call candidates during generation. We further enhance the framework via state-prediction-aware instruction tuning and function-call-specific post-training. Evaluated on the Berkeley Function Calling Leaderboard V2, our method achieves significant improvements in success rate and pass@k, while substantially reducing hallucination rates—demonstrating enhanced output reliability and robustness.

Technology Category

Application Category

📝 Abstract
Tool use in stateful environments presents unique challenges for large language models (LLMs), where existing test-time compute strategies relying on repeated trials in the environment are impractical. We propose dynamics modelling (DyMo), a method that augments LLMs with a state prediction capability alongside function calling during post-training. This enables LLMs to predict the future states of their actions through an internal environment model. On the Berkeley Function Calling Leaderboard V2, DyMo improves success rates and significantly reduces hallucinations. We further integrate the internal environment model into self-verification sampling (SVS), and show that this substantially improves pass^k over number of trials k, and allows the model to refuse unreliable outputs. Together, DyMo and SVS greatly enhance the effectiveness and reliability of LLMs for tool use. We believe this work charts a path towards scalable planning RL methods for LLM inference without repeatedly querying the oracle environment.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' tool use in stateful environments
Reducing hallucinations in function calling tasks
Improving reliability via self-verification sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments LLMs with state prediction capability
Integrates internal environment model into self-verification
Reduces hallucinations and improves success rates
🔎 Similar Papers
No similar papers found.