🤖 AI Summary
This work addresses the challenge of balancing performance gains with the avoidance of high-risk, irreversible actions in agent-based environments under test-time scaling. We propose a risk-aware iterative simulation framework that decouples exploration from commitment, enabling safe test-time exploration using large language models prior to real-world execution. By integrating failure-action-oriented data generation with model rebalancing strategies, our approach significantly enhances the simulation of high-impact failure modes. Evaluated across multi-turn, multi-task agent benchmarks, the method consistently improves action reliability and robustness without introducing environmental risks, demonstrating that risk-aware simulation is essential for achieving consistent performance gains.
📝 Abstract
Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.