๐ค AI Summary
This work addresses the limitations of large language model agents in multi-tool reasoning, where myopic decision-making and unstable reasoning trajectories often lead to error accumulation and suboptimal performance, hindering both global effectiveness and computational efficiency. To overcome these challenges, we propose MAXS, a meta-adaptive exploration framework that integrates forward-looking planning, path stability assessment, and an adaptive termination mechanism. MAXS dynamically selects high-value, stable reasoning paths by estimating the advantage of tool usage and evaluating step-wise consistency variance alongside cross-step trend slopes, while incorporating trajectory convergence control to bound computational overhead. Extensive experiments across three base models and five datasets demonstrate that MAXS significantly outperforms existing methods, achieving superior reasoning performance without compromising computational efficiency, thereby validating the efficacy of its foresight-driven strategy and adaptive mechanisms.
๐ Abstract
Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.