🤖 AI Summary
This work addresses the limited tool diversity and overreliance on high-frequency tools in large language model (LLM) tool use. To tackle this, we propose a diversity-driven exploration method based on offline reinforcement learning. Our approach introduces a rarity-prioritized dual-objective reward function that jointly optimizes answer correctness and tool-selection entropy; employs GPT-4o for fine-grained scoring of both tool invocations and chain-of-thought actions; and adopts a step-level offline PPO framework to fine-tune Llama-3.1-8B on synthetic trajectories from MMLU-Pro. Experiments demonstrate state-of-the-art performance across all 14 MMLU-Pro categories, with significantly higher tool-selection entropy than supervised fine-tuning and high-temperature sampling baselines. Crucially, this is the first systematic study to empirically validate that enhanced tool diversity exploration yields substantial improvements in complex reasoning capabilities.
📝 Abstract
We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel reinforcement learning framework that teaches large language models to explore diverse tool usage patterns beyond conventional high-temperature sampling. Building on recent advances in step-wise reinforcement learning, we introduce a dual-objective reward system that simultaneously optimizes for answer quality and tool diversity, training a Llama-3.1 8B model through offline PPO on synthetically generated trajectories from the MMLU-Pro dataset. Our approach uniquely employs a rarity-first exploitation strategy where a GPT-4o judge scores candidate actions across eight distinct tools plus chain-of-thought reasoning, with the policy favoring less-frequently used but still viable tools to encourage systematic exploration. Empirical results demonstrate that SPaRK achieves competitive performance across 14 MMLU-Pro categories while exhibiting significantly higher entropy in tool selection compared to both baseline and supervised fine-tuning approaches, suggesting that algorithmic exploration through explicit tool diversity can enhance reasoning capabilities without sacrificing accuracy.