Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the limited tool diversity and overreliance on high-frequency tools in large language model (LLM) tool use. To tackle this, we propose a diversity-driven exploration method based on offline reinforcement learning. Our approach introduces a rarity-prioritized dual-objective reward function that jointly optimizes answer correctness and tool-selection entropy; employs GPT-4o for fine-grained scoring of both tool invocations and chain-of-thought actions; and adopts a step-level offline PPO framework to fine-tune Llama-3.1-8B on synthetic trajectories from MMLU-Pro. Experiments demonstrate state-of-the-art performance across all 14 MMLU-Pro categories, with significantly higher tool-selection entropy than supervised fine-tuning and high-temperature sampling baselines. Crucially, this is the first systematic study to empirically validate that enhanced tool diversity exploration yields substantial improvements in complex reasoning capabilities.

Technology Category

Application Category

📝 Abstract
We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel reinforcement learning framework that teaches large language models to explore diverse tool usage patterns beyond conventional high-temperature sampling. Building on recent advances in step-wise reinforcement learning, we introduce a dual-objective reward system that simultaneously optimizes for answer quality and tool diversity, training a Llama-3.1 8B model through offline PPO on synthetically generated trajectories from the MMLU-Pro dataset. Our approach uniquely employs a rarity-first exploitation strategy where a GPT-4o judge scores candidate actions across eight distinct tools plus chain-of-thought reasoning, with the policy favoring less-frequently used but still viable tools to encourage systematic exploration. Empirical results demonstrate that SPaRK achieves competitive performance across 14 MMLU-Pro categories while exhibiting significantly higher entropy in tool selection compared to both baseline and supervised fine-tuning approaches, suggesting that algorithmic exploration through explicit tool diversity can enhance reasoning capabilities without sacrificing accuracy.
Problem

Research questions and friction points this paper is trying to address.

Enhance diverse tool usage in LLMs beyond conventional methods
Optimize answer quality and tool diversity simultaneously
Encourage systematic exploration of rare but viable tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-objective reward for quality and diversity
Rarity-first exploitation strategy for tools
Offline PPO training on synthetic trajectories