🤖 AI Summary
This work addresses the limitations of existing reinforcement learning approaches that rely on coarse-grained binary rewards, which struggle to precisely guide domain-specific agents in external tool usage. The authors propose a three-stage post-training framework that decomposes rewards into fine-grained components across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain compliance. A novel multiplicative correctness decomposition mechanism prioritizes accurate tool selection, while stringent compliance penalties ensure adherence to regulatory requirements in high-stakes scenarios. By integrating supervised fine-tuning, grouped relative policy optimization, and direct preference optimization, the framework establishes an alignment-oriented reinforcement learning pipeline for tool-augmented agents. Evaluated on a real-world financial advisory system, the approach achieves a 47% improvement in task completion rate, a 63% reduction in tool invocation errors, a 93% decrease in policy violations, and maintains response latency under two seconds.
📝 Abstract
Tool-integrated reasoning agents interleaving natural language deliberation with external API calls show promise for complex multi-step tasks. However, aligning such agents for high-stakes domain-specific deployment is challenging, as existing reinforcement learning uses coarse binary rewards (success/failure) that insufficiently guide nuanced tool invocation in production. We present ToolRLA, a three-stage post-training pipeline (Supervised Fine-Tuning, Group Relative Policy Optimization, Direct Preference Optimization) for domain-specific tool-integrated agents. Its core is a fine-grained reward function with multiplicative correctness decomposition, evaluating tool invocation across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain constraint compliance. Multiplicative composition prioritizes correct tool selection (a prerequisite for meaningful parameter evaluation), while a large negative compliance penalty (λ=10) ensures regulatory adherence. Deployed on a real-world financial advisory copilot (80+ advisors, 1,200+ daily queries, 15+ heterogeneous APIs), ToolRLA achieves 47% higher end-to-end task completion (62% to 91%), 63% lower tool invocation error (38% to 14%), 93% lower regulatory violation (12% to 0.8%), and sub-2-second latency after three months. Ablation studies confirm fine-grained reward decomposition contributes 7 percentage points over coarse additive rewards; generalizability is validated on ToolBench and API-Bank.