CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

๐Ÿ“… 2026-02-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of applying reinforcement learning (RL) to multi-turn, multi-step agent tool-use tasks, where sparse verifiable rewards, open-ended action spaces, and high costs of real-environment construction hinder effective training. To overcome these limitations, the authors propose the CM2 framework, which introduces a novel checklist-based reward mechanism that reformulates open-ended behavior evaluation into fine-grained binary judgments grounded in structured metadata. This approach enables stable RL training without requiring final verifiable outcomes. By integrating large language modelโ€“simulated environments with a hybrid strategy combining reinforcement learning and supervised fine-tuning, CM2 achieves performance gains of 8, 10, and 12 points on tau^-Bench, BFCL-V4, and ToolSandbox, respectively, matching or surpassing state-of-the-art open-source baselines of comparable scale.

Technology Category

Application Category

๐Ÿ“ Abstract
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
multi-turn interaction
multi-step tool use
reward sparsity
agent training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Checklist Rewards
Reinforcement Learning
Multi-turn Tool Use
LLM-simulated Environment
Agentic Reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.