Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the token-level credit assignment challenge in reinforcement learning for large language models (LLMs), where sparse and delayed rewards hinder learning over long sequences, this paper proposes the Prefix-to-Tree framework and the TEMPO algorithm. Leveraging multi-response structures in verifiable-reward settings, it constructs a prefix tree to explicitly model response branching paths. At branch nodes, it introduces a critic-free temporal-difference correction mechanism that enables branch-aware, precise credit assignment without requiring a value network. Key technical innovations include nonparametric value estimation grounded in the prefix tree, group-relative reward normalization, and branch-gated policy optimization. Evaluated on mathematical and medical question-answering tasks, the method substantially outperforms PPO and GRPO across both in-distribution and out-of-distribution benchmarks in terms of answer accuracy, while maintaining comparable training efficiency.

Technology Category

Application Category

📝 Abstract

Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce extbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes emph{nonparametric} prefix values (V(s)) by aggregating descendant outcomes. Built on P2T, we propose extbf{TEMPO} (emph{ extbf{T}ree- extbf{E}stimated extbf{M}ean Prefix Value for extbf{P}olicy extbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

Problem

Research questions and friction points this paper is trying to address.

Sparse delayed rewards hinder token-level credit assignment in RL for LLMs

Existing methods struggle with precise credit assignment without complex value models

Current approaches ignore tree structure when distributing rewards across tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts responses into prefix tree structure

Computes nonparametric prefix values from descendants

Uses branch-gated temporal-difference corrections

🔎 Similar Papers

No similar papers found.