🤖 AI Summary
To address the token-level credit assignment challenge in reinforcement learning for large language models (LLMs), where sparse and delayed rewards hinder learning over long sequences, this paper proposes the Prefix-to-Tree framework and the TEMPO algorithm. Leveraging multi-response structures in verifiable-reward settings, it constructs a prefix tree to explicitly model response branching paths. At branch nodes, it introduces a critic-free temporal-difference correction mechanism that enables branch-aware, precise credit assignment without requiring a value network. Key technical innovations include nonparametric value estimation grounded in the prefix tree, group-relative reward normalization, and branch-gated policy optimization. Evaluated on mathematical and medical question-answering tasks, the method substantially outperforms PPO and GRPO across both in-distribution and out-of-distribution benchmarks in terms of answer accuracy, while maintaining comparable training efficiency.
📝 Abstract
Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce extbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes emph{nonparametric} prefix values (V(s)) by aggregating descendant outcomes. Built on P2T, we propose extbf{TEMPO} (emph{ extbf{T}ree- extbf{E}stimated extbf{M}ean Prefix Value for extbf{P}olicy extbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.