SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph

๐Ÿ“… 2025-10-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Sparse rewards in long-horizon, multi-step tasks hinder stable training of language agents. Method: We propose TrajGraph, a fine-grained advantage allocation framework that models each interaction trajectory as a directed graphโ€”termed a *trajectory graph*โ€”and propagates the final outcome reward backward to quantify the relative quality of each action step, enabling unsupervised, zero-overhead, step-level advantage estimation. Contribution/Results: TrajGraph requires no critic network, external annotations, or model modifications, and integrates seamlessly into group-based RL algorithms (e.g., GRPO). Evaluated on WebShop, ALFWorld, and AppWorld benchmarks, it consistently improves policy performance across LLMs of various scales, while enhancing training stability and cross-task generalization.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities, enabling language agents to excel at single-turn tasks. However, their application to complex, multi-step, and long-horizon tasks remains challenging. While reinforcement learning (RL) offers a promising avenue for addressing these challenges, mainstream approaches typically rely solely on sparse, outcome-based rewards, a limitation that becomes especially problematic for group-based RL algorithms lacking critic models, such as Group Relative Policy Optimization (GRPO). In such methods, uniformly rewarding or penalizing all actions within a trajectory can lead to training instability and suboptimal policies, because beneficial and detrimental actions are often entangled across multi-step interactions. To address this challenge, we propose SALT, a novel and lightweight framework that provides a finer-grained advantage assignment, derived solely from outcome rewards. We achieve this by constructing a graph from trajectories of the same prompt, which allows us to quantify the quality of each step and assign advantages accordingly. Crucially, SALT is designed as a plug-and-play module that seamlessly integrates with existing group-based RL algorithms, requiring no modifications to the rollout procedure and introducing negligible computational overhead. Extensive experiments on the WebShop, ALFWorld, and AppWorld benchmarks with various model sizes demonstrate that SALT consistently improves performance. We also conduct a thorough analysis to validate the design choices behind SALT and offer actionable insights.
Problem

Research questions and friction points this paper is trying to address.

Assigning step-level advantages in long-horizon tasks using trajectory graphs
Addressing sparse reward limitations in group-based reinforcement learning algorithms
Improving training stability and policy optimization for multi-step interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Assigns step-level advantages using trajectory graphs
Lightweight plug-and-play module for group-based RL
Quantifies step quality solely from outcome rewards
๐Ÿ”Ž Similar Papers
No similar papers found.