Learning CLI Agents with Structured Action Credit under Selective Observation

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the challenges faced by command-line interface (CLI) agents in partially observable environments, where locating task-relevant evidence is difficult and sparse terminal rewards hinder effective credit assignment across long action trajectories. To overcome these issues, the authors propose σ-Reveal, a selective context mechanism combined with Action Advantage Allocation (A³), which leverages residual subchains from abstract syntax trees (ASTs) and trajectory boundary information to enable fine-grained credit assignment and efficient context awareness. Evaluated on a newly curated ShellOps dataset through reinforcement learning augmented with terminal interaction feedback, the proposed approach demonstrates significant improvements in both task performance and sample efficiency for complex, multi-turn code-related tasks.

📝 Abstract

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $σ$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

Problem

Research questions and friction points this paper is trying to address.

CLI agents

selective observation

credit assignment

reinforcement learning

sparse rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Action Credit

Selective Observation

Action Advantage Assignment