What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current agent evaluation benchmarks struggle to distinguish whether high performance stems from genuine semantic tool-use capabilities or mere memorization of specific interface layouts, thereby failing to reliably reflect true generalization. To address this, this work proposes the PIPE evaluation protocol, which subtly perturbs environmental interfaces while preserving task semantics and executable behaviors. We introduce a novel Interface Reliance (IR) metric to quantify an agent’s dependence on training-time interface specifics. Through interface rewriting, adversarial evaluation, and alias-balancing experiments across 16 AgentBench and AgentGym environments, we find that trajectory-supervised fine-tuning (Trajectory-SFT) significantly exacerbates shortcut learning via interface cues, leading to sharp performance drops under interface perturbations, whereas untrained models remain stable—exposing critical limitations in current evaluation paradigms.

Technology Category

Application Category

📝 Abstract
Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We propose PIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias-based metric that quantifies preference for training-time interfaces, and show that interface shortcutting exhibits environment-dependent, non-monotonic training dynamics that remain invisible under standard evaluation. Our code is available at https://anonymous.4open.science/r/What-Do-Agents-Learn-from-Trajectory-SFT-Semantics-or-Interfaces--0831/.
Problem

Research questions and friction points this paper is trying to address.

trajectory-SFT
interface reliance
semantic tool-use
agent evaluation
environment invariance
Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-SFT
interface reliance
semantic tool-use
evaluation protocol
agent generalization
🔎 Similar Papers
No similar papers found.
W
Weizheng Gu
National Engineering Research Center for Software Engineering, Peking University, Beijing, China
C
Chengze Li
Nanjing University, Nanjing, China
Zhuohao Yu
Zhuohao Yu
Peking University
Natural Language ProcessingSoftware Engineering
M
Mengyuan Sun
National Engineering Research Center for Software Engineering, Peking University, Beijing, China
Z
Zhibang Yang
National Engineering Research Center for Software Engineering, Peking University, Beijing, China
Wei Wang
Wei Wang
Tongji University
Image processing
H
Hongrui Jia
National Engineering Research Center for Software Engineering, Peking University, Beijing, China
Shikun Zhang
Shikun Zhang
北京大学
Wei Ye
Wei Ye
Peking University
Software EngineeringNatural Language Processing