🤖 AI Summary
Current agent evaluation benchmarks struggle to distinguish whether high performance stems from genuine semantic tool-use capabilities or mere memorization of specific interface layouts, thereby failing to reliably reflect true generalization. To address this, this work proposes the PIPE evaluation protocol, which subtly perturbs environmental interfaces while preserving task semantics and executable behaviors. We introduce a novel Interface Reliance (IR) metric to quantify an agent’s dependence on training-time interface specifics. Through interface rewriting, adversarial evaluation, and alias-balancing experiments across 16 AgentBench and AgentGym environments, we find that trajectory-supervised fine-tuning (Trajectory-SFT) significantly exacerbates shortcut learning via interface cues, leading to sharp performance drops under interface perturbations, whereas untrained models remain stable—exposing critical limitations in current evaluation paradigms.
📝 Abstract
Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We propose PIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias-based metric that quantifies preference for training-time interfaces, and show that interface shortcutting exhibits environment-dependent, non-monotonic training dynamics that remain invisible under standard evaluation. Our code is available at https://anonymous.4open.science/r/What-Do-Agents-Learn-from-Trajectory-SFT-Semantics-or-Interfaces--0831/.