🤖 AI Summary
This work addresses the tendency of large language model distillation to induce homogenization in tool-use behaviors, a phenomenon inadequately captured by existing evaluation metrics that conflate task-necessary actions with model-intrinsic preferences. To disentangle these aspects, the authors propose two complementary metrics: Response Pattern Similarity (RPS) and Action Graph Similarity (AGS), which respectively quantify linguistic alignment and similarity in tool-use habits. Innovatively modeling tool usage as directed graphs, they integrate graph-structural and textual pattern analyses to evaluate 18 models on τ-Bench and τ²-Bench. Results reveal significantly higher AGS among models within the same family compared to cross-family pairs (+5.9%), with Kimi-K2 (thinking) achieving 82.6% node similarity and 94.7% dependency similarity—surpassing Anthropic Opus 4.1—and thereby effectively exposing non-mandatory behavioral convergence induced by distillation.
📝 Abstract
Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $τ$-Bench and $τ^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.