TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

📅 2026-02-02

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing evaluation methods struggle to provide fine-grained diagnostics of the mechanisms by which large language model agents improve performance at test time through interaction with their environment (Test-Time Improvement, TTI), particularly lacking analysis of task optimization efficiency, error adaptability, and working memory utility. This work proposes TIDE, a novel framework that establishes, for the first time, an agent- and environment-agnostic diagnostic system for TTI. TIDE decomposes test-time improvement into three quantifiable and interrelated dimensions: temporal dynamics of task completion, constraints on recursive behaviors, and memory burden. Through trajectory-level analysis—integrating time-series modeling, behavioral pattern recognition, and quantification of memory load—TIDE reveals across diverse agents and environments that merely enhancing internal reasoning is insufficient for performance gains; explicit optimization of interactive dynamics is essential.

Technology Category

Application Category

📝 Abstract

Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.

Problem

Research questions and friction points this paper is trying to address.

Test-Time Improvement

LLM Agents

Evaluation Metrics

Working Memory

Behavior Adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Improvement

Trajectory-based Evaluation

LLM Agents