🤖 AI Summary
This work addresses the challenge of behavioral inconsistency in large language model (LLM) agents, which often exhibit divergent behaviors under identical inputs, hindering reliable deployment. To enable cross-run diagnosis of such inconsistencies, we propose InconLens—a novel visual analytics system that introduces an “information node” abstraction to semantically align and finely compare multi-turn execution trajectories across runs. By integrating structured logging, semantic alignment, and interactive exploration techniques, InconLens offers the first interactive diagnostic framework for investigating behavioral discrepancies in LLM agents. Case studies and expert interviews demonstrate that InconLens effectively pinpoints divergence points, uncovers underlying failure patterns, and significantly enhances the reliability and stability of LLM-based agent systems.
📝 Abstract
Large Language Model (LLM)-based agentic systems have shown growing promise in tackling complex, multi-step tasks through autonomous planning, reasoning, and interaction with external environments. However, the stochastic nature of LLM generation introduces intrinsic behavioral inconsistency: the same agent may succeed in one execution but fail in another under identical inputs. Diagnosing such inconsistencies remains a major challenge for developers, as agent execution logs are often lengthy, unstructured, and difficult to compare across runs. Existing debugging and evaluation tools primarily focus on inspecting single executions, offering limited support for understanding how and why agent behaviors diverge across repeated runs. To address this challenge, we introduce InconLens, a visual analytics system designed to support interactive diagnosis of LLM-based agentic systems with a particular focus on cross-run behavioral analysis. InconLens introduces information nodes as an intermediate abstraction that captures canonical informational milestones shared across executions, enabling semantic alignment and inspection of agent reasoning trajectories across multiple runs. We demonstrate the effectiveness of InconLens through a detailed case study and further validate its usability and analytical value via expert interviews. Our results show that InconLens enables developers to more efficiently identify divergence points, uncover latent failure modes, and gain actionable insights into improving the reliability and stability of agentic systems.