RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing evaluation methods—such as single-turn LLM judgment—lack reasoning capability, iterative refinement, and fine-grained fault attribution, hindering root-cause analysis in long-horizon, multi-component LLM agent systems. Method: We propose the first evaluation framework integrating structured reasoning and self-reflection, comprising a central Judge and multiple specialized Evaluators. It performs multi-stage reasoning, dynamic hypothesis generation, and iterative refinement to achieve component-level and timestamp-level attribution—answering “who failed and when.” The framework jointly assesses reasoning quality and component behavior while enabling long-chain logical traceability. Contribution/Results: On the Who&When benchmark, our method achieves >43% error localization accuracy on algorithmically generated data and >20% on manually curated data—substantially reducing reliance on manual inspection. This establishes an interpretable, iterative evaluation paradigm for complex LLM agent systems.

Technology Category

Application Category

📝 Abstract

We have reached a critical roadblock in the development and enhancement of long-horizon, multi-component LLM agentic systems: it is incredibly tricky to identify where these systems break down and why. Evaluation capabilities that currently exist today (e.g., single pass LLM-as-a-judge) are limited in that they often focus on individual metrics or capabilities, end-to-end outcomes, and are narrowly grounded on the preferences of humans. We argue that to match the agentic capabilities, evaluation frameworks must also be able to reason, probe, iterate, and understand the complex logic passing through these systems over long horizons. In this paper, we present RAFFLES - an evaluation architecture that incorporates reasoning and iterative refinement. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically investigate faults and a set of specialized Evaluators to assess not only the system's components but also the quality of the reasoning by the Judge itself, thereby building a history of hypotheses. We tested RAFFLES against several baselines on the Who&When dataset, a benchmark designed to diagnose the "who" (agent) and "when" (step) of a system's failure. RAFFLES outperforms these baselines, achieving an agent-step fault pair accuracy of over 43% on the Algorithmically-Generated dataset (a substantial increase from the previously published best of 16.6%) and over 20% on the Hand-Crafted dataset (surpassing the previously published best of 8.8%). These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual human review.

Problem

Research questions and friction points this paper is trying to address.

Identifying faults in multi-component LLM agentic systems

Developing reasoning-based evaluation frameworks for LLM systems

Automating fault detection to replace manual human review

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-based iterative refinement architecture

Multi-component pipeline with central Judge

Specialized Evaluators assess reasoning quality

🔎 Similar Papers

No similar papers found.