GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of performance degradation in large language models (LLMs) during long-term deployment, where contamination from training data can distort evaluation results and obscure model regression. To mitigate this, the authors introduce a continuously operating evaluation platform featuring a novel user feedback–driven mechanism for dynamic issue tracking. The platform integrates LLM-as-a-judge scoring, an automated QA testing pipeline, and cross-model comparative analysis to enable fine-grained regression detection and real-time performance monitoring. Designed for public access, this infrastructure significantly enhances the authenticity, timeliness, and reliability of LLM evaluations in evolving deployment scenarios.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.
Problem

Research questions and friction points this paper is trying to address.

LLM evaluation
benchmark contamination
model regression
issue tracking
performance inflation
Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous evaluation
LLM-as-a-judge
regression detection
issue tracking
benchmark contamination
🔎 Similar Papers
No similar papers found.
J
Ja Young Lee
IBM Research - AI
M
Mírian Silva
IBM Research - AI
Mohamed Nasr
Mohamed Nasr
ETH Zurich
Party PoliticsVoting BehaviorPolitical BehaviorElections
S
Shonda Witherspoon
IBM Research - AI
E
Enzo Bozzani
IBM Research - AI
V
Veronique Demers
IBM Research - AI
R
Radha Ratnaparkhi
IBM Research - AI
Hui Wu
Hui Wu
Research Scientist, IBM Research
Artificial IntelligenceComputer VisionMachine Learning
Sara Rosenthal
Sara Rosenthal
IBM Research
Natural Language Processing