GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge of performance degradation in large language models (LLMs) during long-term deployment, where contamination from training data can distort evaluation results and obscure model regression. To mitigate this, the authors introduce a continuously operating evaluation platform featuring a novel user feedback–driven mechanism for dynamic issue tracking. The platform integrates LLM-as-a-judge scoring, an automated QA testing pipeline, and cross-model comparative analysis to enable fine-grained regression detection and real-time performance monitoring. Designed for public access, this infrastructure significantly enhances the authenticity, timeliness, and reliability of LLM evaluations in evolving deployment scenarios.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

benchmark contamination

model regression

issue tracking

performance inflation

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous evaluation

LLM-as-a-judge

regression detection

issue tracking

benchmark contamination

🔎 Similar Papers

No similar papers found.