🤖 AI Summary
This work addresses the challenge of performance degradation in large language models (LLMs) during long-term deployment, where contamination from training data can distort evaluation results and obscure model regression. To mitigate this, the authors introduce a continuously operating evaluation platform featuring a novel user feedback–driven mechanism for dynamic issue tracking. The platform integrates LLM-as-a-judge scoring, an automated QA testing pipeline, and cross-model comparative analysis to enable fine-grained regression detection and real-time performance monitoring. Designed for public access, this infrastructure significantly enhances the authenticity, timeliness, and reliability of LLM evaluations in evolving deployment scenarios.
📝 Abstract
Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.