π€ AI Summary
Current evaluations of large language models (LLMs) are hindered by the scarcity of efficient, scalable, high-quality datasets that support multilingual and multidomain assessment. This work proposes a fully automated synthetic evaluation framework that generates customized, high-quality evaluation data with minimal human input, thereby reducing reliance on real-world data and enabling controllable, scalable, and multilingual end-to-end benchmarking. The framework integrates an enhanced TGRT Self-Instruct method, a synthetic data engine, and an LLM-as-a-judge mechanism, complemented by hybrid metrics combining statistical and language modelβbased evaluations. Experimental results demonstrate that the synthetically generated datasets outperform existing language-specific benchmarks by an average of 5.7% in LLM-based scoring, achieving evaluation validity comparable to that of real data while significantly enhancing efficiency and broad applicability.
π Abstract
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.