SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Large language models excel on standard benchmarks but exhibit markedly reduced robustness when evaluated on semantically equivalent variants of the same questions. Existing robustness enhancement methods rely heavily on costly human annotations or generations from large models, limiting their scalability. To address this, this work proposes SAGE, a novel framework that, for the first time, integrates fine-tuning of small models with reinforcement learning to establish a low-cost, highly scalable pipeline for automatic robustness augmentation. SAGE comprises VariantGen—a variant generator optimized through supervised fine-tuning and reinforcement learning—and VariantQual, a rule-based validator. Evaluated on HellaSwag, the resulting large-scale robustness benchmark matches the quality of the human-curated HellaSwag-Pro at a fraction of the cost and generalizes effectively to unseen tasks such as MMLU without task-specific fine-tuning.
📝 Abstract
Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

robustness augmentation
knowledge evaluation
LLM brittleness
benchmark scaling
variant generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

robustness augmentation
variant generation
reinforcement learning
fine-tuned smaller models
automated benchmarking