Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This study addresses the validity threat posed by self-referential bias in large language models (LLMs) when deployed in self-generated, self-scored adaptive assessments. To mitigate this issue, the authors propose “Generation-Evaluation Agreement” (GEA) as a novel validity criterion, which evaluates whether model-assigned scores accurately recover the intended proficiency levels embedded during item generation. The research implements a two-stage adaptive testing system grounded in skill decomposition, integrating item generation, response simulation, and automated scoring, and introduces the first quantitative metric for GEA. Empirical results reveal an overall GEA correlation of 0.698, with high consistency for grammar-related skills (r > 0.7) but near-zero agreement for design-oriented skills. Additionally, low-proficiency examinees near routing thresholds were systematically overestimated, underscoring the critical role of fine-grained scoring rubrics in enhancing assessment validity.
📝 Abstract
When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.
Problem

Research questions and friction points this paper is trying to address.

Generative-Evaluative Agreement
LLM-enabled adaptive assessment
validity criterion
self-referential validation
skill recovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative-Evaluative Agreement
LLM-enabled adaptive assessment
validity criterion
skill-decomposed rubrics
self-referential validation