🤖 AI Summary
This study addresses the limitations of existing student profiling methods, which lack alignment with educational theory and controllable population distribution, thereby hindering systematic evaluation of educational large language models. The authors propose the Theory-Aligned and Distribution-Controlled Profile Generation (TAD-PG) task and introduce a multi-agent Propose-Validate-Revise framework that integrates theory-anchored educational schemata, a neuro-symbolic validator, hierarchical sampling, and semantic deduplication mechanisms. This approach enables, for the first time, formalized generation of student profiles that are both theoretically grounded and distributionally controllable. Leveraging Qwen2.5-72B, the team constructs HACHIMI-1M, a corpus comprising one million synthetic student profiles spanning grades 1–12. Intrinsic evaluations demonstrate high schema validity, precise quota adherence, and strong diversity, while extrinsic assessments on CEPS and PISA 2022 datasets reveal close alignment between generated profiles and real students in dimensions such as mathematical ability and curiosity.
📝 Abstract
Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI