Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the practical problem-solving capabilities and contextual adaptability of large language models (LLMs) on English Standardized Tests (ESTs). Method: We introduce ESTBOOK—the first multimodal benchmark covering five major EST categories, 29 question types, and 10,576 items—incorporating text, images, audio, tables, and mathematical notation. We propose an interpretable, stepwise reasoning analysis framework that enables fine-grained decomposition of solution chains and capability attribution. Our methodology integrates multimodal data ingestion, zero-shot and few-shot inference, structured prompt engineering, and quantitative evaluation metrics. Contribution/Results: Experiments reveal LLMs’ capability boundaries and recurrent bottlenecks across question types, empirically validating the framework’s diagnostic efficacy for capability attribution. ESTBOOK and the analysis framework provide an evidence-based foundation and actionable optimization pathways for developing trustworthy AI-powered tutoring systems in educational settings.

Technology Category

Application Category

📝 Abstract
AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.
Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' ability to solve English Standardized Test questions
Evaluate LLM accuracy and efficiency using the ESTBOOK benchmark
Propose a framework to analyze LLM performance in test preparation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ESTBOOK benchmark for LLM evaluation
Proposes breakdown analysis framework for question steps
Evaluates LLM accuracy and inference efficiency systematically
🔎 Similar Papers
No similar papers found.
L
Luoxi Tang
Binghamton University
T
Tharunya Sundar
Binghamton University
S
Shuai Yang
Binghamton University
A
Ankita Patra
Binghamton University
M
Manohar Chippada
Binghamton University
G
Giqi Zhao
BlossomsAI
Y
Yi Li
BlossomsAI
R
Riteng Zhang
BlossomsAI
T
Tunan Zhao
Binghamton University
T
Ting Yang
Binghamton University
Y
Yuqiao Meng
Binghamton University
Weicheng Ma
Weicheng Ma
Dartmouth College
Natural Language Processing
Zhaohan Xi
Zhaohan Xi
Binghamton University
AI for ScienceLarge Language ModelsHealthcare AICybersecurityAI Security