Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Subjective answer automated scoring faces challenges including question-type heterogeneity, open-ended responses, and poor interpretability, leading to limited generalization of existing methods. This paper proposes the first unified LLM-augmented scoring framework designed for multi-question-type, cross-domain applications, comprising four synergistic modules: content similarity computation, knowledge-point alignment, answer relevance verification, and human-like feedback generation. We innovatively introduce pseudo-question generation and feedback simulation mechanisms, integrating text matching, key information extraction, and large language model–driven reasoning and generation to significantly enhance assessment authenticity and interpretability. Extensive experiments on both general-purpose and domain-specific benchmarks demonstrate consistent superiority over traditional and LLM-based baselines. The framework has been successfully deployed in a large e-commerce enterprise’s training and certification examination system.

Technology Category

Application Category

📝 Abstract
Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.
Problem

Research questions and friction points this paper is trying to address.

Automating subjective question grading across diverse formats
Providing human-like evaluation for open-ended student responses
Developing unified framework supporting comprehensive exam assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified LLM framework for subjective question grading
Four complementary modules for holistic answer evaluation
Leverages LLM reasoning for knowledge comparison and simulation
🔎 Similar Papers
No similar papers found.
Fanwei Zhu
Fanwei Zhu
Hangzhou City University
J
Jiaxuan He
Alibaba Group
X
Xiaoxiao Chen
Zhejiang Hospital
Zulong Chen
Zulong Chen
Director, Alibaba Group
Machine LearningLarge Language ModelSearch&RecommendationNLP
Q
Quan Lu
Mashang Consumer Finance Co
C
Chenrui Mei
Hangzhou City University