Fine-tuning ChatGPT for Automatic Scoring of Written Scientific Explanations in Chinese

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low accuracy of automated scoring for Chinese scientific explanation texts. We propose an education-oriented domain adaptation method for ChatGPT via fine-tuning. Leveraging a high-quality Chinese scientific response dataset, we combine supervised fine-tuning with multidimensional analysis—including Kendall correlation and qualitative linguistic modeling—to reveal, for the first time, a nonlinear relationship between LLM scoring accuracy and reasoning complexity: negative correlation in low-order tasks, positive correlation in high-order tasks. We further identify that conciseness and clarity dominate scoring in foundational tasks, whereas comprehensiveness governs high-order evaluation. Key linguistic features—including sentence length, information density, and causal expression—exert significant moderating effects on scoring bias. Our findings empirically validate the efficacy of domain-specific fine-tuning for AI-driven educational assessment and establish a novel, interpretable, and capability-aligned paradigm for automated scoring, along with concrete optimization pathways.

Technology Category

Application Category

📝 Abstract
The development of explanations for scientific phenomena is essential in science assessment, but scoring student-written explanations remains challenging and resource-intensive. Large language models (LLMs) have shown promise in addressing this issue, particularly in alphabetic languages like English. However, their applicability to logographic languages is less explored. This study investigates the potential of fine-tuning ChatGPT, a leading LLM, to automatically score scientific explanations written in Chinese. Student responses to seven scientific explanation tasks were collected and automatically scored, with scoring accuracy examined in relation to reasoning complexity using the Kendall correlation. A qualitative analysis explored how linguistic features influenced scoring accuracy. The results show that domain-specific adaptation enables ChatGPT to score Chinese scientific explanations with accuracy. However, scoring accuracy correlates with reasoning complexity: a negative correlation for lower-level responses and a positive one for higher-level responses. The model overrates complex reasoning in low-level responses with intricate sentence structures and underrates high-level responses using concise causal reasoning. These correlations stem from linguistic features--simplicity and clarity enhance accuracy for lower-level responses, while comprehensiveness improves accuracy for higher-level ones. Simpler, shorter responses tend to score more accurately at lower levels, whereas longer, information-rich responses yield better accuracy at higher levels. These findings demonstrate the effectiveness of LLMs in automatic scoring within a Chinese context and emphasize the importance of linguistic features and reasoning complexity in fine-tuning scoring models for educational assessments.
Problem

Research questions and friction points this paper is trying to address.

ChatGPT
Chinese Scientific Explanation Evaluation
Text Feature Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese Scientific Explanation Evaluation
ChatGPT Adaptation
Text Complexity Analysis
🔎 Similar Papers
No similar papers found.
J
Jie Yang
Faculty of Psychology, Beijing Normal University, Beijing, 100875, Beijing, China; Research institute of science education, Beijing Normal University, Beijing, 100875, Beijing, China
Ehsan Latif
Ehsan Latif
University of Georgia
Multi-robot systemsMachine LearningAIED
Y
Yuze He
Research institute of science education, Beijing Normal University, Beijing, 100875, Beijing, China
Xiaoming Zhai
Xiaoming Zhai
Associate Professor, University of Georgia
Science EducationAIAssessment