Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing Chinese automated essay scoring (AES) methods exhibit limited performance, and the potential of large language models (LLMs) remains underexploited. To address this, we propose a two-stage framework, Rank-Then-Score (RTS): first, a ranking model performs relative ordering over candidate scores; second, a scoring model integrates the ranking output with the original essay to generate an absolute score. This introduces the novel “ranking-guided scoring” paradigm—the first systematic approach to enhance LLM-based Chinese AES. By decoupling ranking from scoring, RTS improves both interpretability and robustness. Technically, we employ two-stage fine-tuning, feature-augmented ranking data construction, explicit injection of candidate score sets, and conditional scoring modeling. On the HSK (Chinese) and ASAP (English) benchmarks, RTS consistently outperforms direct prompting baselines; on HSK, it achieves state-of-the-art performance, with a significant improvement in average quadratic weighted kappa (QWK).

Technology Category

Application Category

📝 Abstract

In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to English data, the methods for Chinese AES is not well developed. In this paper, we propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs for Automated Essay Scoring (AES)

Improving Chinese AES methods using Rank-Then-Score

Outperforming direct prompting in essay scoring accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes LLMs for essay scoring

Uses Rank-Then-Score framework

Enhances Chinese AES performance

🔎 Similar Papers

No similar papers found.