Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study addresses the challenge of automated essay scoring for long texts, which is constrained by the limited context window (e.g., 512 tokens) of encoder-based language models. To overcome this limitation, the authors propose an ensemble approach that integrates embeddings from multiple pre-trained models—including BERT, RoBERTa, DistilBERT, and DeBERTa—and combines them using gradient-boosting classifiers such as XGBoost and LightGBM. Evaluated on a dataset of 17,307 essays, the proposed method significantly outperforms individual pre-trained language models in terms of Quadratic Weighted Kappa, demonstrating enhanced accuracy in scoring lengthy essays. The results highlight the effectiveness of leveraging diverse model embeddings and ensemble learning strategies for educational assessment tasks.

Technology Category

Application Category

📝 Abstract

Long context may impose challenges for encoder-only language models in text processing, specifically for automated scoring of essays. This study trained several commonly used encoder-based language models for automated scoring of long essays. The performance of these trained models was evaluated and compared with the ensemble models built upon the base language models with a token limit of 512?. The experimented models include BERT-based models (BERT, RoBERTa, DistilBERT, and DeBERTa), ensemble models integrating embeddings from multiple encoder models, and ensemble models of feature-based supervised machine learning models, including Gradient-Boosted Decision Trees, eXtreme Gradient Boosting, and Light Gradient Boosting Machine. We trained, validated, and tested each model on a dataset of 17,307 essays, with an 80%/10%/10% split, and evaluated model performance using Quadratic Weighted Kappa. This study revealed that an ensemble-of-embeddings model that combines multiple pre-trained language model representations with gradient-boosting classifier as the ensemble model significantly outperforms individual language models at scoring long essays.

Problem

Research questions and friction points this paper is trying to address.

automated essay scoring

long-context text

encoder-based language models

feature-based machine learning

model ensemble

Innovation

Methods, ideas, or system contributions that make the work stand out.

ensemble-of-embeddings

automated essay scoring

encoder-based language models