Is text normalization relevant for classifying medieval charters?

📅 2024-08-29

🏛️ International Conference on Theory and Practice of Digital Libraries

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This study investigates the impact of historical text normalization on dating and localization classification of Middle High German charters. Using a digital archival dataset, we compare the performance of support vector machines (SVM), gradient-boosted trees, and BERT-based Transformer models before and after applying linguistics-driven historical normalization. Results show that normalization significantly reduces dating accuracy—confirming that original orthographic variation encodes irreplaceable diachronic cues—while yielding only marginal gains in localization precision. SVM and gradient-boosted trees consistently outperform Transformers, challenging the assumed necessity of Transformer architectures for small-scale historical document tasks. We propose a “selective normalization” strategy that preserves task-critical historical linguistic features, advocating for task-aware preprocessing rather than uniform normalization. This work contributes both empirical evidence against indiscriminate normalization and a methodological framework for principled feature retention in historical text processing.

Technology Category

Application Category

📝 Abstract

This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.

Problem

Research questions and friction points this paper is trying to address.

Impact of text normalization on medieval charter classification

Effectiveness of normalization for document dating and locating

Performance comparison of classifiers on historical texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates classifiers with and without normalization

Uses support vector machines and gradient boosting

Emphasizes preserving critical textual characteristics

🔎 Similar Papers

Historical German Text Normalization Using Type- and Token-Based Language Modeling