Lexicalized Constituency Parsing for Middle Dutch: Low-resource Training and Cross-Domain Generalization

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This study addresses the challenges of neural constituency parsing for Middle Dutch—a low-resource, highly heterogeneous historical language—where performance has been hindered by data scarcity and poor cross-domain generalization. The work proposes the first application of Transformer-based architectures to this task, leveraging multilingual joint training with high-resource auxiliary languages selected based on geographic and temporal proximity. By integrating newly annotated multi-domain data and employing domain adaptation strategies—including contextualized word embeddings, fine-tuning, and feature disentanglement—the approach achieves substantial improvements in parsing accuracy. Experimental results demonstrate a maximum F1 gain of 0.73 over traditional PCFG methods, with effective cross-domain transfer attainable using only around 200 annotated examples, thereby confirming the feasibility and superiority of neural approaches for processing low-resource historical languages.

Technology Category

Application Category

📝 Abstract

Recent years have seen growing interest in applying neural networks and contextualized word embeddings to the parsing of historical languages. However, most advances have focused on dependency parsing, while constituency parsing for low-resource historical languages like Middle Dutch has received little attention. In this paper, we adapt a transformer-based constituency parser to Middle Dutch, a highly heterogeneous and low-resource language, and investigate methods to improve both its in-domain and cross-domain performance. We show that joint training with higher-resource auxiliary languages increases F1 scores by up to 0.73, with the greatest gains achieved from languages that are geographically and temporally closer to Middle Dutch. We further evaluate strategies for leveraging newly annotated data from additional domains, finding that fine-tuning and data combination yield comparable improvements, and our neural parser consistently outperforms the currently used PCFG-based parser for Middle Dutch. We further explore feature-separation techniques for domain adaptation and demonstrate that a minimum threshold of approximately 200 examples per domain is needed to effectively enhance cross-domain performance.

Problem

Research questions and friction points this paper is trying to address.

constituency parsing

Middle Dutch

low-resource

cross-domain generalization

historical languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

constituency parsing

low-resource

cross-domain generalization