The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource languages like Tagalog suffer from severe scarcity of dependency treebanks, compounded by linguistic challenges—including a focus system, verb centrality, and free word order—that impede consistent annotation. To address this, we introduce UD-NewsCrawl, the largest publicly available Universal Dependencies (UD) treebank for Tagalog to date (15.6k sentences), built from news crawl data through systematic corpus collection, preprocessing, expert-driven manual annotation, and multi-stage quality control. We propose, for the first time, a linguistically grounded UD annotation guideline tailored to Tagalog’s grammatical properties and release all data and annotation tools openly. Dependency parsing experiments using BERT and XLM-R demonstrate that training on UD-NewsCrawl improves state-of-the-art models’ labeled attachment score (LAS) by up to 8.2 percentage points. This advancement significantly advances syntactic parsing research for low-resource languages and establishes a reusable methodological framework for treebank development in morphosyntactically complex, under-resourced languages.

Technology Category

Application Category

📝 Abstract
This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.
Problem

Research questions and friction points this paper is trying to address.

Developing the largest Tagalog treebank UD-NewsCrawl
Evaluating dependency parsers on Tagalog using transformer models
Addressing syntactic challenges in Tagalog annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest Tagalog treebank with 15.6k annotated trees
Baseline evaluations using transformer-based models
Addresses challenges in Tagalog syntactic analysis
🔎 Similar Papers
No similar papers found.
A
Angelina A. Aquino
Electrical and Electronics Engineering Institute, University of the Philippines Diliman
Lester James V. Miranda
Lester James V. Miranda
University of Cambridge
Natural Language ProcessingMachine Learning
E
Elsie Marie T. Or
Department of Linguistics, University of the Philippines Diliman