Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) struggle to capture complex grammatical rules in extremely low-resource language translation due to insufficient training data and opaque grammar representation. Method: This paper proposes a syntax-driven two-stage translation framework: (1) constructing ZhuangRules, a modular grammar dataset that decouples grammar-book knowledge into rule retrieval and application components; and (2) pioneering the formalization of grammatical rules as executable code functions—leveraging LLMs’ strong structured reasoning over code to jointly enhance rule comprehension and generation. Contribution/Results: The approach alleviates the rule-retrieval bottleneck and achieves an absolute BLEU improvement of 13.1 points on Zhuang translation—a severely low-resource language. It is the first work to empirically validate that codified syntactic representations significantly improve LLMs’ grammatical generalization capability, establishing a new paradigm for controllable, grammar-aware translation in data-scarce scenarios.

Technology Category

Application Category

📝 Abstract
While large language models (LLMs) have shown promise in translating extremely low-resource languages using resources like dictionaries, the effectiveness of grammar books remains debated. This paper investigates the role of grammar books in translating extremely low-resource languages by decomposing it into two key steps: grammar rule retrieval and application. To facilitate the study, we introduce ZhuangRules, a modularized dataset of grammar rules and their corresponding test sentences. Our analysis reveals that rule retrieval constitutes a primary bottleneck in grammar-based translation. Moreover, although LLMs can apply simple rules for translation when explicitly provided, they encounter difficulties in handling more complex rules. To address these challenges, we propose representing grammar rules as code functions, considering their similarities in structure and the benefit of code in facilitating LLM reasoning. Our experiments show that using code rules significantly boosts both rule retrieval and application, ultimately resulting in a 13.1% BLEU improvement in translation.
Problem

Research questions and friction points this paper is trying to address.

Investigates grammar books' role in translating low-resource languages
Identifies rule retrieval as bottleneck in grammar-based translation
Proposes code-augmented rules to improve retrieval and application
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose translation into rule retrieval and application
Represent grammar rules as code functions
Code rules boost retrieval and application performance
🔎 Similar Papers
No similar papers found.
C
Chen Zhang
Wangxuan Institute of Computer Technology, Peking University
Jiuheng Lin
Jiuheng Lin
Peking University
Natural Language Processing
X
Xiao Liu
Wangxuan Institute of Computer Technology, Peking University
Z
Zekai Zhang
Wangxuan Institute of Computer Technology, Peking University
Yansong Feng
Yansong Feng
Peking University
Natural Language ProcessingPattern Recognition