Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses sound change induction (SLI) in historical linguistics—the automatic inference of phonological regularities from proto-language–descendant word pairs. We reformulate SLI as a programming-by-examples (PBE) problem, modeling each pair as an ordered string transformation. We propose a formal PBE framework, design four synthetic data generation strategies embodying distinct inductive biases, and introduce a fine-grained distribution alignment technique to enhance large language model generalization. Our lightweight, open-source model achieves a 6% absolute improvement in pass rate on standard SLI benchmarks while using only one-third the parameters of the second-best model. Methodologically, this work advances computational historical linguistics by enabling interpretable, rule-based phonological generalization; theoretically, it demonstrates the efficacy of the PBE paradigm for structured symbolic reasoning tasks. The approach provides a scalable, explainable framework for modeling language evolution—bridging formal linguistics and neural sequence modeling.

Technology Category

Application Category

📝 Abstract
Historical linguists have long written"programs"that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a"similar distribution"for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.
Problem

Research questions and friction points this paper is trying to address.

Computational Phonology
Machine Learning Efficiency
Historical Linguistics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Similarity Assessment
Example Generation Strategies
Speech Pattern Learning
🔎 Similar Papers
No similar papers found.
Atharva Naik
Atharva Naik
PhD Student, Carnegie Mellon University
LLM4CodeLLM ReasoningAlignment
D
Darsh Agrawal
Carnegie Mellon University
H
Hong Sng
Carnegie Mellon University
C
Clayton Marr
Ohio State University
Kexun Zhang
Kexun Zhang
Carnegie Mellon University
Nathaniel R. Robinson
Nathaniel R. Robinson
Johns Hopkins University
Natural Language ProcessingArtificial IntelligenceAlgorithmsComputational Mathematics
K
Kalvin Chang
Carnegie Mellon University
R
Rebecca Byrnes
Carnegie Mellon University
A
Aravind Mysore
Carnegie Mellon University
C
Carolyn Rose
Carnegie Mellon University
D
David R. Mortensen
Carnegie Mellon University