Beyond cognacy

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Traditional historical linguistics relies on expert-annotated cognate sets, resulting in data sparsity, high annotation cost, and poor scalability across language families. This paper proposes a fully automated phylogenetic inference method that requires no manual etymological annotation. First, unsupervised cognate clustering extracts lexical correspondences; second, single-character and concept-based features are modeled, and multi-sequence alignment is performed via pair-HMMs; finally, phylogenetic trees are reconstructed using likelihood-based methods. By eliminating dependence on expert cognate judgments, the approach overcomes a key bottleneck and significantly enhances cross-family scalability. Experiments demonstrate that our method outperforms baselines in language classification accuracy and typological variation prediction, while exhibiting stronger phylogenetic signal consistency. It establishes a reproducible, generalizable paradigm for large-scale global language evolution studies.

Technology Category

Application Category

📝 Abstract

Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal, suggesting it as a promising, scalable alternative to traditional cognate-based methods. This opens new avenues for global-scale language phylogenies beyond expert annotation bottlenecks.

Problem

Research questions and friction points this paper is trying to address.

Automating phylogenetic signal extraction from lexical data

Comparing MSA-based inference to traditional cognate methods

Scaling language phylogenies beyond expert-annotated cognate sets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated cognate clustering with unigram features

Multiple sequence alignment via hidden Markov model

Phylogenetic signal comparison for language classification

🔎 Similar Papers

No similar papers found.