Morpheme Induction for Emergent Language

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the unsupervised morpheme induction problem in emerging languages. We propose CSAR, an unsupervised algorithm that leverages parallel semantic–utterance pairs and employs a mutual information-weighted greedy decomposition framework. CSAR iteratively performs counting, candidate selection, and corpus reduction to decouple form–meaning mapping units. To our knowledge, this is the first application of mutual information-driven greedy decomposition to morpheme discovery in emerging languages—offering both interpretability and computational efficiency. Evaluated on synthetic data and real-world emerging language corpora, CSAR significantly outperforms existing baselines. It successfully quantifies core linguistic phenomena—including synonymy and polysemy—demonstrating its capacity to uncover latent semantic structure. Our approach establishes a novel paradigm for structured modeling of low-resource languages, advancing morphological analysis where annotated data are scarce or unavailable. (138 words)

Technology Category

Application Category

📝 Abstract
We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.
Problem

Research questions and friction points this paper is trying to address.

Inducing morphemes from emergent language corpora
Validating algorithm performance on human language data
Analyzing linguistic characteristics like synonymy and polysemy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Greedy algorithm weights morphemes via form-meaning mutual information
Iteratively selects and removes highest-weighted morpheme pairs
Validated on procedural datasets and human language data
🔎 Similar Papers
2024-06-21arXiv.orgCitations: 0