Morpheme Induction for Emergent Language

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This paper addresses the unsupervised morpheme induction problem in emerging languages. We propose CSAR, an unsupervised algorithm that leverages parallel semantic–utterance pairs and employs a mutual information-weighted greedy decomposition framework. CSAR iteratively performs counting, candidate selection, and corpus reduction to decouple form–meaning mapping units. To our knowledge, this is the first application of mutual information-driven greedy decomposition to morpheme discovery in emerging languages—offering both interpretability and computational efficiency. Evaluated on synthetic data and real-world emerging language corpora, CSAR significantly outperforms existing baselines. It successfully quantifies core linguistic phenomena—including synonymy and polysemy—demonstrating its capacity to uncover latent semantic structure. Our approach establishes a novel paradigm for structured modeling of low-resource languages, advancing morphological analysis where annotated data are scarce or unavailable. (138 words)

Technology Category

Application Category

📝 Abstract

We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

Problem

Research questions and friction points this paper is trying to address.

Inducing morphemes from emergent language corpora

Validating algorithm performance on human language data

Analyzing linguistic characteristics like synonymy and polysemy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Greedy algorithm weights morphemes via form-meaning mutual information

Iteratively selects and removes highest-weighted morpheme pairs

Validated on procedural datasets and human language data

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer