π€ AI Summary
This study investigates the selection mechanisms underlying morpheme combination in language evolution, addressing why certain forms are retained while other viable alternatives are discarded. Building on the Rational Speech Act (RSA) framework, it presents the first computational model that formalizes the trade-off between expressive efficiency and semantic informativeness in the historical development of lexical structures. By integrating time-indexed dictionaries with the COHA and COCA corpora, the authors construct a joint optimization model balancing semantic information and production cost. Evaluated on 4,323 English compounds and derivations, the model significantly outperforms baselines that consider only semantics or cost, achieving consistent gains in MRR and Acc@k metricsβan advantage that intensifies with larger candidate sets. These findings demonstrate that lexicalization emerges from a dynamic equilibrium between communicative efficiency and expressive power.
π Abstract
Human languages expand vocabularies by combining existing morphemes rather than inventing arbitrary forms. Communicative efficiency shapes lexical systems at multiple levels (Gibson et al., 2019), yet morphological composition -- combining morphemes through compounding or affixation -- has rarely been modeled as a historically situated speaker choice among competing morpheme sequences, leaving unanswered why a language settles on one morpheme combination over other plausible alternatives. We ask whether a trade-off between listener recoverability and speaker production cost can predict attested compositions over contemporaneously available alternatives. Here we show, within the Rational Speech Act (RSA) framework (Frank & Goodman, 2012; Goodman & Frank, 2016) using a time-indexed lexicon constructed from Corpus of Historical American English (COHA) and Corpus of Contemporary American English (COCA), that across 4323 naturally occurring English compounds and derivations spanning 1820--2019, attested compositions are systematically ranked above unattested alternatives generated from contemporaneously available morphemes. Models integrating semantic informativeness with production cost outperform semantic-only and cost-only baselines on Mean Reciprocal Rank (MRR) and top-k accuracy (Acc@k), with the advantage of the Pragmatic Speaker model ($S_1$) over the semantic-only baseline growing as the candidate set expands, where meaning alone leaves morphological choice underdetermined. These findings suggest that lexicalization reflects a communicative trade-off between expressiveness and efficiency, extending rational accounts of communication from utterance-level choice to the internal structure of words.