Subword models struggle with word learning, but surprisal hides it

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Subword tokenization models conflate lexical and syntactic learning, contradicting psycholinguistic evidence that lexical acquisition precedes syntactic development. Method: We conduct psycholinguistic lexical decision tasks, perplexity analysis, and cross-granularity (subword vs. character-level) comparative experiments on neural language models. Contribution/Results: Character-level models consistently achieve high accuracy in distinguishing words from nonwords, whereas subword models perform significantly worse. Crucially, we provide the first empirical demonstration that lexical and syntactic learning are temporally and mechanistically separable in character models—lexical representations emerge earlier and independently—but remain tightly coupled and inseparable in subword models. These findings challenge the dominant subword modeling paradigm and establish character-level modeling as a cognitively more plausible computational framework for language acquisition.

Technology Category

Application Category

📝 Abstract

We study word learning in subword and character language models with the psycholinguistic lexical decision task. While subword LMs struggle to discern words and non-words with high accuracy, character LMs solve this task easily and consistently. Furthermore, when comparing word learning and syntactic learning, both processes are separable in character LM where word learning predates syntactic learning, whereas these processes are simultaneous in subword LM. This raises questions about the adequacy of subword LMs for modeling language acquisition and positions character LMs as a viable alternative.

Problem

Research questions and friction points this paper is trying to address.

Subword models struggle with word learning.

Character models excel in lexical decision tasks.

Subword and character models differ in learning processes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Character language models excel

Separate word and syntactic learning

Challenge subword model adequacy

🔎 Similar Papers

From Tokens to Words: On the Inner Lexicon of LLMs