Understanding Subword Compositionality of Large Language Models

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work investigates how large language models (LLMs) compose subword representations into word-level embeddings, focusing on three core dimensions: structural similarity, semantic decomposability, and form preservation. Using layer-wise probing experiments, we systematically analyze the response patterns of over 20 representative LLMs across five model families to subword structure, semantic content, and character-sequence length at each layer. We identify three distinct subword composition mechanisms—structure-dominant, semantics-sensitive, and form-conservative—exhibiting systematic differences in representational evolution trajectories, degree of semantic disentanglement, and robustness to input form perturbations. To our knowledge, this is the first empirical study to establish a typology of subword composition strategies in LLMs. Our findings provide an interpretable, generalizable theoretical framework and empirical foundation for understanding how LLMs construct lexical representations internally.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.

Problem

Research questions and friction points this paper is trying to address.

Probing how LLMs compose subword representations into meaningful words

Analyzing structural similarity, semantic decomposability, and form retention

Classifying LLM families based on their subword composition strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probing structural similarity across layers

Analyzing semantic decomposability sensitivity layer-wise

Investigating formal feature retention patterns

🔎 Similar Papers

No similar papers found.