🤖 AI Summary
This work investigates how large language models (LLMs) compose subword representations into word-level embeddings, focusing on three core dimensions: structural similarity, semantic decomposability, and form preservation. Using layer-wise probing experiments, we systematically analyze the response patterns of over 20 representative LLMs across five model families to subword structure, semantic content, and character-sequence length at each layer. We identify three distinct subword composition mechanisms—structure-dominant, semantics-sensitive, and form-conservative—exhibiting systematic differences in representational evolution trajectories, degree of semantic disentanglement, and robustness to input form perturbations. To our knowledge, this is the first empirical study to establish a typology of subword composition strategies in LLMs. Our findings provide an interpretable, generalizable theoretical framework and empirical foundation for understanding how LLMs construct lexical representations internally.
📝 Abstract
Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.