🤖 AI Summary
This work investigates the intrinsic organizational structure of concepts within the input embedding layers of large language models (LLMs) and its alignment with human cognition and predefined semantics, while assessing its potential for mitigating ethnic bias. We propose a novel embedding structure analysis framework integrating fuzzy graph modeling, k-nearest neighbor analysis, and community detection. Applied across multiple mainstream LLMs, it reveals—for the first time—that input embeddings naturally form hierarchical, topologically ordered, and cross-model highly aligned semantic communities, exhibiting significant structural correspondence with human conceptual organization. Furthermore, targeted intervention in concept grouping within the embedding space yields substantial reductions in ethnicity-related bias on downstream tasks. This study provides the first empirical evidence that input embeddings inherently encode interpretable and intervenable semantic structure, establishing a new paradigm for bias mitigation grounded in embedding-space semantics rather than post-hoc calibration or fine-tuning.
📝 Abstract
This paper shifts focus to the often-overlooked input embeddings - the initial representations fed into transformer blocks. Using fuzzy graph, k-nearest neighbor (k-NN), and community detection, we analyze embeddings from diverse LLMs, finding significant categorical community structure aligned with predefined concepts and categories aligned with humans. We observe these groupings exhibit within-cluster organization (such as hierarchies, topological ordering, etc.), hypothesizing a fundamental structure that precedes contextual processing. To further investigate the conceptual nature of these groupings, we explore cross-model alignments across different LLM categories within their input embeddings, observing a medium to high degree of alignment. Furthermore, provide evidence that manipulating these groupings can play a functional role in mitigating ethnicity bias in LLM tasks.