From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) achieve a human-like trade-off between semantic fidelity and compression efficiency in their internal representations. Method: We introduce the first quantitative framework grounded in rate-distortion theory and the information bottleneck principle, enabling systematic comparison of LLM embeddings against human categorization behavior on canonical cognitive benchmarks. Contribution/Results: We find that while LLMs capture coarse-grained, human-aligned concepts, they exhibit significantly weaker fine-grained semantic discrimination than humans. Their representations overemphasize statistical compression at the expense of semantic nuance and lack contextual adaptivity. These findings reveal a fundamental cognitive divergence between LLMs and humans in concept formation and establish the first information-theoretic, interpretable paradigm for evaluating and improving the semantic representational capacity of language models.

Technology Category

Application Category

📝 Abstract

Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

Problem

Research questions and friction points this paper is trying to address.

Compare human and LLM semantic compression strategies

Assess LLM ability to capture fine-grained human distinctions

Identify biases in LLM vs human conceptual prioritization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel information-theoretic framework for comparison

Rate-Distortion Theory and Information Bottleneck principle

Quantitative analysis of token embeddings in LLMs

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models