Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the causal relationship between intrinsic dimensionality (ID) of text and interpretable linguistic attributes, revealing hierarchical differences across genres—academic abstracts, encyclopedic content, and creative stories—and its capacity to characterize representational complexity in large language models (LLMs). Methodologically, it innovatively integrates cross-encoders, sparse autoencoders (SAEs), and fine-grained linguistic feature modeling to systematically decompose ID’s origins for the first time; causal effects are rigorously validated via controlled variable manipulation and intervention experiments. Results show genre-stratified ID: lowest in scientific texts (≈8), intermediate in encyclopedic content (≈9), and highest in creative narratives (≈10.5). Formal expression significantly reduces ID, whereas affective and narrative language robustly increases it—demonstrating a causal effect. Crucially, ID is orthogonal to entropy-based metrics, establishing it as an independent, linguistically grounded dimension for quantifying textual complexity.

Technology Category

Application Category

📝 Abstract
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text"representationally simple"while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively"easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
Problem

Research questions and friction points this paper is trying to address.

Investigating interpretable text properties that determine intrinsic dimension in LLMs
Establishing how intrinsic dimension differs across genres like scientific and creative writing
Identifying causal linguistic features that increase or decrease intrinsic dimension values
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using cross-encoder analysis for interpretable grounding
Applying linguistic features to determine intrinsic dimension
Employing sparse autoencoders to identify causal features
🔎 Similar Papers
No similar papers found.
V
Vladislav Pedashenko
Moscow State University
Laida Kushnareva
Laida Kushnareva
Huawei
Natural Language ProcessingTopological Data AnalysisInterpretability
Y
Yana Khassan Nibal
Lomonosov Research Institute
E
Eduard Tulchinskii
Lomonosov Research Institute
Kristian Kuznetsov
Kristian Kuznetsov
Applied AI Institute
machine learningtopological data analysisnatural language processing
V
Vladislav Zharchinskii
Moscow State University
Y
Yury Maximov
Interdata Astana
Irina Piontkovskaya
Irina Piontkovskaya
Huawei Noah's Ark Lab
natural language processing