Beyond the Surface: Probing the Ideological Depth of Large Language Models

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) possess stable, deep internal ideological representations—beyond superficially manipulable political tendencies. Method: We introduce the notion of *ideological depth* and develop a quantitative framework integrating controllability analysis, sparse autoencoder (SAE)-based probing, and targeted ablation experiments to systematically characterize the abstraction, disentanglement, and functional robustness of political features across models. Contribution/Results: We find that low-controllability models exhibit more abstract and disentangled political representations; ablating their core ideological features induces logically consistent belief shifts. In contrast, shallow models often default to refusal. Empirically, one model exhibits 7.3× more distinct political features than comparably sized baselines—revealing fundamental architectural differences in internal political organization. This work establishes a novel paradigm for assessing the intrinsic foundations of value alignment in LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of "ideological depth" in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the "steerability" of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically "deep" model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a "shallow" model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.

Problem

Research questions and friction points this paper is trying to address.

Investigating ideological depth and stability in LLMs

Measuring steerability between liberal and conservative viewpoints

Probing internal political representations using Sparse Autoencoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Measuring ideological steerability via instruction prompting

Probing internal representations using Sparse Autoencoders

Targeted ablation of core political features

🔎 Similar Papers

No similar papers found.