Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Current LLM-based personalized chatbots lack behavioral predictability, leading to safety and usability issues such as sycophancy, toxicity, and inconsistency. To address this, we propose a neuro-transparent framework designed for non-expert users: it identifies interpretable behavioral feature vectors by contrasting neural activations across system prompts; constructs a cross-trait comparable vector space; and enables behavior prediction visualization via an interactive sunburst diagram. This work pioneers the translation of mechanistic interpretability into a user-facing transparency interface. A user study reveals systematic misjudgments across 11 of 15 personality traits, while our tool significantly improves user trust and comprehension—demonstrating the practical viability of neuro-transparency in real-world AI deployment.

Technology Category

Application Category

📝 Abstract

Millions of users now design personalized LLM-based chatbots that shape their daily interactions, yet they can only loosely anticipate how their design choices will manifest as behaviors in deployment. This opacity is consequential: seemingly innocuous prompts can trigger excessive sycophancy, toxicity, or inconsistency, degrading utility and raising safety concerns. To address this issue, we introduce an interface that enables neural transparency by exposing language model internals during chatbot design. Our approach extracts behavioral trait vectors (empathy, toxicity, sycophancy, etc.) by computing differences in neural activations between contrastive system prompts that elicit opposing behaviors. We predict chatbot behaviors by projecting the system prompt's final token activations onto these trait vectors, normalizing for cross-trait comparability, and visualizing results via an interactive sunburst diagram. To evaluate this approach, we conducted an online user study using Prolific to compare our neural transparency interface against a baseline chatbot interface without any form of transparency. Our analyses suggest that users systematically miscalibrated AI behavior: participants misjudged trait activations for eleven of fifteen analyzable traits, motivating the need for transparency tools in everyday human-AI interaction. While our interface did not change design iteration patterns, it significantly increased user trust and was enthusiastically received. Qualitative analysis indicated that users' had nuanced experiences with the visualization that may enrich future work designing neurally transparent interfaces. This work offers a path for how mechanistic interpretability can be operationalized for non-technical users, establishing a foundation for safer, more aligned human-AI interactions.

Problem

Research questions and friction points this paper is trying to address.

Users cannot anticipate how chatbot design choices manifest as behaviors

Seemingly innocuous prompts trigger sycophancy, toxicity, or inconsistency

Current systems lack transparency for personalized AI interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts behavioral trait vectors from neural activations

Predicts chatbot behaviors using trait vector projections

Visualizes results via interactive sunburst diagram

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models