Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large language models (LLMs) suffer from unobservable internal semantic concepts, leading to hallucinations and misleading outputs. Method: We propose a cross-layer aggregation framework for nonlinear concept detection and steering. It introduces hierarchical nonlinear feature learning and multi-layer activation aggregation, integrated with concept-direction identification and a lightweight, direction-based neural controller (“steering”), enabling both single- and multi-concept joint detection as well as fine-grained, numerically controllable guidance (e.g., rating scores). Contributions/Results: (1) Achieves state-of-the-art performance in hallucination, toxicity, and factual inconsistency detection across seven benchmarks; (2) Successfully steers LLMs to generate novel concept outputs across 10+ domains—including semantic disambiguation, programming languages, and Shakespearean English—demonstrating broad conceptual controllability; (3) Releases fully open-sourced code and an intuitive API, advancing the frontiers of interpretability and controllable text generation in LLMs.

Technology Category

Application Category

📝 Abstract

A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always ``know what they know'' and may even be actively misleading. In this work, we give a general method for detecting semantic concepts in the internal activations of LLMs. Furthermore, we show that our methodology can be easily adapted to steer LLMs toward desirable outputs. Our innovations are the following: (1) we use a nonlinear feature learning method to identify important linear directions for predicting concepts from each layer; (2) we aggregate features across layers to build powerful concept detectors and steering mechanisms. We showcase the power of our approach by attaining state-of-the-art results for detecting hallucinations, harmfulness, toxicity, and untruthful content on seven benchmarks. We highlight the generality of our approach by steering LLMs towards new concepts that, to the best of our knowledge, have not been previously considered in the literature, including: semantic disambiguation, human languages, programming languages, hallucinated responses, science subjects, poetic/Shakespearean English, and even multiple concepts simultaneously. Moreover, our method can steer concepts with numerical attributes such as product reviews. We provide our code (including a simple API for our methods) at https://github.com/dmbeaglehole/neural_controllers .

Problem

Research questions and friction points this paper is trying to address.

Detect semantic concepts in LLM activations

Steer LLMs toward desirable outputs

Aggregate features across layers for detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Nonlinear feature learning

Layer aggregation

Concept steering

🔎 Similar Papers

Aligned at the Start: Conceptual Groupings in LLM Embeddings