ActivationReasoning: Logical Reasoning in Latent Activation Spaces

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit opaque and uncontrollable internal reasoning. Method: We explicitly embed interpretable logical reasoning into their latent activation space—moving beyond conventional sparse autoencoders (SAEs), which merely extract fragile, post-hoc concepts. Our approach leverages SAEs to construct a human-readable concept dictionary, dynamically detects activated concepts in real time, maps them to formal logical propositions, and applies symbolic inference rules for multi-step deduction, novel concept synthesis, and behavior steering. Contribution/Results: This is the first work to enable systematic, interpretable, and controllable logical reasoning directly within the latent space. Experiments demonstrate significant improvements across multi-hop reasoning, abstract comprehension, natural language inference, and context-sensitive safety tasks. Reasoning capability scales robustly with task complexity and generalizes across diverse model architectures.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.
Problem

Research questions and friction points this paper is trying to address.

Enabling logical reasoning in latent LLM activation spaces
Improving interpretability and control over opaque model behaviors
Addressing fragility of latent features through structured inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embeds logical reasoning in latent activation spaces
Maps latent concepts to logical propositions for inference
Applies logical rules to steer model behavior
🔎 Similar Papers
No similar papers found.