Steering Conceptual Bias via Transformer Latent-Subspace Activation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether controllable steering of scientific code generation—specifically toward language biases such as C++—can be achieved via targeted intervention in the latent subspaces of large language models. To this end, we propose G-ACT, a gradient-driven adaptive activation guidance framework: it identifies critical subspaces through MLP neuron attribution and clustering analysis, then applies fine-grained latent-space intervention in Transformers using lightweight online-trained probes and hierarchical activation injection. Crucially, G-ACT enables dynamic, prompt-conditioned selection of guidance directions, enhancing both controllability and interpretability of code generation. Evaluated on LLaMA-3.2 3B, our method improves probe classification accuracy by an average of 15%, reaching 61.5% in early layers; it remains robust and effective on the 70B variant, demonstrating strong scalability and generalization across model scales.

Technology Category

Application Category

📝 Abstract
This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify their baseline bias among four programming languages. A static neuron-attribution method, perturbing the highest activated MLP weight for a C++ or CPP token, proved brittle and exhibited limited generalization across prompt styles and model scales. To address these limitations, a gradient-refined adaptive activation steering framework (G-ACT) was developed: per-prompt activation differences are clustered into a small set of steering directions, and lightweight per-layer probes are trained and refined online to select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably biases generation towards the CPP language by increasing the average probe classification accuracy by 15% and the early layers (0-6) improving the probe classification accuracy by 61.5% compared to the standard ACT framework. For LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted injections at key layers still improve language selection. Although per-layer probing introduces a modest inference overhead, it remains practical by steering only a subset of layers and enables reproducible model behavior. These results demonstrate a scalable, interpretable and efficient mechanism for concept-level control for practical agentic systems.
Problem

Research questions and friction points this paper is trying to address.

Steering LLM code generation bias towards specific programming languages
Overcoming brittleness in neuron-attribution methods for activation steering
Developing efficient gradient-refined activation for concept-level control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activate latent subspaces to steer code generation
Use gradient-refined adaptive activation steering framework
Improve language selection via targeted layer injections
🔎 Similar Papers
No similar papers found.