Distributed Interpretability and Control for Large Language Models

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work addresses the lack of efficient, real-time interpretability and behavioral control mechanisms for large language models deployed across multiple GPUs. We propose a scalable, activation-level interpretability and steering system designed for multi-GPU environments, which, for the first time, enables full-layer activation trajectory capture and position-tagged steering vector injection without requiring fine-tuning or additional forward passes. Leveraging distributed activation caching, post-LayerNorm vector injection, and logit lens–based trajectory tracking, our system supports LLaMA-3.1 and Qwen-3 series models, reducing activation memory usage by up to 7× and increasing throughput by 41× under identical hardware constraints. It maintains processing speeds of 20–100 tokens per second on sequences up to 1,500 tokens, achieving an average steering efficacy (measured by steering slope) of 0.702.

Technology Category

Application Category

📝 Abstract

Large language models that require multiple GPU cards to host are usually the most capable models. It is necessary to understand and steer these models, but the current technologies do not support the interpretability and steering of these models in the multi-GPU setting as well as the single-GPU setting. We present a practical implementation of activation-level interpretability (logit lens) and steering (steering vector) that scales up to multi-GPU language models. Our system implements design choices that reduce the activation memory by up to 7x and increase the throughput by up to 41x compared to a baseline on identical hardware. We demonstrate the method across LLaMA-3.1 (8B, 70B) and Qwen-3 (4B, 14B, 32B), sustaining 20-100 tokens/s while collecting full layer-wise activation trajectories for sequences of 1,500 tokens. Using label-position steering vectors injected post-LayerNorm, we show controllable, monotonic shifts in model outputs with a mean steerability slope of 0.702 across evaluated datasets, without fine-tuning or additional forward passes. We release detailed benchmarks, ablations, and a reproducible instrumentation recipe to enable practical interpretability and real-time behavioral control for frontier LLMs at https://github.com/Devdesai1901/LogitLense.

Problem

Research questions and friction points this paper is trying to address.

Distributed Interpretability

Model Steering

Multi-GPU LLMs

Activation-level Analysis

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed interpretability

steering vectors

multi-GPU LLMs