RAPTOR: Ridge-Adaptive Logistic Probes

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of efficiently and stably extracting concept vectors from intermediate layers of frozen large language models for activation steering. The authors propose a lightweight probing method based on L2-regularized logistic regression, which—by incorporating a validation-tuned ridge parameter into normalized weights—systematically links regularization strength to both directional stability and training efficiency of concept vectors for the first time. Leveraging the Convex Gaussian Minimax Theorem (CGMT), they provide a high-dimensional, few-shot theoretical justification for their approach. Extensive experiments across multiple instruction-tuned models and synthetic concept datasets demonstrate that the method achieves comparable or superior accuracy relative to strong baselines while significantly reducing training cost and enhancing directional stability.

Technology Category

Application Category

📝 Abstract

Probing studies what information is encoded in a frozen LLM's layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.

Problem

Research questions and friction points this paper is trying to address.

probing

concept vector

activation steering

directional stability

training cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ridge regularization

probing

concept vectors