๐ค AI Summary
This work addresses the latency, memory, and operational overhead incurred in production environments when deploying large language models (LLMs) alongside auxiliary classifiers for tasks such as safety detection. The authors propose reusing internal hidden states from a single LLM forward pass to simultaneously perform text generation and classification via lightweight probes. They introduce a novel representation selection framework that leverages hidden states across all tokens and layers, overcoming the limitations of conventional approaches that rely on fixed tokens or layers. A two-stage aggregation mechanism is further designed to enable efficient context-aware classification. Evaluated on safety and sentiment analysis benchmarks, the method outperforms logits-only reuse strategies (e.g., MULI), matches the performance of larger dedicated classifiers, and achieves near-native inference latency with substantially reduced GPU memory consumption.
๐ Abstract
Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.