Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling

📅 2026-01-14
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes STEP, a novel framework that enhances the efficiency and accuracy of test-time multi-trajectory reasoning in large language models. While such reasoning can improve performance, it suffers from high computational overhead and latency, and existing pruning methods struggle to reliably assess trajectory quality early in generation. STEP addresses this by leveraging hidden states during inference as early signals for step-level quality estimation. It introduces a lightweight scorer and a GPU memory-aware dynamic pruning mechanism that promptly eliminates low-potential trajectories during generation. Evaluated across multiple complex reasoning benchmarks, STEP reduces end-to-end latency by 45%–70% on average while simultaneously improving reasoning accuracy.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP
Problem

Research questions and friction points this paper is trying to address.

test-time scaling
reasoning traces
inference latency
trace pruning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

hidden states
step-level pruning
test-time scaling
KV cache
reasoning efficiency
🔎 Similar Papers
2024-07-31International Conference on Electronics, Circuits, and SystemsCitations: 0