🤖 AI Summary
This work proposes STEP, a novel framework that enhances the efficiency and accuracy of test-time multi-trajectory reasoning in large language models. While such reasoning can improve performance, it suffers from high computational overhead and latency, and existing pruning methods struggle to reliably assess trajectory quality early in generation. STEP addresses this by leveraging hidden states during inference as early signals for step-level quality estimation. It introduces a lightweight scorer and a GPU memory-aware dynamic pruning mechanism that promptly eliminates low-potential trajectories during generation. Evaluated across multiple complex reasoning benchmarks, STEP reduces end-to-end latency by 45%–70% on average while simultaneously improving reasoning accuracy.
📝 Abstract
Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP