Intra-request branch orchestration for efficient LLM reasoning

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the high latency and excessive token consumption induced by chain-of-thought and similar reasoning methods in LLM inference, this paper proposes DUCHESS—a novel system that introduces the first lightweight linear probe operating on LLM layer activations to enable fine-grained branch correctness prediction. Based on these predictions, DUCHESS dynamically terminates, replicates, or continues inference branches. It further incorporates a task-difficulty-aware request scheduler to jointly optimize server resource allocation under multi-request workloads. Implemented within the vLLM framework, DUCHESS reduces token consumption by 42%–63% over self-consistency across three inference benchmarks, while decreasing average, median, and tail latency by 52%–85%. It significantly improves the token-accuracy Pareto frontier without compromising accuracy.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms such as chain-of-thought and multi-branch reasoning to improve accuracy on complex tasks. These methods, however, substantially increase token usage and per-request latency. Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors. We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions. DUCHESS employs a lightweight linear probing model over LLM layer activations to estimate branch correctness, and its orchestration policy decides whether to terminate, duplicate, or continue a branch. When handling multiple requests, DUCHESS further reduces latency by prioritizing easier reasoning tasks when complexity can be estimated from the prompt. Experiments on three reasoning benchmarks show that DUCHESS consistently improves the token-accuracy Pareto frontier, reducing token usage by 42-63% at matched accuracy compared to self-consistency. In serving with vLLM, DUCHESS reduces mean, median, and tail latencies by 57-81%, 58-85%, and 52-84% with First-Come-First-Served scheduling, and achieves additional gains under difficulty-aware scheduling at higher request rates.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM reasoning efficiency while maintaining high accuracy levels

Reducing excessive token consumption in multi-branch reasoning algorithms

Addressing latency issues in LLM serving systems during complex reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intra-request branch orchestration guided by predictions

Lightweight linear probing estimates branch correctness

Difficulty-aware scheduling prioritizes easier reasoning tasks

🔎 Similar Papers

Path-Consistency: Prefix Enhancement for Efficient Inference in LLM