Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck in deploying large language models (LLMs) for real-time web applications—where high-quality complex reasoning conflicts with low-latency, high-throughput requirements—this paper proposes Orion, a novel inference framework. Orion decouples query reasoning into two sequential stages: (1) structured keypoint generation and (2) dependency-aware parallel content expansion. It introduces a cross-query pipelined scheduler that preserves logical consistency while improving concurrency. Key points are generated via retrieval-augmented few-shot prompting, and expansion is guided by a dependency graph to enable GPU-aware parallelism and memory-load-balanced scheduling. Experiments demonstrate that Orion achieves up to 4.33× higher throughput, 3.42× lower latency, and 18.75% improvement in reasoning quality over state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
The integration of Large Language Models (LLMs) into real-time Web applications, such as AI-powered search and conversational agents, presents a fundamental Web infrastructure challenge: reconciling the demand for high-quality, complex reasoning with the stringent low-latency and high-throughput requirements of interactive services. Current LLM reasoning, hindered by computationally inefficient sequential generation and rigid reasoning strategies, creates a critical bottleneck for the Web services. Existing approaches typically optimize the LLM reasoning for either efficiency or quality but struggle to achieve both, and thus fail to meet the dual requirements of modern Web platforms. To overcome these limitations, we propose Orion, a novel and efficient reasoning framework that enables dependency-aware query decomposition and logic-parallel content expansion. Concretely, Orion decomposes a single query reasoning process into two synergistic phases: (1) extit{key point generation}, which distills logically structured key points through retrieval-augmented few-shot prompting, and (2) extit{content parallel expansion}, which concurrently elaborates on these points based on a dependency graph to ensure logical consistency. Furthermore, Orion introduces a pipeline scheduling mechanism that exploits the complementary computational characteristics of the two phases (generation imposes pressure on GPU computing and expansion stresses on GPU memory) across multiple queries, enabling cross-query parallelism and dramatically improving reasoning performance (ie, efficiency and quality). Experiments on diverse benchmarks show that Orion not only delivers up to 4.33x higher token generation speed and 3.42x lower answer latency over the baselines but also improves reasoning quality by up to 18.75% through explicitly modeling inter-point dependencies.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning efficiency for real-time web applications
Overcoming sequential generation bottlenecks in complex reasoning tasks
Balancing computational efficiency with high-quality logical reasoning output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dependency-aware query decomposition for logical reasoning
Parallel content expansion using dependency graph consistency
Pipeline scheduling for cross-query computational optimization
🔎 Similar Papers
No similar papers found.