Adaptive Skeleton Graph Decoding

๐Ÿ“… 2024-02-19
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 4
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) suffer from high computational and memory overhead during inference; existing parallel decoding methodsโ€”e.g., Split-and-Translate (SoT)โ€”neglect semantic dependencies among subproblems, degrading output quality. To address this, we propose a dependency-graph-driven, difficulty-aware parallel decoding framework. First, we explicitly construct a subproblem dependency graph to enable cross-branch information propagation. Second, we introduce a dynamic model scaling mechanism guided by fine-grained subproblem difficulty estimation, enabling adaptive scheduling across multiple model sizes. Our approach jointly optimizes inference efficiency and generation quality while preserving semantic coherence. Experiments demonstrate that, compared to standard autoregressive decoding and SoT, our method achieves a 1.69ร— speedup in inference latency and up to a 51% improvement in response quality. To the best of our knowledge, this is the first work to unify dependency-aware decoding, difficulty-driven scheduling, and model-scale adaptivity in a single framework.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have seen significant adoption for natural language tasks, owing their success to massive numbers of model parameters (e.g., 70B+); however, LLM inference incurs significant computation and memory costs. Recent approaches propose parallel decoding strategies, such as Skeleton-of-Thought (SoT), to improve performance by breaking prompts down into sub-problems that can be decoded in parallel; however, they often suffer from reduced response quality. Our key insight is that we can request additional information, specifically dependencies and difficulty, when generating the sub-problems to improve both response quality and performance. In this paper, we propose Skeleton Graph Decoding (SGD), which uses dependencies exposed between sub-problems to support information forwarding between dependent sub-problems for improved quality while exposing parallelization opportunities for decoding independent sub-problems. Additionally, we leverage difficulty estimates for each sub-problem to select an appropriately-sized model, improving performance without significantly reducing quality. Compared to standard autoregressive generation and SoT, SGD achieves a 1.69x speedup while improving quality by up to 51%.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational and memory overhead in LLM inference
Maintain answer quality in parallel decoding methods
Optimize dependency-aware parallel decoding for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware parallel decoding via dependency graphs
Pipelined planning and decoding stages for efficiency
KV cache reuse optimization to minimize overhead
๐Ÿ”Ž Similar Papers
No similar papers found.