Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

📅 2025-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fixed tree structures in multi-head decoding for large language models (LLMs) constrain both generation diversity and inference efficiency. Method: This paper proposes Dynamic Tree Attention, a dynamic multi-head parallel decoding mechanism integrated into the MEDUSA framework. It abandons predefined tree topologies and instead constructs and prunes candidate trees adaptively during decoding, guided by a low-complexity dynamic candidate generation strategy that balances scalability and computational efficiency. Contribution/Results: To our knowledge, this is the first work to introduce dynamic tree structures into the multi-head decoding paradigm, effectively alleviating topological constraints on path diversity. Experiments demonstrate a 1.8–2.3× speedup in decoding latency while preserving generation quality—as measured by BLEU score and KL divergence—validating the effectiveness and generalizability of dynamic tree modeling for speculative multi-head decoding.

Technology Category

Application Category

📝 Abstract
Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed structure. In this paper, we replace the fixed tree attention with dynamic tree attention on multiple head decoding, specifically in the context of MEDUSA. We propose a simple and low complexity strategy to generate candidates and construct the dynamic tree structure. Preliminary experiments show that the proposed method improves the decoding efficiency of multiple head decoding for LLMs while maintaining the generation quality. This result demonstrates the potential for improvement of multiple head decoding in candidate generation.
Problem

Research questions and friction points this paper is trying to address.

Dynamic tree attention for LLM
Multiple heads decoding efficiency
Maintaining quality in generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic tree attention
Low complexity strategy
Improved decoding efficiency
🔎 Similar Papers
No similar papers found.