Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

📅 2025-02-09

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Fixed tree structures in multi-head decoding for large language models (LLMs) constrain both generation diversity and inference efficiency. Method: This paper proposes Dynamic Tree Attention, a dynamic multi-head parallel decoding mechanism integrated into the MEDUSA framework. It abandons predefined tree topologies and instead constructs and prunes candidate trees adaptively during decoding, guided by a low-complexity dynamic candidate generation strategy that balances scalability and computational efficiency. Contribution/Results: To our knowledge, this is the first work to introduce dynamic tree structures into the multi-head decoding paradigm, effectively alleviating topological constraints on path diversity. Experiments demonstrate a 1.8–2.3× speedup in decoding latency while preserving generation quality—as measured by BLEU score and KL divergence—validating the effectiveness and generalizability of dynamic tree modeling for speculative multi-head decoding.

Technology Category

Application Category

📝 Abstract

Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed structure. In this paper, we replace the fixed tree attention with dynamic tree attention on multiple head decoding, specifically in the context of MEDUSA. We propose a simple and low complexity strategy to generate candidates and construct the dynamic tree structure. Preliminary experiments show that the proposed method improves the decoding efficiency of multiple head decoding for LLMs while maintaining the generation quality. This result demonstrates the potential for improvement of multiple head decoding in candidate generation.

Problem

Research questions and friction points this paper is trying to address.

Dynamic tree attention for LLM

Multiple heads decoding efficiency

Maintaining quality in generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic tree attention

Low complexity strategy

Improved decoding efficiency

🔎 Similar Papers

No similar papers found.