🤖 AI Summary
Fixed tree structures in multi-head decoding for large language models (LLMs) constrain both generation diversity and inference efficiency. Method: This paper proposes Dynamic Tree Attention, a dynamic multi-head parallel decoding mechanism integrated into the MEDUSA framework. It abandons predefined tree topologies and instead constructs and prunes candidate trees adaptively during decoding, guided by a low-complexity dynamic candidate generation strategy that balances scalability and computational efficiency. Contribution/Results: To our knowledge, this is the first work to introduce dynamic tree structures into the multi-head decoding paradigm, effectively alleviating topological constraints on path diversity. Experiments demonstrate a 1.8–2.3× speedup in decoding latency while preserving generation quality—as measured by BLEU score and KL divergence—validating the effectiveness and generalizability of dynamic tree modeling for speculative multi-head decoding.
📝 Abstract
Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed structure. In this paper, we replace the fixed tree attention with dynamic tree attention on multiple head decoding, specifically in the context of MEDUSA. We propose a simple and low complexity strategy to generate candidates and construct the dynamic tree structure. Preliminary experiments show that the proposed method improves the decoding efficiency of multiple head decoding for LLMs while maintaining the generation quality. This result demonstrates the potential for improvement of multiple head decoding in candidate generation.