🤖 AI Summary
Decoder-only large language models suffer from limited representational capacity due to their exclusive reliance on unidirectional causal attention. To address this, we propose the first instruction-tuning method specifically designed for pure decoder architectures: during the prompt encoding phase, we introduce parallel causal and bidirectional attention pathways with parameter separation; their outputs are dynamically fused via learnable weights to guide autoregressive generation. Our approach is architecture-agnostic and does not depend on any specific parameter-efficient fine-tuning (PEFT) technique. Experiments demonstrate consistent zero-shot performance gains over baselines across commonsense reasoning, arithmetic, and linguistic understanding tasks. Ablation studies confirm the effectiveness and necessity of incorporating bidirectional attention, the dual-path design, and the learnable fusion mechanism.
📝 Abstract
We introduce Bitune, a method that improves instruction-tuning of pretrained decoder-only large language models, leading to consistent gains on downstream tasks. Bitune applies both causal and bidirectional attention to the prompt, to obtain a better representation of the query or instruction. We realize this by introducing two sets of parameters, for which we apply parameter-efficient finetuning techniques. These causal and bidirectional features are then combined into a weighted average with trainable coefficients, which is subsequently used to generate new tokens. We demonstrate significant improvements in zero-shot performance on commonsense reasoning, arithmetic, and language understanding tasks, while extensive ablation studies validate the role of each component and demonstrate the method's agnosticism to different PEFT techniques.