Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the computational and memory overhead bottlenecks in uncertainty quantification (UQ) for deep neural networks in safety-critical applications, this paper proposes Hydra Ensembles—a computationally efficient ensemble method tailored for Transformers. The core innovation lies in constructing structurally diverse subnetworks via attention head pruning and enabling lightweight fusion through a grouped fully connected layer–coupled multi-head attention mechanism. Crucially, Hydra Ensembles requires no from-scratch training; it operates as a post-hoc procedure on pre-trained models. Despite its efficiency, it achieves superior UQ calibration performance. Experiments demonstrate that Hydra Ensembles consistently outperforms Deep Ensembles across image and text classification benchmarks. Notably, it achieves state-of-the-art (SOTA) accuracy on zero-shot ImageNet-1k classification while maintaining inference latency comparable to that of a single model.

Technology Category

Application Category

📝 Abstract

Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.

Problem

Research questions and friction points this paper is trying to address.

Improving uncertainty quantification in efficient transformer ensembles

Reducing computational costs while maintaining ensemble diversity

Enhancing zero-shot performance without retraining from scratch

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes attention heads to create diverse ensemble members

Merges ensemble members via grouped fully-connected attention layers

Achieves near-single-network inference speed with robust uncertainty

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models