Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

The impact of pooling operations on representational capacity and task performance in Transformer models has long been overlooked. Method: We establish, for the first time, theoretical expressivity bounds for pooling methods and propose a unified analytical framework that characterizes how distinct pooling strategies—e.g., [CLS], mean, and attention-weighted pooling—affect input discriminability, contextual modeling capability, and optimization dynamics. Our analysis spans three modalities—NLP, computer vision, and time series—and encompasses multiple attention variants across diverse downstream tasks. Contribution/Results: Empirical evaluation reveals that pooling choice significantly influences accuracy, gradient sensitivity, and convergence stability. Crucially, we identify task-agnostic, high-performing pooling patterns that generalize consistently across modalities and tasks. This work provides both theoretical foundations and practical guidelines for task-aware pooling design in Transformer architectures.

Technology Category

Application Category

📝 Abstract

Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for downstream tasks. While much of the literature has focused on attention mechanisms, the role of pooling remains underexplored despite its critical impact on model behavior. In this paper, we introduce a theoretical framework that rigorously characterizes the expressivity of Transformer-based models equipped with widely used pooling methods by deriving closed-form bounds on their representational capacity and the ability to distinguish similar inputs. Our analysis extends to different variations of attention formulations, demonstrating that these bounds hold across diverse architectural variants. We empirically evaluate pooling strategies across tasks requiring both global and local contextual understanding, spanning three major modalities: computer vision, natural language processing, and time-series analysis. Results reveal consistent trends in how pooling choices affect accuracy, sensitivity, and optimization behavior. Our findings unify theoretical and empirical perspectives, providing practical guidance for selecting or designing pooling mechanisms suited to specific tasks. This work positions pooling as a key architectural component in Transformer models and lays the foundation for more principled model design beyond attention alone.

Problem

Research questions and friction points this paper is trying to address.

Analyzing pooling's impact on Transformer model expressivity and capacity

Evaluating pooling strategies across vision, NLP, and time-series tasks

Providing theoretical and empirical guidance for pooling mechanism selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical framework analyzes pooling expressivity in Transformers

Closed-form bounds quantify representational capacity distinctions

Empirical evaluation spans vision, language, and time-series tasks

🔎 Similar Papers

No similar papers found.