MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The single [CLS] token in BERT exhibits limited representational capacity, especially in low-resource text classification tasks. Method: This paper proposes a lightweight cross-layer and cross-token semantic aggregation mechanism that jointly introduces inter-layer max-pooling over [CLS] tokens across Transformer layers and full-sequence multi-head attention (MHA) enhancement, with a carefully designed fusion strategy. The method requires no additional pretraining and operates solely via fine-tuning to efficiently integrate deep contextual information. Contribution/Results: Evaluated on the GLUE benchmark, the approach consistently outperforms BERT-base, achieving substantial accuracy gains on low-resource tasks—e.g., RTE and MRPC—demonstrating the effectiveness and generalizability of multi-granularity representation aggregation for semantic modeling.

Technology Category

Application Category

📝 Abstract
The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we propose MaxPoolBERT, a lightweight extension to BERT that refines the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach enhances BERT's classification accuracy (especially on low-resource tasks) without requiring pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance on the standard BERT-base model.
Problem

Research questions and friction points this paper is trying to address.

Improving BERT classification by aggregating layer and token information
Enhancing [CLS] token representation with multi-head attention
Boosting classification accuracy without increasing model size
Innovation

Methods, ideas, or system contributions that make the work stand out.

Max-pooling [CLS] token across multiple layers
Adding MHA layer for [CLS] token attention
Combining max-pooling with MHA for sequences
🔎 Similar Papers
No similar papers found.