🤖 AI Summary
The single [CLS] token in BERT exhibits limited representational capacity, especially in low-resource text classification tasks. Method: This paper proposes a lightweight cross-layer and cross-token semantic aggregation mechanism that jointly introduces inter-layer max-pooling over [CLS] tokens across Transformer layers and full-sequence multi-head attention (MHA) enhancement, with a carefully designed fusion strategy. The method requires no additional pretraining and operates solely via fine-tuning to efficiently integrate deep contextual information. Contribution/Results: Evaluated on the GLUE benchmark, the approach consistently outperforms BERT-base, achieving substantial accuracy gains on low-resource tasks—e.g., RTE and MRPC—demonstrating the effectiveness and generalizability of multi-granularity representation aggregation for semantic modeling.
📝 Abstract
The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we propose MaxPoolBERT, a lightweight extension to BERT that refines the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach enhances BERT's classification accuracy (especially on low-resource tasks) without requiring pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance on the standard BERT-base model.