🤖 AI Summary
Existing approaches lack systematic methodologies for selecting optimal transformer layers and fusing embeddings from multiple large language models (LLMs) in text classification. Method: This paper proposes a layer-aware, fine-tuning-free embedding fusion framework. It first quantifies each hidden layer’s contribution to downstream tasks via a layer importance score grounded in gradient magnitude and attention entropy. Next, it introduces a data-adaptive layer selection mechanism and employs a gated weighted fusion paradigm to integrate embeddings from heterogeneous models—including BERT, RoBERTa, and DeBERTa. Contribution/Results: The method requires no parameter fine-tuning and significantly enhances generalization. On SST-2, MR, R8, and R52, it achieves average accuracy gains of 1.2–2.7% over single-layer and single-model baselines. Empirical analysis confirms the necessity of layer selection, the cross-model complementarity of embeddings, and superior computational efficiency.
📝 Abstract
Embedding fusion has emerged as an effective approach for enhancing performance across various NLP tasks. However, systematic guidelines for selecting optimal layers and developing effective fusion strategies for the integration of LLMs remain underexplored. In this study, we propose a layer-aware embedding selection method and investigate how to quantitatively evaluate different layers to identify the most important ones for downstream NLP tasks, showing that the critical layers vary depending on the dataset. We also explore how combining embeddings from multiple LLMs, without requiring model fine-tuning, can improve performance. Experiments on four English text classification datasets (SST-2, MR, R8, and R52) demonstrate that different layers in LLMs exhibit varying degrees of representational strength for classification, and that combining embeddings from different models can enhance performance if the models exhibit complementary characteristics. Additionally, we discuss resources overhead (memory and inference time) to provide a balanced perspective on the real world feasibility of embedding fusion. Future work will explore multilingual and domain specific datasets, as well as techniques for automating layer selection, to improve both performance and scalability.