🤖 AI Summary
To address the uneven performance and poor cross-disciplinary generalization of large language models (LLMs) across multi-domain tasks, this paper proposes Diversity-Fingerprint-based Ensemble (DFPE). DFPE introduces a novel response-fingerprint-driven mechanism to quantify and preserve model diversity, integrating subject-level K-means clustering, quantile-thresholded dynamic filtering, and accuracy-aware adaptive weighted fusion for fine-grained, robust multi-model ensembling. On the MMLU benchmark, DFPE achieves a 3% absolute improvement in overall accuracy and a 5% gain in subject-level average accuracy, significantly enhancing cross-domain generalization and robustness. Its core contributions are: (1) response-fingerprint-based diversity modeling; (2) subject-granular dynamic filtering; and (3) accuracy-aware weighted ensemble paradigm.
📝 Abstract
Large Language Models (LLMs) have shown remarkable capabilities across various natural language processing tasks but often struggle to excel uniformly in diverse or complex domains. We propose a novel ensemble method - Diverse Fingerprint Ensemble (DFPE), which leverages the complementary strengths of multiple LLMs to achieve more robust performance. Our approach involves: (1) clustering models based on response"fingerprints"patterns, (2) applying a quantile-based filtering mechanism to remove underperforming models at a per-subject level, and (3) assigning adaptive weights to remaining models based on their subject-wise validation accuracy. In experiments on the Massive Multitask Language Understanding (MMLU) benchmark, DFPE outperforms the best single model by 3% overall accuracy and 5% in discipline-level accuracy. This method increases the robustness and generalization of LLMs and underscores how model selection, diversity preservation, and performance-driven weighting can effectively address challenging, multi-faceted language understanding tasks.