🤖 AI Summary
This study systematically investigates the joint effects of learning rate, batch size, and training epochs on accuracy, F1-score, and loss in fine-tuning DistilBERT (distilbert-base-uncased-finetuned-sst-2-english) for the SST-2 sentiment classification task. Methodologically, we employ polynomial regression to quantify both main and interaction effects among hyperparameters, complemented by statistical significance testing (p < 0.05). Results reveal that: (i) high learning rate significantly reduces loss (p = 0.027) but degrades accuracy; (ii) batch size and epochs exhibit a strong interaction effect that significantly improves F1-score (p = 0.001); and (iii) batch size significantly affects accuracy (p = 0.028) and F1-score (p = 0.005), yet shows no significant effect on loss (p = 0.170). These findings motivate an adaptive fine-tuning paradigm explicitly modeling nonlinear hyperparameter interactions, providing empirical evidence and methodological guidance for efficient fine-tuning of BERT-family models.
📝 Abstract
This study evaluates fine-tuning strategies for text classification using the DistilBERT model, specifically the distilbert-base-uncased-finetuned-sst-2-english variant. Through structured experiments, we examine the influence of hyperparameters such as learning rate, batch size, and epochs on accuracy, F1-score, and loss. Polynomial regression analyses capture foundational and incremental impacts of these hyperparameters, focusing on fine-tuning adjustments relative to a baseline model. Results reveal variability in metrics due to hyperparameter configurations, showing trade-offs among performance metrics. For example, a higher learning rate reduces loss in relative analysis (p=0.027) but challenges accuracy improvements. Meanwhile, batch size significantly impacts accuracy and F1-score in absolute regression (p=0.028 and p=0.005) but has limited influence on loss optimization (p=0.170). The interaction between epochs and batch size maximizes F1-score (p=0.001), underscoring the importance of hyperparameter interplay. These findings highlight the need for fine-tuning strategies addressing non-linear hyperparameter interactions to balance performance across metrics. Such variability and metric trade-offs are relevant for tasks beyond text classification, including NLP and computer vision. This analysis informs fine-tuning strategies for large language models and promotes adaptive designs for broader model applicability.