🤖 AI Summary
To address the high computational cost and complex hyperparameter tuning of Transformer models, this paper proposes an efficient training framework integrating boosting mechanisms. The method introduces (1) a least-squares boosting objective—replacing standard cross-entropy—to concentrate gradient updates on hard-to-classify samples; (2) a sub-grid token selection strategy that dynamically identifies information-dense local token subsets; and (3) importance-weighted sampling to suppress redundant computation. These components are jointly embedded into the Transformer training pipeline. Empirical evaluation across multiple fine-grained text classification benchmarks demonstrates that the approach accelerates convergence and improves generalization: it reduces training time by 32%–47% while boosting accuracy by 1.8–3.4 percentage points on average. Moreover, it significantly lowers architecture search overhead.
📝 Abstract
Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.