π€ AI Summary
Training multimodal large language models (MLLMs) incurs prohibitive computational overhead due to massive token counts, yet existing efficiency methods primarily target inference, offering limited acceleration for training. To address this, we propose ReGATEβthe first adaptive token pruning framework for MLLM training that leverages a teacher-student paradigm: a frozen LLM serves as the teacher to generate reference loss, while an exponential moving average (EMA) mechanism dynamically estimates token difficulty and importance, enabling low-informativeness tokens to be skipped during forward propagation. ReGATE achieves fine-grained, difficulty-aware pruning. On VideoLLaMA2, it attains full-training peak accuracy using only 35% of tokens, accelerating training by 2Γ. With extended training, it surpasses the baseline across multimodal benchmarks while reducing total token consumption by over 41%, significantly enhancing both training efficiency and model performance.
π Abstract
The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student's own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$ imes$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.