ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

πŸ“… 2025-07-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Training multimodal large language models (MLLMs) incurs prohibitive computational overhead due to massive token counts, yet existing efficiency methods primarily target inference, offering limited acceleration for training. To address this, we propose ReGATEβ€”the first adaptive token pruning framework for MLLM training that leverages a teacher-student paradigm: a frozen LLM serves as the teacher to generate reference loss, while an exponential moving average (EMA) mechanism dynamically estimates token difficulty and importance, enabling low-informativeness tokens to be skipped during forward propagation. ReGATE achieves fine-grained, difficulty-aware pruning. On VideoLLaMA2, it attains full-training peak accuracy using only 35% of tokens, accelerating training by 2Γ—. With extended training, it surpasses the baseline across multimodal benchmarks while reducing total token consumption by over 41%, significantly enhancing both training efficiency and model performance.

Technology Category

Application Category

πŸ“ Abstract
The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student's own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$ imes$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost in MLLM training
Improving training efficiency with adaptive token pruning
Maintaining accuracy while processing fewer tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive token pruning for MLLM training acceleration
Teacher-student framework with reference-guided token elision
Selective token processing reduces computational overhead
πŸ”Ž Similar Papers
No similar papers found.