🤖 AI Summary
This work addresses the high computational cost of Vision Transformers arising from their self-attention mechanism and the inability of existing token compression methods to dynamically adapt to input-dependent redundancy. The authors propose the first approach that formulates token merging as a Markov decision process, employing a lightweight reinforcement learning agent to determine layer-wise merging strategies online and dynamically. Their method leverages an asymmetric Actor-Critic architecture and a dense reward function derived from nonlinear distillation, combining offline training with low-overhead online inference. Evaluated on ImageNet-1K, it achieves up to 76% relative computational savings with less than 0.05% accuracy degradation, while demonstrating over 430% efficiency gains on out-of-distribution datasets—significantly outperforming static or heuristic baselines.
📝 Abstract
Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.