🤖 AI Summary
This work addresses the high computational cost of existing Transformer-based video large language models in first-person dynamic spatial reasoning, where dense visual tokens impede efficiency and static pruning struggles to preserve critical motion and geometric cues. To overcome this, the authors propose the Event-Cascaded Pruning (ECP) framework, which introduces a training-free, event-guided token pruning mechanism leveraging high-frequency motion priors from event cameras. ECP employs a three-stage pipeline—event-triggered causal sampling, saliency filtering, and event-attention fusion ranking—to achieve efficient and informative token reduction. The study also establishes ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning. Experiments show that at an 80% token pruning rate, ECP achieves 37.62% accuracy—surpassing the full-token baseline—while accelerating inference by 1.89×, reducing GFLOPs by 52%, and outperforming prior methods by 2.68 percentage points on ESR-Real.
📝 Abstract
First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.