🤖 AI Summary
Long-form video understanding is hindered by the high computational cost of dense frame decoding, quadratic growth in token count with respect to frame number, and weak motion perception under sparse sampling. This work proposes a hierarchical video-language framework that decouples semantic and motion representations: sparse I-frames are processed by a Vision Transformer to capture scene semantics, while inter-frame dynamics from densely sampled frames are encoded into aligned motion tokens via a lightweight compressed-domain tri-stream adapter that fuses motion vector maps, residual maps, and I-frame context. These motion tokens are injected into a large language model through a differentiable placeholder mechanism. Combining contrastive alignment pretraining with LoRA fine-tuning, the method achieves a 2.3-point improvement (61.2% → 63.5%) over a 32-frame dense baseline on Video-MME while using 3.6× fewer context tokens, demonstrating the efficacy of the tri-stream architecture and hierarchical fusion.
📝 Abstract
Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.