HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Long-form video understanding is hindered by the high computational cost of dense frame decoding, quadratic growth in token count with respect to frame number, and weak motion perception under sparse sampling. This work proposes a hierarchical video-language framework that decouples semantic and motion representations: sparse I-frames are processed by a Vision Transformer to capture scene semantics, while inter-frame dynamics from densely sampled frames are encoded into aligned motion tokens via a lightweight compressed-domain tri-stream adapter that fuses motion vector maps, residual maps, and I-frame context. These motion tokens are injected into a large language model through a differentiable placeholder mechanism. Combining contrastive alignment pretraining with LoRA fine-tuning, the method achieves a 2.3-point improvement (61.2% → 63.5%) over a 32-frame dense baseline on Video-MME while using 3.6× fewer context tokens, demonstrating the efficacy of the tri-stream architecture and hierarchical fusion.

📝 Abstract

Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.

Problem

Research questions and friction points this paper is trying to address.

long-video understanding

multimodal language models

motion perception

token efficiency

sparse sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical video-language framework

compressed-domain motion encoding

tri-stream adapter