HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the failure of long-range semantic modeling in vision-language models (VLMs) for long-video understanding—caused by improper RoPE frequency allocation—this paper proposes Hybrid Positional Encoding (HoPE). HoPE introduces two key innovations: (1) the first theoretical analysis of frequency mismatch in multimodal RoPE across spatiotemporal dimensions, leading to an interpretable hybrid frequency allocation strategy; and (2) a dynamic temporal scaling mechanism that adaptively adjusts temporal resolution to support arbitrary-length contexts. HoPE requires no architectural modifications and is fully compatible with existing VLM training paradigms. Evaluated on four long-video understanding and retrieval benchmarks, HoPE consistently outperforms state-of-the-art methods, demonstrating strong generalization, robustness, and theoretical soundness under extended context lengths.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.
Problem

Research questions and friction points this paper is trying to address.

Improving length generalization in Vision-Language Models for long videos
Addressing unreliable semantic similarity capture in multimodal RoPEs
Enhancing spatial-temporal dependency modeling in extended video contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid frequency allocation for semantic modeling
Dynamic temporal scaling for robust learning
Improved long-context video understanding performance
🔎 Similar Papers
No similar papers found.
H
Haoran Li
Carnegie Mellon University
Y
Yingjie Qin
Xiaohongshu Inc.
B
Baoyuan Ou
Xiaohongshu Inc.
L
Lai Xu
Xiaohongshu Inc.
Ruiwen Xu
Ruiwen Xu
Hongkong university; Xiaohongshu
Multi-modalRecommendations