LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations of existing vision-language models in long-form video understanding, fine-grained spatiotemporal localization, and unified multimodal perception by introducing a next-generation vision-language architecture. The model employs a native OneVision encoder with window-based attention to enable efficient local computation while preserving original resolution. It innovatively adopts codec-stream tokenization, which dynamically allocates spatiotemporal tokens based on the bit cost of video compression streams, and integrates a shared 3D RoPE mechanism to unify the spatiotemporal coordinate systems across images, sampled frames, and compressed canvases. Trained on 8M video pretraining and 4M spatial fine-tuning samples, the model achieves a 74.9 mAP on the JumpScore benchmark—surpassing Qwen3-VL-8B by 44.8 points—and establishes new state-of-the-art results with gains of 4.3, 5.3, and 15.6 points on video, spatial, and tracking tasks, respectively, significantly enhancing long-video compression stability and fine-grained localization capabilities.
📝 Abstract
We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.
Problem

Research questions and friction points this paper is trying to address.

video understanding
temporal grounding
spatial grounding
token compression
multimodal perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

codec-stream tokenization
Windowed Attention
3D RoPE
adaptive temporal grouping
unified spatiotemporal perception
🔎 Similar Papers
No similar papers found.
Xiang An
Xiang An
DeepGlint
Computer Vision
Y
Yin Xie
Glint Lab, AIM for Health Lab, MVP Lab
F
Feilong Tang
Glint Lab, AIM for Health Lab, MVP Lab
Y
Yunyao Yan
Glint Lab, AIM for Health Lab, MVP Lab
Huajie Tan
Huajie Tan
Peking University
Embodied AIFoundation Models
Didi Zhu
Didi Zhu
Imperial College London
Multi-Modal LLMsOut of Distribution Generalization
C
Changrui Chen
Glint Lab, AIM for Health Lab, MVP Lab
X
Xiuwei Zhao
Glint Lab, AIM for Health Lab, MVP Lab
Bin Qin
Bin Qin
Institute of Software Chinese Academy of Sciences
Machine LearningCausal Inference
Kaicheng Yang
Kaicheng Yang
DeepGlint
Multimodal、CV、NLP
Y
Yifei Shen
Glint Lab, AIM for Health Lab, MVP Lab
Yuanhan Zhang
Yuanhan Zhang
PhD Candidate, MMLab@NTU
Computer VisionMachine Learning
Kaichen Zhang
Kaichen Zhang
Nanyang Technological University
VLMsComputer VisionMulti-modality
Wenkang Zhang
Wenkang Zhang
Shanghai Jiao Tong University
3D VisionEmbodied AIWorld ModelLearning-based Compression
Z
Zheng Cheng
Glint Lab, AIM for Health Lab, MVP Lab
N
Nansen Zhang
Glint Lab, AIM for Health Lab, MVP Lab
C
Chunsheng Wu
Glint Lab, AIM for Health Lab, MVP Lab
C
Chunjiang Ge
Glint Lab, AIM for Health Lab, MVP Lab
Z
Zimin Ran
Glint Lab, AIM for Health Lab, MVP Lab
D
Dehua Song
Glint Lab, AIM for Health Lab, MVP Lab
Chunyuan Li
Chunyuan Li
xAI
Deep LearningVisionLanguageMultimodal
Shikun Feng
Shikun Feng
Baidu
nlp
Ming Hu
Ming Hu
Monash University | Shanghai AI Laboratory
Z
Zhangquan Chen
Glint Lab, AIM for Health Lab, MVP Lab
Junbo Niu
Junbo Niu
Peking University
Foundation Model