Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

πŸ“… 2026-01-15
πŸ“ˆ Citations: 6
✨ Influential: 2
πŸ“„ PDF
πŸ€– AI Summary
Existing open-source video-language models are hindered by the scarcity of high-quality training data and insufficient pixel-level grounding capabilities, limiting their performance on complex vision-language tasks. This work proposes Molmo2β€”the first open-source vision-language model to achieve high-performance video and multi-image understanding with point-driven, pixel-level localization, without relying on any data generated by closed-source models. Molmo2 leverages efficient data packing, message-tree encoding, bidirectional visual token attention, and a novel token weighting strategy, trained on seven newly curated video datasets and two multi-image annotation datasets. The 8B-parameter variant, Molmo2-8B, outperforms existing open-source models on video counting and captioning tasks, significantly surpasses Qwen3-VL in video grounding, and even exceeds Gemini 3 Pro on select metrics.

Technology Category

Application Category

πŸ“ Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
Problem

Research questions and friction points this paper is trying to address.

video-language models
open-weight models
grounding
training data transparency
pixel-level localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

video grounding
open-weight VLMs
point-driven localization
vision-language pretraining
token-weight strategy
πŸ”Ž Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
2024-07-12European Conference on Computer VisionCitations: 3
Christopher Clark
Christopher Clark
Allen Institute for AI
Out-of-Domain GeneralizationMulti-Modal Machine LearningNLP
Jieyu Zhang
Jieyu Zhang
University of Washington
Data-Centric AIAgentic AIMultimodal ModelsMachine LearningComputer Vision
Zixian Ma
Zixian Ma
University of Washington
Multi-modal models and agentshuman-agent interaction and collaboration
J
Jae Sung Park
Allen Institute for AI, University of Washington
Mohammadreza Salehi
Mohammadreza Salehi
University of Washington
Multimodal modelsVideo understandingComputer visionNatural language processing
R
Rohun Tripathi
Allen Institute for AI
Sangho Lee
Sangho Lee
Research Scientist at the Allen Institute for AI
Computer VisionDeep Learning
Z
Zhongzheng Ren
Allen Institute for AI, University of Washington
Chris Dongjoo Kim
Chris Dongjoo Kim
Ai2
Machine LearningData QualityMultimodal dataReal-time Post-Training
Y
Yinuo Yang
University of Washington
V
Vincent Shao
University of Washington
Yue Yang
Yue Yang
Research Scientist at Ai2
Artificial IntelligenceNatural Language ProcessingComputer Vision
W
Weikai Huang
University of Washington
Ziqi Gao
Ziqi Gao
HKUST
AI for ProteinGraph Machine Learning
T
Taira Anderson
Allen Institute for AI
J
Jianrui Zhang
Allen Institute for AI
Jitesh Jain
Jitesh Jain
Georgia Tech
Image SegmentationMultimodal ReasoningComputer Vision
G
George Stoica
Allen Institute for AI
W
Winson Han
Allen Institute for AI
Ali Farhadi
Ali Farhadi
Professor, Computer Science and Engineering, University of Washington
Computer VisionMachine learningArtificial Intelligence
Ranjay Krishna
Ranjay Krishna
University of Washington, Allen Institute for AI
Computer VisionNatural Language ProcessingMachine LearningHuman Computer Interaction