Watch Before You Answer: Learning from Visually Grounded Post-Training

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the limitation of existing video understanding benchmarks, where many questions can be answered using textual cues alone, thereby failing to adequately challenge models’ visual reasoning capabilities. To overcome this, the authors propose a data curation criterion centered on visual grounding and construct VidGround, a high-quality post-training dataset. By integrating this dataset with reinforcement learning–based post-training, they achieve up to a 6.2 percentage point improvement in video understanding performance—despite using only 69.1% of the original data—and substantially outperform several sophisticated post-training strategies. The study underscores data quality as a critical bottleneck in advancing video understanding and demonstrates that carefully curated data can yield greater gains than increasingly complex algorithmic approaches.

Technology Category

Application Category

📝 Abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

video understanding

visually grounded

post-training data

evaluation benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

visually grounded

post-training

video understanding