Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Video Large Multimodal Models (Video-LMMs) struggle with complex spatiotemporal reasoning—particularly temporal localization, spatiotemporal alignment, long-video modeling, and multimodal evidence fusion—during post-training. Method: We propose the first three-pillar post-training framework for video understanding: (1) chain-of-thought supervised fine-tuning (SFT) to enhance spatiotemporal logical reasoning; (2) verifiable, goal-driven reinforcement learning (RL) to improve inference consistency; and (3) test-time computation augmentation coupled with efficient long-video processing. Contribution/Results: We establish a standardized evaluation protocol and benchmark dataset, uncovering trade-offs among reward design, scalability, and efficiency. Experiments demonstrate substantial gains in long-range dependency modeling, multi-step reasoning, and cross-modal alignment—advancing Video-LMMs from perceptual recognition toward deep, systematic reasoning.

Technology Category

Application Category

📝 Abstract

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

Problem

Research questions and friction points this paper is trying to address.

Examining post-training methods for video reasoning models

Addressing temporal localization and spatiotemporal grounding challenges

Optimizing long video efficiency and multimodal evidence integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training enhances Video-LMM reasoning capabilities

Supervised fine-tuning uses chain-of-thought for adaptation

Reinforcement learning optimizes models from verifiable objectives

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models