Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video Large Multimodal Models (Video-LMMs) struggle with complex spatiotemporal reasoning—particularly temporal localization, spatiotemporal alignment, long-video modeling, and multimodal evidence fusion—during post-training. Method: We propose the first three-pillar post-training framework for video understanding: (1) chain-of-thought supervised fine-tuning (SFT) to enhance spatiotemporal logical reasoning; (2) verifiable, goal-driven reinforcement learning (RL) to improve inference consistency; and (3) test-time computation augmentation coupled with efficient long-video processing. Contribution/Results: We establish a standardized evaluation protocol and benchmark dataset, uncovering trade-offs among reward design, scalability, and efficiency. Experiments demonstrate substantial gains in long-range dependency modeling, multi-step reasoning, and cross-modal alignment—advancing Video-LMMs from perceptual recognition toward deep, systematic reasoning.

Technology Category

Application Category

📝 Abstract
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
Problem

Research questions and friction points this paper is trying to address.

Examining post-training methods for video reasoning models
Addressing temporal localization and spatiotemporal grounding challenges
Optimizing long video efficiency and multimodal evidence integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training enhances Video-LMM reasoning capabilities
Supervised fine-tuning uses chain-of-thought for adaptation
Reinforcement learning optimizes models from verifiable objectives
🔎 Similar Papers
No similar papers found.
Y
Yunlong Tang
University of Rochester
J
Jing Bi
University of Rochester
Pinxin Liu
Pinxin Liu
Univeristy of Rochester
Computer VisionNatural Language ProcessingData Mining
Zhenyu Pan
Zhenyu Pan
Northwestern University, Computer Science
Foundation modelsInformation Retrieval3D World Generation
Z
Zhangyun Tan
University of Rochester
Q
Qianxiang Shen
University of Rochester
J
Jiani Liu
University of Rochester
Hang Hua
Hang Hua
University of Rochester
Computer VisionNatural Language ProcessingMachine Learning
J
Junjia Guo
University of Rochester
Y
Yunzhong Xiao
CMU
C
Chao Huang
University of Rochester
Z
Zhiyuan Wang
UCSB
Susan Liang
Susan Liang
University of Rochester
Computer Vision
Xinyi Liu
Xinyi Liu
Wuhan University
3D ReconstructionPoint Cloud and Image IntegrationComputational Origami
Yizhi Song
Yizhi Song
Research Scientist, Bytedance / Tiktok
Image generationGenerative AIMLLMDiffusion
Y
Yuhe Nie
NYU
Jia-Xing Zhong
Jia-Xing Zhong
University of Oxford <- Peking University
B
Bozheng Li
Brown University
Daiqing Qi
Daiqing Qi
University of Virginia
Multimodal LearningMachine LearningNatural Language ProcessingComputer Vision
Ziyun Zeng
Ziyun Zeng
National University of Singapore
Video UnderstandingMulti-modal LLMRepresentation Learning
Ali Vosoughi
Ali Vosoughi
University of Rochester PhD | Microsoft Research & Bosch AI | ML Research Scientist
Multimodal AIAudio AILarge Language Models (LLMs)Generative AIComputer Vision
Luchuan Song
Luchuan Song
University of Rochester
Computer VisionComputer GraphicsAnimation
Zeliang Zhang
Zeliang Zhang
PhD Candidate @ University of Rochester; BEng @ HUST
trustworthy and efficient AI
Daiki Shimada
Daiki Shimada
Sony Group Corp.
Computer VisionMachine Learning
H
Han Liu
Northwestern University