FrameMind: Frame-Interleaved Chain-of-Thought for Video Reasoning via Reinforcement Learning

๐Ÿ“… 2025-09-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video understanding models rely on fixed-frame sampling strategies, limiting their ability to adaptively acquire spatiotemporal evidence in response to question semanticsโ€”thereby hindering performance on complex video question answering (VideoQA) tasks. To address this, we propose FrameMind, the first framework enabling dynamic visual information acquisition without frame-level annotations. FrameMind integrates multi-round textual reasoning with active perception via reinforcement learning. It introduces Frame-Interleaved Chain-of-Thought (FiCOT) to model fine-grained, interleaved reasoning over frames and questions, and jointly trains Dynamic Resolution Frame Sampling (DRFS) with Groupwise Relative Policy Optimization (GRPO) to enable goal-directed spatiotemporal perception. Evaluated on MLVU and VideoMME benchmarks, FrameMind significantly outperforms state-of-the-art methods, demonstrating both the effectiveness and generalizability of adaptive perception for video understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.
Problem

Research questions and friction points this paper is trying to address.

Dynamic frame selection for adaptive video reasoning tasks
Overcoming limitations of fixed sampling in video understanding
Enhancing visual evidence gathering through reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic frame sampling via reinforcement learning
Frame-interleaved reasoning with active visual perception
Group-relative policy optimization without frame annotations
๐Ÿ”Ž Similar Papers
No similar papers found.
Haonan Ge
Haonan Ge
Southeast University
Vision Language Model
Y
Yiwei Wang
University of California, Merced
K
Kai-Wei Chang
University of California, Los Angeles
H
Hang Wu
University of California, Merced
Yujun Cai
Yujun Cai
NTU โ†’ Meta โ†’ Lecturer(Assistant Professor) @UQ
Multi-Modal PerceptionVision-Language Models