STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability of advantage estimation in multimodal reinforcement learning for video question answering, which stems from weak and low-variance reward signals. To mitigate this, the authors propose a structured reinforcement learning framework that constructs multiple spatiotemporal variants of input videos and integrates a question-aware keyframe sampling mechanism to maintain temporal coverage while emphasizing semantically relevant regions. Furthermore, they introduce joint normalization between textual outputs and visual variants to broaden the scope of population-based comparisons, thereby amplifying the reward signal. By extending exploration from linguistic diversity to structured visual perturbations, the approach significantly enhances policy update stability. Extensive experiments across six benchmarks—including VideoMME and TempCompass—demonstrate consistent and substantial improvements over existing reinforcement learning baselines with multiple large models, confirming the method’s effectiveness and generalization capability.
📝 Abstract
We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.
Problem

Research questions and friction points this paper is trying to address.

video question answering
reinforcement learning
reward variance
spatiotemporal exploration
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured spatiotemporal exploration
importance-aware sampling
multimodal reinforcement learning
video question answering
joint normalization
🔎 Similar Papers
No similar papers found.