Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Traditional text-based Next Event Prediction (NEP) struggles to capture dynamic physical-world information. To address this limitation, we propose Video Next Event Prediction (VNEP)—a novel paradigm where the predicted outcome is a video rather than text. We introduce VANS-Data-100K, a large-scale benchmark dataset specifically designed for VNEP. To tackle the challenges of semantic fidelity and temporal coherence, we design Joint-GRPO, a joint optimization framework that unifies vision-language modeling and video diffusion modeling under a shared reward mechanism incorporating reinforcement learning. This enables end-to-end training for generating semantically accurate and temporally consistent videos. Our approach overcomes the expressive bottleneck of purely textual prediction, facilitating both procedural learning and creative exploration through intuitive, multimodal content generation. Evaluated on dual benchmarks—procedural and predictive—our method achieves state-of-the-art performance, significantly improving both event prediction accuracy and semantic-visual alignment in generated videos.

Technology Category

Application Category

📝 Abstract

While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.

Problem

Research questions and friction points this paper is trying to address.

Extending video as answer modality for next-event prediction tasks

Generating dynamic video responses instead of text for procedural questions

Aligning vision-language models with video diffusion for consistent video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages reinforcement learning to align vision-language model

Uses Joint-GRPO to coordinate VLM and video diffusion model

Generates videos faithful to captions and input context

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Collaboratively Self-supervised Video Representation Learning for Action Recognition

2024-01-15IEEE Transactions on Information Forensics and SecurityCitations: 0