EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study systematically investigates, for the first time, temporal reasoning consistency of video large language models (Video-LLMs) across synchronized multi-perspective (ego-centric and exo-centric) videos. To this end, we introduce EgoExo-Con—the first benchmark for cross-perspective temporal consistency evaluation—featuring human-annotated natural language queries and precisely synchronized multi-view video pairs. We propose View-GRPO, a reinforcement learning framework that jointly optimizes single-view temporal reasoning while explicitly modeling cross-perspective semantic alignment and temporal consistency constraints. Experiments demonstrate significant improvements over supervised fine-tuning (SFT) and standard GRPO baselines on Temporal Verification and Temporal Grounding tasks, with enhanced cross-perspective consistency and no degradation in single-view performance. Our core contributions are threefold: (1) establishing the first evaluation paradigm for cross-perspective temporal consistency; (2) constructing the dedicated EgoExo-Con benchmark; and (3) designing a novel RL training framework that balances view-specific capability and cross-perspective consistency.

Technology Category

Application Category

📝 Abstract

Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Video-LLMs' temporal consistency across different viewpoints

Introducing EgoExo-Con benchmark for synchronized egocentric-exocentric video analysis

Proposing View-GRPO framework to enhance cross-view temporal reasoning consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces synchronized egocentric-exocentric video benchmark

Proposes reinforcement learning framework View-GRPO

Strengthens view-specific reasoning while ensuring cross-view consistency

🔎 Similar Papers

No similar papers found.