🤖 AI Summary
To address insufficient spatiotemporal reasoning and interpretability in first-person video understanding, this paper introduces EgoVLM—the first vision-language model specifically designed for embodied intelligence. Methodologically: (1) it proposes Group Relative Policy Optimization (GRPO), a novel reinforcement learning (RL) paradigm; (2) it designs a keyframe-aware reward mechanism; (3) it abandons supervised chain-of-thought fine-tuning, enabling purely RL-driven reasoning alignment; and (4) it integrates keyframe selection, spatiotemporal modeling, and joint visual–linguistic representation learning. On the EgoSchema benchmark, EgoVLM-3B achieves substantial gains—outperforming Qwen2.5-VL 3B and 7B by 14.33 and 13.87 percentage points, respectively—while simultaneously enhancing reasoning interpretability and domain adaptability.
📝 Abstract
Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero's approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.