EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of achieving precise interaction reasoning and fine-grained pixel-level localization in egocentric videos with current vision-language models. To this end, the authors propose EARL, a two-stage parsing framework that first generates a global interaction description as a semantic prior and then jointly produces a textual response and a pixel-wise mask conditioned on the user query. Key innovations include an Analysis-guided Feature Synthesizer (AFS), a multi-objective reward function, and a GRPO reinforcement learning algorithm, which collectively enable unified optimization of heterogeneous outputs. Experimental results demonstrate that EARL achieves a 65.48% cIoU for pixel localization on Ego-IRGBench, outperforming prior reinforcement learning methods by 8.37%, while zero-shot evaluations on EgoHOS confirm its strong generalization capability.

📝 Abstract

Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.

Problem

Research questions and friction points this paper is trying to address.

egocentric vision

interaction reasoning

pixel grounding

multimodal large language models

embodied intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric vision

reinforcement learning

pixel grounding