OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the instability in embodied reasoning for cross-domain first-person videos, caused by ambiguous temporal boundaries, inconsistent semantic granularity, and option ambiguity. It formulates the EgoCross task as a cross-domain embodied video reasoning problem and introduces a lightweight test-time routing inference framework. Without modifying the Qwen3-VL-4B-SFT backbone, the method enhances robustness under sparse sampling and complex interference through temporal evidence normalization, domain-agnostic capability routing, structured perception–dynamics–decision reasoning, and boundary-aware option verification. The approach achieved second place in both the Source-Limited and Open-Source tracks of the CVPR 2026 EgoCross Challenge, attaining accuracies of 66.35% and 66.77%, respectively.

📝 Abstract

The 1st Cross-Domain EgoCross Challenge at EgoVis, CVPR 2026 evaluates whether multimodal large language models can reason over egocentric videos across surgery, industry, extreme sports, and animal perspective. We achieved second place in both the Source-Limited and Open-Source tracks. In this report, we formulate EgoCross as a robust cross-domain embodied video reasoning problem rather than a simple multiple-choice visual question answering task. We identify three key challenges: (C1) temporal boundary ambiguity, where critical state transitions are sparsely sampled and often occur between frames; (C2) cross-domain semantic granularity mismatch, where the same capability requires different domain-specific visual grammar; and (C3) decision instability under close options, where long multimodal reasoning can select unsupported distractors or produce malformed outputs. To address them, we propose OmniEgo-R$^2$ (Omnidomain Egocentric Routed Reasoning), a unified routed reasoning pipeline consisting of temporal-evidence normalization, domain-agnostic capability routing, structured perception--dynamics--decision reasoning, boundary-aware option verification, and defensive answer calibration. OmniEgo-R$^2$ uses the Qwen3-VL-4B-SFT checkpoints on each EgoCross domain as the visual-language backbone, and wraps them with lightweight test-time reasoning and parsing programs. Our final submissions obtain 66.35% overall accuracy in the Source-Limited track and 66.77% in the Open-Source track, ranking second in both leaderboards.

Problem

Research questions and friction points this paper is trying to address.

cross-domain

egocentric video

temporal boundary ambiguity

semantic granularity mismatch

decision instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

routed reasoning

cross-domain egocentric video

temporal boundary awareness