MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

269K/year

🤖 AI Summary

This work addresses the closed-ended question-answering task on the CASTLE 2024 dataset, which involves 185 questions requiring reasoning across four-day egocentric activities, 15 synchronized visual perspectives, and multimodal auxiliary signals. The authors propose an agent-based multimodal reasoning framework that constructs a unified memory bank encompassing video clips, textual transcripts, eye-tracking data, heart rate measurements, photographs, and thermal imaging. A central intelligent agent dynamically selects relevant evidence from this memory or explicitly requests missing modalities to formulate answers. By innovatively integrating multimodal evidence selection with agent-driven decision-making, the approach transcends conventional text-only QA paradigms and effectively fuses long-horizon video sequences with heterogeneous data sources. The system leverages DeepSeek for video summarization and GPT-5.4 as its reasoning core, achieving second place in the EgoVis 2026 CASTLE Challenge.

📝 Abstract

This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.

Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning

egocentric vision

evidence selection

CASTLE challenge

activity understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reasoning

agentic evidence selection

egocentric vision