VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

πŸ“… 2026-03-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges in video understanding caused by visual information loss from lossy text-based compression and the high computational complexity of processing long-video context. To overcome these limitations, the authors propose VideoAtlasβ€”a task-agnostic, hierarchical grid-based video representation framework that enables recursive zooming and structured exploration in a lossless, navigable manner without requiring subtitles or preprocessing. Building upon this representation, they introduce Video-RLM, a model grounded in Markov decision processes and a Master-Worker parallel architecture, which achieves logarithmic computational scaling on videos spanning 1–10 hours, supports environment-aware budget control, and enables adaptive computation allocation. The approach attains multimodal cache hit rates of 30–60%, substantially outperforming existing methods while enabling, for the first time, lossless visual recursive reasoning.

Technology Category

Application Category

πŸ“ Abstract
Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.
Problem

Research questions and friction points this paper is trying to address.

video representation
long-context video understanding
lossy approximation
visual fidelity
scalable video processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoAtlas
hierarchical video representation
lossless navigation
Recursive Language Models
logarithmic compute
πŸ”Ž Similar Papers
M
Mohamed Eltahir
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
A
Ali Habibullah
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Y
Yazan Alshoibi
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
L
Lama Ayash
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia; Department of Computer Science, King Khalid University (KKU), Abha, Saudi Arabia
Tanveer Hussain
Tanveer Hussain
Lecturer at Department of Computer Science, Edge Hill University
Computer VisionVideo SummarisationSaliency DetectionFire/Smoke Detection
N
Naeemullah Khan
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia