AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing approaches to long-form video understanding struggle under the one-shot encoding paradigm to simultaneously achieve broad temporal coverage, fine-grained visual detail, and computational efficiency, often sacrificing critical information through aggressive compression or incurring substantial memory and latency overheads. This work proposes a progressive evidence acquisition framework that generates compact video previews via query-aware adaptive relevance-diversity sampling (AdaRD) and, when model uncertainty arises, triggers a zero-cache on-demand retrieval mechanism to directly fetch high-resolution frames from disk. Requiring no pre-cached frames, the method significantly outperforms current state-of-the-art approaches across seven benchmarks—yielding a 2.59% absolute gain in accuracy on VideoMME, an 8.39% improvement in mIoU on Charades-STA, and a ~33× reduction in visual token consumption.

📝 Abstract

Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.

Problem

Research questions and friction points this paper is trying to address.

long video understanding

temporal coverage

visual details

computational efficiency

evidence loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdaFocus

adaptive sampling

zero-cache retrieval