🤖 AI Summary
Existing approaches to long-form video understanding struggle under the one-shot encoding paradigm to simultaneously achieve broad temporal coverage, fine-grained visual detail, and computational efficiency, often sacrificing critical information through aggressive compression or incurring substantial memory and latency overheads. This work proposes a progressive evidence acquisition framework that generates compact video previews via query-aware adaptive relevance-diversity sampling (AdaRD) and, when model uncertainty arises, triggers a zero-cache on-demand retrieval mechanism to directly fetch high-resolution frames from disk. Requiring no pre-cached frames, the method significantly outperforms current state-of-the-art approaches across seven benchmarks—yielding a 2.59% absolute gain in accuracy on VideoMME, an 8.39% improvement in mIoU on Charades-STA, and a ~33× reduction in visual token consumption.
📝 Abstract
Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency.
We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading.
Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.