CATS: Curvature Aware Temporal Selection for efficient long video understanding

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of efficiently selecting keyframes from long videos under limited computational resources to support accurate understanding by multimodal large language models (MLLMs). The authors propose a curvature-aware temporal frame selection method that, for the first time, leverages temporal curvature to model the rhythm of semantic changes in video content. By analyzing the temporal geometric structure of query-frame relevance, the approach adaptively adjusts sampling density to retain both abrupt events and gradual transitions while discarding redundant frames. This strategy substantially reduces preprocessing overhead, achieving 93–95% of the performance of the MIRA method on LongVideoBench and VideoMME benchmarks at only 3–4% of its computational cost, while generating more coherent and informative video descriptions.

📝 Abstract

Understanding long videos with multimodal large language models (MLLMs) requires selecting a small subset of informative frames under strict computational budgets, where exhaustive processing is infeasible and optimal selection is combinatorial. We propose CATS, a curvature-aware frame selection method that explicitly models the temporal geometry of query-frame relevance to identify salient events and their surrounding context. By leveraging temporal curvature to adapt selection density, CATS captures both abrupt transitions and gradually evolving content while suppressing redundant frames. Under a fixed backbone and frame budget, CATS consistently outperforms prior lightweight approaches such as AKS on LongVideoBench and VideoMME. While multi-stage methods such as MIRA achieve higher absolute accuracy, they incur substantial computational overhead; in contrast, CATS retains approximately 93-95% of MIRA's performance while requiring only 3-4% of its preprocessing cost, yielding a favorable efficiency-accuracy trade-off. Beyond answer accuracy, we evaluate description generation using an LLM-as-a-judge protocol, and the obtained results show that CATS produces more coherent and informative outputs, indicating improved grounding in visual evidence. These results position CATS as a computationally efficient and principled approach to long-video understanding.

Problem

Research questions and friction points this paper is trying to address.

long video understanding

frame selection

computational efficiency

multimodal large language models

temporal relevance

Innovation

Methods, ideas, or system contributions that make the work stand out.

curvature-aware selection

temporal geometry

efficient video understanding