🤖 AI Summary
This work addresses the challenge of efficiently selecting keyframes from long videos under limited computational resources to support accurate understanding by multimodal large language models (MLLMs). The authors propose a curvature-aware temporal frame selection method that, for the first time, leverages temporal curvature to model the rhythm of semantic changes in video content. By analyzing the temporal geometric structure of query-frame relevance, the approach adaptively adjusts sampling density to retain both abrupt events and gradual transitions while discarding redundant frames. This strategy substantially reduces preprocessing overhead, achieving 93–95% of the performance of the MIRA method on LongVideoBench and VideoMME benchmarks at only 3–4% of its computational cost, while generating more coherent and informative video descriptions.
📝 Abstract
Understanding long videos with multimodal large language models (MLLMs) requires selecting a small subset of informative frames under strict computational budgets, where exhaustive processing is infeasible and optimal selection is combinatorial. We propose CATS, a curvature-aware frame selection method that explicitly models the temporal geometry of query-frame relevance to identify salient events and their surrounding context. By leveraging temporal curvature to adapt selection density, CATS captures both abrupt transitions and gradually evolving content while suppressing redundant frames. Under a fixed backbone and frame budget, CATS consistently outperforms prior lightweight approaches such as AKS on LongVideoBench and VideoMME. While multi-stage methods such as MIRA achieve higher absolute accuracy, they incur substantial computational overhead; in contrast, CATS retains approximately 93-95% of MIRA's performance while requiring only 3-4% of its preprocessing cost, yielding a favorable efficiency-accuracy trade-off. Beyond answer accuracy, we evaluate description generation using an LLM-as-a-judge protocol, and the obtained results show that CATS produces more coherent and informative outputs, indicating improved grounding in visual evidence. These results position CATS as a computationally efficient and principled approach to long-video understanding.