🤖 AI Summary
Multimodal large language models (MLLMs) suffer from semantic misinterpretation in long-video understanding due to visual token overload. Method: This paper proposes a test-time visual prompt optimization framework featuring a novel chain-based shot selection mechanism—formulating shot filtering as task-aware binary pseudo-temporal localization and integrating positive–negative sample collaborative reasoning to construct dynamic, semantically aligned lightweight prompts. The method comprises three modules: binary video summarization, video collaborative reasoning, and task-shot semantic alignment. Contribution/Results: Evaluated on three baseline MLLMs and five long-video understanding benchmarks, the approach reduces visual token count by over 60% on average, significantly enhancing fine-grained temporal reasoning performance. Crucially, it requires no model fine-tuning, demonstrating strong generalizability across architectures and datasets, as well as deployment efficiency.
📝 Abstract
Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.