Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing approaches to keyframe retrieval in long videos are often constrained by fixed architectures or single scoring mechanisms, limiting their ability to adapt to diverse queries. This work proposes ToolMerge, a novel method that leverages large language models (LLMs) to dynamically decompose user queries into multiple visual tool invocations and subsequently integrates the resulting keyframe rankings using Boolean logic. By enabling compositional and flexible query interpretation, ToolMerge overcomes the representational limitations of conventional frameworks, substantially enhancing both retrieval flexibility and accuracy. Evaluated on the newly introduced Molmo-2 Moments benchmark, ToolMerge outperforms existing methods by 5% on caption-based retrieval tasks and demonstrates competitive performance in question answering and query-based retrieval scenarios.
📝 Abstract
Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .
Problem

Research questions and friction points this paper is trying to address.

keyframe retrieval
long-video QA
query decomposition
visual evidence
video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

ToolMerge
query decomposition
keyframe retrieval
LLM-based planning
visual tool fusion
🔎 Similar Papers