HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing long video question answering methods struggle to balance efficiency and accuracy: similarity-based approaches are fast but neglect temporal dynamics and cross-modal relationships, while agent-based methods incur prohibitive computational costs. This work proposes HiMu, a training-free hierarchical multimodal frame selection framework that parses a query into a logical tree via a single large language model call. Lightweight expert models—including CLIP, open-vocabulary detectors, OCR, ASR, and CLAP—process the leaf nodes, and their outputs are fused bottom-up using fuzzy logic and temporal smoothing to produce a continuous, temporally coherent scoring curve that guides frame selection. HiMu is the first method to jointly handle compositional query parsing, cross-modal alignment, and temporal modeling without any training. It achieves state-of-the-art performance on Video-MME, LongVideoBench, and HERBench-Lite, surpassing all baselines with only 16 selected frames and outperforming agent-based systems using 32–512 frames at roughly one-tenth the FLOPs when paired with GPT-4o.

Technology Category

Application Category

📝 Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

Problem

Research questions and friction points this paper is trying to address.

long video question answering

frame selection

multimodal reasoning

temporal context

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical multimodal frame selection

training-free framework

query decomposition