🤖 AI Summary
Existing long video question answering methods struggle to balance efficiency and accuracy: similarity-based approaches are fast but neglect temporal dynamics and cross-modal relationships, while agent-based methods incur prohibitive computational costs. This work proposes HiMu, a training-free hierarchical multimodal frame selection framework that parses a query into a logical tree via a single large language model call. Lightweight expert models—including CLIP, open-vocabulary detectors, OCR, ASR, and CLAP—process the leaf nodes, and their outputs are fused bottom-up using fuzzy logic and temporal smoothing to produce a continuous, temporally coherent scoring curve that guides frame selection. HiMu is the first method to jointly handle compositional query parsing, cross-modal alignment, and temporal modeling without any training. It achieves state-of-the-art performance on Video-MME, LongVideoBench, and HERBench-Lite, surpassing all baselines with only 16 selected frames and outperforming agent-based systems using 32–512 frames at roughly one-tenth the FLOPs when paired with GPT-4o.
📝 Abstract
Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.