FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Fine-grained information retrieval and frame-level precise localization in hour-long videos remain challenging, as answers reside only in a sparse set of key frames—exacerbated by the severe context-length limitations of current vision-language models (VLMs). Method: We propose a cross-modal long-video question-answering search framework that overcomes VLM context constraints. Our approach employs a lightweight VLM jointly orchestrated with a medium-scale LLM, integrated with video segmentation, automatic subtitle generation, and a confidence-driven joint exploration algorithm for short video segments, subtitles, and temporal localization. We further conduct the first systematic analysis of VLM answer confidence calibration. Results: Evaluated on our newly constructed long-video benchmark FALCON-Bench (mean duration >1 hour), our method significantly outperforms state-of-the-art methods. It achieves comparable or superior performance on multiple public benchmarks while enabling deployment on standard single-machine hardware under typical computational constraints.

Technology Category

Application Category

📝 Abstract

Information retrieval in hour-long videos presents a significant challenge, even for state-of-the-art Vision-Language Models (VLMs), particularly when the desired information is localized within a small subset of frames. Long video data presents challenges for VLMs due to context window limitations and the difficulty of pinpointing frames containing the answer. Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. FALCONEye novelty relies on 1) the proposed meta-architecture, which is better suited to tackle hour-long videos compared to short video approaches in the state-of-the-art; 2) a new efficient exploration algorithm to locate the information using short clips, captions and answer confidence; and 3) our state-of-the-art VLMs calibration analysis for the answer confidence. Our agent is built over a small-size VLM and a medium-size LLM being accessible to run on standard computational resources. We also release FALCON-Bench, a benchmark to evaluate long (average>1 hour) Video Answer Search challenges, highlighting the need for open-ended question evaluation. Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Retrieving localized information from hour-long videos efficiently

Overcoming context window limits in Vision-Language Models for long videos

Accurately pinpointing answer-containing frames using multimodal LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines VLM and LLM for video information retrieval

Uses efficient exploration algorithm with short clips

Calibrates VLM confidence for accurate answer localization

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs