Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing text-to-video retrieval methods perform well on explicit queries but struggle with implicit queries requiring semantic reasoning. To address this, we propose a digital twin video representation paradigm that structurally encodes videos as scene graphs amenable to logical inference, integrated with large language models for two-stage retrieval: (1) subquery alignment and (2) context-driven dynamic reasoning—enabling fine-grained implicit semantic matching. Additionally, object-level grounding masks are introduced to enhance interpretability. Evaluated on our newly constructed benchmark ReasonT2VBench-135, our method achieves 81.2% R@1, surpassing the strongest baseline by over 50 percentage points, and attains state-of-the-art performance across multiple mainstream benchmarks. This work is the first to systematically tackle implicit-text-query video retrieval, advancing cross-modal reasoning toward explainable, decomposable, and semantically grounded frameworks.

Technology Category

Application Category

📝 Abstract

The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).

Problem

Research questions and friction points this paper is trying to address.

Retrieving videos for implicit queries requiring reasoning

Providing object-level grounding masks for query conditions

Overcoming limitations of direct vision-language model processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital twin video representations for structured scene decomposition

Large language models reasoning over long-horizon video content

Two-stage framework with compositional alignment and refinement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs