CAST: Modeling Visual State Transitions for Consistent Video Retrieval

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitation of existing video retrieval methods, which often neglect the consistency between visual states and entity identities during inference, thereby struggling to support coherent narrative composition in long videos. To this end, the paper formally introduces the Consistent Video Retrieval (CVR) task and proposes CAST, a novel framework that incorporates lightweight, plug-and-play state-conditioned residual adapters into a frozen vision-language embedding space. CAST explicitly models visual state transitions by predicting state evolution conditioned on historical visual context. This approach provides effective reranking signals for black-box video generation systems, significantly outperforming zero-shot baselines on YouCook2 and CrossTask while remaining competitive on COIN, and demonstrably enhances the temporal coherence of retrieved results.

Technology Category

Application Category

📝 Abstract

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($\Delta$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

Problem

Research questions and friction points this paper is trying to address.

Consistent Video Retrieval

state consistency

identity consistency

context-agnostic retrieval

video narrative coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistent Video Retrieval

State Transition Modeling

Context-Aware Adapter