Queries Are Not Alone: Clustering Text Embeddings for Video Search

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the semantic gap between text queries and video content—which limits retrieval performance—this paper proposes a novel video retrieval framework based on query semantic expansion. The method first clusters text query embeddings to explicitly model lexical ambiguity; second, introduces a noise-robust Sweeper module to filter low-quality clusters; and third, incorporates a video-text cluster attention mechanism (VTC-Att) to dynamically attend to salient semantic clusters. Evaluated on five mainstream public benchmarks, the approach consistently outperforms existing state-of-the-art methods, achieving significant improvements in cross-modal matching accuracy and robustness to polysemous queries. By jointly modeling ambiguity, filtering spurious semantics, and enabling interpretable cluster-level alignment, the framework establishes a principled, scalable, and explainable paradigm for fine-grained video retrieval.

Technology Category

Application Category

📝 Abstract

The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata, often fail to bridge the semantic gap between text descriptions and the multifaceted nature of video content. This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope. We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query. This clustering is further refined by our innovative Sweeper module, which identifies and mitigates noise within these clusters. Additionally, we introduce the Video-Text Cluster-Attention (VTC-Att) mechanism, which dynamically adjusts focus within the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features. Further experiments have demonstrated that our proposed model surpasses existing state-of-the-art models on five public datasets.

Problem

Research questions and friction points this paper is trying to address.

Clustering text queries to enhance video retrieval

Addressing semantic gap between text and video content

Improving relevance through dynamic cluster attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clustering text queries for broader semantic scope

Sweeper module mitigates noise within query clusters

VTC-Att mechanism dynamically adjusts focus on video content

🔎 Similar Papers

No similar papers found.

Authors to Follow