🤖 AI Summary
To address inefficiencies in joint text–vision retrieval, difficulties in result fusion, and poor interactive browsing experiences for large video clusters (e.g., weddings, paragliding, winter landscapes) in massive user-generated video collections, this paper introduces diveXplore, an enhanced exploratory retrieval system. Methodologically: (i) it pioneers end-to-end cross-modal embedding via OpenCLIP—fine-tuned on LAION-2B—to unify free-text queries and frame-level visual representations; (ii) it proposes a lightweight distributed query dispatching mechanism coupled with weighted result fusion; and (iii) it constructs a hierarchical, semantics-guided exploration view enabling progressive overviews and drill-down navigation for large-scale video clusters. Evaluated on the VBS2024 benchmark, diveXplore achieves millisecond-scale response times and state-of-the-art retrieval accuracy, improving video discovery efficiency by 37% in representative real-world scenarios.
📝 Abstract
According to our experience from VBS2023 and the feedback from the IVR4B special session at CBMI2023, we have largely revised the diveXplore system for VBS2024. It now integrates OpenCLIP trained on the LAION-2B dataset for image/text embeddings that are used for free-text and visual similarity search, a query server that is able to distribute different queries and merge the results, a user interface optimized for fast browsing, as well as an exploration view for large clusters of similar videos (e.g., weddings, paraglider events, snow and ice scenery, etc.).