Towards Visual Query Localization in the 3D World

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing research on visual query localization (VQL) is confined to 2D video and lacks support for 3D spatiotemporal response prediction. This work introduces the first VQL task tailored for 3D environments, accompanied by 3DVQL—a large-scale multimodal benchmark dataset comprising 2,002 sequences, 170,000 frames, and 6.4K densely annotated 3D response trajectories, integrating point clouds, RGB images, and depth maps. To address this new setting, we propose LaF, a novel fusion algorithm leveraging lifting and attention mechanisms. Evaluated on the proposed benchmark, LaF substantially outperforms existing baseline methods, thereby establishing both a foundational dataset and a strong methodological framework for future research in 3D VQL.

📝 Abstract

Visual query localization (VQL) aims to predict the spatio-temporal response of the most recent occurrence in a sequence given a query. Currently, most research focuses on visual query localization in 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to address visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities, including point clouds, RGB images, and depth images, to support flexible research. To ensure high-quality annotations, each sequence is manually annotated with multiple rounds of verification and refinement. To the best of our knowledge, 3DVQL is the first benchmark for 3D multimodal visual query localization. To facilitate comparison in subsequent research, we implement a series of representative 3D multimodal VQL baselines using point clouds and RGB images. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift-and-attention fusion algorithm named LaF, which significantly outperforms existing baseline models. Our benchmark and model will be publicly released at https://github.com/wuhengliangliang/3DVQL.

Problem

Research questions and friction points this paper is trying to address.

visual query localization

3D world

multimodal

benchmark

spatio-temporal response

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D visual query localization

multimodal fusion

point cloud