SHREC 2025: Retrieval of Optimal Objects for Multi-modal Enhanced Language and Spatial Assistance (ROOMELSA)

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of fine-grained 3D model retrieval from panoramic indoor images using natural language queries in realistic, complex scenes. To this end, we introduce ROOMELSA—the first benchmark unifying scene-level language grounding and 3D object retrieval—comprising over 1,600 apartments, 5,200 rooms, and 44,000 natural language queries. Methodologically, we propose a lightweight multimodal framework built upon CLIP, jointly modeling material properties, part-level structure, and spatial context to achieve cross-modal fine-grained alignment. Experiments reveal that while coarse-grained retrieval has matured, fine-grained recognition remains highly challenging. Our method consistently achieves state-of-the-art performance across most evaluations, significantly outperforming CLIP and other baselines. This demonstrates the effectiveness of tightly coupled vision-language modeling augmented with explicit spatial awareness.

Technology Category

Application Category

📝 Abstract
Recent 3D retrieval systems are typically designed for simple, controlled scenarios, such as identifying an object from a cropped image or a brief description. However, real-world scenarios are more complex, often requiring the recognition of an object in a cluttered scene based on a vague, free-form description. To this end, we present ROOMELSA, a new benchmark designed to evaluate a system's ability to interpret natural language. Specifically, ROOMELSA attends to a specific region within a panoramic room image and accurately retrieves the corresponding 3D model from a large database. In addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms, and more than 44,000 targeted queries. Empirically, while coarse object retrieval is largely solved, only one top-performing model consistently ranked the correct match first across nearly all test cases. Notably, a lightweight CLIP-based model also performed well, although it struggled with subtle variations in materials, part structures, and contextual cues, resulting in occasional errors. These findings highlight the importance of tightly integrating visual and language understanding. By bridging the gap between scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new benchmark for advancing robust, real-world 3D recognition systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluate 3D retrieval in cluttered scenes with vague descriptions
Bridge gap between scene-level grounding and fine-grained 3D retrieval
Advance robust real-world 3D recognition systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for natural language interpretation in 3D retrieval
Lightweight CLIP-based model for object retrieval
Integration of visual and language understanding
🔎 Similar Papers
No similar papers found.
Trong-Thuan Nguyen
Trong-Thuan Nguyen
University of Science, VNU-HCM
Deep LearningComputer VisionVideo Understanding
Viet-Tham Huynh
Viet-Tham Huynh
Researcher at Software Engineering Laboratory, University of Science, VNU-HCM
Software EngineeringVirtual RealityComputer Vision
Q
Quang-Thuc Nguyen
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Hoang-Phuc Nguyen
Hoang-Phuc Nguyen
University of Science, VNU-HCM, Vietnam National University, Vietnam
artificial intelligencemachine learningcomputer vision
L
Long Le Bao
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
T
Thai Hoang Minh
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
M
Minh Nguyen Anh
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
T
Thang Nguyen Tien
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
P
Phat Nguyen Thuan
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
H
Huy Nguyen Phong
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
B
Bao Huynh Thai
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
Vinh-Tiep Nguyen
Vinh-Tiep Nguyen
University of Information Technology, VNU-HCMC
Deep learningComputer VisionInformation Retrieval
Duc-Vu Nguyen
Duc-Vu Nguyen
University of Information Technology
Natural Language Processing
P
Phu-Hoa Pham
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
M
Minh-Huy Le-Hoang
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Nguyen-Khang Le
Nguyen-Khang Le
Japan Advanced Institute of Science and Technology
Deep Learning
M
Minh-Chinh Nguyen
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
M
Minh-Quan Ho
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
N
Ngoc-Long Tran
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
H
Hien-Long Le-Hoang
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
M
Man-Khoi Tran
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
A
Anh-Duong Tran
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
Kim Nguyen
Kim Nguyen
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Q
Quan Nguyen Hung
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
D
Dat Phan Thanh
University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam