ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D scene captioning methods generate only single-granularity descriptions, failing to capture fine-grained attributes—such as material, texture, and shape—at both object and part levels. To address this, we introduce the novel task of *Expressive 3D Captioning*: generating multi-granular textual descriptions for indoor scenes, simultaneously capturing high-level semantic object descriptions and low-level attribute-rich part descriptions. We construct ExCap3D, the first large-scale, hierarchically annotated dataset (34K objects, 190K captions), and propose a data synthesis paradigm integrating 3D point cloud detection with multi-view vision-language models. Our dual-level captioning architecture combines conditional generation and contrastive alignment, augmented by CLIP latent-space constraints to ensure cross-granularity semantic consistency. On ScanNet++, our method achieves state-of-the-art performance, improving CIDEr scores by 17% (object-level) and 124% (part-level). Code, models, and the ExCap3D dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Generating multi-level text descriptions for 3D objects
Capturing fine-grained details like textures and shapes
Ensuring semantic consistency in generated captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multi-level 3D object descriptions
Uses semantic consistency for caption quality
Leverages visual-language model for dataset creation
🔎 Similar Papers
No similar papers found.