DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) rely heavily on sparse textual annotations for training, which inadequately capture fine-grained visual semantics in images and 3D scenes—particularly across cross-cultural, multilingual, and domain-specific contexts. Textual inputs suffer from limited expressivity, low annotation efficiency, and poor capacity to describe visual details. To address this, we propose the first speech-driven dense annotation paradigm, introducing an end-to-end audio-visual synchronization platform that integrates automatic speech recognition (ASR), attention-based region localization, and multimodal temporal alignment. The platform supports fine-grained spoken annotations for both images and 3D scenes across 20 languages, covering 898 scenes and 7,460 objects. Our resulting dataset comprises 3,531 image–3D paired samples with synchronized speech annotations. Evaluation shows substantial improvements: +5% in multilingual understanding, +47% in cultural alignment, and +54% in 3D spatial reasoning.

Technology Category

Application Category

📝 Abstract
With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image's visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.
Problem

Research questions and friction points this paper is trying to address.

Current image annotation methods rely on sparse typing-based captions
Traditional annotation pipelines limit expressiveness for dense visual content
There is a scarcity of high-quality dense annotations for multimodal training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses audio narration for dense annotation collection
Links spoken phrases to visual regions automatically
Applies speech-to-text and attention marking technology
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Lin
University of Pennsylvania
A
Aniket Ghorpade
University of Pennsylvania
H
Hansheng Zhu
University of Pennsylvania
J
Justin Qiu
University of Pennsylvania
D
Dea Rrozhani
University of Pennsylvania
M
Monica Lama
University of Pennsylvania
M
Mick Yang
University of Pennsylvania
Z
Zixuan Bian
University of Pennsylvania
R
Ruohan Ren
University of Pennsylvania
A
Alan B. Hong
University of Pennsylvania
Jiatao Gu
Jiatao Gu
UPenn CIS / Apple MLR
machine learninggenerative modelsnatural language processingcomputer visiondeep learning
Chris Callison-Burch
Chris Callison-Burch
Professor, University of Pennsylvania
Natural Language ProcessingCrowdsourcingArtificial Intelligence