Describe Anything: Detailed Localized Image and Video Captioning

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Addressing the critical challenge of generating fine-grained, contextually coherent multi-sentence natural language descriptions for arbitrary regions in images and videos, this paper proposes DLC (Detailed Localized Captioning). Methodologically: (1) it introduces a focal prompting mechanism and a localized visual encoder to strengthen region–text alignment; (2) it establishes DLC-Bench—the first reference-free benchmark for DLC evaluation; and (3) it designs a semi-supervised DLC-SDP data pipeline that leverages segmentation-guided synthetic data generation to improve annotation efficiency. Evaluated across seven keyword-level, phrase-level, and multi-sentence DLC benchmarks, DLC consistently outperforms state-of-the-art methods, achieving significant gains in both local detail fidelity and global semantic consistency.

Technology Category

Application Category

📝 Abstract

Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

Problem

Research questions and friction points this paper is trying to address.

Generating detailed descriptions for specific image/video regions

Addressing scarcity of high-quality localized captioning data

Evaluating localized captioning without reference captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focal prompt for high-resolution region encoding

Localized vision backbone integrating precise context

Semi-supervised data pipeline expanding unlabeled images

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs