🤖 AI Summary
Addressing the critical challenge of generating fine-grained, contextually coherent multi-sentence natural language descriptions for arbitrary regions in images and videos, this paper proposes DLC (Detailed Localized Captioning). Methodologically: (1) it introduces a focal prompting mechanism and a localized visual encoder to strengthen region–text alignment; (2) it establishes DLC-Bench—the first reference-free benchmark for DLC evaluation; and (3) it designs a semi-supervised DLC-SDP data pipeline that leverages segmentation-guided synthetic data generation to improve annotation efficiency. Evaluated across seven keyword-level, phrase-level, and multi-sentence DLC benchmarks, DLC consistently outperforms state-of-the-art methods, achieving significant gains in both local detail fidelity and global semantic consistency.
📝 Abstract
Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.