OmniCaptioner: One Captioner to Rule Them All

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the lack of a unified framework for fine-grained vision-to-language captioning across heterogeneous visual domains—including natural images, visual text (e.g., UIs and posters), and structured charts (e.g., tables and flowcharts). We propose the first general-purpose vision description generation model spanning all such domains. Methodologically, we design an end-to-end architecture integrating multi-stage visual encoding with large language models (LLMs), incorporating vision–language semantic alignment, long-context modeling, and domain-adaptive prompting. Key contributions include: (1) the first unified captioning capability across natural images, UI/poster layouts, and structured diagrams; (2) substantial enhancement of LLMs’ multimodal reasoning—demonstrated with DeepSeek-R1; and (3) empirical improvements: 12.6% higher text-to-image generation quality, 40% faster convergence during supervised fine-tuning, and 35% reduction in annotation data requirements.

Technology Category

Application Category

📝 Abstract

We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.

Problem

Research questions and friction points this paper is trying to address.

Unified framework for captioning diverse visual domains

Bridging gap between visual and textual modalities

Enhancing multimodal reasoning and image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified captioning for diverse visual domains

Converts pixels to rich textual representations

Enhances LLM reasoning with detailed captions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs