🤖 AI Summary
This work addresses the lack of a unified framework for fine-grained vision-to-language captioning across heterogeneous visual domains—including natural images, visual text (e.g., UIs and posters), and structured charts (e.g., tables and flowcharts). We propose the first general-purpose vision description generation model spanning all such domains. Methodologically, we design an end-to-end architecture integrating multi-stage visual encoding with large language models (LLMs), incorporating vision–language semantic alignment, long-context modeling, and domain-adaptive prompting. Key contributions include: (1) the first unified captioning capability across natural images, UI/poster layouts, and structured diagrams; (2) substantial enhancement of LLMs’ multimodal reasoning—demonstrated with DeepSeek-R1; and (3) empirical improvements: 12.6% higher text-to-image generation quality, 40% faster convergence during supervised fine-tuning, and 35% reduction in annotation data requirements.
📝 Abstract
We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.