AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current controllable multimodal captioning suffers from insufficient fine-grained control and the absence of standardized evaluation protocols, hindering content–style alignment and instruction adherence. To address these challenges, we propose ACM—a lightweight, plug-and-play framework that enhances controllability of foundation models without fine-tuning, via feature reweighting and instruction–multimodal feature fusion. We introduce ACD, the first large-scale, instruction-rich multimodal dataset covering images, videos, and audio, with 28 diverse instruction categories. Furthermore, we design AnyCapEval, a decoupled benchmark that independently quantifies content accuracy and style fidelity. Experiments demonstrate that ACM-8B boosts GPT-4o’s performance on AnyCapEval by +45% in content score and +12% in style score, while also achieving significant gains over baselines on established benchmarks including MIA-Bench and VidCapBench.

Technology Category

Application Category

📝 Abstract
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4oś content scores by 45% and style scores by 12%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
Problem

Research questions and friction points this paper is trying to address.

Enhance fine-grained control in omni-modal captioning models
Address data scarcity in controllable multimodal captioning
Improve evaluation metrics for content and style accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight plug-and-play framework for controllability
Comprehensive dataset covering multiple modalities
Reliable evaluation metrics decoupling accuracy and style