🤖 AI Summary
This work addresses the bottleneck in multimodal understanding—its heavy reliance on large-scale annotated data and model training. We propose MILS, a Multimodal Iterative Large Language Model Solver that requires no training or fine-tuning. MILS equips purely text-based LLMs with zero-shot visual and auditory comprehension capabilities via multi-step prompt engineering, gradient-free embedding inversion, candidate generation, and feedback-based rescorings. Its core innovation is the first demonstration of cross-modal reasoning without any parameter updates—enabling emergent zero-shot description, media generation optimization, cross-modal arithmetic, and embedding inversion for images, videos, and audio. MILS achieves state-of-the-art performance across multiple zero-shot multimodal understanding benchmarks and demonstrates practical efficacy in prompt optimization and style transfer tasks.
📝 Abstract
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.