LLMs can see and hear without any training

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the bottleneck in multimodal understanding—its heavy reliance on large-scale annotated data and model training. We propose MILS, a Multimodal Iterative Large Language Model Solver that requires no training or fine-tuning. MILS equips purely text-based LLMs with zero-shot visual and auditory comprehension capabilities via multi-step prompt engineering, gradient-free embedding inversion, candidate generation, and feedback-based rescorings. Its core innovation is the first demonstration of cross-modal reasoning without any parameter updates—enabling emergent zero-shot description, media generation optimization, cross-modal arithmetic, and embedding inversion for images, videos, and audio. MILS achieves state-of-the-art performance across multiple zero-shot multimodal understanding benchmarks and demonstrates practical efficacy in prompt optimization and style transfer tasks.

Technology Category

Application Category

📝 Abstract

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Problem

Research questions and friction points this paper is trying to address.

Cross-modal Information Processing

Unsupervised Learning

Multimedia Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

MILS

Autonomous Multimedia Interpretation

Cross-domain Problem Solving

🔎 Similar Papers

No similar papers found.