MARVIS: Modality Adaptive Reasoning over VISualizations

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address poor generalization, domain-specific fine-tuning dependencies, and privacy risks (e.g., PII leakage) in non-traditional modalities (audio, biosignals, tabular data) and long-tailed prediction tasks, this paper proposes MARVIS—a training-free, fine-tuning-free universal cross-modal reasoning framework. MARVIS maps latent representations of arbitrary modalities into visually interpretable images, thereby activating the spatial-structural understanding and fine-grained semantic alignment capabilities of compact vision-language models (VLMs) for zero-shot cross-modal knowledge transfer. Leveraging only a single 3B-parameter VLM, MARVIS achieves an average 16% performance gain over Gemini across diverse long-tailed benchmarks, matching specialized models—while eliminating domain-customized training and exposure of raw sensitive data. To our knowledge, MARVIS is the first framework enabling truly universal multimodal inference with small-scale VLMs under strict privacy-preserving constraints.

Technology Category

Application Category

📝 Abstract

Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis

Problem

Research questions and friction points this paper is trying to address.

Enables small vision-language models to predict diverse data modalities accurately

Transforms latent embeddings into visual representations for better reasoning

Achieves competitive performance across multiple domains without domain-specific training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free method for multi-modal prediction

Transforms embeddings into visual representations

Leverages VLMs for spatial reasoning

🔎 Similar Papers

No similar papers found.