🤖 AI Summary
To address poor generalization, domain-specific fine-tuning dependencies, and privacy risks (e.g., PII leakage) in non-traditional modalities (audio, biosignals, tabular data) and long-tailed prediction tasks, this paper proposes MARVIS—a training-free, fine-tuning-free universal cross-modal reasoning framework. MARVIS maps latent representations of arbitrary modalities into visually interpretable images, thereby activating the spatial-structural understanding and fine-grained semantic alignment capabilities of compact vision-language models (VLMs) for zero-shot cross-modal knowledge transfer. Leveraging only a single 3B-parameter VLM, MARVIS achieves an average 16% performance gain over Gemini across diverse long-tailed benchmarks, matching specialized models—while eliminating domain-customized training and exposure of raw sensitive data. To our knowledge, MARVIS is the first framework enabling truly universal multimodal inference with small-scale VLMs under strict privacy-preserving constraints.
📝 Abstract
Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis