Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work investigates the feasibility of zero-training cross-modal reasoning with large language models (LLMs) on non-textual modalities. To this end, we propose In-Context Representation Learning (ICRL), a framework that enables plug-and-play multimodal adaptation without fine-tuning or additional supervision: non-text embeddings—e.g., from vision or molecular foundation models—are directly injected into the input sequence of a pure-text LLM, conditioned on a few cross-modal in-context examples. We systematically evaluate ICRL on a molecular multi-task benchmark, demonstrating substantial gains in zero-shot cross-modal reasoning performance. Our analysis reveals critical dependencies on representation alignment strategies and the number of in-context examples. To our knowledge, this is the first approach to achieve training-free integration of pure-text LLMs with non-textual foundation models, establishing a lightweight, general-purpose paradigm for multimodal reasoning.

Technology Category

Application Category

📝 Abstract
The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.
Problem

Research questions and friction points this paper is trying to address.

Integrating non-text modality representations into LLMs without training
Enabling multi-modal inference using foundational models without fine-tuning
Exploring training-free adaptation of LLMs to new domains and modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates non-text modality representations without training
Uses in-context learning with foundation model representations
Enables multi-modal inference through representation replacement