TextME: Bridging Unseen Modalities Through Text Descriptions

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal extension is often hindered by the high annotation cost of large-scale paired data, particularly in specialized domains such as medical imaging and molecular analysis. This work proposes TextME, a framework that, for the first time, maps diverse modalities—including images, audio, 3D, X-rays, and molecular data—into the embedding space of large language models using only textual descriptions, without any modality-paired supervision. By leveraging the geometric structure of pretrained contrastive encoders, TextME enables zero-shot cross-modal transfer purely through text-driven alignment. This approach establishes a novel paradigm for modality extension, achieving effective zero-shot retrieval across heterogeneous, unaligned modalities—such as audio-to-image or 3D-to-X-ray—while preserving the representational capacity of the pretrained encoders.

Technology Category

Application Category

📝 Abstract
Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text-image, text-audio, text-3D, text-molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to-image). These results establish text-only training as a practical alternative to paired supervision for modality expansion.
Problem

Research questions and friction points this paper is trying to address.

multimodal representation
modality expansion
paired datasets
zero-shot transfer
text-only training
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-only modality expansion
zero-shot cross-modal transfer
LLM embedding space
contrastive encoder geometry
emergent cross-modal retrieval
🔎 Similar Papers
No similar papers found.
S
Soyeon Hong
Department of Artificial Intelligence, Ajou University, Suwon, South Korea
J
Jinchan Kim
Department of Artificial Intelligence, Ajou University, Suwon, South Korea
J
Jaegook You
Department of Artificial Intelligence, Ajou University, Suwon, South Korea
S
Seungtaek Choi
Division of Language & AI, Hankuk University of Foreign Studies, Seoul, Korea
Suha Kwak
Suha Kwak
POSTECH
Computer VisionMachine Learning
Hyunsouk Cho
Hyunsouk Cho
Assistant professor, Ajou university, Korea
Artificial Intelligence