🤖 AI Summary
It remains unclear whether large language models (LLMs) trained exclusively on text implicitly encode modality-specific visual and auditory representations, given their lack of genuine multimodal input.
Method: To probe such latent capabilities, we propose a lightweight prompting strategy leveraging sensory verbs (e.g., “see”, “hear”) to explicitly activate modality-selective internal structures during text generation—without architectural modification or fine-tuning. We systematically assess alignment between LLM hidden states and representations from specialized vision/audio encoders using centered kernel alignment (CKA) and canonical correlation analysis (CCA).
Results: Our approach reliably elicits implicit representations semantically consistent with the targeted perceptual modality. It significantly improves cross-modal representational consistency between LLMs and multimodal encoders, achieving this solely through prompt engineering. This work establishes a novel paradigm for uncovering implicit multimodal competence in text-only models.
📝 Abstract
Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.