Words That Make Language Models Perceive

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

It remains unclear whether large language models (LLMs) trained exclusively on text implicitly encode modality-specific visual and auditory representations, given their lack of genuine multimodal input. Method: To probe such latent capabilities, we propose a lightweight prompting strategy leveraging sensory verbs (e.g., “see”, “hear”) to explicitly activate modality-selective internal structures during text generation—without architectural modification or fine-tuning. We systematically assess alignment between LLM hidden states and representations from specialized vision/audio encoders using centered kernel alignment (CKA) and canonical correlation analysis (CCA). Results: Our approach reliably elicits implicit representations semantically consistent with the targeted perceptual modality. It significantly improves cross-modal representational consistency between LLMs and multimodal encoders, achieving this solely through prompt engineering. This work establishes a novel paradigm for uncovering implicit multimodal competence in text-only models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

Problem

Research questions and friction points this paper is trying to address.

Activate latent perceptual representations in text-only language models

Align text model representations with vision and audio encoders

Use sensory prompts to cue modality-specific predictions without input

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sensory prompts activate latent multimodal representations

Prompt engineering aligns text models with vision encoders

Modality-specific cues resolve predictions without actual input

🔎 Similar Papers

From Tokens to Words: On the Inner Lexicon of LLMs