🤖 AI Summary
To address the scarcity of high-quality multimodal annotations in zero-resource dialogue generation, this paper proposes an implicit multimodal knowledge distillation paradigm that transfers knowledge from modalities such as images and audio to large language models (LLMs) without requiring target-modality annotations. Methodologically, we design an implicit distillation framework grounded in contrastive learning and gradient masking, incorporate multimodal prompts to bridge cross-modal semantic gaps, and employ lightweight adapters with frozen LLM parameters for efficient fine-tuning. Our approach is the first to eliminate explicit modality alignment and annotation dependence. Experiments on image–text and speech–text dialogue tasks demonstrate substantial improvements over strong baselines (+12.7 BLEU, +9.3 METEOR), achieving near fully supervised performance at only one-fifth the training cost. This work establishes a novel paradigm for zero-resource cross-modal dialogue generation.