Latent Denoising Improves Visual Alignment in Large Multimodal Models

๐Ÿ“… 2026-04-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

203K/year
๐Ÿค– AI Summary
This work addresses the weak visual representations and poor robustness under distribution shifts in current large multimodal models (LMMs), which stem from their reliance solely on autoregressive language objectives. To overcome this limitation, the authors introduce latent-space denoising into LMM visual supervision for the first time. Their approach perturbs projected visual tokens using a mixture of saliency-aware masking and Gaussian noise, then reconstructs teacher features derived from intermediate hidden states of the language model. This is further enhanced by preserving intra-image similarity structures and applying contrastive distillation to strengthen visionโ€“language alignment. Notably, the method incurs no additional inference overhead and achieves substantial gains over strong baselines across multiple multimodal benchmarks, demonstrating particularly strong performance on the NaturalBench compositional robustness suite and ImageNet-C common corruption scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.
Problem

Research questions and friction points this paper is trying to address.

Large Multimodal Models
visual representation
distribution shift
visual supervision
multimodal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent denoising
visual alignment
multimodal models
contrastive distillation
robustness
๐Ÿ”Ž Similar Papers
No similar papers found.