🤖 AI Summary
Multimodal large language models (MLLMs) employing CLIP-ViT-based visual encoders excel at global feature modeling but struggle to capture local spatial relationships among image patches, limiting fine-grained visual understanding. To address this, we propose a lightweight spatial enhancement mechanism: only six learnable spatial visual tokens are introduced; local structural priors are extracted via convolutional kernels, and a novel projector explicitly encodes two spatial inductive biases—“center-to-global” and “abstract-to-concrete.” Fine-grained visual features are fused into the language model via cross-attention, yielding two efficient variants: Cropping and Pooling. Extensive evaluation across multiple multimodal benchmarks demonstrates consistent and significant improvements over LLaVA-1.5, with notable gains in visual reasoning and captioning tasks, while incurring negligible overhead in inference latency. Our code and models are publicly released.
📝 Abstract
The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which extbf{ only adds six spatial visual tokens} to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1)We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: ``from central region to global" and ``from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at href{https://github.com/CnFaker/LLaVA-SP}{ exttt{https://github.com/CnFaker/LLaVA-SP}}.