🤖 AI Summary
Vision-language models (VLMs) often suffer from insufficient cross-modal attention coordination, leading to semantic-visual misalignment, attention drift, and poor interpretability. To address this, we propose a cross-layer region-wise attention alignment framework featuring two core innovations: Layer-Patch-wise Cross Attention (LPWCA), which models fine-grained cross-attention between multi-layer semantic representations and image patches, and Progressive Attention Integration (PAI), which enforces consistency across layers via hierarchical attention fusion. Our method introduces only 3.55 million parameters yet achieves state-of-the-art performance across ten major vision-language benchmarks. It significantly improves regional focus precision and semantic-visual alignment quality while enhancing model interpretability—demonstrating that lightweight architectural refinements in cross-modal attention design can yield substantial gains in both accuracy and transparency.
📝 Abstract
Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.