FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses insufficient deep integration of visual and linguistic representations. We propose FUSION-3B, a multimodal large language model featuring full-modality dynamic alignment. Methodologically, we introduce three novel components: (1) text-guided unified visual encoding, (2) context-aware recursive alignment decoding, and (3) dual-supervised semantic mapping loss—collectively transcending conventional late-fusion paradigms to enable pixel-level, question-level, and end-to-end cross-modal unified modeling. Despite its compact 3B parameter count, FUSION-3B achieves superior performance on most benchmarks using only 630 visual tokens—outperforming Cambrian-1 (8B) and Florence-VL (8B); even with only 300 visual tokens, it surpasses Cambrian-1 (8B), and it exceeds LLaVA-NeXT on over half of the evaluated benchmarks. To further support fine-grained vision–language alignment, we construct a language-driven synthetic QA dataset.

Technology Category

Application Category

📝 Abstract

We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION

Problem

Research questions and friction points this paper is trying to address.

Achieves deep vision-language integration in MLLMs

Enables fine-grained semantic alignment during decoding

Mitigates modality discrepancies with supervised loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-Guided Unified Vision Encoding for pixel integration

Context-Aware Recursive Alignment Decoding for fine-grained semantics

Dual-Supervised Semantic Mapping Loss for modality alignment

🔎 Similar Papers

No similar papers found.