Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of document retrieval in real-world scenarios where formats are highly diverse. Traditional text-based methods often disregard layout information, while purely visual models struggle to capture fine-grained semantic content. To bridge this gap, the authors propose Unveil, a novel framework that jointly optimizes vision–text embeddings and knowledge distillation to efficiently transfer multimodal semantic knowledge into a pure vision model, enabling high-performance retrieval without explicit document parsing. By integrating multimodal alignment with parse-free representation learning, Unveil preserves semantic fidelity while significantly improving both retrieval accuracy and efficiency, effectively narrowing the performance gap between multimodal and purely visual approaches.
📝 Abstract
Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.
Problem

Research questions and friction points this paper is trying to address.

document retrieval
multi-modal
visual-textual integration
parsing-free
text-rich documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-textual integration
knowledge distillation
parsing-free retrieval
multi-modal document retrieval
document representation
🔎 Similar Papers