Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenges of embedding inconsistency and semantic ambiguity in cross-modal retrieval between document images and text, which stem from modality heterogeneity, unstructured layouts, and static training paradigms. To this end, the authors propose Evo-Retriever, a novel framework that enables fine-grained alignment through multi-view, multi-scale image-text matching. It integrates bidirectional contrastive learning with hard negative mining to establish complementary learning pathways and leverages a large language model as a meta-controller to dynamically adjust the training curriculum, thereby adaptively rebalancing supervisory signals. The synergistic mechanism of view-path coordination and LLM-guided curriculum evolution significantly enhances the model’s continual learning capability, achieving state-of-the-art performance with nDCG@5 scores of 65.2% on ViDoRe V2 and 77.1% on MMEB (VisDoc).

Technology Category

Application Category

📝 Abstract

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

Problem

Research questions and friction points this paper is trying to address.

multimodal document retrieval

cross-modal embedding inconsistency

training curriculum adaptation

document heterogeneity

retrieval confusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided curriculum

Viewpoint-Pathway collaboration

multimodal retrieval