Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

πŸ“… 2026-03-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges of embedding inconsistency and semantic ambiguity in cross-modal retrieval between document images and text, which stem from modality heterogeneity, unstructured layouts, and static training paradigms. To this end, the authors propose Evo-Retriever, a novel framework that enables fine-grained alignment through multi-view, multi-scale image-text matching. It integrates bidirectional contrastive learning with hard negative mining to establish complementary learning pathways and leverages a large language model as a meta-controller to dynamically adjust the training curriculum, thereby adaptively rebalancing supervisory signals. The synergistic mechanism of view-path coordination and LLM-guided curriculum evolution significantly enhances the model’s continual learning capability, achieving state-of-the-art performance with nDCG@5 scores of 65.2% on ViDoRe V2 and 77.1% on MMEB (VisDoc).

Technology Category

Application Category

πŸ“ Abstract
Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.
Problem

Research questions and friction points this paper is trying to address.

multimodal document retrieval
cross-modal embedding inconsistency
training curriculum adaptation
document heterogeneity
retrieval confusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided curriculum
Viewpoint-Pathway collaboration
multimodal retrieval
bidirectional contrastive learning
multi-view alignment
πŸ”Ž Similar Papers
No similar papers found.
W
Weiqing Li
Alibaba Cloud Computing
J
Jinyue Guo
Alibaba Cloud Computing
Y
Yaqi Wang
Alibaba Cloud Computing
H
Haiyang Xiao
Alibaba Cloud Computing
Yuewei Zhang
Yuewei Zhang
Alibaba Cloud
llm
G
Guohua Liu
Alibaba Cloud Computing
H
Hao Henry Wang
Alibaba Cloud Computing