Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language models, which exhibit suboptimal performance on text retrieval tasks and incur substantial storage and inference overhead in multilingual settings due to multi-encoder architectures. To overcome these challenges, we propose a unified multitask learning framework that jointly optimizes multilingual image-to-text retrieval, text-to-image retrieval, and natural language understanding (NLU) within a single model for the first time. By sharing a common text encoder and aligning cross-modal embedding spaces with NLU-enhanced semantic representations, our approach significantly improves retrieval accuracy across modalities and languages while reducing system complexity and inference costs. This enables efficient, unified representation learning without compromising performance.

Technology Category

Application Category

📝 Abstract
Multimodal retrieval systems typically employ Vision Language Models (VLMs) that encode images and text independently into vectors within a shared embedding space. Despite incorporating text encoders, VLMs consistently underperform specialized text models on text-only retrieval tasks. Moreover, introducing additional text encoders increases storage, inference overhead, and exacerbates retrieval inefficiencies, especially in multilingual settings. To address these limitations, we propose a multi-task learning framework that unifies the feature representation across images, long and short texts, and intent-rich queries. To our knowledge, this is the first work to jointly optimize multilingual image retrieval, text retrieval, and natural language understanding (NLU) tasks within a single framework. Our approach integrates image and text retrieval with a shared text encoder that is enhanced by NLU features for intent understanding and retrieval accuracy.
Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval
multilingual retrieval
text-only retrieval
retrieval efficiency
Vision Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task learning
multimodal retrieval
multilingual retrieval
NLU integration
shared text encoder
🔎 Similar Papers
No similar papers found.
X
Xinyuan Zhang
Xiaomi Corporation, China
L
Lina Zhang
Xiaomi Corporation, China
L
Lisung Chen
Xiaomi Corporation, China
Guangyao Liu
Guangyao Liu
Huawei
Photonics Integrated CircuitsTransceiversLong-hual System
S
Shuai Nie
Xiaomi Corporation, China
Jiaming Xu
Jiaming Xu
Xiaomi Corp.; before at CASIA
Speech and Language ProcessingSpeech SeparationDialogue System
R
Runyu Shi
Xiaomi Corporation, China
Y
Ying Huang
Xiaomi Corporation, China
G
Guoquan Zhang
Xiaomi Corporation, China