Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language models, which exhibit suboptimal performance on text retrieval tasks and incur substantial storage and inference overhead in multilingual settings due to multi-encoder architectures. To overcome these challenges, we propose a unified multitask learning framework that jointly optimizes multilingual image-to-text retrieval, text-to-image retrieval, and natural language understanding (NLU) within a single model for the first time. By sharing a common text encoder and aligning cross-modal embedding spaces with NLU-enhanced semantic representations, our approach significantly improves retrieval accuracy across modalities and languages while reducing system complexity and inference costs. This enables efficient, unified representation learning without compromising performance.

Technology Category

Application Category

📝 Abstract

Multimodal retrieval systems typically employ Vision Language Models (VLMs) that encode images and text independently into vectors within a shared embedding space. Despite incorporating text encoders, VLMs consistently underperform specialized text models on text-only retrieval tasks. Moreover, introducing additional text encoders increases storage, inference overhead, and exacerbates retrieval inefficiencies, especially in multilingual settings. To address these limitations, we propose a multi-task learning framework that unifies the feature representation across images, long and short texts, and intent-rich queries. To our knowledge, this is the first work to jointly optimize multilingual image retrieval, text retrieval, and natural language understanding (NLU) tasks within a single framework. Our approach integrates image and text retrieval with a shared text encoder that is enhanced by NLU features for intent understanding and retrieval accuracy.

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

multilingual retrieval

text-only retrieval

retrieval efficiency

Vision Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task learning

multimodal retrieval

multilingual retrieval