🤖 AI Summary
This work addresses the limitations of existing vision-language models, which exhibit suboptimal performance on text retrieval tasks and incur substantial storage and inference overhead in multilingual settings due to multi-encoder architectures. To overcome these challenges, we propose a unified multitask learning framework that jointly optimizes multilingual image-to-text retrieval, text-to-image retrieval, and natural language understanding (NLU) within a single model for the first time. By sharing a common text encoder and aligning cross-modal embedding spaces with NLU-enhanced semantic representations, our approach significantly improves retrieval accuracy across modalities and languages while reducing system complexity and inference costs. This enables efficient, unified representation learning without compromising performance.
📝 Abstract
Multimodal retrieval systems typically employ Vision Language Models (VLMs) that encode images and text independently into vectors within a shared embedding space. Despite incorporating text encoders, VLMs consistently underperform specialized text models on text-only retrieval tasks. Moreover, introducing additional text encoders increases storage, inference overhead, and exacerbates retrieval inefficiencies, especially in multilingual settings. To address these limitations, we propose a multi-task learning framework that unifies the feature representation across images, long and short texts, and intent-rich queries. To our knowledge, this is the first work to jointly optimize multilingual image retrieval, text retrieval, and natural language understanding (NLU) tasks within a single framework. Our approach integrates image and text retrieval with a shared text encoder that is enhanced by NLU features for intent understanding and retrieval accuracy.