Image-Text Relation Prediction for Multilingual Tweets

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses cross-lingual prediction of semantic relationships between images and text in multilingual tweets. To overcome the lack of low-resource language benchmarks in prior research, we introduce the first multilingual vision–language relation classification task and construct Latvian-English TweetVLM—the first high-quality, human-annotated, linguistically aligned bilingual Twitter benchmark. Leveraging this benchmark, we systematically evaluate the cross-lingual generalization capabilities of multilingual vision–language models (e.g., X-VLM, FLAVA), incorporating cross-lingual transfer and balanced sampling strategies. Experimental results reveal that state-of-the-art models achieve strong performance on English but suffer an average accuracy drop of 12.3% on Latvian, exposing a critical bottleneck in low-resource language understanding. Our contributions include: (1) a novel multilingual multimodal task; (2) the first curated bilingual benchmark for cross-lingual vision–language evaluation; and (3) empirical insights into the limitations of current models in low-resource settings—advancing research in cross-lingual multimodal understanding.

Technology Category

Application Category

📝 Abstract
Various social networks have been allowing media uploads for over a decade now. Still, it has not always been clear what is their relation with the posted text or even if there is any at all. In this work, we explore how multilingual vision-language models tackle the task of image-text relation prediction in different languages, and construct a dedicated balanced benchmark data set from Twitter posts in Latvian along with their manual translations into English. We compare our results to previous work and show that the more recently released vision-language model checkpoints are becoming increasingly capable at this task, but there is still much room for further improvement.
Problem

Research questions and friction points this paper is trying to address.

Predicting image-text relations in multilingual tweets
Evaluating vision-language models across different languages
Creating a balanced benchmark dataset for Latvian and English tweets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual vision-language models for relation prediction
Balanced benchmark dataset from Latvian tweets
Comparison with recent vision-language model checkpoints
🔎 Similar Papers
No similar papers found.
M
Matīss Rikters
Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology
Edison Marrese-Taylor
Edison Marrese-Taylor
National Institute of Advanced Industrial Science and Technology (AIST)
Natural Language Processing - Machine Learning