UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the semantic alignment and ranking of idiomatic noun phrases (e.g., English/Portuguese idiomatic NPs) with images in multilingual multimodal settings—without relying on manual annotations. The proposed method leverages generative large language models (LLMs) to interpret idiomatic meaning and generate natural-language semantic descriptions, which are then encoded into cross-modal embeddings via multilingual CLIP. This integration constitutes the first effort to synergize LLM-based paraphrasing with multilingual vision-language modeling for idiomaticity representation. A contrastive learning framework with data augmentation enables learning transferable, idiomaticity-aware embeddings. Experiments on image ranking demonstrate that the method significantly outperforms baselines using raw noun phrases alone. Notably, zero-shot CLIP embeddings—without fine-tuning—surpass their fine-tuned counterparts, validating both the effectiveness and generalizability of LLM-injected idiomatic semantics.

Technology Category

Application Category

📝 Abstract
SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.
Problem

Research questions and friction points this paper is trying to address.

Ranking images based on idiomatic nominal compounds in English and Portuguese.
Enhancing idiomatic compound representations using generative LLMs and CLIP models.
Improving image ranking through multimodal embeddings and contrastive learning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative LLMs enhance idiomatic compound meanings.
Multilingual CLIP models encode idiomatic representations.
Contrastive learning fine-tunes embeddings for image ranking.
🔎 Similar Papers
No similar papers found.
Thanet Markchom
Thanet Markchom
Department of Computer Science, University of Reading
Recommender SystemMachine LearningComputer VisionNatural Language Processing
T
Tong Wu
Formerly at School of Computing, Newcastle University, Newcastle upon Tyne, UK
L
Liting Huang
School of Computing, Newcastle University, Newcastle upon Tyne, UK
Huizhi Liang
Huizhi Liang
Newcastle University
Data MiningMachine LearningPersonalizationRecommender Systems