Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the joint impact of training data language composition on Korean–English cross-lingual information retrieval (CLIR) and monolingual IR performance. Addressing the prevalent performance trade-off between these two tasks in existing models, we construct a high-quality Korean–English parallel corpus and systematically evaluate diverse multilingual data mixtures, revealing for the first time that language composition critically governs the performance equilibrium between CLIR and monolingual retrieval. Building on this finding, we propose a lightweight model fusion strategy that jointly optimizes both tasks without increasing inference overhead. Experiments on KorQuAD and MSMARCO benchmarks demonstrate that the optimal data configuration combined with our fusion method simultaneously improves Korean–English CLIR accuracy (+2.1% MRR@10) and monolingual retrieval effectiveness (+1.3% NDCG@10), validating the efficacy of co-designing data composition and model architecture.

Technology Category

Application Category

📝 Abstract
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.
Problem

Research questions and friction points this paper is trying to address.

Impact of training data composition on cross-lingual retrieval performance
Trade-off between CLIR and mono-lingual IR performance optimization
Model Merging as a solution to balance CLIR and mono-lingual IR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Construct linguistically parallel Korean-English datasets
Analyze training data language composition impact
Apply Model Merging to optimize CLIR performance
🔎 Similar Papers
No similar papers found.
Youngjoon Jang
Youngjoon Jang
KAIST
Computer VisionMachine Learning
J
Junyoung Son
Korea University
T
Taemin Lee
Korea University
Seongtae Hong
Seongtae Hong
Korea University
Natural Language Processing
H
Heuiseok Lim
Korea University