🤖 AI Summary
This study investigates the joint impact of training data language composition on Korean–English cross-lingual information retrieval (CLIR) and monolingual IR performance. Addressing the prevalent performance trade-off between these two tasks in existing models, we construct a high-quality Korean–English parallel corpus and systematically evaluate diverse multilingual data mixtures, revealing for the first time that language composition critically governs the performance equilibrium between CLIR and monolingual retrieval. Building on this finding, we propose a lightweight model fusion strategy that jointly optimizes both tasks without increasing inference overhead. Experiments on KorQuAD and MSMARCO benchmarks demonstrate that the optimal data configuration combined with our fusion method simultaneously improves Korean–English CLIR accuracy (+2.1% MRR@10) and monolingual retrieval effectiveness (+1.3% NDCG@10), validating the efficacy of co-designing data composition and model architecture.
📝 Abstract
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.