CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-lingual news analysis datasets rely on expert curation, suffering from poor scalability and limited language coverage—particularly lacking fine-grained benchmarks for non-English-dominant contexts. Method: We propose the first interpretable, scalable, crowdsourced framework for constructing cross-lingual news semantic similarity annotations, centered on Ukrainian and covering Russian, Polish, and English. It employs a 4W (Who/What/Where/When) semantic schema to enable fine-grained similarity labeling. Contribution/Results: Based on this framework, we introduce and publicly release CrossNews-UA, the first benchmark for cross-lingual news semantic similarity in Eastern European languages. Comprehensive evaluation across bag-of-words models, multilingual Transformers, and large language models reveals persistent performance bottlenecks in cross-lingual news matching. This work fills a critical gap in geographically and linguistically underrepresented semantic similarity evaluation, providing an essential benchmark and empirical foundation for multilingual fake news detection.

Technology Category

Application Category

📝 Abstract
In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English. Each news pair is annotated for semantic similarity with detailed justifications based on the 4W criteria (Who, What, Where, When). We further tested a range of models, from traditional bag-of-words, Transformer-based architectures to large language models (LLMs). Our results highlight the challenges in multilingual news analysis and offer insights into models performance.
Problem

Research questions and friction points this paper is trying to address.

Addressing scalable cross-lingual news similarity assessment
Detecting fake news across multiple languages effectively
Overcoming limitations of manually curated news datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable crowdsourcing pipeline for cross-lingual news assessment
Multi-language dataset with semantic similarity annotations
Evaluated diverse models from bag-of-words to LLMs