Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing multilingual topic modeling and clustering approaches suffer from poor scalability, opaque similarity metrics, and insufficient semantic granularity. To address these limitations, we propose the Multilingual Matryoshka Embedding framework, which encodes hierarchical semantic structures—from event-level to topic-level—into a single vector via nested, multi-granular representations. Our method integrates multilingual pretraining with a dimensionality-adaptive pruning mechanism, and introduces a hierarchy-aware similarity metric alongside a lightweight hierarchical clustering algorithm. Evaluated on SemEval 2022 Task 8, it achieves a Pearson correlation coefficient of 0.816—establishing a new state-of-the-art. The framework significantly advances cross-lingual story discovery and thematic abstraction, and, for the first time, enables multilingual hierarchical clustering that is unified (single embedding), multi-granular, interpretable, and strongly generalizable.

Technology Category

Application Category

📝 Abstract

Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson $ ho$ = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.

Problem

Research questions and friction points this paper is trying to address.

Improves multilingual news clustering scalability and interpretability

Addresses poor scaling and opaque metrics in topic modeling

Enables hierarchical story and theme identification in news data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Matryoshka embeddings for granular similarity

Hierarchical clustering algorithm for news articles

State-of-the-art performance on SemEval 2022

🔎 Similar Papers

No similar papers found.