Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the challenges of topic modeling on morphologically rich, low-resource Belgian Dutch narrative texts—particularly personal narratives—where cultural sensitivity and semantic coherence pose significant difficulties. Method: We systematically evaluate BERTopic (leveraging Sentence-BERT embeddings, UMAP dimensionality reduction, and HDBSCAN clustering) against LDA and K-means on authentic Belgian Dutch narrative data. A hybrid evaluation framework is introduced, integrating automated metrics (e.g., Coherence, Diversity) with expert human assessment focused on cultural appropriateness and semantic plausibility. Contribution/Results: BERTopic yields culturally adaptive, semantically coherent topics; LDA achieves higher automated scores but exhibits numerous spurious term co-occurrences per human judgment; K-means performs substantially worse, underscoring the difficulty of open-ended narrative modeling. Our findings provide a reproducible methodology and empirical validation for topic modeling in morphologically complex, low-resource languages.

Technology Category

Application Category

📝 Abstract

This study explores BERTopic's potential for modeling open-ended Belgian Dutch daily narratives, contrasting its performance with Latent Dirichlet Allocation (LDA) and KMeans. Although LDA scores well on certain automated metrics, human evaluations reveal semantically irrelevant co-occurrences, highlighting the limitations of purely statistic-based methods. In contrast, BERTopic's reliance on contextual embeddings yields culturally resonant themes, underscoring the importance of hybrid evaluation frameworks that account for morphologically rich languages. KMeans performed less coherently than prior research suggested, pointing to the unique challenges posed by personal narratives. Our findings emphasize the need for robust generalization in NLP models, especially in underrepresented linguistic contexts.

Problem

Research questions and friction points this paper is trying to address.

Evaluating BERTopic's performance on open-ended Belgian Dutch narratives

Comparing BERTopic with LDA and KMeans for topic modeling accuracy

Assessing NLP model generalization in morphologically rich, underrepresented languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

BERTopic utilizes contextual embeddings for themes

Hybrid evaluation frameworks assess language richness

Contrasts BERTopic with LDA and KMeans

🔎 Similar Papers

A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling