🤖 AI Summary
This study addresses the challenges of topic modeling on morphologically rich, low-resource Belgian Dutch narrative texts—particularly personal narratives—where cultural sensitivity and semantic coherence pose significant difficulties. Method: We systematically evaluate BERTopic (leveraging Sentence-BERT embeddings, UMAP dimensionality reduction, and HDBSCAN clustering) against LDA and K-means on authentic Belgian Dutch narrative data. A hybrid evaluation framework is introduced, integrating automated metrics (e.g., Coherence, Diversity) with expert human assessment focused on cultural appropriateness and semantic plausibility. Contribution/Results: BERTopic yields culturally adaptive, semantically coherent topics; LDA achieves higher automated scores but exhibits numerous spurious term co-occurrences per human judgment; K-means performs substantially worse, underscoring the difficulty of open-ended narrative modeling. Our findings provide a reproducible methodology and empirical validation for topic modeling in morphologically complex, low-resource languages.
📝 Abstract
This study explores BERTopic's potential for modeling open-ended Belgian Dutch daily narratives, contrasting its performance with Latent Dirichlet Allocation (LDA) and KMeans. Although LDA scores well on certain automated metrics, human evaluations reveal semantically irrelevant co-occurrences, highlighting the limitations of purely statistic-based methods. In contrast, BERTopic's reliance on contextual embeddings yields culturally resonant themes, underscoring the importance of hybrid evaluation frameworks that account for morphologically rich languages. KMeans performed less coherently than prior research suggested, pointing to the unique challenges posed by personal narratives. Our findings emphasize the need for robust generalization in NLP models, especially in underrepresented linguistic contexts.