🤖 AI Summary
This study addresses the challenge of linguistic variation in Luxembourgish—a low-resource language lacking standardized texts and predefined variant inventories—by proposing an unsupervised method that requires neither manual annotation nor prior knowledge of variants. The approach trains subword embeddings on raw user-generated comments and clusters word forms using a combination of cosine similarity and n-gram similarity, treating linguistic variation as a structural feature rather than noise. The method is transparent and reproducible, successfully uncovering systematic lexical and orthographic variation patterns aligned with dialectological and sociolinguistic principles. It yields well-defined clusters that facilitate both quantitative and qualitative analysis, offering a robust tool for investigating variation in low-resource languages.
📝 Abstract
This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in''noisy''or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.