🤖 AI Summary
This work addresses the scarcity of large-scale, long-span Italian-language online dialogue corpora, which has hindered the development of native large language models and sociolinguistic research. To bridge this gap, we present a massive Italian forum dialogue corpus spanning 1996 to 2024, comprising over 30 billion words—the largest such resource for Italian to date. By leveraging web crawling and advanced text cleaning techniques, we systematically processed historical forum data into a structured, standardized, and longitudinal high-quality corpus. This dataset offers the first comprehensive aggregation of nearly three decades of online discourse in Italian and will be publicly released to significantly advance pretraining of native NLP models as well as research on language evolution and social dynamics in digital contexts.
📝 Abstract
We present"Testimole-conversational"a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards'messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.