Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of large-scale, long-span Italian-language online dialogue corpora, which has hindered the development of native large language models and sociolinguistic research. To bridge this gap, we present a massive Italian forum dialogue corpus spanning 1996 to 2024, comprising over 30 billion words—the largest such resource for Italian to date. By leveraging web crawling and advanced text cleaning techniques, we systematically processed historical forum data into a structured, standardized, and longitudinal high-quality corpus. This dataset offers the first comprehensive aggregation of nearly three decades of online discourse in Italian and will be publicly released to significantly advance pretraining of native NLP models as well as research on language evolution and social dynamics in digital contexts.

Technology Category

Application Category

📝 Abstract
We present"Testimole-conversational"a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards'messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.
Problem

Research questions and friction points this paper is trying to address.

Italian corpus
language modeling
sociolinguistics
computer-mediated communication
conversational analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Italian language corpus
large language model pre-training
computer-mediated communication
sociolinguistic analysis
conversational data
🔎 Similar Papers
No similar papers found.
M
Matteo Rinaldi
Dipartimento di Informatica, University of Turin, Italy
R
Rossella Varvara
Dipartimento di Informatica, University of Turin, Italy
Viviana Patti
Viviana Patti
Associate Professor of Computer Science, Università diTorino, Dipartimento di Informatica
artificial intelligencenatural language processingirony detectionsentiment analysissocial semantic web