π€ AI Summary
Contemporary NLP methods often treat text as temporally homogeneous, neglecting semantic evolution over time and thereby introducing temporal semantic bias; moreover, existing tools for temporal text analysis are fragmented and lack reproducibility. To address these limitations, we propose the first unified framework that systematically integrates multi-granular temporal text analysis, enabling both topic evolution modeling and lexical semantic shift detection. Implemented in Python, the framework incorporates dynamic topic modeling, temporal word embedding alignment, sliding-window LDA, and interactive visualization modules. It ensures end-to-end reproducibility and achieves significant improvements in topic trend identification accuracy on cross-year news and social media datasets. Our core contribution is the first standardized, open-source, and extensible toolkit for temporal NLPβbridging the gap between theoretical models of language evolution and empirical, large-scale diachronic analysis.
π Abstract
Text data is inherently temporal. The meaning of words and phrases changes over time, and the context in which they are used is constantly evolving. This is not just true for social media data, where the language used is rapidly influenced by current events, memes and trends, but also for journalistic, economic or political text data. Most NLP techniques however consider the corpus at hand to be homogenous in regard to time. This is a simplification that can lead to biased results, as the meaning of words and phrases can change over time. For instance, running a classic Latent Dirichlet Allocation on a corpus that spans several years is not enough to capture changes in the topics over time, but only portraits an"average"topic distribution over the whole time span. Researchers have developed a number of tools for analyzing text data over time. However, these tools are often scattered across different packages and libraries, making it difficult for researchers to use them in a consistent and reproducible way. The ttta package is supposed to serve as a collection of tools for analyzing text data over time.