🤖 AI Summary
This study reveals that temporal evolution of Stack Overflow code snippets undermines the reproducibility of security-related empirical conclusions drawn from such data. Method: Through a systematic literature review and cross-temporal replication of six seminal studies (spanning 2012–2023), we model Stack Overflow as a time-series data source and introduce the “temporal context” framework to explain and calibrate cross-sectional biases. Our approach integrates semantic evolution tracking, multi-version dataset construction, and robust statistical tests (e.g., McNemar’s and Wilcoxon signed-rank tests). Contribution/Results: Four of the six original studies exhibited statistically significant deviations (p < 0.01) under updated data, invalidating their conclusions. We propose seven methodological guidelines for time-sensitive empirical security research—already adopted by three follow-up projects—thereby advancing a paradigm shift toward reproducible, temporally aware empirical security science.
📝 Abstract
We study the impact of Stack Overflow code evolution on the stability of prior research findings derived from Stack Overflow data and provide recommendations for future studies. We systematically reviewed papers published between 2005--2023 to identify key aspects of Stack Overflow that can affect study results, such as the language or context of code snippets. Our analysis reveals that certain aspects are non-stationary over time, which could lead to different conclusions if experiments are repeated at different times. We replicated six studies using a more recent dataset to demonstrate this risk. Our findings show that four papers produced significantly different results than the original findings, preventing the same conclusions from being drawn with a newer dataset version. Consequently, we recommend treating Stack Overflow as a time series data source to provide context for interpreting cross-sectional research conclusions.