Predictive Analytics for Collaborators Answers, Code Quality, and Dropout on Stack Overflow

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior studies on Stack Overflow user behavior prediction suffer from narrow baseline model selection and insufficient algorithmic coverage. Method: This work systematically evaluates 21 machine learning and deep learning algorithms across three core tasks—answer count prediction (regression), code quality violation detection (classification), and user churn prediction (classification)—using a unified modeling framework. It integrates multi-source features (structured behavioral data + CodeBERT-finetuned textual/code representations), applies standardization and logarithmic transformation for preprocessing, and employs Bayesian optimization and genetic algorithms for hyperparameter tuning. Results: Bagging achieves the highest regression performance (R² = 0.821) for answer count prediction; XGBoost attains the best classification performance for churn prediction (F1 = 0.825); and fine-tuned CodeBERT yields the top F1 score (0.809) for code defect identification. This is the first large-scale, cross-algorithm benchmark in this domain, establishing a reusable multimodal modeling paradigm and providing a methodological benchmark for developer behavior modeling.

Technology Category

Application Category

📝 Abstract
Previous studies that used data from Stack Overflow to develop predictive models often employed limited benchmarks of 3-5 models or adopted arbitrary selection methods. Despite being insightful, their limited scope suggests the need to benchmark more models to avoid overlooking untested algorithms. Our study evaluates 21 algorithms across three tasks: predicting the number of question a user is likely to answer, their code quality violations, and their dropout status. We employed normalisation, standardisation, as well as logarithmic and power transformations paired with Bayesian hyperparameter optimisation and genetic algorithms. CodeBERT, a pre-trained language model for both natural and programming languages, was fine-tuned to classify user dropout given their posts (questions and answers) and code snippets. We found Bagging ensemble models combined with standardisation achieved the highest R2 value (0.821) in predicting user answers. The Stochastic Gradient Descent regressor, followed by Bagging and Epsilon Support Vector Machine models, consistently demonstrated superior performance to other benchmarked algorithms in predicting user code quality across multiple quality dimensions and languages. Extreme Gradient Boosting paired with log-transformation exhibited the highest F1-score (0.825) in predicting user dropout. CodeBERT was able to classify user dropout with a final F1-score of 0.809, validating the performance of Extreme Gradient Boosting that was solely based on numerical data. Overall, our benchmarking of 21 algorithms provides multiple insights. Researchers can leverage findings regarding the most suitable models for specific target variables, and practitioners can utilise the identified optimal hyperparameters to reduce the initial search space during their own hyperparameter tuning processes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating 21 algorithms for predicting Stack Overflow user contributions
Assessing code quality violations and user dropout using diverse models
Optimizing model performance via transformations and hyperparameter tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 21 algorithms for predictive tasks
Used CodeBERT for dropout classification
Optimized hyperparameters with Bayesian methods
🔎 Similar Papers
No similar papers found.