🤖 AI Summary
In financial research, the absence of behavior-ground-truth labels in real-market data and the poor interpretability of black-box models hinder trader behavior modeling. To address this, we propose an “interpretable laboratory” framework based on surrogate models: generating high-fidelity synthetic trading data with labeled investor behavior types, and systematically comparing supervised classification versus unsupervised clustering for behavioral discrimination. Our methodology integrates behavioral sequence feature engineering, supervised learning (SVM, Random Forest), unsupervised clustering (k-means, DBSCAN), and SHAP-based interpretability analysis. Experiments show supervised classifiers achieve >95% accuracy, whereas unsupervised clustering incurs >40% error rates—revealing its fundamental limitations in behavioral classification. Key discriminative dimensions—including order-flow persistence and response latency—are identified. This work pioneers the use of interpretable surrogate models as a benchmark tool for financial behavioral research, establishing a novel paradigm for behavioral finance modeling.
📝 Abstract
The rapid development of sophisticated machine learning methods, together with the increased availability of financial data, has the potential to transform financial research, but also poses a challenge in terms of validation and interpretation. A good case study is the task of classifying financial investors based on their behavioral patterns. Not only do we have access to both classification and clustering tools for high-dimensional data, but also data identifying individual investors is finally available. The problem, however, is that we do not have access to ground truth when working with real-world data. This, together with often limited interpretability of modern machine learning methods, makes it difficult to fully utilize the available research potential. In order to deal with this challenge we propose to use a realistic agent-based model as a way to generate synthetic data. This way one has access to ground truth, large replicable data, and limitless research scenarios. Using this approach we show how, even when classifying trading agents in a supervised manner is relatively easy, a more realistic task of unsupervised clustering may give incorrect or even misleading results. We complete the results with investigating the details of how supervised techniques were able to successfully distinguish between different trading behaviors.