🤖 AI Summary
This study addresses the challenges of signal decay and fragile predictive structures in medium- to long-term stock portfolio allocation, arising from market non-stationarity, transaction costs, and risk constraints. To tackle these issues, the authors propose a closed-loop multi-agent reinforcement learning framework that integrates Policy Space Response Oracle (PSRO), coalition-based best-response training, an evolutionary replacement mechanism, and execution-aware checkpoint selection. The framework further incorporates a hierarchical policy architecture—comprising directional and risk heads—nonlinear signal enhancement, and feature-quality reweighting. In out-of-sample tests from 2014 to 2024, the strategy achieves an annualized return of 19.6% (versus 11.7% for SPY), improving to 20.5% (versus 13.5% for SPY) when extended to 2026, with an average excess Sharpe ratio of 0.7600 and a robustness score of −0.0203, significantly outperforming benchmark approaches.
📝 Abstract
Medium-to-long-horizon stock allocation presents significant challenges due toveak predictive structures, non-stadonary market regimes, and the degradationf signals following the application of transaction costs, capacity limits, and tail-isk constraints. Conventional approaches commonly rely on a single predictor orloosely coupled prediction-to-allocation pipeline, limiting robustness underThis work addresses a targeted design question: whetherlistribution shift. 1coupling reinforcement learning (RL), multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response trainingevolutionary replacement, and execution-aware checkpoint selection within ainified walk-forward loop improves allocator robustness at medium to longhorizons. The proposed framework, EvoNash-MARL, integrates these componentswithin an execution-aware allocation loop and further introduces a layeredpolicy architecture comprising a direction head and a risk head, nonlinear signalenhancement, feature-quality reweighting, and constraint-aware checkpointselection. Under a 120-window walk-forward protocol, the resolved v21configuration achieves mean excess Sharpe 0.7600 and robust score -0.0203,anking first among internal controls; on aligned daily out-of-sample returnsrom 2014-01-02 to 2024-01-05, it delivers 19.6% annualized return versus 11.7% for SPY, and in an extended walk-forward evaluation through 2026-02-10 it delivers 20.5% rersus 13.5%. The framework maintains positive performance under realistictress constraints and exhibits structured cross-market generalization; however,lobal strong significance under White's Reality Check (WRC) and SPA-lite testingestablished. Therefore, the results are presented as evidence supporting asnotnore stable medium-to long-horizon training and selection paradigm, ratherhan as prooffof universally superior market-timing performance.