EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses the challenges of signal decay and fragile predictive structures in medium- to long-term stock portfolio allocation, arising from market non-stationarity, transaction costs, and risk constraints. To tackle these issues, the authors propose a closed-loop multi-agent reinforcement learning framework that integrates Policy Space Response Oracle (PSRO), coalition-based best-response training, an evolutionary replacement mechanism, and execution-aware checkpoint selection. The framework further incorporates a hierarchical policy architecture—comprising directional and risk heads—nonlinear signal enhancement, and feature-quality reweighting. In out-of-sample tests from 2014 to 2024, the strategy achieves an annualized return of 19.6% (versus 11.7% for SPY), improving to 20.5% (versus 13.5% for SPY) when extended to 2026, with an average excess Sharpe ratio of 0.7600 and a robustness score of −0.0203, significantly outperforming benchmark approaches.

Technology Category

Application Category

📝 Abstract

Medium-to-long-horizon stock allocation presents significant challenges due toveak predictive structures, non-stadonary market regimes, and the degradationf signals following the application of transaction costs, capacity limits, and tail-isk constraints. Conventional approaches commonly rely on a single predictor orloosely coupled prediction-to-allocation pipeline, limiting robustness underThis work addresses a targeted design question: whetherlistribution shift. 1coupling reinforcement learning (RL), multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response trainingevolutionary replacement, and execution-aware checkpoint selection within ainified walk-forward loop improves allocator robustness at medium to longhorizons. The proposed framework, EvoNash-MARL, integrates these componentswithin an execution-aware allocation loop and further introduces a layeredpolicy architecture comprising a direction head and a risk head, nonlinear signalenhancement, feature-quality reweighting, and constraint-aware checkpointselection. Under a 120-window walk-forward protocol, the resolved v21configuration achieves mean excess Sharpe 0.7600 and robust score -0.0203,anking first among internal controls; on aligned daily out-of-sample returnsrom 2014-01-02 to 2024-01-05, it delivers 19.6% annualized return versus 11.7% for SPY, and in an extended walk-forward evaluation through 2026-02-10 it delivers 20.5% rersus 13.5%. The framework maintains positive performance under realistictress constraints and exhibits structured cross-market generalization; however,lobal strong significance under White's Reality Check (WRC) and SPA-lite testingestablished. Therefore, the results are presented as evidence supporting asnotnore stable medium-to long-horizon training and selection paradigm, ratherhan as prooffof universally superior market-timing performance.

Problem

Research questions and friction points this paper is trying to address.

medium-to-long-horizon equity allocation

non-stationary market regimes

signal degradation

distribution shift

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Reinforcement Learning

Policy-Space Response Oracle

Walk-Forward Optimization