π€ AI Summary
Existing evaluation metrics, such as MAPE and directional accuracy, struggle to diagnose behavioral deficiencies of agents in multi-stage stock prediction. This work proposes the first multidimensional behavioral evaluation framework powered by large language models (GPT-5.4, Claude-4.6 Opus, and Gemini-3.1 Pro), wherein LLM βjudgesβ assign fine-grained scores to agent trajectories across six behavioral dimensions. These scores are embedded into the reward function of a Soft Actor-Critic reinforcement learning algorithm, establishing a closed-loop optimization mechanism. Combined with perturbation validation and credit assignment penalties, the approach significantly enhances performance on the 2017β2025 test set: MAPE decreases by 11.5% (from 0.61% to 0.54%), directional accuracy reaches 74%, and the Sharpe ratio improves by 18%, with particularly pronounced gains during high-volatility periods.
π Abstract
Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Perturbation-based validation on 420 episodes yields targeted score drops of $-1.6$ to $-2.4$ on intended dimensions versus an average of $-0.32$ on the remaining five, with cross-model agreement up to Krippendorff's $Ξ±= 0.85$. The composite behavioral score, used here only for cross-episode reporting, correlates at $Ο= 0.72$ with realized 20-day Sharpe ratio from offline backtesting. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty term added to the Soft Actor-Critic (SAC) reward. Three short fine-tuning cycles, all confined to the validation period, produce on the held-out 2017-2025 test period a one-day MAPE reduction from 0.61% to 0.54% (an 11.5% relative reduction; $p<0.001$, Cohen's $d=0.31$), a directional accuracy increase from 71% to 74%, and an 18% Sharpe ratio improvement (95% bootstrap CI [8.2%, 27.4%]), with gains concentrated in high-volatility episodes where the original system was most behaviorally deficient. Results are from offline backtesting and do not address effects specific to live deployment.