Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
To address the poor generalization and overfitting of offline reinforcement learning (RL) policies to historical data in non-stationary financial markets, this paper proposes MetaTrader, a two-level meta-RL framework. At the upper level, it jointly optimizes in-domain cumulative returns and out-of-domain robustness across multiple data transformations. At the lower level, it introduces Worst-case Temporal Difference (Worst-case TD), a robust TD algorithm based on worst-case value estimation, to mitigate value overestimation inherent in small-batch offline financial data. Furthermore, domain-specific data augmentation techniques tailored to financial time series are integrated to enhance adaptability to market dynamics. Experiments on two public stock datasets demonstrate that MetaTrader significantly outperforms existing offline RL methods and conventional forecasting models: it improves return stability by 23.6% and cross-market transfer performance by 19.4%.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has shown significant promise for sequential portfolio optimization tasks, such as stock trading, where the objective is to maximize cumulative returns while minimizing risks using historical data. However, traditional RL approaches often produce policies that merely memorize the optimal yet impractical buying and selling behaviors within the fixed dataset. These offline policies are less generalizable as they fail to account for the non-stationary nature of the market. Our approach, MetaTrader, frames portfolio optimization as a new type of partial-offline RL problem and makes two technical contributions. First, MetaTrader employs a bilevel learning framework that explicitly trains the RL agent to improve both in-domain profits on the original dataset and out-of-domain performance across diverse transformations of the raw financial data. Second, our approach incorporates a new temporal difference (TD) method that approximates worst-case TD estimates from a batch of transformed TD targets, addressing the value overestimation issue that is particularly challenging in scenarios with limited offline data. Our empirical results on two public stock datasets show that MetaTrader outperforms existing methods, including both RL-based approaches and traditional stock prediction models.
Problem

Research questions and friction points this paper is trying to address.

Overcoming offline RL policies' poor market generalization
Addressing value overestimation in limited offline data
Improving both in-domain and out-of-domain trading performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilevel learning framework for portfolio optimization
Worst-case TD method to prevent value overestimation
Partial-offline RL for non-stationary market adaptation
🔎 Similar Papers
No similar papers found.
H
Haochen Yuan
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China
Minting Pan
Minting Pan
Shanghai Jiao Tong University
Machine Learning Reinforcement Learning
Y
Yunbo Wang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China
S
Siyu Gao
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China
P
Philip S.Yu
Department of Computer Science, University of Illinois Chicago, USA
X
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China