Hierarchical Semantic RL: Tackling the Problem of Dynamic Action Space for RL-based Recommendations

๐Ÿ“… 2025-10-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Reinforcement learningโ€“based recommender systems suffer from unstable policy training under dynamic action spaces. To address this, we propose Hierarchical Semantic Reinforcement Learning (HSRL), a novel framework with three key contributions: (1) a fixed semantic action space constructed via reversible semantic ID encoding, which decouples action representation from policy decision-making; (2) a coarse-to-fine hierarchical policy network coupled with multi-level critics to mitigate representation-decision mismatch; and (3) hierarchical residual state modeling, multi-level value estimation, and a fixed lookup table to enable efficient policy learning and deployment over large-scale dynamic candidate sets. Extensive experiments on public benchmarks and real-world short-video advertising data demonstrate that HSRL significantly outperforms state-of-the-art methods. Online A/B testing shows an 18.421% lift in conversion rate with only a 1.251% increase in cost.

Technology Category

Application Category

๐Ÿ“ Abstract
Recommender Systems (RS) are fundamental to modern online services. While most existing approaches optimize for short-term engagement, recent work has begun to explore reinforcement learning (RL) to model long-term user value. However, these efforts face significant challenges due to the vast, dynamic action spaces inherent in recommendation, which hinder stable policy learning. To resolve this bottleneck, we introduce Hierarchical Semantic RL (HSRL), which reframes RL-based recommendation over a fixed Semantic Action Space (SAS). HSRL encodes items as Semantic IDs (SIDs) for policy learning, and maps SIDs back to their original items via a fixed, invertible lookup during execution. To align decision-making with SID generation, the Hierarchical Policy Network (HPN) operates in a coarse-to-fine manner, employing hierarchical residual state modeling to refine each level's context from the previous level's residual, thereby stabilizing training and reducing representation-decision mismatch. In parallel, a Multi-level Critic (MLC) provides token-level value estimates, enabling fine-grained credit assignment. Across public benchmarks and a large-scale production dataset from a leading Chinese short-video advertising platform, HSRL consistently surpasses state-of-the-art baselines. In online deployment over a seven-day A/B testing, it delivers an 18.421% CVR lift with only a 1.251% increase in cost, supporting HSRL as a scalable paradigm for RL-based recommendation. Our code is released at https://github.com/MinmaoWang/HSRL.
Problem

Research questions and friction points this paper is trying to address.

Addresses dynamic action space challenges in RL recommendations
Stabilizes policy learning through hierarchical semantic action encoding
Enables scalable long-term value modeling for industrial recommender systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses fixed Semantic Action Space for RL recommendations
Employs Hierarchical Policy Network for coarse-to-fine decisions
Implements Multi-level Critic for token-level value estimates
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Minmao Wang
Fudan University, Shanghai, China
X
Xingchen Liu
Kuaishou Technology, Beijing, China
S
Shijie Yi
Kuaishou Technology, Beijing, China
L
Likang Wu
Tianjin University, Tianjin, China
H
Hongke Zhao
Tianjin University, Tianjin, China
Fei Pan
Fei Pan
Unversity of Michigan
Computer VisionMachine Learning
Qingpeng Cai
Qingpeng Cai
Kuaishou Technology
Reinforcement LearningLLMRecommender SystemComputational Advertising
P
Peng Jiang
Kuaishou Technology, Beijing, China