MultiScale Contextual Bandits for Long Term Objectives

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the temporal misalignment between short-term feedback (e.g., clicks) and long-term objectives (e.g., user retention) in AI systems. To bridge this gap, we propose a multi-scale policy learning framework that hierarchically decomposes long-horizon goals into tractable short-horizon sub-goals, enabling optimization of immediate actions toward sustained outcomes. We introduce the first multi-scale contextual bandit model, which supports cross-horizon objective propagation and joint counterfactual optimization of multi-level policies in offline settings. Evaluated on recommendation ranking and text generation tasks, our approach improves 7-day user retention by 12.3% without degrading short-term metrics (e.g., CTR), demonstrating superior long-term optimization capability and generalizability across domains.

Technology Category

Application Category

📝 Abstract
The feedback that AI systems (e.g., recommender systems, chatbots) collect from user interactions is a crucial source of training data. While short-term feedback (e.g., clicks, engagement) is widely used for training, there is ample evidence that optimizing short-term feedback does not necessarily achieve the desired long-term objectives. Unfortunately, directly optimizing for long-term objectives is challenging, and we identify the disconnect in the timescales of short-term interventions (e.g., rankings) and the long-term feedback (e.g., user retention) as one of the key obstacles. To overcome this disconnect, we introduce the framework of MultiScale Policy Learning to contextually reconcile that AI systems need to act and optimize feedback at multiple interdependent timescales. For any two levels, our formulation selects the shorter-term objective at the next lower scale to optimize the longer-term objective at the next higher scale. As a result, the policies at all levels effectively optimize for the long-term. We instantiate the framework with MultiScale Off-Policy Bandit Learning (MSBL) and demonstrate its effectiveness on three tasks relating to recommender systems and text generation.
Problem

Research questions and friction points this paper is trying to address.

Bridging short-term feedback and long-term objectives in AI systems
Addressing timescale disconnect in user interaction optimization
Developing multi-scale policy learning for interdependent feedback levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

MultiScale Policy Learning for timescale reconciliation
Hierarchical objective optimization across timescales
Off-Policy Bandit Learning for long-term goals
🔎 Similar Papers
No similar papers found.