GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the fragility of unimodal policies in offline reinforcement learning when confronted with multimodal action distributions, as they often generate “intermediate” actions poorly supported by the dataset. To overcome this limitation, the authors propose the GEM framework, which trains a Gaussian Mixture Model (GMM) policy via a critic-guided, advantage-weighted Expectation-Maximization algorithm and incorporates behavior policy normalization to quantify action support. During inference, candidate actions are rescored and reranked to enable multimodal yet controllable action selection. The approach innovatively combines behavior normalization with candidate reranking and introduces an adjustable number of candidates as a tunable knob to balance computational cost and performance at inference time—enhancing decision quality without retraining. Extensive experiments on the D4RL benchmark demonstrate the method’s effectiveness and flexibility.

Technology Category

Application Category

📝 Abstract

Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield "in-between" actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together with behavior-normalized support, where the behavior log-likelihood is standardized within each state's candidate set to yield stable, comparable control across states and candidate budgets. Empirically, GEM is competitive across D4RL benchmarks, and offers a simple inference-time budget knob (candidate count) that trades compute for decision quality without retraining.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

multimodal action selection

action distribution

policy extraction

behavior support

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal policy

Gaussian Mixture Model

offline reinforcement learning