Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reinforcement learning (RL) relies solely on scalar rewards, limiting its ability to incorporate rich semantic information—such as natural language instructions, commonsense knowledge, or domain-specific constraints. To address this, we propose Prompted Policy Search (ProPS), the first RL framework to directly integrate large language models (LLMs) into the core of policy optimization. ProPS jointly leverages natural language prompts—encoding task objectives, domain knowledge, and behavioral constraints—with numerical rewards to enable semantic-numerical co-reasoning. Crucially, it requires no LLM fine-tuning, instead harnessing in-context learning for policy updates and gradient-free numerical optimization. This design markedly improves exploration efficiency and sample efficiency. Evaluated on 15 Gymnasium benchmark tasks, ProPS outperforms mainstream algorithms—including PPO and SAC—on 8 tasks; performance gains are especially pronounced when domain knowledge is injected. ProPS advances interpretable, human-aligned, and general-purpose RL.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement learning lacks semantic knowledge from language.
Humans learn with language and numbers, unlike RL.
ProPS integrates LLMs for policy updates using both.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-centered policy optimization with linguistic and numerical reasoning
In-context numerical optimization using semantic signals like goals and hints
Unified framework for transparent and sample-efficient reinforcement learning
🔎 Similar Papers
No similar papers found.
Y
Yifan Zhou
Interactive Robotics Lab, Arizona State University
S
Sachin Grover
Interactive Robotics Lab, Arizona State University
M
Mohamed El Mistiri
Interactive Robotics Lab, Arizona State University
K
Kamalesh Kalirathnam
Interactive Robotics Lab, Arizona State University
P
Pratyush Kerhalkar
Interactive Robotics Lab, Arizona State University
Swaroop Mishra
Swaroop Mishra
Research Scientist, Google DeepMind
Large Language ModelsNatural Language Processing
Neelesh Kumar
Neelesh Kumar
Senior Scientist - AI R&D, Procter and Gamble
Machine Learning
S
Sanket Gaurav
Research & Development, Procter & Gamble
Oya Aran
Oya Aran
Director @ Procter & Gamble R&D Data Scence & AI
Social computingMachine LearningComputer Vision
Heni Ben Amor
Heni Ben Amor
Associate Professor, Arizona State University
Human-Robot InteractionRoboticsMotor Skill LearningArtificial Intelligence