Lever: Inference-Time Policy Reuse under Support Constraints

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Reinforcement learning policies are typically trained for fixed objectives and struggle to adapt to dynamically changing task requirements. This work proposes an end-to-end framework that, without requiring environment interaction, retrieves, evaluates, and composes high-quality policies from a pretrained policy library to address novel composite tasks. The method achieves efficient inference-time policy reuse under a no-value-propagation (support-limited) setting by introducing behavioral embeddings for policy evaluation and integrating offline Q-value composition with a controllable exploration mechanism for policy synthesis. Experiments in GridWorld demonstrate that the composed policies match or even surpass those trained from scratch in performance while offering significant speedup; however, the approach is limited in tasks requiring long-horizon value propagation, and the results highlight the critical role of policy coverage in determining reuse effectiveness.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) policies are typically trained for fixed objectives, making reuse difficult when task requirements change. We study inference-time policy reuse: given a library of pre-trained policies and a new composite objective, can a high-quality policy be constructed entirely offline, without additional environment interaction? We introduce lever (Leveraging Efficient Vector Embeddings for Reusable policies), an end-to-end framework that retrieves relevant policies, evaluates them using behavioral embeddings, and composes new policies via offline Q-value composition. We focus on the support-limited regime, where no value propagation is possible, and show that the effectiveness of reuse depends critically on the coverage of available transitions. To balance performance and computational cost, lever proposes composition strategies that control the exploration of candidate policies. Experiments in deterministic GridWorld environments show that inference-time composition can match, and in some cases exceed, training-from-scratch performance while providing substantial speedups. At the same time, performance degrades when long-horizon dependencies require value propagation, highlighting a fundamental limitation of offline reuse.

Problem

Research questions and friction points this paper is trying to address.

policy reuse

inference-time composition

support constraints

offline reinforcement learning

composite objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time policy reuse

behavioral embeddings

offline Q-value composition