🤖 AI Summary
This paper addresses the problem of automatically learning unknown, complex constraints from expert demonstrations—without assuming prior knowledge of constraint structure or an environmental model. The proposed method introduces an iterative framework grounded in Positive-Unlabeled (PU) learning: expert trajectories serve as positive examples, while trajectories sampled by the current policy constitute a mixed unlabeled set; a constraint discriminator and a safe policy are jointly optimized. Crucially, this work is the first to integrate PU learning with experience replay, enabling flexible modeling of arbitrary continuous nonlinear constraints—thereby eliminating restrictive linear or parametric assumptions. The framework comprises three core components: binary classification of feasible versus infeasible states, policy gradient–based optimization, and co-adaptive updating of the constraint discriminator and policy. Evaluated on MuJoCo benchmarks, the approach achieves significantly higher constraint recovery accuracy and policy safety compared to state-of-the-art baselines.
📝 Abstract
Planning for a wide range of real-world tasks necessitates to know and write all constraints. However, instances exist where these constraints are either unknown or challenging to specify accurately. A possible solution is to infer the unknown constraints from expert demonstration. The majority of prior works limit themselves to learning simple linear constraints, or require strong knowledge of the true constraint parameterization or environmental model. To mitigate these problems, this paper presents a positive-unlabeled (PU) learning approach to infer a continuous, arbitrary and possibly nonlinear, constraint from demonstration. From a PU learning view, We treat all data in demonstrations as positive (feasible) data, and learn a (sub)-optimal policy to generate high-reward-winning but potentially infeasible trajectories, which serve as unlabeled data containing both feasible and infeasible states. Under an assumption on data distribution, a feasible-infeasible classifier (i.e., constraint model) is learned from the two datasets through a postprocessing PU learning technique. The entire method employs an iterative framework alternating between updating the policy, which generates and selects higher-reward policies, and updating the constraint model. Additionally, a memory buffer is introduced to record and reuse samples from previous iterations to prevent forgetting. The effectiveness of the proposed method is validated in two Mujoco environments, successfully inferring continuous nonlinear constraints and outperforming a baseline method in terms of constraint accuracy and policy safety.