🤖 AI Summary
This work addresses the problem of learning safe and optimal behavioral policies from human feedback—such as pairwise comparisons, rankings, or demonstrations—in safety-critical domains including robotic navigation and F1 racing control. The proposed method introduces a preference modeling framework based on weighted signal temporal logic (WSTL), which unifies safety constraints and task objectives into expressive temporal logic specifications. To enable scalable optimization, the approach innovatively integrates structural pruning and logarithmic transformation techniques to efficiently convert multilinear WSTL constraints into mixed-integer linear programs (MILPs). This reformulation significantly improves computational efficiency and scalability. Experiments on simulated robotic navigation tasks and real-world F1 telemetry data demonstrate that the method accurately captures fine-grained human preferences, strictly enforces safety requirements, and effectively models complex, dynamic task objectives.
📝 Abstract
Autonomous systems increasingly rely on human feedback to align their behavior, expressed as pairwise comparisons, rankings, or demonstrations. While existing methods can adapt behaviors, they often fail to guarantee safety in safety-critical domains. We propose a safety-guaranteed, optimal, and efficient approach to solve the learning problem from preferences, rankings, or demonstrations using Weighted Signal Temporal Logic (WSTL). WSTL learning problems, when implemented naively, lead to multi-linear constraints in the weights to be learned. By introducing structural pruning and log-transform procedures, we reduce the problem size and recast the problem as a Mixed-Integer Linear Program while preserving safety guarantees. Experiments on robotic navigation and real-world Formula 1 data demonstrate that the method effectively captures nuanced preferences and models complex task objectives.