Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the limitation of linear scalarization in offline multi-objective reinforcement learning, which fails to recover non-convex Pareto fronts. To overcome this, the authors propose STOMP, the first algorithm to integrate smooth Chebyshev scalarization into offline multi-objective preference optimization. STOMP combines reward distribution normalization with Direct Preference Optimization (DPO) to establish a general framework for multi-objective alignment, effectively circumventing the theoretical constraints of linear scalarization and enabling full recovery of non-convex Pareto fronts. Evaluated on three protein language models and real-world experimental datasets, STOMP achieves state-of-the-art performance, attaining the highest hypervolume metric in eight out of nine settings and demonstrating superior capability in multi-attribute joint optimization.

Technology Category

Application Category

📝 Abstract

Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.

Problem

Research questions and friction points this paper is trying to address.

multi-objective reinforcement learning

Pareto optimality

offline reinforcement learning

reward scalarization

preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-objective reinforcement learning

offline reinforcement learning

smooth Tchebysheff scalarization