Aligning Transformers with Continuous Feedback via Energy Rank Alignment

📅 2024-05-21
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive models for molecular and protein sequence generation suffer from poor alignment with target properties and lack efficient guidance mechanisms. Method: This paper proposes Energy-Rank Alignment (ERA), the first method to explicitly model rewards as energy functions and directly steer autoregressive Transformers toward the Gibbs–Boltzmann distribution via gradient-based optimization—bypassing reinforcement learning entirely. Contribution/Results: Theoretically, ERA unifies PPO and DPO frameworks, offering both scalability and robustness to preference data scarcity. Empirically, on chemical space navigation and protein language modeling tasks, ERA significantly improves attribute fidelity and structural diversity of generated molecules and sequences. Notably, under sparse preference data regimes, ERA outperforms DPO, demonstrating superior sample efficiency and generalization.

Technology Category

Application Category

📝 Abstract
Searching through chemical space is an exceptionally challenging problem because the number of possible molecules grows combinatorially with the number of atoms. Large, autoregressive models trained on databases of chemical compounds have yielded powerful generators, but we still lack robust strategies for generating molecules with desired properties. This molecular search problem closely resembles the"alignment"problem for large language models, though for many chemical tasks we have a specific and easily evaluable reward function. Here, we introduce an algorithm called energy rank alignment (ERA) that leverages an explicit reward function to produce a gradient-based objective that we use to optimize autoregressive policies. We show theoretically that this algorithm is closely related to proximal policy optimization (PPO) and direct preference optimization (DPO), but has a minimizer that converges to an ideal Gibbs-Boltzmann distribution with the reward playing the role of an energy function. Furthermore, this algorithm is highly scalable, does not require reinforcement learning, and performs well relative to DPO when the number of preference observations per pairing is small. We deploy this approach to align molecular transformers and protein language models to generate molecules and protein sequences, respectively, with externally specified properties and find that it does so robustly, searching through diverse parts of chemical space.
Problem

Research questions and friction points this paper is trying to address.

Searching chemical space for molecules with desired properties
Aligning autoregressive models using explicit reward functions
Optimizing molecular and protein generation via energy rank alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy rank alignment optimizes autoregressive policies
Leverages explicit reward for gradient-based objective
Converges to ideal Gibbs-Boltzmann distribution
🔎 Similar Papers
No similar papers found.
S
Shriram Chennakesavalu
Department of Chemistry, Stanford University, Stanford, CA, USA 94305
F
Frank Hu
Department of Chemistry, Stanford University, Stanford, CA, USA 94305
S
Sebastian Ibarraran
Department of Chemistry, Stanford University, Stanford, CA, USA 94305
Grant M. Rotskoff
Grant M. Rotskoff
Department of Chemistry, Stanford University
Nonequilibrium Statistical MechanicsSelf-AssemblyBiophysicsMachine Learning