🤖 AI Summary
To address the challenge that single-reward signals in foundation model fine-tuning struggle to balance multiple, often conflicting, optimization objectives, this paper proposes MR-ITF—a Multi-Reward Iterative Tuning Framework grounded in reinforcement learning. MR-ITF jointly models heterogeneous structured reward signals (e.g., text fluency, bioactivity, molecular properties), dynamically coordinating gradient updates across objectives in each iteration, and provides theoretical convergence analysis and characterization of training dynamics. Unlike existing RLHF approaches, MR-ITF eliminates the need for manual reward weighting or scalarization, naturally supporting diverse, non-commensurable rewards. Empirically, it achieves state-of-the-art performance across three distinct generative tasks—text generation, protein sequence design, and small-molecule generation—demonstrating superior Pareto-front coverage in multi-objective evaluation and competitive or better single-objective performance. These results validate MR-ITF’s dual advantages in generation quality and objective balancing.
📝 Abstract
Fine-tuning foundation models has emerged as a powerful approach for generating objects with specific desired properties. Reinforcement learning (RL) provides an effective framework for this purpose, enabling models to generate outputs that maximize a given reward function. However, in many applications such as text generation and drug discovery, it can be suboptimal to optimize using a single reward signal, as multiple evaluation criteria are often necessary. This paper proposes a novel reinforcement learning-based method for fine-tuning foundation models using multiple reward signals. By employing an iterative fine-tuning strategy across these rewards, our approach generalizes state-of-the-art RL-based methods. We further provide a theoretical analysis that offers insights into the performance of multi-reward RL fine-tuning. Experimental results across diverse domains including text, biological sequence, and small molecule generation, demonstrate the effectiveness of the proposed algorithm compared to state-of-the-art baselines.