On the Role of Difficult Prompts in Self-Play Preference Optimization

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work investigates the impact of prompt difficulty on Self-Play Preference Optimization (SPPO) performance. We find that high-difficulty prompts significantly degrade LLM alignment, and mixing easy and hard prompts harms overall performance. To address this, we propose a model-capacity-aware prompt filtering strategy: using the average reward across *N* sampled responses as a proxy for prompt difficulty, and actively discarding difficult prompts within the DPO framework. Experiments show that training exclusively on easy prompts outperforms full-prompt training, and moderate filtering of hard prompts consistently improves final preference alignment. This is the first systematic study to reveal the interaction between prompt difficulty and model capacity, establishing prompt selection as a critical—yet previously overlooked—optimization dimension in SPPO.

Technology Category

Application Category

📝 Abstract

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of $N$ sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.

Problem

Research questions and friction points this paper is trying to address.

Investigating how prompt difficulty affects self-play preference optimization performance

Analyzing the negative impact of difficult prompts on language model training outcomes

Exploring strategies to mitigate performance degradation from challenging prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using prompt difficulty as optimization performance proxy

Selectively removing difficult prompts enhances performance

Model capacity interacts with prompt difficulty effects

🔎 Similar Papers

A Survey on Self-play Methods in Reinforcement Learning