Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Noisy human feedback severely limits the effectiveness of large language model (LLM) preference alignment, yet existing automated data cleaning methods lack systematic evaluation. This paper introduces PrefCleanBench—the first benchmark dedicated to evaluating data cleaning for LLM preference alignment—enabling unified assessment of 13 cleaning methods across diverse datasets, model architectures, and optimization algorithms. We propose a standardized cleaning evaluation framework that identifies key determinants of cleaning efficacy, including noise type and preference strength distribution, and empirically establish the decisive impact of data quality on both alignment performance and generalization. Experiments demonstrate that principled data cleaning significantly improves reward modeling accuracy and policy alignment fidelity. To foster reproducible and trustworthy alignment research, we open-source all code and evaluation protocols.

Technology Category

Application Category

📝 Abstract

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating effectiveness of preference data cleaning methods

Benchmarking data cleaning for reliable LLM alignment

Assessing cleaning strategies across diverse datasets and models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking 13 preference data cleaning methods

Standardized protocol assessing alignment and generalizability

Modular implementations released for further research

🔎 Similar Papers

Challenges and Future Directions of Data-Centric AI Alignment