🤖 AI Summary
This study addresses the underexplored domain of preference alignment in speech modality, where annotation protocols remain ill-defined compared to the well-studied text modality. Through a cross-modal controlled experiment, the authors systematically compare human preference judgments for semantically identical content presented as either text or audio, employing intraclass correlation coefficient (ICC) to quantify annotation reliability. The work reveals, for the first time, that preference criteria differ significantly between modalities—exhibiting near-random cross-modal consistency—yet audio annotations achieve reliability comparable to text (ICC ≈ 0.80, requiring approximately nine annotators). Moreover, audio preferences demonstrate stronger user orientation and reduced length bias. The study further shows that synthetic scores can effectively predict human agreement, suggesting their potential to substitute manual annotation in preference modeling.
📝 Abstract
Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) $\approx$ .80) at $\sim$9 raters -- the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.