DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the dual bottlenecks of reinforcement learning’s reliance on costly human annotations and conventional dual learning’s restriction to strictly invertible tasks (e.g., machine translation), this paper proposes DuPO—a dual preference optimization framework that operates without labeled feedback. Methodologically, DuPO leverages generalized duality to decompose task inputs into known and unknown components; it then constructs asymmetric dual tasks to reconstruct the unknown part and uses reconstruction quality as a self-supervised reward for large language model self-validation. This design relaxes the strict invertibility requirement inherent in prior dual learning approaches, enabling application to diverse non-invertible tasks—including translation and mathematical reasoning. Empirically, DuPO achieves a +2.13 COMET score improvement across 756 translation directions, a +6.4% average accuracy gain on three mathematical reasoning benchmarks, and a +9.3% performance boost when deployed as a reasoning re-ranker.

Technology Category

Application Category

📝 Abstract

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

Problem

Research questions and friction points this paper is trying to address.

Reduces reliance on costly labeled data for LLM verification

Extends dual learning beyond strictly invertible task pairs

Enables self-supervised optimization via reconstruction-based rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual learning-based preference optimization framework

Generates annotation-free feedback via generalized duality

Self-supervised reward from dual task reconstruction

🔎 Similar Papers

No similar papers found.