Teaching Models to Balance Resisting and Accepting Persuasion

📅 2024-10-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are vulnerable to adversarial persuasion—requiring both robustness against harmful misinformation and receptivity to beneficial corrections to improve reasoning quality—necessitating a dynamic balance between these competing objectives. This paper introduces the novel concept of *Persuasion Balance* and proposes the *Persuasion-Balanced Training* (PBT) framework. PBT leverages small models to generate multi-agent recursive dialogue trees, then employs preference optimization and cross-scale collaborative data distillation—integrating small- and large-model capabilities—to enable scalable, balanced training. Experiments demonstrate that PBT significantly enhances model robustness against misleading inputs and inappropriate challenges, achieving state-of-the-art overall performance in mixed-persuasion scenarios. Moreover, it improves multi-agent debate stability and team coordination while mitigating answer-order bias—advancing both safety and reasoning fidelity in persuasive interactions.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models. Moreover, PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates across two domains (trivia and commonsense QA). We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
Problem

Research questions and friction points this paper is trying to address.

Teaching models to balance persuasion resistance
Optimizing models for both positive and negative persuasion
Improving model performance in multi-agent debates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent recursive dialogue trees
Preference optimization training
Balancing positive and negative persuasion
🔎 Similar Papers
No similar papers found.