Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the inconsistency of large language models (LLMs) under minor prompt perturbations, this paper proposes Flip-Flop Consistency (F²C), an unsupervised consistency optimization method. F²C leverages majority-voting consensus across multiple prompt variants to construct hard pseudo-labels via a Consensus Cross-Entropy (CCE) loss, and introduces a representation alignment loss that pulls low-confidence predictions closer to the consensus representation in latent space—thereby enhancing robustness to prompt variations. Fully unsupervised and annotation-free, F²C improves output consistency by 11.62% and average F₁ score by 8.94% across 11 benchmarks, significantly mitigating performance fluctuations across input formats while maintaining robustness under out-of-distribution examples and unseen prompt templates. Its core innovation lies in jointly exploiting consensus-driven supervisory signals and latent-space alignment, enabling efficient, scalable, and fully unsupervised consistency training.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM robustness to prompt phrasing variations

Reducing performance variance across different prompt formats

Enhancing model consistency without supervised training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised training method improves prompt perturbation robustness

Consensus Cross-Entropy creates pseudo-labels via majority voting

Representation alignment loss pulls predictors toward consensus

🔎 Similar Papers

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets