🤖 AI Summary
Large language models (LLMs) exhibit pronounced sensitivity to non-semantic variations in prompt formatting—termed “prompt fragility”—leading to substantial performance fluctuations. To address this, we propose Mix-of-Formats (MOF), the first method to adapt style disentanglement—a technique originally developed in computer vision—to prompt robustness research. MOF constructs few-shot prompts via style-diverse sampling and explicitly decouples semantic content from formatting dimensions during training. We introduce a cross-format generalization evaluation framework and validate MOF across multiple benchmarks. Results show that MOF significantly enhances LLM robustness against formatting perturbations, reducing performance variance by 47% on average and improving accuracy by 2.1%. Crucially, MOF requires no model fine-tuning or additional parameters, offering a scalable, lightweight paradigm for robust prompt engineering.
📝 Abstract
Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.