iFlip: Iterative Feedback-driven Counterfactual Example Refinement

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limited effectiveness of existing single-pass generation methods in harnessing large language models’ self-correction capabilities, which often results in counterfactual samples that fail to reliably flip model predictions. To overcome this, we propose iFlip, a novel approach that integrates multi-source feedback—including model confidence, feature attribution, and natural language explanations—within an iterative refinement framework. By leveraging keyword-guided generation and an early-stopping strategy, iFlip progressively refines counterfactual instances to enhance their quality and impact. Experimental results demonstrate that iFlip improves label-flipping success rates by 57.8% on average over five state-of-the-art baselines. Furthermore, user studies confirm significant gains in sample completeness, feasibility, and user satisfaction, while also enhancing model robustness and overall performance.

Technology Category

Application Category

📝 Abstract

Counterfactual examples are minimal edits to an input that alter a model's prediction. They are widely employed in explainable AI to probe model behavior and in natural language processing (NLP) to augment training data. However, generating valid counterfactuals with large language models (LLMs) remains challenging, as existing single-pass methods often fail to induce reliable label changes, neglecting LLMs'self-correction capabilities. To explore this untapped potential, we propose iFlip, an iterative refinement approach that leverages three types of feedback, including model confidence, feature attribution, and natural language. Our results show that iFlip achieves an average 57.8% higher validity than the five state-of-the-art baselines, as measured by the label flipping rate. The user study further corroborates that iFlip outperforms baselines in completeness, overall satisfaction, and feasibility. In addition, ablation studies demonstrate that three components are paramount for iFlip to generate valid counterfactuals: leveraging an appropriate number of iterations, pointing to highly attributed words, and early stopping. Finally, counterfactuals generated by iFlip enable effective counterfactual data augmentation, substantially improving model performance and robustness.

Problem

Research questions and friction points this paper is trying to address.

counterfactual examples

large language models

label flipping

explainable AI

natural language processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual examples

iterative refinement

large language models