🤖 AI Summary
Existing diffusion-based visual counterfactual generation methods suffer from high computational costs, slow sampling speeds, and imprecise localization of modified regions. This work proposes MaskDiME, a training-free adaptive masking diffusion framework that introduces, for the first time, an adaptive local sampling mechanism to focus on decision-relevant areas, enabling semantically consistent, spatially precise, and high-fidelity local counterfactual generation. By integrating a training-free diffusion architecture with a dynamic masking strategy, MaskDiME achieves state-of-the-art or comparable performance across five cross-domain visual benchmarks while accelerating inference by over 30× compared to baseline methods, substantially enhancing both efficiency and practical applicability.
📝 Abstract
Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, and effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.