🤖 AI Summary
This work addresses the vulnerability of existing synthetic-data-based learning methods to synthesis-induced biases and artifacts, which often lead models to rely on spurious correlations from non-target regions. To mitigate this issue, the authors propose a novel approach that leverages provenance information inherent in the synthetic data generation process as a supervisory signal. By decomposing input gradients and explicitly suppressing those originating from non-target regions, the method steers the model to focus on authentic discriminative features. Notably, this technique requires no additional annotations and demonstrates consistent performance gains across multiple tasks—including weakly supervised object localization, spatio-temporal action localization, and image classification—thereby validating its effectiveness and broad applicability.
📝 Abstract
Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.