Attack logics, not outputs: Towards efficient robustification of deep neural networks by falsifying concept-based properties

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Deep neural networks in computer vision are vulnerable to adversarial attacks; existing verification methods focus solely on output label flip (e.g., *stop_sign* → ¬*stop_sign*) and fail to detect violations of human-interpretable conceptual logic (e.g., breach of the entailment “red ∧ octagonal → *stop_sign*”). Method: We propose a concept-logic-based adversarial falsification framework that generalizes adversarial objectives from label misclassification to violation of logical implications among semantic concepts. Leveraging explainable AI to extract high-level concepts, we encode domain knowledge as formal logical constraints and perform logic-consistency verification via constrained optimization. Contribution/Results: Theoretical analysis shows our method operates in a significantly reduced search space and achieves higher verification efficiency. It effectively exposes latent logical flaws in models, thereby simultaneously enhancing adversarial robustness, interpretability, and safety—without requiring model retraining or architectural modification.

Technology Category

Application Category

📝 Abstract

Deep neural networks (NNs) for computer vision are vulnerable to adversarial attacks, i.e., miniscule malicious changes to inputs may induce unintuitive outputs. One key approach to verify and mitigate such robustness issues is to falsify expected output behavior. This allows, e.g., to locally proof security, or to (re)train NNs on obtained adversarial input examples. Due to the black-box nature of NNs, current attacks only falsify a class of the final output, such as flipping from $ exttt{stop_sign}$ to $ eg exttt{stop_sign}$. In this short position paper we generalize this to search for generally illogical behavior, as considered in NN verification: falsify constraints (concept-based properties) involving further human-interpretable concepts, like $ exttt{red}wedge exttt{octogonal} ightarrow exttt{stop_sign}$. For this, an easy implementation of concept-based properties on already trained NNs is proposed using techniques from explainable artificial intelligence. Further, we sketch the theoretical proof that attacks on concept-based properties are expected to have a reduced search space compared to simple class falsification, whilst arguably be more aligned with intuitive robustness targets. As an outlook to this work in progress we hypothesize that this approach has potential to efficiently and simultaneously improve logical compliance and robustness.

Problem

Research questions and friction points this paper is trying to address.

Deep neural networks are vulnerable to adversarial attacks on inputs

Current methods only falsify final output classes, not logical behavior

Proposes concept-based property falsification to improve robustness efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Falsifying concept-based properties for robustness

Using explainable AI to implement logical constraints

Reducing search space by targeting illogical behavior

🔎 Similar Papers

Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness