Alignment and Adversarial Robustness: Are More Human-Like Models More Secure?

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work investigates whether visual model representation alignment—i.e., similarity to human neural or behavioral responses—enhances adversarial robustness. We construct a multidimensional evaluation framework spanning 118 models and 106 benchmarks, incorporating fMRI/MEG fitting (neural alignment), psychophysical simulation (behavioral alignment), and task performance, with robustness quantified via AutoAttack. We find only weak average correlation between overall alignment and robustness; however, specific alignment metrics—such as shape bias—serve as strong predictors, achieving >85% accuracy in robustness ranking across diverse architectures. Critically, the impact of alignment on security exhibits significant heterogeneity across dimensions, challenging the oversimplified “more human-like, more secure” hypothesis. To our knowledge, this is the first systematic disentanglement of alignment types and their distinct functional roles, establishing a new paradigm for interpretable and robust vision modeling.

Technology Category

Application Category

📝 Abstract

Representational alignment refers to the extent to which a model's internal representations mirror biological vision, offering insights into both neural similarity and functional correspondence. Recently, some more aligned models have demonstrated higher resiliency to adversarial examples, raising the question of whether more human-aligned models are inherently more secure. In this work, we conduct a large-scale empirical analysis to systematically investigate the relationship between representational alignment and adversarial robustness. We evaluate 118 models spanning diverse architectures and training paradigms, measuring their neural and behavioral alignment and engineering task performance across 106 benchmarks as well as their adversarial robustness via AutoAttack. Our findings reveal that while average alignment and robustness exhibit a weak overall correlation, specific alignment benchmarks serve as strong predictors of adversarial robustness, particularly those that measure selectivity towards texture or shape. These results suggest that different forms of alignment play distinct roles in model robustness, motivating further investigation into how alignment-driven approaches can be leveraged to build more secure and perceptually-grounded vision models.

Problem

Research questions and friction points this paper is trying to address.

Investigates relationship between model alignment and security

Analyzes human-aligned models' resilience to adversarial examples

Identifies specific alignment benchmarks predicting adversarial robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Representational alignment enhances security

Human-aligned models resist adversarial examples

Texture and shape selectivity predict robustness

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?