BIRD: Behavior Induction via Representation-structure Distillation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenges of forgetting human-aligned behaviors—robustness, fairness, and honesty—during model transfer, high fine-tuning costs, and reliance on task-specific data. We propose a representation-structure distillation framework that enables data-free transfer of alignment behaviors by matching the geometric structure of latent spaces between teacher and student models. We establish representation-structure matching as the core mechanism for behavioral alignment transfer—the first such formalization. Through rigorous analysis, we identify three interpretable properties—task relevance, behavioral relevance, and complementary knowledge—that collectively explain 85% of the variance in transfer performance. Empirically, our method improves out-of-distribution image classification robustness by up to 16% in accuracy. It enables alignment transfer from compact teachers (as small as 1/25 the parameters of students) to large-scale students, and demonstrates broad architectural and distributional generality across 400+ teacher–student pairs.

Technology Category

Application Category

📝 Abstract

Human-aligned deep learning models exhibit behaviors consistent with human values, such as robustness, fairness, and honesty. Transferring these behavioral properties to models trained on different tasks or data distributions remains challenging: aligned behavior is easily forgotten during fine-tuning, and collecting task-specific data that preserves this behavior can be prohibitively costly. We introduce BIRD (Behavior Induction via Representation-structure Distillation), a flexible framework for transferring aligned behavior by matching the internal representation structure of a student model to that of a teacher. Applied to out-of-distribution robustness in image classification, BIRD outperforms fine-tuning, transfer learning, and continual learning methods, improving robust accuracy by up to 16% over the next strongest baseline. It remains effective even when the teacher is trained on a much simpler dataset and is $25 imes$ smaller than the student. In a large-scale study of over 400 teacher-student pairs, we show that three interpretable and computable properties of the teacher's representations (i.e., task relevance, behavioral relevance, and complementary knowledge) explain up to 85% of the variance in transfer success. These insights offer practical guidance for teacher selection and design. BIRD turns small, well-aligned models into scalable alignment seeds, removing a key bottleneck in deploying safe AI systems in the wild.

Problem

Research questions and friction points this paper is trying to address.

Transferring human-aligned behavior to different models

Improving robustness in image classification tasks

Identifying key teacher representation properties for success

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfers behavior via representation-structure distillation

Matches student model structure to teacher model

Improves robust accuracy by up to 16%

🔎 Similar Papers

No similar papers found.