DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This study addresses the tendency of large language models to generate incorrect responses, issue unwarranted refusals, or mechanically apply “equal treatment” strategies when handling demographic group differences, often due to overcorrection for bias. Such interventions frequently induce harmful content drift—termed harm drift—when accuracy is improved. The work introduces DART, a novel framework that systematizes and quantifies this phenomenon through label-conditioned reasoning distilled from a teacher model, baseline-based output auditing to detect harm drift, and severity-weighted fine-tuning on problematic samples. Evaluated on Llama-3-8B-Instruct, DART increases factual accuracy on demographic commonsense queries from 39.0% to 68.8% and boosts accuracy on equal-treatment prompts from 11.3% to 72.6%, while reducing harm drift by 72.6%. In real-world queries, appropriate response rates rise to 77.5% with refusal rates dropping to 3.0%.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatment" defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the baseline identified. To mitigate this, we introduce DART (Distill--Audit--Repair Training), which distills label-conditioned reasoning from a teacher, audits outputs for harm drift cases relative to baseline, and repairs problematic cases via severity-weighted fine-tuning. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8%, with largest gains on equal-treatment prompts (11.3% -> 72.6%), while reducing harm drift cases by 72.6%. It also transfers to 280 open-ended real-world queries across medical, legal, policy, and educational domains, improving difference-appropriate responses from 39.8% to 77.5% while reducing refusals from 34.3% to 3.0%. Our results demonstrate that accuracy and safety need not conflict when explicit detection and repair mechanisms are in place.

Problem

Research questions and friction points this paper is trying to address.

harm drift

difference-awareness

identity-blindness

safety tuning

demographic differences

Innovation

Methods, ideas, or system contributions that make the work stand out.

harm drift

difference-awareness

Distill-Audit-Repair Training