Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the impact of supervised statement phrasing—particularly identity framing—on model safety in low-data LoRA fine-tuning. The authors construct four supervision conditions based on identical safety rules but differing in linguistic formulation and evaluate them on Llama, Gemma, and Qwen, three prominent instruction-tuned models. Model performance in rejecting harmful requests is assessed using HarmBench, supplemented by dual human adjudication from DeepSeek v3.2 and Sonnet 4.6, alongside general capability evaluations via MMLU and ARC-Challenge. Results demonstrate that non-identity-based safety supervision significantly enhances refusal rates (74.4%, 76.9%, and 74.1%, respectively) without relying on identity-related language or compromising general capabilities, thereby challenging the prevailing assumption that strong identity framing is necessary for effective safety alignment.

Technology Category

Application Category

📝 Abstract
How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of $D > B > C \geq A > baseline$. This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions.
Problem

Research questions and friction points this paper is trying to address.

safety fine-tuning
identity framing
LoRA
low-data
instruction-tuned models
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-identity safety condition
LoRA fine-tuning
identity framing
safety supervision
low-data alignment