Limits of Convergence-Rate Control for Open-Weight Safety

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Open-source foundation models are vulnerable to fine-tuning for malicious purposes, yet existing training-time defenses lack theoretical guarantees. This work frames safety interventions as a problem of controlling optimization convergence rates and establishes, for the first time, a theoretical link between convergence behavior and the spectral structure of model weights. Building on this insight, the authors propose SpecDef, a provably effective algorithm that leverages spectral reparameterization to significantly slow down both first- and second-order optimization processes in non-adversarial settings. However, the analysis also reveals a fundamental limitation in adversarial scenarios: sufficiently knowledgeable attackers can circumvent this defense by linearly scaling up model size to restore rapid convergence.

Technology Category

Application Category

📝 Abstract

Open-weight foundation models can be fine-tuned for harmful purposes after release, yet no existing training resistance methods provide theoretical guarantees. Treating these interventions as convergence-rate control problems allows us to connect optimization speed to the spectral structure of model weights. We leverage this insight to develop a novel understanding of convergence rate control through spectral reparameterization and derive an algorithm, SpecDef, that can both provably and empirically slow first- and second-order optimization in non-adversarial settings. In adversarial settings, we establish a fundamental limit on a broad class of convergence rate control methods including our own: an attacker with sufficient knowledge can restore fast convergence at a linear increase in model size. In order to overcome this limitation, future works will need to investigate methods that are not equivalent to controlling convergence rate.

Problem

Research questions and friction points this paper is trying to address.

open-weight safety

convergence-rate control

adversarial fine-tuning

training resistance

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

convergence-rate control

spectral reparameterization

open-weight safety