Position: Capability Control Should be a Separate Goal From Alignment

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the growing risks of misuse and loss of control associated with the broad applicability of foundation models, which existing alignment methods struggle to mitigate through hard behavioral constraints. It establishes capability control as a core objective distinct from alignment and introduces a defense-in-depth framework spanning data, learning, and system layers to enforce multi-granular behavioral constraints throughout the model lifecycle. By integrating techniques such as data distribution shaping, representational intervention, and runtime input/output/action-level safeguards, the paper systematically constructs pathways for capability control. It further identifies critical challenges—including the dual-use nature of knowledge and combinatorial generalization—offering a new paradigm for developing safe and controllable AI systems.

Technology Category

Application Category

📝 Abstract

Foundation models are trained on broad data distributions, yielding generalist capabilities that enable many downstream applications but also expand the space of potential misuse and failures. This position paper argues that capability control -- imposing restrictions on permissible model behavior -- should be treated as a distinct goal from alignment. While alignment is often context and preference-driven, capability control aims to impose hard operational limits on permissible behaviors, including under adversarial elicitation. We organize capability control mechanisms across the model lifecycle into three layers: (i) data-based control of the training distribution, (ii) learning-based control via weight- or representation-level interventions, and (iii) system-based control via post-deployment guardrails over inputs, outputs, and actions. Because each layer has characteristic failure modes when used in isolation, we advocate for a defense-in-depth approach that composes complementary controls across the full stack. We further outline key open challenges in achieving such control, including the dual-use nature of knowledge and compositional generalization.

Problem

Research questions and friction points this paper is trying to address.

capability control

foundation models

misuse prevention

adversarial elicitation

operational limits

Innovation

Methods, ideas, or system contributions that make the work stand out.

capability control

defense-in-depth

foundation models