A Constrained Optimization Perspective of Unrolled Transformers

📅 2026-01-24

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

This work proposes a novel approach to enhance the robustness and out-of-distribution generalization of Transformers without compromising in-distribution performance. The training process is formulated as a constrained optimization problem, introducing—for the first time—a layer-wise monotonic descent constraint that enforces intermediate representations across layers to progressively reduce the expected loss. This is achieved through a primal-dual training mechanism, yielding an Unrolled Transformer architecture that explicitly mimics the behavior of iterative optimization algorithms. Experimental results demonstrate that the proposed method significantly improves robustness and generalization on both video denoising and text classification tasks, while maintaining competitive performance on in-distribution data.

Technology Category

Application Category

📝 Abstract

We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms. Specifically, we enforce layerwise descent constraints on the objective function and replace standard empirical risk minimization (ERM) with a primal-dual training scheme. This approach yields models whose intermediate representations decrease the loss monotonically in expectation across layers. We apply our method to both unrolled transformer architectures and conventional pretrained transformers on tasks of video denoising and text classification. Across these settings, we observe constrained transformers achieve stronger robustness to perturbations and maintain higher out-of-distribution generalization, while preserving in-distribution performance.

Problem

Research questions and friction points this paper is trying to address.

constrained optimization

transformers

layerwise descent

robustness

out-of-distribution generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

constrained optimization

unrolled transformers

primal-dual training