Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
This work addresses the imbalanced capability allocation across layers in large language models and the absence of a theoretical framework that translates sensitivity assessments into optimization decisions under hardware constraints. Leveraging the Minimum Description Length principle, the study introduces— for the first time—local curvature information to construct a curvature-weighted layer gain metric. It establishes a convex optimization model with a closed-form solution and generalization guarantees, enabling adaptive allocation and pruning of expert slots or LoRA ranks. The proposed algorithm integrates Hessian inverse approximation, curvature-adjusted gains, and bisection over dual variables, yielding an efficient solver with $O(K \log 1/\varepsilon)$ time complexity and achieving a transfer regret bound of $O(\delta^2)$, thereby significantly enhancing model compression and cross-domain adaptation efficiency.

Technology Category

Application Category

📝 Abstract
Layer-wise capacity in large language models is highly non-uniform: some layers contribute disproportionately to loss reduction while others are near-redundant. Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions under hardware constraints. We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle. Our central quantity is the curvature-adjusted layer gain $ζ_k^2 = g_k^\top \widetilde{H}_{kk}^{-1} g_k$, which we show equals twice the maximal second-order reduction in empirical risk achievable by updating layer $k$ alone, and which strictly dominates gradient-norm-based scores by incorporating local curvature. Normalizing these gains into layer quality scores $q_k$, we formulate two convex MDL programs: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation. Both programs admit unique closed-form solutions parameterized by a single dual variable, computable in $O(K \log 1/\varepsilon)$ via bisection. We prove an $O(δ^2)$ transfer regret bound showing that source-domain allocations remain near-optimal on target tasks when curvature scores drift by $δ$, with explicit constants tied to the condition number of the target program. Together, these results elevate layer-wise capacity optimization from an empirical heuristic to a theoretically grounded, computationally efficient framework with provable optimality and generalization guarantees.
Problem

Research questions and friction points this paper is trying to address.

layer-wise capacity
curvature-aware optimization
minimum description length
large language models
capacity allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimum Description Length
Curvature-Weighted Allocation
Layer-Adaptive Optimization
Convex Pruning
Transfer Regret Bound
🔎 Similar Papers