Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the imbalanced capability allocation across layers in large language models and the absence of a theoretical framework that translates sensitivity assessments into optimization decisions under hardware constraints. Leveraging the Minimum Description Length principle, the study introduces— for the first time—local curvature information to construct a curvature-weighted layer gain metric. It establishes a convex optimization model with a closed-form solution and generalization guarantees, enabling adaptive allocation and pruning of expert slots or LoRA ranks. The proposed algorithm integrates Hessian inverse approximation, curvature-adjusted gains, and bisection over dual variables, yielding an efficient solver with $O(K \log 1/\varepsilon)$ time complexity and achieving a transfer regret bound of $O(\delta^2)$, thereby significantly enhancing model compression and cross-domain adaptation efficiency.

Technology Category

Application Category

📝 Abstract
Layer-wise capacity in large language models is highly non-uniform: some layers contribute disproportionately to loss reduction while others are near-redundant. Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions under hardware constraints. We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle. Our central quantity is the curvature-adjusted layer gain $ζ_k^2 = g_k^\top \widetilde{H}_{kk}^{-1} g_k$, which we show equals twice the maximal second-order reduction in empirical risk achievable by updating layer $k$ alone, and which strictly dominates gradient-norm-based scores by incorporating local curvature. Normalizing these gains into layer quality scores $q_k$, we formulate two convex MDL programs: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation. Both programs admit unique closed-form solutions parameterized by a single dual variable, computable in $O(K \log 1/\varepsilon)$ via bisection. We prove an $O(δ^2)$ transfer regret bound showing that source-domain allocations remain near-optimal on target tasks when curvature scores drift by $δ$, with explicit constants tied to the condition number of the target program. Together, these results elevate layer-wise capacity optimization from an empirical heuristic to a theoretically grounded, computationally efficient framework with provable optimality and generalization guarantees.
Problem

Research questions and friction points this paper is trying to address.

layer-wise capacity
curvature-aware optimization
minimum description length
large language models
capacity allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimum Description Length
Curvature-Weighted Allocation
Layer-Adaptive Optimization
Convex Pruning
Transfer Regret Bound
🔎 Similar Papers
No similar papers found.
T
Theophilus Amaefuna
Bellini College of Artificial Intelligence, Cybersecurity and Computing, University of South Florida, Tampa
H
Hitesh Vaidya
Bellini College of Artificial Intelligence, Cybersecurity and Computing, University of South Florida, Tampa
Anshuman Chhabra
Anshuman Chhabra
Assistant Professor of Computer Science and Engineering, University of South Florida
AI SafetyRobust AITrustworthy AI
Ankur Mali
Ankur Mali
Assistant Professor, University of South Florida
Formal languageMemory NetworksPredictive CodingNatural Language Processinglifelong machine