Block removal for large language models through constrained binary optimization

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the exponential growth of the search space in large language model compression by formulating Transformer module removal as a constrained binary optimization problem and, for the first time, mapping it to an Ising physical model, where the energy function serves as an efficient proxy for downstream performance. This approach overcomes the conventional limitation of removing only contiguous modules, enabling the discovery of high-quality non-contiguous sparse architectures. By integrating lightweight gradient estimation with (approximate) Ising solvers, the method achieves highly efficient structural search. It consistently outperforms existing approaches across multiple benchmarks, yielding up to a 6-point improvement in MMLU scores after brief retraining, and has been successfully applied to the structurally complex NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model.

Technology Category

Application Category

📝 Abstract
Compressing resource-intensive large language models by removing whole transformer blocks is a seemingly simple idea, but identifying which blocks to remove constitutes an exponentially difficult combinatorial problem. In this paper, we formulate block removal as a constrained binary optimization problem that can be mapped to a physical system (Ising model), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations and yields many high-quality, non-trivial solutions beyond consecutive regions. We demonstrate that our approach outperforms state-of-the-art block-removal methods across several benchmarks, with performance gains persisting after short retraining, and reaching improvements of up to 6 points on the MMLU benchmark. Our method requires only forward and backward passes for a few active parameters, together with an (at least approximate) Ising solver, and can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure.
Problem

Research questions and friction points this paper is trying to address.

block removal
large language models
combinatorial optimization
model compression
constrained binary optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

constrained binary optimization
block removal
Ising model
large language model compression
non-consecutive pruning