DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the significant parameter redundancy in the feed-forward networks (FFNs) of large language models, which existing pruning methods fail to handle effectively due to their reliance on static strategies and fixed calibration data that cannot adapt to the dynamic evolution of knowledge neurons during autoregressive generation. To overcome this limitation, we propose DARTβ€”a lightweight, training-free dynamic pruning framework that operates at inference time. DART monitors shifts in attention distributions to detect contextual drift, enabling real-time assessment of neuron importance and adaptive updating of sparse masks for semantic-aware FFN pruning. Evaluated on LLaMA-3.1-8B, DART achieves up to a 14.5% accuracy gain at 70% FFN sparsity, with ROUGE-L scores on summarization tasks tripling those of static pruning, while matching the performance of dense models. The approach incurs less than 10 MB of additional memory overhead and only a 0.1% increase in FLOPs.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at https://github.com/seeder-research/DART.
Problem

Research questions and friction points this paper is trying to address.

pruning
large language models
dynamic inference
knowledge neurons
parameter redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic pruning
knowledge neurons
attention-guided tracing
inference-time adaptation
LLM efficiency
πŸ”Ž Similar Papers
No similar papers found.
A
Abhishek Tyagi
Department of Electrical and Computer Engineering, National University of Singapore, Singapore
Yunuo Cen
Yunuo Cen
National University of Singapore
OptimizationSAT SolvingQuantum-Inspired ComputingHardware/Software Co-Design
S
Shrey Dhorajiya
Department of Computer Science, Birla Institute of Technology and Science, Pilani, India
B
B. Veeravalli
Department of Electrical and Computer Engineering, National University of Singapore, Singapore
Xuanyao Fong
Xuanyao Fong
National University of Singapore
Hardware-software co-designemerging technologiescompact modelingelectronics simulations