π€ AI Summary
This work addresses the significant parameter redundancy in the feed-forward networks (FFNs) of large language models, which existing pruning methods fail to handle effectively due to their reliance on static strategies and fixed calibration data that cannot adapt to the dynamic evolution of knowledge neurons during autoregressive generation. To overcome this limitation, we propose DARTβa lightweight, training-free dynamic pruning framework that operates at inference time. DART monitors shifts in attention distributions to detect contextual drift, enabling real-time assessment of neuron importance and adaptive updating of sparse masks for semantic-aware FFN pruning. Evaluated on LLaMA-3.1-8B, DART achieves up to a 14.5% accuracy gain at 70% FFN sparsity, with ROUGE-L scores on summarization tasks tripling those of static pruning, while matching the performance of dense models. The approach incurs less than 10 MB of additional memory overhead and only a 0.1% increase in FLOPs.
π Abstract
Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at https://github.com/seeder-research/DART.