Hopscotch: Discovering and Skipping Redundancies in Language Models

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address redundancy in attention blocks of large language models (LLMs), this paper proposes a dynamic skipping mechanism that, during inference, identifies and bypasses the least task-relevant attention layers while applying lightweight trainable scaling parameters—applied to attention and MLP outputs—to mitigate distributional shift and preserve output quality. The method employs gradient-based block importance estimation, enabling end-to-end joint optimization of both skipping decisions and scaling coefficients. It requires no model weight modification, is independent of pretraining data, and is fully compatible with existing compression techniques for plug-and-play deployment. Evaluated on Llama-3.1-8B and Qwen2.5-7B, skipping four attention blocks incurs less than 2% performance degradation—substantially outperforming pruning and early-exit baselines. This work presents the first approach achieving joint, data-agnostic, weight-free attention block skipping with adaptive output scaling.

Technology Category

Application Category

📝 Abstract

Modern causal language models stack many attention blocks to improve performance, but not all blocks are necessary for every task. We propose Hopscotch, a simple yet effective method that identifies and skips attention blocks with least contributions to a task and adapts to preserve output quality. Hopscotch jointly optimizes which blocks to skip and how to scale the outputs of the remaining layers. By introducing lightweight, trainable scaling parameters to attention and MLP blocks, it mitigates distribution shifts in hidden states caused by removing attention blocks. Hopscotch does not modify model weights or require access to pretraining or instruction-tuning data, and is compatible with existing model compression techniques. When applied to $ exttt{Llama-3.1-8B}$ and $ exttt{Qwen2.5-7B}$, Hopscotch achieves less than a 2% drop in performance even after skipping four attention blocks.

Problem

Research questions and friction points this paper is trying to address.

Identifies and skips redundant attention blocks in language models

Optimizes block skipping and output scaling to maintain quality

Compatible with existing compression without modifying model weights

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies and skips least contributing attention blocks

Uses trainable scaling to mitigate distribution shifts

Compatible with existing model compression techniques

🔎 Similar Papers

No similar papers found.