Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenges of deploying blockwise attention in long-context scenarios, where the lack of effective semantic chunking methods and inefficient fine-tuning strategies hinder performance. The authors introduce SemanticSeg, the first multi-category semantic chunking dataset comprising 30k instances, and train a lightweight automatic chunker. They propose a block distillation framework in which a full-attention teacher model guides a student model, incorporating novel components: block sink tokens, block dropout, and token-level loss weighting. These innovations significantly enhance generalization. Experiments demonstrate that the learned chunker outperforms heuristic and statistical baselines, and that block distillation consistently approaches the performance of full attention across diverse models and benchmarks, offering a scalable and efficient pathway for deploying blockwise attention.

📝 Abstract

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.

Problem

Research questions and friction points this paper is trying to address.

block attention

semantic segmentation

KV cache reuse

long-context

Retrieval-Augmented Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

block attention

semantic segmentation

block distillation