Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational and memory overhead of many-shot in-context learning (ICL) inference in multi-task settings, hindering practical deployment. We propose a training-free dynamic block-sparse attention framework that jointly designs dynamic block-sparse attention with cached demonstration set retrieval. This co-design significantly reduces per-inference computation and memory footprint. Compared to standard ICL, our method achieves single-example inference latency approaching that of fine-tuned models, while maintaining average accuracy above 95% of the strongest ICL and fine-tuning baselines. Our key innovation lies in the first integration of training-agnostic dynamic sparse attention with retrieval-augmented caching—enabling high-accuracy, low-latency, multi-task-shared ICL inference. Crucially, this breakthrough overcomes scalability bottlenecks for large-scale deployment without compromising performance.

Technology Category

Application Category

📝 Abstract
Many-shot in-context learning has recently shown promise as an alternative to finetuning, with the major advantage that the same model can be served for multiple tasks. However, this shifts the computational burden from training-time to inference-time, making deployment of many-shot ICL challenging to justify in-practice. This cost is further increased if a custom demonstration set is retrieved for each inference example. We present Dynamic Block-Sparse Attention, a training-free framework for retrieval-based many-shot in-context learning. By combining carefully designed block-sparse attention and retrieval of cached groups of demonstrations, we achieve comparable per-example latency to finetuning while maintaining on average>95% of the best method's accuracy across strong ICL and finetuning baselines. We hope that this will further enable the deployment of many-shot ICL at scale.
Problem

Research questions and friction points this paper is trying to address.

Reduces inference-time computational burden in many-shot in-context learning.
Enables efficient retrieval of custom demonstration sets for each example.
Achieves low latency and high accuracy comparable to finetuning methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Block-Sparse Attention for efficient inference
Retrieval-based many-shot in-context learning framework
Training-free approach with cached demonstration groups
🔎 Similar Papers
No similar papers found.