🤖 AI Summary
On edge devices, cross-encoder re-rankers impose prohibitive latency and memory overhead in semantic top-K retrieval tasks—e.g., RAG, agent memory, and personalized recommendation—becoming a critical end-to-end performance bottleneck. This paper introduces GRATING, a training-free inference system. Its core innovation is the empirical observation that ranking stability emerges early in deep transformer layers, enabling sequence-level intermediate-layer dynamic pruning. GRATING further integrates a two-tier sliding window mechanism with block-wise execution to achieve single-pass forward computation, global candidate maintenance, and I/O-computation overlap. Evaluated on models ranging from 0.6B to 8B parameters, GRATING reduces microbenchmark latency by up to 89.0% and peak memory usage by up to 94.9%. In three real-world applications, it achieves 11.6%–51.0% latency reduction and 18.6%–77.8% memory savings, with zero accuracy degradation.
📝 Abstract
Semantic top-K selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference.
Building on this insight, we propose monolithic forwarding and develop a training-free inference system, GRATING. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate GRATING against state-of-the-art baselines on rerankers from 0.6B to 8B parameters across Apple M2 and RTX 5070. GRATING consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, GRATING lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability.