GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

On edge devices, cross-encoder re-rankers impose prohibitive latency and memory overhead in semantic top-K retrieval tasks—e.g., RAG, agent memory, and personalized recommendation—becoming a critical end-to-end performance bottleneck. This paper introduces GRATING, a training-free inference system. Its core innovation is the empirical observation that ranking stability emerges early in deep transformer layers, enabling sequence-level intermediate-layer dynamic pruning. GRATING further integrates a two-tier sliding window mechanism with block-wise execution to achieve single-pass forward computation, global candidate maintenance, and I/O-computation overlap. Evaluated on models ranging from 0.6B to 8B parameters, GRATING reduces microbenchmark latency by up to 89.0% and peak memory usage by up to 94.9%. In three real-world applications, it achieves 11.6%–51.0% latency reduction and 18.6%–77.8% memory savings, with zero accuracy degradation.

Technology Category

Application Category

📝 Abstract

Semantic top-K selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, GRATING. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate GRATING against state-of-the-art baselines on rerankers from 0.6B to 8B parameters across Apple M2 and RTX 5070. GRATING consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, GRATING lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability.

Problem

Research questions and friction points this paper is trying to address.

Reducing latency in on-device semantic selection

Minimizing memory usage for edge AI services

Maintaining accuracy while optimizing computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses progressive cluster pruning to reduce latency

Bounds memory via dual-layer sliding window execution

Implements training-free inference with monolithic forwarding

🔎 Similar Papers

No similar papers found.