Gleaner: A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge of high computational overhead in existing graph-based tail sampling methods for distributed tracing in microservice systems, which hinders their online deployment. The authors propose an efficient online tail sampling framework that eschews explicit graph modeling and instead leverages a novel “edge-bundle” semantic representation to integrate log information. By incorporating an alert-driven dynamic quota allocation mechanism and a diversity-preserving strategy, the approach shifts the sampling paradigm from data compression to signal enhancement. Evaluated at a 1% sampling rate, the method achieves root cause analysis accuracy surpassing the best baseline by 42%–107%—even outperforming full-data analysis—while requiring only 0.74 ms per trace. It also improves trace pattern coverage by up to 128.7% and increases Shannon entropy by 32.9%.

Technology Category

Application Category

📝 Abstract

Distributed tracing in microservices is critical for diagnostics but generates overwhelming data volumes, necessitating intelligent sampling. To maximize fidelity, state-of-the-art (SOTA) tail-based samplers analyze complete (or even log-enriched) traces by modeling them as graphs. However, this reliance on computationally expensive graph analysis creates a performance bottleneck that prohibits their use in online settings. To this end, we propose Gleaner, an online tail-sampling framework that breaks this trade-off. It is founded on the key insight that explicit graph structures are unnecessary for high-fidelity trace grouping. Instead, Gleaner represents each trace as a "bag-of-edges" augmented with log semantics, replacing slow graph algorithms with highly efficient set-based operations. It also employs an alarm-driven quota and a diversity-preserving strategy to prioritize anomalous and rare traces for downstream Root Cause Analysis (RCA). Experimentally, Gleaner processes traces at 0.74ms each, improving Trace Pattern Coverage by up to 128.7% and Shannon Entropy by up to 32.9% over baselines. At just a 1% sampling rate, Gleaner improves RCA accuracy by 42%-107% over the next-best sampler. Moreover, RCA on Gleaner's sampled data is more accurate than with the entire, unsampled dataset. This result reframes intelligent sampling from a data reduction technique to a powerful signal enhancement paradigm for automated operations.

Problem

Research questions and friction points this paper is trying to address.

microservice diagnostics

distributed tracing

online sampling

tail-based sampling

trace analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

online tail sampling

bag-of-edges

log-augmented trace representation