UniSage: A Unified and Post-Analysis-Aware Sampling for Microservices

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In modern distributed systems, massive trace and log data incur prohibitive storage overhead and impede fault diagnosis. Existing presampling methods often discard failure-relevant signals, compromising diagnostic transparency. This paper proposes UniSage—the first unified trace and log sampling framework tailored for microservices—adopting a *post-analysis–aware* paradigm: lightweight multimodal anomaly detection and root cause analysis (RCA) are first executed on the full data stream to generate service-level diagnostic insights that guide sampling decisions. UniSage innovatively integrates two complementary pillars: *analysis-guided sampling*, prioritizing critical failure signals, and *edge-case preservation*, ensuring rare but potentially diagnostic behaviors are retained. Experiments demonstrate that at a 2.5% sampling rate, UniSage captures 56.5% of critical traces and 96.25% of relevant logs, improves RCA accuracy@1 by 42.45%, and processes 10 minutes of data in under 5 seconds—substantially outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Traces and logs are essential for observability and fault diagnosis in modern distributed systems. However, their ever-growing volume introduces substantial storage overhead and complicates troubleshooting. Existing approaches typically adopt a sample-before-analysis paradigm: even when guided by data heuristics, they inevitably discard failure-related information and hinder transparency in diagnosing system behavior. To address this, we introduce UniSage, the first unified framework to sample both traces and logs using a post-analysis-aware paradigm. Instead of discarding data upfront, UniSagefirst performs lightweight and multi-modal anomaly detection and root cause analysis (RCA) on the complete data stream. This process yields fine-grained, service-level diagnostic insights that guide a dual-pillar sampling strategy for handling both normal and anomalous scenarios: an analysis-guided sampler prioritizes data implicated by RCA, while an edge-case-based sampler ensures rare but critical behaviors are captured. Together, these pillars ensure comprehensive coverage of critical signals without excessive redundancy. Extensive experiments demonstrate that UniSage significantly outperforms state-of-the-art baselines. At a 2.5% sampling rate, it captures 56.5% of critical traces and 96.25% of relevant logs, while improving the accuracy (AC@1) of downstream root cause analysis by 42.45%. Furthermore, its efficient pipeline processes 10 minutes of telemetry data in under 5 seconds, demonstrating its practicality for production environments.
Problem

Research questions and friction points this paper is trying to address.

Reducing storage overhead from growing microservices traces and logs
Preventing loss of failure-related information in sampling approaches
Improving accuracy of root cause analysis in distributed systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

UniSage samples traces and logs post-analysis-aware
It performs anomaly detection before sampling data
It uses dual-pillar strategy for normal and anomalous scenarios
🔎 Similar Papers
No similar papers found.