🤖 AI Summary
In modern distributed systems, massive trace and log data incur prohibitive storage overhead and impede fault diagnosis. Existing presampling methods often discard failure-relevant signals, compromising diagnostic transparency. This paper proposes UniSage—the first unified trace and log sampling framework tailored for microservices—adopting a *post-analysis–aware* paradigm: lightweight multimodal anomaly detection and root cause analysis (RCA) are first executed on the full data stream to generate service-level diagnostic insights that guide sampling decisions. UniSage innovatively integrates two complementary pillars: *analysis-guided sampling*, prioritizing critical failure signals, and *edge-case preservation*, ensuring rare but potentially diagnostic behaviors are retained. Experiments demonstrate that at a 2.5% sampling rate, UniSage captures 56.5% of critical traces and 96.25% of relevant logs, improves RCA accuracy@1 by 42.45%, and processes 10 minutes of data in under 5 seconds—substantially outperforming state-of-the-art approaches.
📝 Abstract
Traces and logs are essential for observability and fault diagnosis in modern distributed systems. However, their ever-growing volume introduces substantial storage overhead and complicates troubleshooting. Existing approaches typically adopt a sample-before-analysis paradigm: even when guided by data heuristics, they inevitably discard failure-related information and hinder transparency in diagnosing system behavior. To address this, we introduce UniSage, the first unified framework to sample both traces and logs using a post-analysis-aware paradigm. Instead of discarding data upfront, UniSagefirst performs lightweight and multi-modal anomaly detection and root cause analysis (RCA) on the complete data stream. This process yields fine-grained, service-level diagnostic insights that guide a dual-pillar sampling strategy for handling both normal and anomalous scenarios: an analysis-guided sampler prioritizes data implicated by RCA, while an edge-case-based sampler ensures rare but critical behaviors are captured. Together, these pillars ensure comprehensive coverage of critical signals without excessive redundancy. Extensive experiments demonstrate that UniSage significantly outperforms state-of-the-art baselines. At a 2.5% sampling rate, it captures 56.5% of critical traces and 96.25% of relevant logs, while improving the accuracy (AC@1) of downstream root cause analysis by 42.45%. Furthermore, its efficient pipeline processes 10 minutes of telemetry data in under 5 seconds, demonstrating its practicality for production environments.