LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenge of verbose continuous integration (CI) logs, which impose substantial computational overhead and environmental costs for both human review and large language model (LLM) analysis. To this end, we propose LogSieve, a lightweight, root cause analysis (RCA)-oriented log reduction method that uniquely integrates task awareness with semantic fidelity. LogSieve employs an embedding-based classifier to automatically identify low-information log lines and combines semantic similarity metrics—cosine similarity and GPTScore—with exact matching to achieve near-human filtering accuracy (97%). Evaluated on GitHub Actions logs from 20 open-source Android projects, LogSieve reduces log volume by 42% and token count by 40% on average, while preserving high semantic consistency (cosine similarity: 0.93; GPTScore: 0.93; exact match accuracy: 80%), thereby significantly lowering LLM inference energy consumption.

Technology Category

Application Category

📝 Abstract

Logs are essential for understanding Continuous Integration (CI) behavior, particularly for diagnosing build failures and performance regressions. Yet their growing volume and verbosity make both manual inspection and automated analysis increasingly costly, time-consuming, and environmentally costly. While prior work has explored log compression, anomaly detection, and LLM-based log analysis, most efforts target structured system logs rather than the unstructured, noisy, and verbose logs typical of CI workflows. We present LogSieve, a lightweight, RCA-aware and semantics-preserving log reduction technique that filters low-information lines while retaining content relevant to downstream reasoning. Evaluated on CI logs from 20 open-source Android projects using GitHub Actions, LogSieve achieves an average 42% reduction in lines and 40% reduction in tokens with minimal semantic loss. This pre-inference reduction lowers computational cost and can proportionally reduce energy use (and associated emissions) by decreasing the volume of data processed during LLM inference. Compared with structure-first baselines (LogZip and random-line removal), LogSieve preserves much higher semantic and categorical fidelity (Cosine = 0.93, GPTScore = 0.93, 80% exact-match accuracy). Embedding-based classifiers automate relevance detection with near-human accuracy (97%), enabling scalable and sustainable integration of semantics-aware filtering into CI workflows. LogSieve thus bridges log management and LLM reasoning, offering a practical path toward greener and more interpretable CI automation.

Problem

Research questions and friction points this paper is trying to address.

CI logs

log reduction

LLM-based analysis

sustainability

unstructured logs

Innovation

Methods, ideas, or system contributions that make the work stand out.

log reduction

LLM-based analysis

semantic preservation