๐ค AI Summary
This work addresses the challenge of accurately attributing wireless performance fluctuations to specific code changes in rapidly evolving O-RAN software, where random factors such as channel conditions and interference obscure causal relationships. To this end, we propose RANalyzerโthe first automated evaluation framework that integrates large language model (LLM)-based semantic analysis with statistical residual modeling. RANalyzer leverages LLMs to classify code modifications at both protocol-layer and functional-component granularity and employs residual analysis to distinguish environmental noise from genuine performance regressions. The approach enables interpretable attribution of software change impacts and introduces a large-scale O-RAN evaluation dataset encompassing 69 software versions and over 8,600 over-the-air tests. Experimental results demonstrate successful identification of multiple performance degradations caused by specific code changes, and the framework is designed for seamless integration into CI/CD/CT pipelines to support efficient, continuous RAN regression detection.
๐ Abstract
Software-driven O-RAN architectures enable rapid innovation through frequent, independent updates to virtualized components. However, attributing performance variations to specific software changes is challenging due to the stochastic nature of wireless systems, where channel conditions, interference, and hardware variability confound analysis. Traditional threshold-based monitoring and manual troubleshooting do not scale with modern software evolution.
This paper presents RANalyzer, an automated test analysis framework that quantifies the performance impact of software updates beyond what can be explained by wireless channel conditions. RANalyzer combines LLM-assisted semantic extraction with residuals analysis. The first categorizes code changes by affected protocol layers and functional components, while the second provides insights on the effect of load, channel, or code changes on the test performance. We contribute an extensive dataset collected over more than two years of continuous over-the-air testing on an experimental O-RAN testbed, comprising over 8,600 automated tests across 69 releases of the OAI stack. By modeling expected performance and interpreting deviations as software-induced effects, we identify degraded instances attributable to code changes and correlate them with specific change categories. The framework can be integrated into CI/CD/CT pipelines for automated, continuous evaluation of software updates at scale.