🤖 AI Summary
This work addresses the challenge of network configuration errors, which frequently cause severe service outages, yet existing automated repair approaches struggle to scale due to their reliance on complex semantic modeling. The paper introduces, for the first time, a syntax-driven paradigm from program repair into the domain of network configuration correction, proposing a multi-round “localize–repair–validate” methodology that operates without explicit network semantic modeling. Leveraging a code grafting–based strategy, the approach efficiently explores the repair space. Evaluated on synthetic networks, it successfully repairs all injected faults; on real-world configurations, it fixes 97.5% of errors within an average of 7.36 seconds. Moreover, for four actual production incidents, it generates valid repair candidates within six minutes each, substantially enhancing both generalizability and scalability.
📝 Abstract
Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a semantic-driven approach: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a syntax-driven approach, which tries to repair program bugs by ``grafting'' some existing code in the same repository, without modeling program semantics. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a ``localize-fix-validate'' pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5\% of the incidents on a real network, both with 15 types of errors injected, within an average time of 7.36 seconds. It has also provided valid repair options in under 6 minutes for 4 recent network incidents or undesired changes, in a real production network with O(1,000)Õ(10,000) devices.