🤖 AI Summary
This work proposes the first large language model (LLM)-based approach for online dependency mocking in microservice testing, overcoming the limitations of traditional static methods—such as record-replay or specification-driven stubbing—that fail to capture runtime dynamics and cross-request state. By leveraging dependency source code, caller context, and production traces during test execution, the LLM generates high-fidelity responses in real time without requiring predefined specifications or pre-generated stubs. The method incorporates a typed intermediate representation constraint and a cross-request state maintenance mechanism, achieving 99% fidelity in status codes and response structures across 110 test scenarios, with end-to-end results indistinguishable from those produced by real dependencies. Notably, it attains 100% fidelity using only source code, at a cost of merely $0.16–$0.82 per dependency.
📝 Abstract
Existing approaches to microservice dependency simulation--record-replay, pattern-mining, and specification-driven stubs--generate static artifacts before test execution. We propose online LLM simulation, a runtime approach where the LLM directly answers each dependency request as it arrives, maintaining cross-request state throughout a test scenario. No mock specification is pre-generated; the model reads the dependency's source code, caller code, and production traces, then simulates dependency behavior on demand. We instantiate this approach in MIRAGE and evaluate it on 110 test scenarios spanning 14 caller-dependency pairs across three microservice systems (Google's Online Boutique, Weaveworks' Sock Shop, and a custom system). In white-box mode (dependency source available), MIRAGE achieves 99% status-code fidelity (109/110) and 99% response-shape fidelity, compared to 62% / 16% for record-replay. End-to-end, caller integration tests produce the same pass/fail outcomes with MIRAGE as with real dependencies (8/8 scenarios). A signal ablation shows dependency source code is often sufficient for high-fidelity runtime simulation (100% alone); without it, the model still infers correct error codes (94%) but loses response-structure accuracy (75%). Constraining LLM output through typed intermediate representations reduces fidelity on complex stateful services (55%) while performing adequately on simple APIs (86%), suggesting that the runtime approach's implicit state tracking matters for behavioral complexity. Results are stable across three LLM families (within 3%) at $0.16--$0.82 per dependency.