🤖 AI Summary
This study addresses the latency accumulation and lack of empirical guidance in service-tier selection for multi-agent large language model tutoring systems under high concurrency. The authors construct ITAS, a four-agent system leveraging Gemini 2.5 Flash and Google Vertex AI, and conduct large-scale experiments across three service tiers and eleven concurrency levels to quantitatively characterize the maximum latency effect during parallel inference phases for the first time. Results show that Priority PayGo maintains stable response times under four seconds at 50 concurrent users, while Standard PayGo exhibits significant performance degradation under classroom-scale loads. Provisioned Throughput delivers optimal latency at low concurrency but saturates easily. Notably, all configurations incur per-student semester costs lower than typical STEM textbook prices, enabling the proposal of deployment-scale-aware service-tier selection strategies.
📝 Abstract
Multi-agent LLM tutoring systems improve response quality through agent specialization, but each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face. We instrument ITAS, a four-agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment. Priority PayGo maintains flat sub-4-second response times across the full load range; Standard PayGo degrades substantially under classroom-scale concurrency; and Provisioned Throughput delivers the lowest latency at low concurrency but saturates its reserved capacity above approximately 20 concurrent users. Cost analysis places both pay-per-token tiers well below the price of a STEM textbook per student per semester under a worst-case usage ceiling. Provisioned Throughput, expensive under continuous provisioning, becomes cost-competitive for institutions that can predict and concentrate their traffic toward high utilization. These results provide concrete tier-selection guidance across deployment scales from a single seminar to a university-wide rollout.