🤖 AI Summary
To address low testing and debugging efficiency and immature toolchains when deploying rapidly evolving large language models (LLMs) on emerging platforms (e.g., browsers, mobile devices), this paper proposes TapML—a top-down, test-driven framework. Methodologically, TapML introduces (1) the first operator-level test pruning technique that automatically generates high-coverage, realistic test inputs; (2) a progressive cross-platform migration strategy that significantly narrows the scope for compound error localization; and (3) native backend support for Metal and WebGPU, with deep integration into MLC-LLM. Evaluated over two years, TapML has enabled efficient deployment of 105 emerging models—spanning 27 distinct architectures—across five platform categories, reducing average deployment time by 42%. It has since become the default development paradigm for MLC-LLM.
📝 Abstract
While existing machine learning (ML) frameworks focus on established platforms, like running CUDA on server-grade GPUs, there have been growing demands to enable emerging AI applications in a broader set of scenarios, such as running Large Language Models (LLMs) within browsers and mobile phones. However, deploying emerging models on new platforms (such as Metal and WebGPU) presents significant software engineering challenges due to rapid model evolution and limited tooling and practices for these platforms. Previous practice for ML model deployment often follows a bottom-up fashion, where engineers first implement individual required operators and then put them together. However, this traditional development approach fails to meet the productivity requirements when deploying emerging ML applications, with the testing and debugging part as a bottleneck. To this end, we introduce extsc{TapML}, a top-down approach designed to streamline model deployment on diverse platforms. While the traditional bottom-up approach requires crafting manual tests, extsc{TapML} automatically creates high-quality, realistic test data through operator-wise test carving. Furthermore, extsc{TapML} uses a migration-based strategy to gradually offload model implementation from the mature source platform to the target platform, minimizing the debugging scope of compound errors. extsc{TapML} has been used as the default development method in the MLC-LLM project to deploy emerging ML models. Within 2 years, extsc{TapML} has accelerated the deployment of 105 emerging models in 27 model architectures across 5 emerging platforms. We show that extsc{TapML} effectively boosts developer productivity while ensuring the quality of deployed models. Furthermore, we summarize comprehensive case studies from our real-world development, offering best practices for developing emerging ML systems.