๐ค AI Summary
This work proposes LeJOT-AutoML, the first AutoML framework integrating a large language model (LLM) agent with the Model Context Protocol to address the challenge of job runtime prediction for cloud cost optimization in Databricks. Existing approaches rely on static, handcrafted features and fail to capture dynamic runtime effects such as data skew and partition pruning, while also suffering from lengthy iteration cycles. LeJOT-AutoML overcomes these limitations by leveraging retrieval-augmented generation (RAG) in coordination with a log parser, metadata query engine, and a read-only SQL sandbox to automatically construct and validate runtime-aware features. The framework enables end-to-end automated feature engineering, generating over 200 features on enterprise-scale workloads within 20โ30 minutesโreducing the feature engineering cycle from weeks to minutes. When integrated into production pipelines, it achieves a 19.01% reduction in cloud costs while maintaining high prediction accuracy.
๐ Abstract
Databricks job orchestration systems (e.g., LeJOT) reduce cloud costs by selecting low-priced compute configurations while meeting latency and dependency constraints. Accurate execution-time prediction under heterogeneous instance types and non-stationary runtime conditions is therefore critical. Existing pipelines rely on static, manually engineered features that under-capture runtime effects (e.g., partition pruning, data skew, and shuffle amplification), and predictive signals are scattered across logs, metadata, and job scripts-lengthening update cycles and increasing engineering overhead. We present LeJOT-AutoML, an agent-driven AutoML framework that embeds large language model agents throughout the ML lifecycle. LeJOT-AutoML combines retrieval-augmented generation over a domain knowledge base with a Model Context Protocol toolchain (log parsers, metadata queries, and a read-only SQL sandbox) to analyze job artifacts, synthesize and validate feature-extraction code via safety gates, and train/select predictors. This design materializes runtime-derived features that are difficult to obtain through static analysis alone. On enterprise Databricks workloads, LeJOT-AutoML generates over 200 features and reduces the feature-engineering and evaluation loop from weeks to 20-30 minutes, while maintaining competitive prediction accuracy. Integrated into the LeJOT pipeline, it enables automated continuous model updates and achieves 19.01% cost savings in our deployment setting through improved orchestration.