Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the incomplete formal foundations of empirical process theory in contemporary statistical learning, particularly the absence of key theorems and tools in the Lean 4 mathematical library. We present the first systematic formalization of this theory in Lean 4, establishing rigorous machine-checked proofs of the Gaussian Lipschitz concentration inequality and Dudley’s entropy integral theorem, while uncovering and correcting implicit assumptions in standard textbooks. By integrating human-designed proof strategies with AI-driven tactic synthesis, we develop an end-to-end verifiable framework that enables sharp convergence rate analyses for least squares and sparse regression. The resulting open-source toolbox provides a reusable foundation for the formal verification of machine learning theory.

Technology Category

Application Category

📝 Abstract

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is available at https://github.com/YuanheZ/lean-stat-learning-theory

Problem

Research questions and friction points this paper is trying to address.

Statistical Learning Theory

Empirical Processes

Formalization

Lean 4

Dudley's Entropy Integral

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical Learning Theory

Empirical Processes

Formal Verification