🤖 AI Summary
To address challenges in the large model era—including inefficient model discovery, opaque provenance, version fragmentation, and inconsistent evaluation—this paper proposes the “Model Lake” paradigm, the first systematic adaptation of data lake principles to model governance. We formally define five core tasks: model discovery, provenance attribution, version management, multidimensional search, and benchmark-based evaluation. Our approach unifies these tasks via metadata modeling, lineage tracking, multidimensional indexing, and a standardized evaluation framework. The Model Lake overcomes the unreliability and inconsistency of manual documentation, substantially enhancing model traceability, reproducibility, and auditability. This work establishes the first comprehensive theoretical framework and task taxonomy for the full lifecycle governance of large models, providing a foundational methodology for automated, structured model management. (149 words)
📝 Abstract
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, differentiating, and understanding models become increasingly crucial. Inspired from research on data lakes, we introduce the concept of model lakes. We formalize key model lake tasks, including model attribution, versioning, search, and benchmarking, and discuss fundamental research challenges in the management of large models. We also explore what data management techniques can be brought to bear on the study of large model management.