🤖 AI Summary
This work addresses the challenge of assessing training data quality for code large language models (code-LLMs). To overcome the semantic blindness of conventional statistical methods, we propose a customizable, programming-language-concept-based dataset profiling method. Our approach introduces a novel hybrid paradigm combining LLM-assisted rule learning with deterministic online analysis: an LLM distills syntactic and semantic rules from code, which are then instantiated via abstract syntax tree (AST) parsing and generalized across programming languages—including low-resource and previously unseen ones—to enable automated extraction and expansion of syntactic structures and semantic concepts. Experiments demonstrate 90.33% accuracy in grammar rule extraction and average semantic classification accuracies of 80% on known languages and 77% on low-resource languages. The method significantly enhances the depth of code data understanding, precision of quality assessment, and efficiency of data governance.
📝 Abstract
Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of code datasets for Large Language Models (code-LLMs), where data quality directly influences tasks such as code generation and summarization. Characterizing code datasets in terms of programming language concepts enables better insights and targeted data curation. Our proposed methodology decomposes code data profiling into two phases: (1) an offline phase where LLMs are leveraged to derive and learn rules for extracting syntactic and semantic concepts across various programming languages, including previously unseen or low-resource languages, and (2) an online deterministic phase applying these derived rules for efficient real-time analysis. This hybrid approach is customizable, extensible to new syntactic and semantic constructs, and scalable to multiple languages. Experimentally, our LLM-aided method achieves a mean accuracy of 90.33% for syntactic extraction rules and semantic classification accuracies averaging 80% and 77% across languages and semantic concepts, respectively.