🤖 AI Summary
Program understanding faces challenges in modeling fine-grained, multilingual semantic representations. Method: We introduce COFO, the first large-scale, multilingual, structured competitive programming benchmark, covering 809 algorithmic problem categories and comprising 369K C/C++/Java/Python source code submissions. Each instance includes full problem statements, standardized I/O examples, and multidimensional semantic labels (e.g., algorithm type, data structure, difficulty). Our methodology systematically integrates dynamic web crawling (Selenium + BeautifulSoup), multilingual code normalization, I/O specification parsing, and hierarchical problem-label ontology modeling. Contribution/Results: Experiments demonstrate substantial improvements over baselines in program classification, intent recognition, and cross-lingual code representation learning. COFO provides a high-quality, open-source benchmark to advance foundational research and rigorous evaluation in code intelligence.
📝 Abstract
In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to write software, there is a large collection of source codes that can be found on the web, also known as Big Code, which can be used as a source of data for driving the machine learning applications tending to solve certain software engineering problems. In this paper, we present COFO, a dataset consisting of 809 classes/problems with a total of 369K source codes written in C, C++, Java, and Python programming languages, along with other metadata such as code tags, problem specification, and input-output specifications. COFO has been scraped from the openly available Codeforces website using a selenium-beautifulsoup-python based scraper. We envision that this dataset can be useful for solving machine learning-based problems like program classification/recognition, tagging, predicting program properties, and code comprehension.