COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Program understanding faces challenges in modeling fine-grained, multilingual semantic representations. Method: We introduce COFO, the first large-scale, multilingual, structured competitive programming benchmark, covering 809 algorithmic problem categories and comprising 369K C/C++/Java/Python source code submissions. Each instance includes full problem statements, standardized I/O examples, and multidimensional semantic labels (e.g., algorithm type, data structure, difficulty). Our methodology systematically integrates dynamic web crawling (Selenium + BeautifulSoup), multilingual code normalization, I/O specification parsing, and hierarchical problem-label ontology modeling. Contribution/Results: Experiments demonstrate substantial improvements over baselines in program classification, intent recognition, and cross-lingual code representation learning. COFO provides a high-quality, open-source benchmark to advance foundational research and rigorous evaluation in code intelligence.

Technology Category

Application Category

📝 Abstract
In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to write software, there is a large collection of source codes that can be found on the web, also known as Big Code, which can be used as a source of data for driving the machine learning applications tending to solve certain software engineering problems. In this paper, we present COFO, a dataset consisting of 809 classes/problems with a total of 369K source codes written in C, C++, Java, and Python programming languages, along with other metadata such as code tags, problem specification, and input-output specifications. COFO has been scraped from the openly available Codeforces website using a selenium-beautifulsoup-python based scraper. We envision that this dataset can be useful for solving machine learning-based problems like program classification/recognition, tagging, predicting program properties, and code comprehension.
Problem

Research questions and friction points this paper is trying to address.

Creating dataset for program classification and recognition
Providing source codes for machine learning applications
Solving software engineering problems with big code
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scraped Codeforces data using Selenium-BeautifulSoup-Python
Created COFO dataset with 369K multi-language source codes
Includes metadata like tags and problem specifications