COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Program understanding faces challenges in modeling fine-grained, multilingual semantic representations. Method: We introduce COFO, the first large-scale, multilingual, structured competitive programming benchmark, covering 809 algorithmic problem categories and comprising 369K C/C++/Java/Python source code submissions. Each instance includes full problem statements, standardized I/O examples, and multidimensional semantic labels (e.g., algorithm type, data structure, difficulty). Our methodology systematically integrates dynamic web crawling (Selenium + BeautifulSoup), multilingual code normalization, I/O specification parsing, and hierarchical problem-label ontology modeling. Contribution/Results: Experiments demonstrate substantial improvements over baselines in program classification, intent recognition, and cross-lingual code representation learning. COFO provides a high-quality, open-source benchmark to advance foundational research and rigorous evaluation in code intelligence.

Technology Category

Application Category

📝 Abstract

In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to write software, there is a large collection of source codes that can be found on the web, also known as Big Code, which can be used as a source of data for driving the machine learning applications tending to solve certain software engineering problems. In this paper, we present COFO, a dataset consisting of 809 classes/problems with a total of 369K source codes written in C, C++, Java, and Python programming languages, along with other metadata such as code tags, problem specification, and input-output specifications. COFO has been scraped from the openly available Codeforces website using a selenium-beautifulsoup-python based scraper. We envision that this dataset can be useful for solving machine learning-based problems like program classification/recognition, tagging, predicting program properties, and code comprehension.

Problem

Research questions and friction points this paper is trying to address.

Creating dataset for program classification and recognition

Providing source codes for machine learning applications

Solving software engineering problems with big code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scraped Codeforces data using Selenium-BeautifulSoup-Python

Created COFO dataset with 369K multi-language source codes

Includes metadata like tags and problem specifications

🔎 Similar Papers

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models

2024-07-03arXiv.orgCitations: 10

AutoFL: A Tool for Automatic Multi-granular Labelling of Software Repositories

2024-08-05arXiv.orgCitations: 0

💼 Related Jobs

Software Engineer