CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

📅 2023-05-31
🏛️ arXiv.org
📈 Citations: 29
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
Developing and deploying large language models for code (Code LLMs) faces high barriers stemming from the intersection of machine learning expertise and software engineering knowledge, compounded by fragmented toolchains. Method: We introduce the first open-source, unified Transformer library for Code LLMs, featuring a novel modular and extensible unified interface. It tightly integrates language-aware parsing (e.g., abstract syntax trees), code property extraction, standardized data loading, and efficient inference serving. The library supports major pre-trained models and standard benchmarks (e.g., HumanEval, MBPP), covering the full lifecycle—from training and evaluation to deployment. Contribution/Results: Compared to existing solutions, our library significantly lowers cross-disciplinary collaboration overhead and accelerates experimental iteration and industrial adoption. The implementation is publicly available and has been widely adopted by both academia and industry.
📝 Abstract
Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners.
Problem

Research questions and friction points this paper is trying to address.

Develops a library to simplify Code LLM adoption
Provides unified interface for models, datasets, tasks
Bridges gap between AI and software engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source Transformer library for Code LLMs
Unified interface for models, datasets, and tasks
Supports pretrained models, benchmarks, and code features
🔎 Similar Papers
No similar papers found.