🤖 AI Summary
Football match data suffers from poor interoperability and high analytical overhead due to multi-source heterogeneity—divergent acquisition dimensions, semantic definitions, representation schemes, and delivery protocols. To address this, we propose the Common Data Format (CDF), a universal standard format for holistic match data. CDF introduces the first minimal yet complete standardized architecture covering five core data types: match metadata, events, tracking data, video annotations, and auxiliary metadata—emphasizing traceability, contextual completeness, and downstream task readiness. Structurally defined via JSON Schema, CDF incorporates semantic naming conventions, a unified pitch coordinate system, provenance annotation, and versioned delivery protocols. The released CDF 1.0 technical specification enables plug-and-play cross-platform integration, substantially reducing data processing costs and integration timelines for clubs and national football associations. By establishing a foundational interoperability layer, CDF catalyzes a paradigm shift toward industry-wide collaborative data ecosystems.
📝 Abstract
During football matches, a variety of different parties (e.g., companies) each collect (possibly overlapping) data about the match ranging from basic information (e.g., starting players) to detailed positional data. This data is provided to clubs, federations, and other organizations who are increasingly interested in leveraging this data to inform their decision making. Unfortunately, analyzing such data pose significant barriers because each provider may (1) collect different data, (2) use different specifications even within the same category of data, (3) represent the data differently, and (4) delivers the data in a different manner (e.g., file format, protocol). Consequently, working with these data requires a significant investment of time and money. The goal of this work is to propose a uniform and standardized format for football data called the Common Data Format (CDF). The CDF specifies a minimal schema for five types of match data: match sheet data, video footage, event data, tracking data, and match meta data. It aims to ensure that the provided data is clear, sufficiently contextualized (e.g., its provenance is clear), and complete such that it enables common downstream analysis tasks. Concretely, this paper will detail the technical specifications of the CDF, the representational choices that were made to help ensure the clarity of the provided data, and a concrete approach for delivering data in the CDF.