🤖 AI Summary
Traditional labor market databases (e.g., O*NET) suffer from infrequent updates, limited occupational coverage, and restricted accessibility. Method: This study introduces an automated paradigm for constructing high-fidelity labor market data from large-scale job postings—leveraging 155 million online job advertisements from the NLx Corpus and aligning them systematically with the O*NET occupational taxonomy via the open-source tool JAAT. Using NLP-driven multidimensional structured extraction, it captures skills, SOC codes, tools/technologies, compensation, and other attributes, generating a monthly, occupation/state/industry-resolved dataset comprising over 10 billion data points (2015–2025). Extraction reliability is rigorously validated via an LLM-as-a-Judge evaluation framework. Contribution/Results: The resulting public infrastructure delivers unprecedented timeliness, breadth, and reproducibility, enabling robust research in education and workforce development.
📝 Abstract
Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.