Job Skill Extraction via LLM-Centric Multi-Module Framework

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the limitations of large language models (LLMs) in extracting skill phrases from job postings, particularly their poor performance on long-tail terms and cross-domain scenarios due to formatting errors, boundary drift, and hallucinations. To mitigate these issues, the authors propose SRICL, a novel framework that integrates semantic retrieval with the ESCO knowledge base to construct constraint-enforced prompts. SRICL synergistically combines in-context learning and supervised fine-tuning, and incorporates a rule-based deterministic validator to ensure syntactic correctness and precise span boundaries in the outputs. Evaluated across six multilingual, cross-industry public datasets, the method significantly outperforms GPT-3.5 baselines, achieving substantial gains in STRICT-F1 while markedly reducing invalid labels and hallucinated fragments, thereby demonstrating strong practical deployment potential.

Technology Category

Application Category

📝 Abstract

Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries. On six public span-labeled corpora of job-ad sentences across sectors and languages, SRICL achieves substantial STRICT-F1 improvements over GPT-3.5 prompting baselines and sharply reduces invalid tags and hallucinated spans, enabling dependable sentence-level deployment in low-resource, multi-domain settings.

Problem

Research questions and friction points this paper is trying to address.

job skill extraction

span-level extraction

large language models

hallucination

boundary drift

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-centric framework

span-level skill extraction

semantic retrieval