VietJobs: A Vietnamese Job Advertisement Dataset

๐Ÿ“… 2026-03-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the scarcity of large-scale, publicly available Vietnamese job posting corpora by introducing VietJobs, the first comprehensive dataset comprising 48,092 structured job listings spanning 34 provinces and cities in Vietnam, with annotated fields including job category, salary, and required skills. The authors employ web crawling and cleaning techniques to construct the dataset and evaluate instruction-tuned large language modelsโ€”such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-ITโ€”on job classification and salary prediction tasks. Experimental results demonstrate that instruction tuning significantly enhances model performance under both few-shot and fine-tuning settings, highlighting the promise of multilingual and Vietnamese-specific language models for labor market analysis. This dataset and the accompanying benchmark framework fill a critical gap in Vietnamese natural language processing and workforce-related research.

Technology Category

Application Category

๐Ÿ“ Abstract
VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.
Problem

Research questions and friction points this paper is trying to address.

Vietnamese job advertisements
large-scale dataset
natural language processing
labour market analytics
multilingual modelling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vietnamese job advertisement dataset
large language models
labour market analytics
few-shot learning
NLP benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hieu Pham Dinh
College of Engineering and Computer Science, VinUniversity
H
Hung Nguyen Huy
College of Engineering and Computer Science, VinUniversity
Mo El-Haj
Mo El-Haj
Associate Professor (Reader) in NLP at VinUniversity. Visiting Researcher at Lancaster University
Natural Language ProcessingText SummarizationFinancial NLPArabic Natural Language Processing