🤖 AI Summary
In resume–job matching, extreme sparsity of interaction labels—arising from candidates applying to only a few positions—severely limits recommendation performance. To address this, we propose a contrastive learning framework that jointly integrates generative data augmentation and high-quality hard negative mining. Specifically, we leverage large language models (LLMs) to synthesize counterfactual positive-resume samples, alleviating the scarcity of ground-truth annotations; concurrently, we introduce a Runner-Up strategy to automatically identify semantically proximal hard negatives. Built upon a dual-tower encoder architecture, our method achieves state-of-the-art results on two real-world datasets, outperforming the prior best method ConFit by +13.8% in Recall@10 and +17.5% in nDCG@10—substantially surpassing both BM25 and OpenAI’s text-embedding-003.
📝 Abstract
A reliable resume-job matching system helps a company recommend suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction labels in resume-job datasets are sparse. We introduce ConFit v2, an improvement over ConFit to tackle this sparsity problem. We propose two techniques to enhance the encoder's contrastive training process: augmenting job data with hypothetical reference resume generated by a large language model; and creating high-quality hard negatives from unlabeled resume/job pairs using a novel hard-negative mining strategy. We evaluate ConFit v2 on two real-world datasets and demonstrate that it outperforms ConFit and prior methods (including BM25 and OpenAI text-embedding-003), achieving an average absolute improvement of 13.8% in recall and 17.5% in nDCG across job-ranking and resume-ranking tasks.