🤖 AI Summary
This work investigates the zero-shot keyword generation capability of large language models (LLMs), focusing on accurate keyphrase extraction from documents without fine-tuning. To address LLMs’ output instability and task-agnostic prompting limitations, we propose a task-specific instruction template grounded in keyword identification logic, coupled with a self-consistency–inspired multi-sample response aggregation strategy. We conduct the first systematic evaluation of this approach across both open-weight models (Phi-3, Llama-3) and closed-weight models (GPT-4o). Experiments on multiple benchmark datasets demonstrate consistent improvements over zero-shot baselines, achieving an average F1-score gain of 12.6%. Our core contributions are: (1) a task-aware instruction design that explicitly encodes keyphrase recognition heuristics; (2) a lightweight, inference-time response fusion mechanism to mitigate LLM output variance; and (3) empirical validation that synergistic instruction engineering and response aggregation effectively unlock LLMs’ zero-shot keyphrase generation potential—establishing a low-cost, training-free paradigm for keyphrase extraction.
📝 Abstract
Keyphrases are the essential topical phrases that summarize a document. Keyphrase generation is a long-standing NLP task for automatically generating keyphrases for a given document. While the task has been comprehensively explored in the past via various models, only a few works perform some preliminary analysis of Large Language Models (LLMs) for the task. Given the impact of LLMs in the field of NLP, it is important to conduct a more thorough examination of their potential for keyphrase generation. In this paper, we attempt to meet this demand with our research agenda. Specifically, we focus on the zero-shot capabilities of open-source instruction-tuned LLMs (Phi-3, Llama-3) and the closed-source GPT-4o for this task. We systematically investigate the effect of providing task-relevant specialized instructions in the prompt. Moreover, we design task-specific counterparts to self-consistency-style strategies for LLMs and show significant benefits from our proposals over the baselines.