π€ AI Summary
This paper addresses syntactic constituent parsingβa fundamental yet unresolved challenge in NLPβby proposing a novel LLM-driven parsing paradigm that eschews explicit grammar modeling. Methodologically, it introduces three tree-linearization strategies to encode constituent trees as symbolic sequences, enabling pure text generation for parsing via LLMs. The approach is rigorously evaluated across zero-shot, few-shot, and fully supervised settings using diverse models including ChatGPT, GPT-4, OPT, LLaMA, and Alpaca. Key contributions include: (i) the first comprehensive empirical analysis revealing both the generalization bottlenecks and latent capabilities of LLMs for constituent parsing; and (ii) strong cross-domain generalization on multiple benchmarks, with performance approaching that of state-of-the-art specialized parsers in certain configurations. The work provides empirical validation and methodological foundations for lightweight, grammar-agnostic, and broadly applicable constituent parsing.
π Abstract
Constituency parsing is a fundamental yet unsolved natural language processing task. In this paper, we explore the potential of recent large language models (LLMs) that have exhibited remarkable performance across various domains and tasks to tackle this task. We employ three linearization strategies to transform output trees into symbol sequences, such that LLMs can solve constituency parsing by generating linearized trees. We conduct experiments using a diverse range of LLMs, including ChatGPT, GPT-4, OPT, LLaMA, and Alpaca, comparing their performance against the state-of-the-art constituency parsers. Our experiments encompass zero-shot, few-shot, and full-training learning settings, and we evaluate the models on one in-domain and five out-of-domain test datasets. Our findings reveal insights into LLMs' performance, generalization abilities, and challenges in constituency parsing.