Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unsupervised constituency parsing suffers from a misalignment between parsing accuracy and probabilistic objectives (e.g., log-likelihood) due to neglecting the coupling between syntactic structure and semantics. This work introduces semantic–structural mutual information (SemInfo)—the mutual information between induced parse trees and sentence semantics—as the primary training objective, marking the first integration of semantic–structural mutual information into unsupervised constituency parsing. Methodologically: (1) sentence semantics are represented via bag-of-substrings; (2) a probability-weighted mutual information estimator is proposed; and (3) a TreeCRF framework is developed for PCFG induction. Experiments across five PCFG variants and four languages demonstrate that SemInfo strongly correlates with parsing accuracy, yielding an average improvement of 7.85 points in F1 score and establishing new state-of-the-art results on three languages.

Technology Category

Application Category

📝 Abstract
Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective for training unsupervised parsers: maximizing the information between constituent structures and sentence semantics (SemInfo). We introduce a bag-of-substrings model to represent the semantics and apply the probability-weighted information metric to estimate the SemInfo. Additionally, we develop a Tree Conditional Random Field (TreeCRF)-based model to apply the SemInfo maximization objective to Probabilistic Context-Free Grammar (PCFG) induction, the state-of-the-art method for unsupervised constituency parsing. Experiments demonstrate that SemInfo correlates more strongly with parsing accuracy than LL. Our algorithm significantly enhances parsing accuracy by an average of 7.85 points across five PCFG variants and in four languages, achieving new state-of-the-art results in three of the four languages.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised Sentence Structure Analysis
Accuracy Limitations
Structure-Meaning Correlation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SemInfo
TreeCRF
Unsupervised Sentence Structure Analysis
🔎 Similar Papers
No similar papers found.
J
Junjie Chen
Department of Computer Science, the University of Tokyo
X
Xiangheng He
GLAM – Group on Language, Audio, & Music, Imperial College London
Y
Yusuke Miyao
Department of Computer Science, the University of Tokyo
D
D. Bollegala
Department of Computer Science, the University of Liverpool