Empirical Evaluation of Embedding Models in the Context of Text Classification in Document Review in Construction Delay Disputes

📅 2024-12-15
🏛️ BigData Congress [Services Society]
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low efficiency and high subjectivity inherent in manual review of legal documents concerning construction schedule disputes. It formulates a binary classification task—“whether the document involves schedule delay”—and systematically evaluates four text embedding methods—Sentence-BERT, Doc2Vec, TF-IDF, and GloVe—for semantic modeling of legal texts. To the best of our knowledge, this is the first empirical comparative study of embedding models specifically in the domain of construction delay disputes, with emphasis on domain-specific semantic adaptability and downstream classification robustness. Experimental results demonstrate that Sentence-BERT paired with logistic regression significantly outperforms traditional methods, achieving an 8.2% improvement in accuracy and a 7.6% gain in F1-score. These findings validate that semantic embedding techniques can substantially enhance the automation level and decision-support capability for schedule delay identification, offering a reusable methodological framework for intelligent legal document review.

Technology Category

Application Category

📝 Abstract
Text embeddings are numerical representations of text data, where words, phrases, or entire documents are converted into vectors of real numbers. These embeddings capture semantic meanings and relationships between text elements in a continuous vector space. The primary goal of text embeddings is to enable the processing of text data by machine learning models, which require numerical input. Numerous embedding models have been developed for various applications. This paper presents our work in evaluating different embeddings through a comprehensive comparative analysis of four distinct models, focusing on their text classification efficacy. We employ both K-Nearest Neighbors (KNN) and Logistic Regression (LR) to perform binary classification tasks, specifically determining whether a text snippet is associated with 'delay' or 'not delay' within a labeled dataset. Our research explores the use of text snippet embeddings for training supervised text classification models to identify delay-related statements during the document review process of construction delay disputes. The results of this study highlight the potential of embedding models to enhance the efficiency and accuracy of document analysis in legal contexts, paving the way for more informed decision-making in complex investigative scenarios.
Problem

Research questions and friction points this paper is trying to address.

Mathematical Models
Automatic Identification and Classification
Legal Document Processing Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Embedding Models
Construction Dispute Resolution
Delay-related Classification
🔎 Similar Papers
No similar papers found.
F
Fusheng Wei
Data & Technology, Ankura Consulting Group, LLC, Washington, D.C. USA
R
Robert Neary
Data & Technology, Ankura Consulting Group, LLC, Washington, D.C. USA
Han Qin
Han Qin
Ankura Consulting Group, LLC.
GeospatialAILegal
Q
Qiang Mao
Data & Technology, Ankura Consulting Group, LLC, Washington, D.C. USA
J
Jianping Zhang
Data & Technology, Ankura Consulting Group, LLC, Washington, D.C. USA