🤖 AI Summary
To address the limitation of robotic process automation (RPA) in handling unstructured data—such as emails and scanned documents—this paper introduces UNDRESS, the first end-to-end framework for unstructured document information extraction and retrieval that integrates fuzzy regular expressions, lightweight NLP modules, and large language models (LLMs). UNDRESS overcomes RPA’s traditional reliance on structured inputs by employing multi-granularity text parsing, semantics-enhanced pattern matching, and context-aware information retrieval to achieve high-accuracy, robust parsing of complex document formats. Experiments on real-world enterprise document corpora demonstrate that UNDRESS improves F1 scores by 23.6% on key-field extraction and cross-document question answering, while reducing inference latency by 41%. These advances significantly broaden RPA’s applicability in unstructured-data-intensive domains—including finance and legal operations—and confirm the framework’s strong scalability and practical deployability.
📝 Abstract
The growing volume of unstructured data within organizations poses significant challenges for data analysis and process automation. Unstructured data, which lacks a predefined format, encompasses various forms such as emails, reports, and scans. It is estimated to constitute approximately 80% of enterprise data. Despite the valuable insights it can offer, extracting meaningful information from unstructured data is more complex compared to structured data. Robotic Process Automation (RPA) has gained popularity for automating repetitive tasks, improving efficiency, and reducing errors. However, RPA is traditionally reliant on structured data, limiting its application to processes involving unstructured documents. This study addresses this limitation by developing the UNstructured Document REtrieval SyStem (UNDRESS), a system that uses fuzzy regular expressions, techniques for natural language processing, and large language models to enable RPA platforms to effectively retrieve information from unstructured documents. The research involved the design and development of a prototype system, and its subsequent evaluation based on text extraction and information retrieval performance. The results demonstrate the effectiveness of UNDRESS in enhancing RPA capabilities for unstructured data, providing a significant advancement in the field. The findings suggest that this system could facilitate broader RPA adoption across processes traditionally hindered by unstructured data, thereby improving overall business process efficiency.