🤖 AI Summary
ESG reports impede automated understanding due to their unstructured layouts (e.g., slide-based formatting) and implicit semantic hierarchies. To address this, we propose the first unified multimodal parsing framework specifically designed for ESG reports: it models layout-aware reading order via typographic flow, integrates table-of-contents-guided hierarchical segmentation with multimodal semantic aggregation, and incorporates a novel triple-labeling scheme—ESG, GRI, and sentiment—to enhance semantic grounding. We further introduce Aurora-ESG, the first large-scale, cross-market ESG report dataset comprising over 12,000 documents. Extensive experiments demonstrate that our method significantly outperforms both domain-specific document parsers and general-purpose multimodal foundation models across multiple benchmarks, producing high-fidelity structured outputs. These results robustly support downstream ESG quantification and financial governance decision-making.
📝 Abstract
Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.