🤖 AI Summary
Multimodal document structural understanding lacks formal theoretical foundations. Method: This paper proposes a novel categorical modeling paradigm grounded in category theory: documents are formalized as categories whose objects are document elements and morphisms are question-answer pairs; information orthogonal decomposition and composability constraints are introduced to enable mathematical representation and rate-distortion analysis of document content. Building upon this, we develop a measurable, enumerable information-theoretic evaluation framework—enabling unsupervised text exegesis expansion and consistency-aware abstractive summarization—and optimize large language models via RLVR (Reconstruction-Labeling-Verification-Refinement) self-supervision. Contribution/Results: This work is the first to systematically integrate category theory into document structure modeling; it establishes the principle of information orthogonality and unifies structural, semantic, and generative aspects under a quantitative framework. Empirical results demonstrate significant improvements in summary quality and generation consistency, enabling theory-driven, self-supervised enhancement of multimodal foundation models.
📝 Abstract
We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.