Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the gap between research and production deployment in large-scale multi-page document processing by proposing a microservice architecture tailored for high-throughput scenarios, integrating a multi-stage pipeline of document classification, optical character recognition (OCR), and large language model (LLM) inference. The system employs a hybrid classification strategy, decouples GPU-based inference from CPU-driven orchestration, leverages asynchronous I/O, and supports independent horizontal scaling, enabling stable processing of thousands of documents per hour. Empirical analysis reveals that OCR constitutes the primary bottleneck in end-to-end latency and that system concurrency is constrained by the inference capacity of shared GPUs rather than the number of nodes. This study offers a reusable, efficient deployment paradigm for industrial-scale document understanding systems.

📝 Abstract

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

Problem

Research questions and friction points this paper is trying to address.

Document AI

production deployment

microservice architecture

OCR

LLM pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

microservice architecture

OCR pipeline

LLM inference