🤖 AI Summary
QuickBooks transaction categorization faces three key challenges: unstructured textual descriptions, a fine-grained category taxonomy, and poor model generalization and cold-start performance induced by relational database architecture. To address these, we propose Rel-Cat—the first graph-based classification framework natively designed for SQL relational schemas—formulating transaction categorization as a link prediction task on a relational graph. Methodologically, Rel-Cat jointly encodes transaction text via BERT and relational structure via Relational Graph Convolutional Networks (R-GCN), constructing the graph directly over the native database schema and training end-to-end without data flattening or manual feature engineering. Experiments on QuickBooks production data demonstrate that Rel-Cat significantly outperforms baseline models in accuracy, exhibits strong few-shot adaptability, achieves millisecond-scale inference latency, and scales to millions of users. Rel-Cat establishes a novel paradigm for semantic classification over relational data.
📝 Abstract
Automatic transaction categorization is crucial for enhancing the customer experience in QuickBooks by providing accurate accounting and bookkeeping. The distinct challenges in this domain stem from the unique formatting of transaction descriptions, the wide variety of transaction categories, and the vast scale of the data involved. Furthermore, organizing transaction data in a relational database creates difficulties in developing a unified model that covers the entire database. In this work, we develop a novel graph-based model, named Rel-Cat, which is built directly over the relational database. We introduce a new formulation of transaction categorization as a link prediction task within this graph structure. By integrating techniques from natural language processing and graph machine learning, our model not only outperforms the existing production model in QuickBooks but also scales effectively to a growing customer base with a simpler, more effective architecture without compromising on accuracy. This design also helps tackle a key challenge of the cold start problem by adapting to minimal data.