🤖 AI Summary
Existing genre classification methods predominantly employ flat, single-label modeling, neglecting the inherent hierarchical structure of literary genres and over-relying on noisy, subjective user reviews—compromising reliability. This paper proposes HiGeMine, the first framework for hierarchical genre mining that jointly leverages authoritative book abstracts and noisy user reviews. It introduces a zero-shot semantic alignment filtering mechanism to enhance data quality and designs a dual-path graph convolutional network to simultaneously model genre hierarchy and label co-occurrence dependencies. The method integrates a pre-trained language model (BERT), a hierarchical label graph, and a cascaded binary-classification–multi-label architecture. Evaluated on a newly constructed hierarchical dataset, HiGeMine achieves 96.2% accuracy on Level-1 fiction/non-fiction discrimination and improves Level-2 fine-grained genre F1-score by 12.7% over baselines, demonstrating substantial robustness to label noise.
📝 Abstract
Accurate book genre classification is fundamental to digital library organization, content discovery, and personalized recommendation. Existing approaches typically model genre prediction as a flat, single-label task, ignoring hierarchical genre structure and relying heavily on noisy, subjective user reviews, which often degrade classification reliability. We propose HiGeMine, a two-phase hierarchical genre mining framework that robustly integrates user reviews with authoritative book blurbs. In the first phase, HiGeMine employs a zero-shot semantic alignment strategy to filter reviews, retaining only those semantically consistent with the corresponding blurb, thereby mitigating noise, bias, and irrelevance. In the second phase, we introduce a dual-path, two-level graph-based classification architecture: a coarse-grained Level-1 binary classifier distinguishes fiction from non-fiction, followed by Level-2 multi-label classifiers for fine-grained genre prediction. Inter-genre dependencies are explicitly modeled using a label co-occurrence graph, while contextual representations are derived from pretrained language models applied to the filtered textual content. To facilitate systematic evaluation, we curate a new hierarchical book genre dataset. Extensive experiments demonstrate that HiGeMine consistently outperformed strong baselines across hierarchical genre classification tasks. The proposed framework offers a principled and effective solution for leveraging both structured and unstructured textual data in hierarchical book genre analysis.