🤖 AI Summary
Large language models (LLMs) lack human-like hierarchical reasoning capabilities, particularly in simultaneously capturing semantic abstractions at multiple granularities. Method: We propose Hierarchical Decoding—a novel paradigm for decoder-only pretrained models—where the language head is replicated and independently fine-tuned atop multiple intermediate transformer layers, enabling concurrent, multi-granularity semantic decoding. This unifies classification and generation within a single architecture, supporting hierarchical text classification, classification-guided generation, and hierarchical text generation. Contribution/Results: Our approach breaks the conventional single-output-layer constraint by endowing intermediate layers with trainable, interpretable semantic output heads for the first time. It further introduces a classification-guided generation mechanism to enforce inter-layer consistency. Evaluated across diverse benchmarks—including hierarchical classification, controlled generation, and reasoning tasks—our method achieves state-of-the-art performance, demonstrating that hierarchical decoding significantly enhances LLMs’ cognitive modeling capacity.
📝 Abstract
Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human's hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.