H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of limited context windows, high prefill latency, and excessive memory consumption in large language models when processing long texts. The authors propose an efficient inference method grounded in semantic hierarchical structure: offline, a bottom-up semantic tree is constructed, with node memory embeddings generated via post-order aggregation; online, coarse-to-fine query routing combined with dynamic pruning discards irrelevant branches early in inference. This approach uniquely integrates structure-aware memory embeddings with hierarchical retrieval, substantially reducing redundant computation. Experimental results demonstrate that the method achieves ROUGE-L and F1 scores comparable to state-of-the-art techniques on LongBench question answering and structured technical document tasks, while significantly lowering peak GPU memory usage and time-to-first-token latency.
📝 Abstract
Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream processing and chunk-based retrieval can therefore spend substantial computation and context budget on text unrelated to the query. Offline-indexed RAG additionally introduces external storage and index management overhead, and typically appends retrieved evidence as raw text, increasing prefill cost and latency. H^{2}MT makes long-context inference structure-aware: it builds a semantic hierarchy offline, computes a memory embedding for each node via bottom-up post-order aggregation, and routes queries coarse-to-fine at inference to prune irrelevant branches early. On LongBench QA (NarrativeQA, HotpotQA, QASPER) and two structured technical-document settings, H MT achieves favorable quality efficiency trade-offs, delivering competitive ROUGE-L and F1 (where applicable) with lower peak GPU memory and time-to-first-token (TTFT) than prompt compression, memory-token methods, and retrieval-augmented generation baselines.
Problem

Research questions and friction points this paper is trying to address.

long-context inference
prefill latency
memory efficiency
retrieval-augmented generation
context window limitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Memory Transformer
semantic hierarchy
long-context inference
query routing
memory-efficient LLM
🔎 Similar Papers