🤖 AI Summary
This work addresses the high computational overhead and latency of existing span-based named entity recognition methods, which enumerate numerous candidate spans and rely on token-level augmentation, rendering them inefficient for industrial applications demanding real-time performance. To overcome these limitations, the authors propose SpanDec, a framework that concentrates span representation interaction exclusively in the final Transformer layer and introduces a lightweight decoder coupled with a dynamic candidate filtering mechanism. This design enables early pruning of low-quality spans, thereby eliminating redundant computations in earlier layers. SpanDec achieves accuracy comparable to state-of-the-art span-based models while significantly improving throughput and reducing computational cost, making it well-suited for high-concurrency services and edge deployment.
📝 Abstract
Named Entity Recognition (NER) is a key component in industrial information extraction pipelines, where systems must satisfy strict latency and throughput constraints in addition to strong accuracy. State-of-the-art NER accuracy is often achieved by span-based frameworks, which construct span representations from token encodings and classify candidate spans. However, many span-based methods enumerate large numbers of candidates and process each candidate with marker-augmented inputs, substantially increasing inference cost and limiting scalability in large-scale deployments. In this work, we propose SpanDec, an efficient span-based NER framework that targets this bottleneck. Our main insight is that span representation interactions can be computed effectively at the final transformer stage, avoiding redundant computation in earlier layers via a lightweight decoder dedicated to span representations. We further introduce a span filtering mechanism during enumeration to prune unlikely candidates before expensive processing. Across multiple benchmarks, SpanDec matches competitive span-based baselines while improving throughput and reducing computational cost, yielding a better accuracy-efficiency trade-off suitable for high-volume serving and on-device applications.