🤖 AI Summary
Existing syntax highlighting models are language-specific, rely on computationally expensive parser-based annotation pipelines to generate large-scale labeled datasets, incur high training costs, and necessitate maintaining separate models per language—leading to prohibitive deployment overhead and poor cross-lingual generalization. This paper proposes the first unified, real-time multilingual syntax highlighting model. It achieves language-agnostic semantic modeling via cross-lingual deep abstraction and a novel code normalization technique. Leveraging few-shot learning and multi-task optimization, it enables efficient training with minimal annotated samples. A lightweight inference architecture ensures low-latency response. The model supports six mainstream languages—Python, JavaScript, Java, C++, Go, and Rust—reducing deployment complexity by 6×. It demonstrates strong zero-shot generalization to unseen languages and significantly outperforms baseline methods in both accuracy and inference speed.
📝 Abstract
Syntax highlighting is a critical feature in modern software development environments, enhancing code readability and developer productivity. However, delivering accurate highlighting in real time remains challenging for online and web-based development tools due to strict time and memory constraints on backend services. These systems must serve highlights rapidly and frequently, even when code is partially valid or invalid. This has led to on-the-fly syntax highlighting, where visual annotations are generated just before content is served, often at high request rates and under incomplete input conditions. To meet these demands efficiently, state-of-the-art models use deep learning to learn the behavior of brute-force syntax highlighting resolvers, tools that are easy to implement but too slow for production. Through the Deep Abstraction process, brute-force strategies are encoded into fast statistical models that achieve both high accuracy and low-latency inference. Despite their success, such models face key challenges: they support only one programming language per model, require large datasets from slow brute-force generators, and involve resource-intensive training. In multi-language environments, this means maintaining multiple independent models, increasing system complexity and operational cost. This work addresses these issues by introducing a unified model capable of highlighting up to six mainstream programming languages, reducing deployment complexity by a factor of six and improving performance on unseen languages. A novel normalization technique significantly enhances model generalization, while few-shot learning experiments show that a small number of oracle samples can replace large datasets, minimizing dependence on brute-force generators. Combined, these innovations enable efficient, scalable, and cost-effective syntax highlighting across diverse programming languages.