Multi Language Models for On-the-Fly Syntax Highlighting

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing syntax highlighting models are language-specific, rely on computationally expensive parser-based annotation pipelines to generate large-scale labeled datasets, incur high training costs, and necessitate maintaining separate models per language—leading to prohibitive deployment overhead and poor cross-lingual generalization. This paper proposes the first unified, real-time multilingual syntax highlighting model. It achieves language-agnostic semantic modeling via cross-lingual deep abstraction and a novel code normalization technique. Leveraging few-shot learning and multi-task optimization, it enables efficient training with minimal annotated samples. A lightweight inference architecture ensures low-latency response. The model supports six mainstream languages—Python, JavaScript, Java, C++, Go, and Rust—reducing deployment complexity by 6×. It demonstrates strong zero-shot generalization to unseen languages and significantly outperforms baseline methods in both accuracy and inference speed.

Technology Category

Application Category

📝 Abstract
Syntax highlighting is a critical feature in modern software development environments, enhancing code readability and developer productivity. However, delivering accurate highlighting in real time remains challenging for online and web-based development tools due to strict time and memory constraints on backend services. These systems must serve highlights rapidly and frequently, even when code is partially valid or invalid. This has led to on-the-fly syntax highlighting, where visual annotations are generated just before content is served, often at high request rates and under incomplete input conditions. To meet these demands efficiently, state-of-the-art models use deep learning to learn the behavior of brute-force syntax highlighting resolvers, tools that are easy to implement but too slow for production. Through the Deep Abstraction process, brute-force strategies are encoded into fast statistical models that achieve both high accuracy and low-latency inference. Despite their success, such models face key challenges: they support only one programming language per model, require large datasets from slow brute-force generators, and involve resource-intensive training. In multi-language environments, this means maintaining multiple independent models, increasing system complexity and operational cost. This work addresses these issues by introducing a unified model capable of highlighting up to six mainstream programming languages, reducing deployment complexity by a factor of six and improving performance on unseen languages. A novel normalization technique significantly enhances model generalization, while few-shot learning experiments show that a small number of oracle samples can replace large datasets, minimizing dependence on brute-force generators. Combined, these innovations enable efficient, scalable, and cost-effective syntax highlighting across diverse programming languages.
Problem

Research questions and friction points this paper is trying to address.

Enables real-time syntax highlighting for multiple programming languages
Reduces deployment complexity by unifying six languages into one model
Minimizes dataset dependency through few-shot learning techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model supports six programming languages simultaneously
Novel normalization technique enhances model generalization ability
Few-shot learning reduces dataset size and generator dependence
🔎 Similar Papers
No similar papers found.
M
Marco Edoardo Palma
University of Zurich, Zurich, Switzerland
Pooja Rani
Pooja Rani
University of Zurich
Empirical Software engineering (SE)Green SESoftware documentationCode ReviewData Science
H
Harald C. Gall
University of Zurich, Zurich, Switzerland