π€ AI Summary
Existing information extraction (IE) research predominantly relies on Python-style code simulation, overlooking the synergistic potential of multi-programming-language supervision in supervised fine-tuning (SFT). Method: This paper pioneers a systematic investigation into leveraging C++, Java, and Python as structured output guidance signals. We propose a function-prompt construction strategy and a virtual execution mechanism to enhance template modeling efficiency and cross-lingual generalization. Our approach integrates multi-programming-language (PL) template design, large language model (LLM) SFT, and lightweight runtime simulation. Contribution/Results: Evaluated on multiple standard IE benchmarks, our method significantly outperforms single-language (Python-only) baselines, demonstrating both effectiveness and robustness of the multi-PL strategy. All code is publicly available.
π Abstract
Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose extbf{M}ultiple extbf{P}rogramming extbf{L}anguages with large language models for information extraction (abbreviated as extbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce exttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.