Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks an open, real-world Jira–oriented text-to-JQL evaluation benchmark. This paper introduces Jackal—the first large-scale, execution-based JQL generation benchmark grounded in authentic Jira usage, comprising 100K natural language query–executable JQL pairs spanning diverse user intent categories. Methodologically, we propose a novel multi-dimensional evaluation framework grounded in dynamic execution against live Jira instances, measuring exact match, canonical match, and execution accuracy. We publicly release a reproducible Jira snapshot and an open-source scoring toolkit. Evaluating 23 large language models on the Jackal-5K subset, Gemini 2.5 Pro achieves the highest execution accuracy (60.3%), yet exhibits substantial performance variance across query types—revealing fundamental limitations in short-text comprehension and semantic similarity reasoning.

Technology Category

Application Category

📝 Abstract
Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira. Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL benchmark comprising 100,000 natural language (NL) requests paired with validated JQL queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect real-world usage, each JQL query is associated with four types of user requests: (i) Long NL, (ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed source models, across execution accuracy, exact match, and canonical exact match. In this paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally across four user request types. Performance varies significantly across user request types: (i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv) Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and sets a new, execution-based challenge for future research in Jira enterprise data.
Problem

Research questions and friction points this paper is trying to address.

Lack of execution-based benchmark for natural language to JQL conversion
Evaluating LLMs' ability to generate correct executable JQL queries
Addressing performance gaps across different user request types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark with 100,000 text-to-JQL pairs
Execution-based scoring toolkit for accuracy evaluation
Four user request types reflecting real-world usage
🔎 Similar Papers
No similar papers found.
K
Kevin Frank
PricewaterhouseCoopers, U.S.A
Anmol Gulati
Anmol Gulati
Researcher, Google Deepmind
E
Elias Lumer
PricewaterhouseCoopers, U.S.A
S
Sindy Campagna
PricewaterhouseCoopers, U.S.A
V
Vamse Kumar Subbiah
PricewaterhouseCoopers, U.S.A