SommBench: Assessing Sommelier Expertise of Language Models

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study presents a systematic evaluation of large language models’ capacity to comprehend olfactory and gustatory expertise in wine tasting within multilingual settings, moving beyond conventional assessments that rely solely on textual cultural knowledge. To this end, we introduce SommBench, the first multilingual benchmark tailored to a sensory-intensive cultural domain, comprising three structured tasks: Wine Theory Question Answering (WTQA), Wine Flavor Completion (WFC), and Food-Wine Pairing (FWP). High-quality datasets were curated in collaboration with professional sommeliers and native speakers. Experimental results reveal that leading closed-source models achieve up to 97% accuracy on theoretical questions, yet attain only 65% on flavor completion and exhibit low performance on food-wine pairing (MCC ranging from 0 to 0.39), underscoring sensory reasoning as a significant challenge. This work pioneers a decoupled assessment of linguistic proficiency and domain-specific expertise.

Technology Category

Application Category

📝 Abstract
With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.
Problem

Research questions and friction points this paper is trying to address.

sommelier expertise
language models
sensory judgment
multilingual benchmark
wine knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

SommBench
sommelier expertise
multilingual benchmark
sensory grounding
wine-language modeling
🔎 Similar Papers
No similar papers found.
William Brach
William Brach
CS PhD student at STU FIIT
Machine LearningNatural Language ProcessingInformation RetrievalWeb scrapingData mining
T
Tomas Bedej
sommify, Helsinki, Finland
Jacob Nielsen
Jacob Nielsen
Associate Professor of Robotics Engineering, University of Southern Denmark
RoboticsArtificial IntelligenceModular Systems
J
Jacob Pichna
sommify, Helsinki, Finland
J
Juraj Bedej
sommify, Helsinki, Finland
E
Eemeli Saarensilta
sommify, Helsinki, Finland
J
Julie Dupouy
sommify, Helsinki, Finland
G
Gianluca Barmina
University of Southern Denmark, Odense, Denmark
A
Andrea Blasi Núñez
University of Southern Denmark, Odense, Denmark
Peter Schneider-Kamp
Peter Schneider-Kamp
Professor of Computer Science, University of Southern Denmark
Artificial IntelligenceAutomated ReasoningDeclarative ProgrammingProgramming LanguagesSoftware Verification
K
Kristian Košťál
Slovak University of Technology, Bratislava, Slovakia
Michal Ries
Michal Ries
Researcher
ITwireless networksblockchainQoE
L
Lukas Galke Poech
University of Southern Denmark, Odense, Denmark