NC-Bench: An LLM Benchmark for Evaluating Conversational Competence

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevailing gap in large language model (LLM) evaluation, which has predominantly emphasized content quality while neglecting systematic assessment of natural dialogue form and structure. Drawing on the IBM Natural Conversation Framework, the study operationalizes human-like sequential management principles to construct a lightweight, theory-driven evaluation framework encompassing three representative interaction paradigms: basic dialogue, retrieval-augmented generation (RAG), and complex multi-turn requests. The proposed framework prioritizes formal appropriateness of dialogue acts over factual correctness, offering strong extensibility and generalizability. Experimental results across six open-source models and fourteen interaction scenarios reveal that while models perform adequately in basic question-answering, they exhibit significant deficiencies in repair, closure, and complex multi-turn tasks; among them, Qwen excels in basic tasks, whereas Granite demonstrates superior performance in RAG and complex scenarios.

Technology Category

Application Category

📝 Abstract
The Natural Conversation Benchmark (NC-Bench) introduce a new approach to evaluating the general conversational competence of large language models (LLMs). Unlike prior benchmarks that focus on the content of model behavior, NC-Bench focuses on the form and structure of natural conversation. Grounded in the IBM Natural Conversation Framework (NCF), NC-Bench comprises three distinct sets. The Basic Conversation Competence set evaluates fundamental sequence management practices, such as answering inquiries, repairing responses, and closing conversational pairs. The RAG set applies the same sequence management patterns as the first set but incorporates retrieval-augmented generation (RAG). The Complex Request set extends the evaluation to complex requests involving more intricate sequence management patterns. Each benchmark tests a model's ability to produce contextually appropriate conversational actions in response to characteristic interaction patterns. Initial evaluations across 6 open-source models and 14 interaction patterns show that models perform well on basic answering tasks, struggle more with repair tasks (especially repeat), have mixed performance on closing sequences, and find complex multi-turn requests most challenging, with Qwen models excelling on the Basic set and Granite models on the RAG set and the Complex Request set. By operationalizing fundamental principles of human conversation, NC-Bench provides a lightweight, extensible, and theory-grounded framework for assessing and improving the conversational abilities of LLMs beyond topical or task-specific benchmarks.
Problem

Research questions and friction points this paper is trying to address.

conversational competence
large language models
conversation structure
sequence management
dialogue evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational competence
natural conversation framework
sequence management
retrieval-augmented generation
LLM benchmark
🔎 Similar Papers
No similar papers found.
R
Robert J. Moore
Independent Researcher
Sungeun An
Sungeun An
IBM Research, Almaden
conversational AIcognitive & learning sciencedata analyticshuman-computer interaction
F
Farhan Ahmed
IBM Research, 555 Bailey Ave, San Jose, CA 95141, The United States of America
J
Jay Pankaj Gala
IBM Research, 555 Bailey Ave, San Jose, CA 95141, The United States of America