KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the challenge of evaluating large language models’ ability to handle the “tip-of-the-iceberg” problem within constrained knowledge domains. It introduces KnowledgeBerg, a multilingual benchmark spanning 10 domains and 17 languages, comprising 4,800 multiple-choice questions derived from 1,183 structured enumeration seeds. The work formalizes a two-dimensional evaluation framework—knowledge breadth and reasoning depth—and reveals systematic deficiencies in models across three stages: knowledge completeness, intent recognition, and reasoning execution. Experiments show that mainstream open-source models achieve enumeration F1 scores of only 5.26–36.88 and reasoning accuracies of 16.00–44.19. Test-time computation and retrieval augmentation yield maximum gains of 4.35 and 3.78 points, respectively, yet overall performance remains substantially inadequate. This work provides a cross-lingual, reproducible benchmark and diagnostic toolkit for rigorous model assessment.

Technology Category

Application Category

📝 Abstract

Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term "the tip of the iceberg." We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains -- up to 4.35 and 3.78 points, respectively -- substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at https://huggingface.co/datasets/2npc/KnowledgeBerg

Problem

Research questions and friction points this paper is trying to address.

systematic knowledge coverage

compositional reasoning

knowledge width

reasoning depth

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

systematic knowledge coverage

compositional reasoning

knowledge width