TANQ: An open domain dataset of table answered questions

πŸ“… 2024-05-13
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses open-domain table question answeringβ€”a novel task requiring multi-source retrieval, multi-hop reasoning, and numerical computation to dynamically generate structured table answers with fine-grained provenance. To this end, we introduce TANQ, the first benchmark dataset for this task. Our key contributions are: (1) formalizing the open-domain QA paradigm centered on dynamic table generation; (2) providing cell-level provenance annotations that precisely characterize bottlenecks in multi-hop inference, unit conversion, and arithmetic operations; and (3) conducting comprehensive evaluations across open-book, closed-book, and oracle settings using state-of-the-art LLMs (e.g., GPT-4). Results show that the best-performing model achieves only 29.1 F1β€”19.7 points below human performance. Failure analysis further exposes fundamental limitations in information aggregation and answer interpretability, underscoring critical challenges for future research.

Technology Category

Application Category

πŸ“ Abstract
Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.
Problem

Research questions and friction points this paper is trying to address.

Language Model Evaluation
Multi-source Information Gathering
Complex Task Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

TANQ Dataset
Complex Task Assessment
Language Model Limitations
πŸ”Ž Similar Papers
No similar papers found.