TEDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Ethical documentation for multimodal datasets is widely absent or inconsistent, severely impeding the development of responsible AI. To address this, we propose TEDI, the first standardized ethical assessment framework comprising 143 fine-grained, verifiable metrics across core dimensions—including informed consent, privacy protection, and harmful content mitigation. Leveraging human annotation and metadata modeling, we empirically evaluate over 100 speech-related multimodal datasets. Our analysis reveals that web-crawled datasets exhibit significantly poorer ethical documentation quality compared to crowdsourced or directly collected ones, uncovering a strong correlation between data acquisition methodology and documentation rigor. Beyond diagnostic insights, TEDI establishes the first reproducible, extensible benchmark for ethical completeness—enabling automated documentation parsing and supporting trustworthy AI governance practices.

Technology Category

Application Category

📝 Abstract

Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets. We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators. The extent to which these and other ethical indicators are addressed varies based on the data collection method, with documentation of datasets collected via crowdsourced and direct collection approaches being more likely to mention them. Scraping dominates scale at the cost of ethical indicators, but is not the only viable collection method. Our approach and empirical insights contribute to increasing dataset transparency along trustworthy and ethical dimensions and pave the way for automating the tedious task of extracting information from dataset documentation in future.

Problem

Research questions and friction points this paper is trying to address.

Lack of insights into trustworthy and ethical dataset attributes

Difficulty in comparing ethical indicators across multimodal datasets

Limited documentation on consent, privacy, and harmful content in datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces 143 trustworthy and ethical dataset indicators

Manually annotates 100+ multimodal voice datasets

Compares ethical indicators across collection methods

🔎 Similar Papers

No similar papers found.