Towards Cross-Modal Error Detection with Tables and Images

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This paper addresses cross-modal inconsistency—a long-overlooked data quality challenge in multimodal data, particularly tabular and image modalities. We first systematically construct a benchmark for cross-modal error detection, exposing the inherent limitations of unimodal methods in real-world e-commerce and healthcare scenarios. To tackle this, we propose a synergistic framework integrating Cleanlab (for label error detection), DataScope (for data value assessment), and AutoML (for automated model selection and hyperparameter tuning), enabling end-to-end cross-modal error identification. We rigorously evaluate five baseline methods across four multimodal datasets; our approach achieves state-of-the-art F1 scores. However, experiments reveal a significant performance degradation under long-tailed class distributions—highlighting a critical gap for future research. This work establishes the first dedicated benchmark and methodology for cross-modal inconsistency detection, advancing robustness and reliability in multimodal learning systems.

Technology Category

Application Category

📝 Abstract

Ensuring data quality at scale remains a persistent challenge for large organizations. Despite recent advances, maintaining accurate and consistent data is still complex, especially when dealing with multiple data modalities. Traditional error detection and correction methods tend to focus on a single modality, typically a table, and often miss cross-modal errors that are common in domains like e-Commerce and healthcare, where image, tabular, and text data co-exist. To address this gap, we take an initial step towards cross-modal error detection in tabular data, by benchmarking several methods. Our evaluation spans four datasets and five baseline approaches. Among them, Cleanlab, a label error detection framework, and DataScope, a data valuation method, perform the best when paired with a strong AutoML framework, achieving the highest F1 scores. Our findings indicate that current methods remain limited, particularly when applied to heavy-tailed real-world data, motivating further research in this area.

Problem

Research questions and friction points this paper is trying to address.

Detecting cross-modal errors between tables and images in data

Addressing data quality challenges in multi-modal datasets

Benchmarking error detection methods for tabular data with images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal error detection using tables and images

Benchmarking Cleanlab and DataScope with AutoML

Evaluating methods on four datasets and five baselines

🔎 Similar Papers

No similar papers found.