TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address accuracy degradation, slow inference, and knowledge staleness in joint analysis of multimodal (structured, semi-structured, unstructured) data within data lakes, this paper proposes an intelligent analytical system based on the Model Context Protocol (MCP). Methodologically, it introduces: (1) a novel hierarchical semantic operator framework coupled with an AI-agent–driven NL2Operator translation mechanism for precise user intent parsing and modality-specific model orchestration; (2) an MCP server cluster architecture enabling on-demand scheduling of multimodal foundation models (e.g., text, tabular, graph, log); and (3) a hybrid approach integrating deep retrieval with machine forgetting to ensure synchronized freshness updates between raw data and LLM knowledge. Evaluated on real-world data lake workloads, the system achieves a 32% improvement in query accuracy, a 47% reduction in inference latency, sub-second response over TB-scale heterogeneous data, and knowledge base refreshes within minutes.

Technology Category

Application Category

📝 Abstract

The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi-modal data, including structured, semi-structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi-modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL-like query languages may struggle to precisely and comprehensively capture users' analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open-domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi-modal data analytics system. Specifically, we propose a novel architecture built upon the Model Context Protocol (MCP), an emerging paradigm that enables LLMs to collaborate with knowledgeable agents. First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes and develop an AI-agent-powered NL2Operator translator to bridge user intent and analytical execution. Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities. This design enhances both accuracy and efficiency, while supporting high scalability through modular deployment. Finally, we propose a updating mechanism by harnessing the deep research and machine unlearning techniques to refresh the data lakes and LLM knowledges, with the goal of balancing the data freshness and inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

Precise capture of user intent in multi-modal queries

Efficient processing of diverse data modalities

Ensuring data freshness with external knowledge integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-agent-powered NL2Operator translator for intent bridging

MCP-based execution framework with specialized foundation models

Updating mechanism using deep research and machine unlearning

🔎 Similar Papers

MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake

2024-08-29arXiv.orgCitations: 0

Authors to Follow