ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation

📅 2024-10-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

To address weak representation learning in multimodal (tabular + textual) clustering, this paper proposes a novel tabular-to-text encoding framework synergistically optimized with large language models (LLMs). The method introduces the first clustering-oriented tabular textualization paradigm, integrating numerical verbalization, stochastic feature sequence permutation, and context-aware embedding generation—enabling LLMs to jointly capture semantic and statistical structures of tabular data. After lightweight fine-tuning, the model produces joint embeddings that preserve structural fidelity while enriching semantic expressiveness, directly supporting end-to-end clustering. Evaluated on multiple benchmark datasets, our approach achieves an average 12.7% improvement in clustering accuracy over conventional unimodal and shallow multimodal baselines. It effectively uncovers complex cross-field correlations otherwise missed by prior methods, demonstrating superior representational capacity for multimodal tabular clustering.

Technology Category

Application Category

📝 Abstract

Cluster analysis plays a crucial role in various domains and applications, such as customer segmentation in marketing. These contexts often involve multimodal data, including both tabular and textual datasets, making it challenging to represent hidden patterns for obtaining meaningful clusters. This study introduces ERASMO, a framework designed to fine-tune a pretrained language model on textually encoded tabular data and generate embeddings from the fine-tuned model. ERASMO employs a textual converter to transform tabular data into a textual format, enabling the language model to process and understand the data more effectively. Additionally, ERASMO produces contextually rich and structurally representative embeddings through techniques such as random feature sequence shuffling and number verbalization. Extensive experimental evaluations were conducted using multiple datasets and baseline approaches. Our results demonstrate that ERASMO fully leverages the specific context of each tabular dataset, leading to more precise and nuanced embeddings for accurate clustering. This approach enhances clustering performance by capturing complex relationship patterns within diverse tabular data.

Problem

Research questions and friction points this paper is trying to address.

Integrating multimodal tabular and textual data for clustering

Generating context-aware embeddings from textually encoded tabular data

Enhancing clustering accuracy through language model fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes pretrained language models on textually encoded tabular data

Converts tabular data to textual format using specialized converter

Generates enriched embeddings via feature shuffling and number verbalization

🔎 Similar Papers

ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models