🤖 AI Summary
This study investigates topic-level patterns between user prompts and LLM responses in the LMSYS-Chat-1M dataset and their correlation with human model preferences.
Method: We pioneer the application of BERTopic to multilingual LLM comparative evaluation data, integrating dialogue cleaning, multilingual preprocessing, and topic distribution visualization to construct a model–topic preference matrix.
Contribution/Results: We identify 29 semantically coherent topics and discover consistent user preference advantages for specific LLMs across domains such as technology, programming, and ethics—revealing a topic-dependent distribution of model strengths. This work establishes an interpretable, topic-level analytical framework for LLM capability assessment and enables domain-aware model selection and targeted fine-tuning, thereby advancing personalized LLM deployment.
📝 Abstract
This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.