🤖 AI Summary
This work addresses the limitations of existing cultural alignment methods, which rely on manual curation or biased large language models (LLMs) for seed selection, lacking quantifiable criteria and scalability. The authors propose C-Mining, a novel framework that formulates cultural seed discovery as an unsupervised data mining problem by modeling cultural specificity through geometric isolation and linguistic exclusivity signals in pretrained multilingual embedding spaces. Without requiring human annotation or LLM intervention, C-Mining leverages geometric misalignment analysis, noise filtering, and cultural point extraction to enable high-fidelity, scalable seed identification. Experiments demonstrate over a 150-fold reduction in seed preparation cost and a 6.03-point performance gain on CulturalBench-Hard, substantially outperforming current baselines and significantly enhancing models’ cultural understanding and reasoning capabilities.
📝 Abstract
Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.