🤖 AI Summary
Existing physical commonsense reasoning benchmarks predominantly reflect Western cultural contexts, overlooking how cultural differences influence physical problem-solving. To address this gap, we introduce EPiK—the first Korean-culture-specific physical commonsense reasoning benchmark—comprising 181 binary-choice questions spanning nine reasoning categories (e.g., kimchi fermentation) and 84 culturally grounded scenarios. EPiK is constructed via a two-stage, culture-context-driven generation pipeline followed by rigorous expert validation, ensuring both physical accuracy and cultural authenticity. Experimental results demonstrate that culturally adapted models significantly outperform general-purpose foundation models, exposing critical limitations of current large language models in culture-specific physical reasoning. EPiK thus fills a key void in non-Western physical commonsense evaluation and empirically validates the essential role of culturally aware benchmarks in enhancing language models’ real-world situational understanding.
📝 Abstract
Existing physical commonsense reasoning benchmarks predominantly focus on Western contexts, overlooking cultural variations in physical problem-solving. To address this gap, we introduce EPiK (Everyday Physics in Korean Contexts), a novel benchmark comprising 181 binary-choice problems that test physical reasoning within Korean cultural contexts, ranging from kimchi (Korean food) to traditional fermentation. EPiK is constructed using a two-stage generation and verification pipeline to create culturally-authentic problems across 9 reasoning subtasks and 84 scenarios. Unlike approaches based on simple translation, our method generates problems organically from Korean contexts while upholding rigorous physical reasoning standards. Our evaluations show that Korean-specialized models consistently outperform general-purpose models of comparable size. This performance gap highlights the limitations of culturally-agnostic models and demonstrates the critical need for culturally-aware benchmarks to truly measure language understanding. Our EPiK is publicly available at https://huggingface.co/datasets/jjae/EPiK.