🤖 AI Summary
This study addresses the long-overlooked task of indirect question answering (IQA) polarity classification in natural language processing, presenting the first systematic investigation of its performance across both high-resource languages (English and Standard German) and a low-resource variant (Bavarian German). To this end, we introduce InQA+, a high-quality multilingual dataset with human annotations, alongside GenIQA, a large-scale synthetic dataset generated using GPT-4o-mini. We conduct comprehensive experiments using multilingual Transformer models, including mBERT, XLM-R, and mDeBERTa. Our results reveal that IQA consistently suffers from poor performance and severe overfitting; while large-scale data offers some gains, current LLM-generated examples still lack deep pragmatic understanding and fail to transfer effectively. This work highlights the challenges of multilingual pragmatic reasoning and provides benchmark resources and an analytical framework for future research.
📝 Abstract
Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.