🤖 AI Summary
This study identifies a systematic negative judgment bias in large language models (LLMs) induced by response format—binary versus continuous—challenging the implicit assumption that model outputs depend solely on input. Method: Through controlled experiments across multiple open-source and commercial LLMs using rigorous prompt engineering, we evaluate this effect on value judgments and text sentiment analysis. Contribution/Results: Binary-format responses exhibit significantly higher negative classification rates than continuous formats—by 12.3%–18.7% on average—with high consistency across models and tasks. This is the first empirical demonstration that task framing alone can introduce reproducible, systematic bias in LLM outputs. The findings establish response format as a critical, often overlooked design variable in LLM-based decision-making applications, particularly in high-stakes domains such as psychological text analysis, where reliability and calibration are essential.
📝 Abstract
Large Language Models (LLMs) are increasingly used in tasks such as psychological text analysis and decision-making in automated workflows. However, their reliability remains a concern due to potential biases inherited from their training process. In this study, we examine how different response format: binary versus continuous, may systematically influence LLMs' judgments. In a value statement judgments task and a text sentiment analysis task, we prompted LLMs to simulate human responses and tested both formats across several models, including both open-source and commercial models. Our findings revealed a consistent negative bias: LLMs were more likely to deliver"negative"judgments in binary formats compared to continuous ones. Control experiments further revealed that this pattern holds across both tasks. Our results highlight the importance of considering response format when applying LLMs to decision tasks, as small changes in task design can introduce systematic biases.