Can Language Models Follow Multiple Turns of Entangled Instructions?

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large language models (LLMs) struggle to maintain consistency under multi-turn, entangled instructions involving privacy constraints, preference requirements, and priority conflicts. To address this, we introduce MultiTurnInstruct—a benchmark comprising 1,100 high-quality multi-turn dialogues—and propose the first evaluation framework spanning three hierarchical challenges: information extraction, cross-turn instruction tracking, and conflict resolution. We establish a nine-category, multi-granularity capability taxonomy. Methodologically, we integrate human feedback-based data construction with multi-dimensional evaluation—BLEU scoring, human assessment, and fine-grained behavioral analysis—complemented by attention visualization and diagnostic probing. Key findings reveal that attention mechanisms fundamentally fail to coherently integrate entangled instructions; while GPT-series models exhibit strong memory retention, they perform poorly on privacy enforcement; enhanced reasoning capabilities do not substantially improve conflict resolution; and performance gains plateau with increasing parameter count, underscoring the intrinsic difficulty of multi-turn instruction coordination.

Technology Category

Application Category

📝 Abstract

Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to handle entangled instructions

Evaluating performance in multi-turn instruction scenarios

Identifying gaps in privacy and conflict resolution tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic investigation of multi-turn instruction handling

Human-in-the-loop approach for dataset construction

Analysis of trade-offs in model capabilities

🔎 Similar Papers

Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models