🤖 AI Summary
This work addresses the challenge of inconsistent behaviors in large language model (LLM) reasoning, which often manifest too late for real-time intervention. The authors propose Behavior Cue Reasoning (BCR), a novel approach that introduces behavior cue tokens endowed with both signaling and control capabilities, enabling models to explicitly reveal their intent before executing critical actions. This mechanism renders the reasoning process monitorable and controllable. BCR integrates a weak supervisor fine-tuned via reinforcement learning with a rule-based strong supervisor. Evaluated on mathematical reasoning and text-based games, the method demonstrates significant efficacy: it prunes up to 50% of invalid reasoning tokens in complex mathematical tasks and boosts task success rates from 46% to 96% in highly constrained environments, substantially enhancing reasoning efficiency and safety without compromising original performance.
📝 Abstract
Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight.
Code to be released at https://github.com/christopherzc/text-games