Time-Reversal Provides Unsupervised Feedback to LLMs

📅 2024-12-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) can achieve self-improvement via time-reversed reasoning. To this end, we introduce the Time-Reversed Language Model (TRLM) paradigm: TRLM takes an LLM’s response as input and inversely generates the corresponding query or scores it, thereby enabling unsupervised self-critique and feedback for the forward LLM. Our method comprises reverse-token pretraining/fine-tuning, response-conditioned query generation and scoring, TRLM-guided best-of-N re-ranking, and input-side safety filtering. Empirical evaluation demonstrates that TRLM improves performance by 5% on AlpacaEval; outperforms forward scoring significantly in citation generation and paragraph retrieval; and reduces false-negative rates substantially on JailbreakBench while maintaining near-constant false-positive rates. These results validate TRLM’s effectiveness as a complementary mechanism to forward LLMs—particularly in re-ranking, safety filtering, and self-refinement tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are typically trained to predict in the forward direction of time. However, recent works have shown that prompting these models to look back and critique their own generations can produce useful feedback. Motivated by this, we explore the question of whether LLMs can be empowered to think (predict and score) backwards to provide unsupervised feedback that complements forward LLMs. Towards this, we introduce Time Reversed Language Models (TRLMs), which can score and generate queries when conditioned on responses, effectively functioning in the reverse direction of time. Further, to effectively infer in the response to query direction, we pre-train and fine-tune a language model (TRLM-Ba) in the reverse token order from scratch. We show empirically (and theoretically in a stylized setting) that time-reversed models can indeed complement forward model predictions when used to score the query given response for re-ranking multiple forward generations. We obtain up to 5% improvement on the widely used AlpacaEval Leaderboard over the competent baseline of best-of-N re-ranking using self log-perplexity scores. We further show that TRLM scoring outperforms conventional forward scoring of response given query, resulting in significant gains in applications such as citation generation and passage retrieval. We next leverage the generative ability of TRLM to augment or provide unsupervised feedback to input safety filters of LLMs, demonstrating a drastic reduction in false negative rate with negligible impact on false positive rates against several attacks published on the popular JailbreakBench leaderboard.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Time-reversed Language Modeling
Performance Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-Reversed Language Models
Performance Enhancement
Safety Filtering
🔎 Similar Papers
No similar papers found.