Inverse Language Modeling towards Robust and Grounded LLMs

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Current defenses for large language models (LLMs) are fragmented, lack robustness, and fail to identify harmful inputs via output-driven reverse inference. To address this, we propose Inverse Language Modeling (ILM), the first framework to leverage inverse modeling for enhancing LLM adversarial robustness and intrinsic interpretability. ILM jointly optimizes input perturbation defense and output-based backward reasoning, enabling native detection of toxic triggers, content-grounded explanations, and behavior analyzability—without external classifiers or model fine-tuning. It supports red-teaming evaluations and real-time safety responses. Experiments demonstrate that ILM significantly improves model stability under adversarial attacks while increasing detection accuracy and traceability of latent unsafe inputs. By unifying defensive and explanatory capabilities within a single forward-inverse modeling paradigm, ILM establishes a novel foundation for developing controllable, trustworthy next-generation LLMs.

Technology Category

Application Category

📝 Abstract

The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM robustness to input perturbations

Enabling native grounding by inverting model outputs

Transforming LLMs into analyzable and controllable systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse Language Modeling framework improves LLM robustness

Inverts model outputs to identify toxic input triggers

Transforms LLMs into analyzable and robust systems

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning