FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses key limitations in molecular optimization driven by natural language prompts—namely, constrained data scalability, chemical hallucinations, and neglect of fragment-level contextual dependencies—by introducing FORGE, a two-stage framework that reframes optimization as a context-aware local editing task. In the first stage, candidate fragments are ranked according to their contribution to desired molecular properties based on holistic molecular context; the second stage executes explicit fragment replacements. FORGE eliminates reliance on manual text annotations by leveraging automatically mined low-to-high fidelity edit pairs for fragment-level supervision, establishing a scalable, hallucination-free training paradigm. Moreover, it adapts to unknown black-box objectives through contextual examples. Built upon a 0.6B-parameter language model and integrating automatic edit pairs, context-aware mechanisms, and few-shot learning, FORGE consistently outperforms existing approaches—including larger language models and graph neural networks—on Prompt-MolOpt, PMO-1k, and ChemCoTBench benchmarks.

📝 Abstract

Molecular optimization seeks to improve a molecule through small structural edits while preserving similarity to the starting compound. Recent language-model approaches typically treat this task as prompt-conditioned sequence generation. However, relying on natural language introduces an inherent data-scaling bottleneck, often leads to chemical hallucinations, and ignores the strong context dependence of fragment effects. We present FORGE, a two-stage framework that reformulates molecular optimization as context-aware local editing. By utilizing automatically mined, verified low-to-high edit pairs instead of expensive human text annotations, Stage 1 ranks candidate fragments by their property contribution under the full molecular context to inject chemical prior, and Stage 2 generates explicit fragment replacements. Built on a compact 0.6B language model, FORGE further adapts to unseen black-box objectives through in-context demonstrations. Across Prompt-MolOpt, PMO-1k and ChemCoTBench, FORGE consistently outperforms prior methods, including substantially larger language models and graph methods. These results highlight the value of explicit fragment-level supervision as a more easily obtainable, scalable, and hallucination-less alternative to natural language training.

Problem

Research questions and friction points this paper is trying to address.

molecular optimization

chemical hallucinations

context dependence

fragment effects

data-scaling bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

fragment-oriented editing

context-aware molecular optimization

chemical hallucination mitigation