FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses key limitations in molecular optimization driven by natural language prompts—namely, constrained data scalability, chemical hallucinations, and neglect of fragment-level contextual dependencies—by introducing FORGE, a two-stage framework that reframes optimization as a context-aware local editing task. In the first stage, candidate fragments are ranked according to their contribution to desired molecular properties based on holistic molecular context; the second stage executes explicit fragment replacements. FORGE eliminates reliance on manual text annotations by leveraging automatically mined low-to-high fidelity edit pairs for fragment-level supervision, establishing a scalable, hallucination-free training paradigm. Moreover, it adapts to unknown black-box objectives through contextual examples. Built upon a 0.6B-parameter language model and integrating automatic edit pairs, context-aware mechanisms, and few-shot learning, FORGE consistently outperforms existing approaches—including larger language models and graph neural networks—on Prompt-MolOpt, PMO-1k, and ChemCoTBench benchmarks.
📝 Abstract
Molecular optimization seeks to improve a molecule through small structural edits while preserving similarity to the starting compound. Recent language-model approaches typically treat this task as prompt-conditioned sequence generation. However, relying on natural language introduces an inherent data-scaling bottleneck, often leads to chemical hallucinations, and ignores the strong context dependence of fragment effects. We present FORGE, a two-stage framework that reformulates molecular optimization as context-aware local editing. By utilizing automatically mined, verified low-to-high edit pairs instead of expensive human text annotations, Stage 1 ranks candidate fragments by their property contribution under the full molecular context to inject chemical prior, and Stage 2 generates explicit fragment replacements. Built on a compact 0.6B language model, FORGE further adapts to unseen black-box objectives through in-context demonstrations. Across Prompt-MolOpt, PMO-1k and ChemCoTBench, FORGE consistently outperforms prior methods, including substantially larger language models and graph methods. These results highlight the value of explicit fragment-level supervision as a more easily obtainable, scalable, and hallucination-less alternative to natural language training.
Problem

Research questions and friction points this paper is trying to address.

molecular optimization
chemical hallucinations
context dependence
fragment effects
data-scaling bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

fragment-oriented editing
context-aware molecular optimization
chemical hallucination mitigation
in-context learning
molecular property ranking
🔎 Similar Papers
No similar papers found.
Q
Qingchuan Zhang
University of Science and Technology of China
H
He Cao
International Digital Economy Academy
H
Hao Li
Peking University
Y
Yanjun Shao
Yale University
Z
Zhiyuan Liu
National University of Singapore
Shihang Wang
Shihang Wang
DAMO Academy, Alibaba Inc.
Natural Language Processing
Shufang Xie
Shufang Xie
GSAI, Renmin University of China
Machine Learning
Shenghua Gao
Shenghua Gao
The University of Hong Kong
Computer visionPattern RecognitionMachine Learning
Xinwu Ye
Xinwu Ye
The University of Hong Kong
AIDDLLMsbioinformatics