arxiv:2509.19586

A Foundation Chemical Language Model for Comprehensive Fragment-Based Drug Discovery

Published on Sep 23, 2025

Authors:

Abstract

A GPT-2 based foundation model named FragAtlas-62M is presented, which generates chemically valid fragments with high coverage of fragment chemical space using a large-scale fragment dataset.

AI-generated summary

We introduce FragAtlas-62M, a specialized foundation model trained on the largest fragment dataset to date. Built on the complete ZINC-22 fragment subset comprising over 62 million molecules, it achieves unprecedented coverage of fragment chemical space. Our GPT-2 based model (42.7M parameters) generates 99.90% chemically valid fragments. Validation across 12 descriptors and three fingerprint methods shows generated fragments closely match the training distribution (all effect sizes < 0.4). The model retains 53.6% of known ZINC fragments while producing 22% novel structures with practical relevance. We release FragAtlas-62M with training code, preprocessed data, documentation, and model weights to accelerate adoption.