[1911.04738v1] SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery
ST fingerprints were shown to work well with any predictive model in MoleculeNet downstream tasks and is effective especially when there is not enough labeled data
Abstract In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequenceto-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.
‹Figure 2: The illustration of SMILES Transformer pretraining and fingerprint extraction. (Introduction)Figure 5: ROC-AUC scores on each stratified group by the lengths of SMILES (left) and the distributions of the lengths of SMILES (right) of BBBP dataset. (Conclusions)›