Standardizing a molecule using RDKit

Written by

Cheminformatics is hard. That is a great quote from Prof. Paul Finn. I think part of it is due to the nature of chemistry (e.g. which is the correct tautomer for this molecule?), and part of it is because of the lack of “standard” process definitions.

So I am revisiting the standardization (of the molecule)/normalization(of functional groups) pipeline for ML, and I had to post to the extremely helpful RDKit mailing list for help (here). Using the excellent sources they pointed to me, I ended up with the following (which will surely come in handy in a few months time when I go through the whole process again):

def standardize(smiles):
    # follows the steps in
    # https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb
    # as described **excellently** (by Greg) in
    # https://www.youtube.com/watch?v=eWTApNX8dJQ
    mol = Chem.MolFromSmiles(smiles)
    
    # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
    clean_mol = rdMolStandardize.Cleanup(mol) 
    
    # if many fragments, get the "parent" (the actual mol we are interested in) 
    parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
        
    # try to neutralize molecule
    uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
    uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
    
    # note that no attempt is made at reionization at this step
    # nor at ionization at some pH (rdkit has no pKa caculator)
    # the main aim to to represent all molecules from different sources
    # in a (single) standard way, for use in ML, catalogue, etc.
    
    te = rdMolStandardize.TautomerEnumerator() # idem
    taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
    
    return taut_uncharged_parent_clean_mol

Comments

2 responses to “Standardizing a molecule using RDKit”

May 10, 2022

Molecular Standardization | Oxford Protein Informatics Group

[…] following code derives from Greg Landrum and JP Ebejer, with two variants of the method: one that expects a SMILES string, and another that needs an RDKit […]

Reply
August 4, 2022

Rahul

Hello JP, Thanks for the great tutorial very brief and crisp. I am pretty new to python and a beginner, when I applied your code to my data. It did not return any value and not even any error. If you could also add a bit of code on how to get the output csv file out of this by implementing the standardization would be highly appreciated.

Thanks and cheers!

Reply

Standardizing a molecule using RDKit

Share this:

Comments

2 responses to “Standardizing a molecule using RDKit”

Leave a Reply Cancel reply

More posts

M.Sc. Dissertation Examples

Standardizing a molecule using RDKit

The 10+ Commandments of Undertaking Postgraduate Research

Content Tips for your Dissertation or Project Write-up