Cheminformatics is hard. That is a great quote from Prof. Paul Finn. I think part of it is due to the nature of chemistry (e.g. which is the correct tautomer for this molecule?), and part of it is because of the lack of “standard” process definitions.
So I am revisiting the standardization (of the molecule)/normalization(of functional groups) pipeline for ML, and I had to post to the extremely helpful RDKit mailing list for help (here). Using the excellent sources they pointed to me, I ended up with the following (which will surely come in handy in a few months time when I go through the whole process again):
def standardize(smiles): # follows the steps in # https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb # as described **excellently** (by Greg) in # https://www.youtube.com/watch?v=eWTApNX8dJQ mol = Chem.MolFromSmiles(smiles) # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule clean_mol = rdMolStandardize.Cleanup(mol) # if many fragments, get the "parent" (the actual mol we are interested in) parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol) # try to neutralize molecule uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol) # note that no attempt is made at reionization at this step # nor at ionization at some pH (rdkit has no pKa caculator) # the main aim to to represent all molecules from different sources # in a (single) standard way, for use in ML, catalogue, etc. te = rdMolStandardize.TautomerEnumerator() # idem taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol) return taut_uncharged_parent_clean_mol
2 responses to “Standardizing a molecule using RDKit”
[…] following code derives from Greg Landrum and JP Ebejer, with two variants of the method: one that expects a SMILES string, and another that needs an RDKit […]
Hello JP, Thanks for the great tutorial very brief and crisp. I am pretty new to python and a beginner, when I applied your code to my data. It did not return any value and not even any error. If you could also add a bit of code on how to get the output csv file out of this by implementing the standardization would be highly appreciated.
Thanks and cheers!