Cheminformatics is hard. That is a great quote from Prof. Paul Finn. I think part of it is due to the nature of chemistry (e.g. which is the correct tautomer for this molecule?), and part of it is because of the lack of “standard” process definitions.
So I am revisiting the standardization (of the molecule)/normalization(of functional groups) pipeline for ML, and I had to post to the extremely helpful RDKit mailing list for help (here). Using the excellent sources they pointed to me, I ended up with the following (which will surely come in handy in a few months time when I go through the whole process again):
def standardize(smiles): # follows the steps in # https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb # as described **excellently** (by Greg) in # https://www.youtube.com/watch?v=eWTApNX8dJQ mol = Chem.MolFromSmiles(smiles) # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule clean_mol = rdMolStandardize.Cleanup(mol) # if many fragments, get the "parent" (the actual mol we are interested in) parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol) # try to neutralize molecule uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol) # note that no attempt is made at reionization at this step # nor at ionization at some pH (rdkit has no pKa caculator) # the main aim to to represent all molecules from different sources # in a (single) standard way, for use in ML, catalogue, etc. te = rdMolStandardize.TautomerEnumerator() # idem taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol) return taut_uncharged_parent_clean_mol