{"id":722,"date":"2021-06-24T10:28:26","date_gmt":"2021-06-24T08:28:26","guid":{"rendered":"https:\/\/bitsilla.com\/blog\/?p=722"},"modified":"2021-10-04T23:23:55","modified_gmt":"2021-10-04T21:23:55","slug":"standardizing-a-molecule-using-rdkit","status":"publish","type":"post","link":"https:\/\/bitsilla.com\/blog\/2021\/06\/standardizing-a-molecule-using-rdkit\/","title":{"rendered":"Standardizing a molecule using RDKit"},"content":{"rendered":"\n<p class=\"has-drop-cap wp-block-paragraph\"><em>Cheminformatics is hard.<\/em>  That is a great quote from <a rel=\"noreferrer noopener\" href=\"https:\/\/www.buckingham.ac.uk\/directory\/professor-paul-finn\/\" target=\"_blank\">Prof. Paul Finn<\/a>.  I think part of it is due to the nature of chemistry (e.g. which is the correct tautomer for this molecule?), and part of it is because of the lack of &#8220;standard&#8221; process definitions.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So I am revisiting the standardization (of the molecule)\/normalization(of functional groups) pipeline for ML, and I had to post to the extremely helpful RDKit mailing list for help (<a rel=\"noreferrer noopener\" href=\"https:\/\/sourceforge.net\/p\/rdkit\/mailman\/rdkit-discuss\/thread\/CANjYGkSoZZsTrOLvjM8mN9FymX7nMu3G9iZQL8N_sTA%3DmzKmfw%40mail.gmail.com\/#msg37305148\" target=\"_blank\">here<\/a>).  Using the excellent sources they pointed to me, I ended up with the following (which will surely come in handy in a few months time when I go through the whole process again):<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ndef standardize(smiles):\n    # follows the steps in\n    # https:\/\/github.com\/greglandrum\/RSC_OpenScience_Standardization_202104\/blob\/main\/MolStandardize%20pieces.ipynb\n    # as described **excellently** (by Greg) in\n    # https:\/\/www.youtube.com\/watch?v=eWTApNX8dJQ\n    mol = Chem.MolFromSmiles(smiles)\n    \n    # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule\n    clean_mol = rdMolStandardize.Cleanup(mol) \n    \n    # if many fragments, get the &quot;parent&quot; (the actual mol we are interested in) \n    parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)\n        \n    # try to neutralize molecule\n    uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists\n    uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)\n    \n    # note that no attempt is made at reionization at this step\n    # nor at ionization at some pH (rdkit has no pKa caculator)\n    # the main aim to to represent all molecules from different sources\n    # in a (single) standard way, for use in ML, catalogue, etc.\n    \n    te = rdMolStandardize.TautomerEnumerator() # idem\n    taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)\n    \n    return taut_uncharged_parent_clean_mol\n<\/pre><\/div>","protected":false},"excerpt":{"rendered":"<p>Cheminformatics is hard. That is a great quote from Prof. Paul Finn. I think part of it is due to the nature of chemistry (e.g. which is the correct tautomer for this molecule?), and part of it is because of the lack of &#8220;standard&#8221; process definitions. So I am revisiting the standardization (of the molecule)\/normalization(of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-722","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pap6Kd-bE","_links":{"self":[{"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/posts\/722","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/comments?post=722"}],"version-history":[{"count":6,"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/posts\/722\/revisions"}],"predecessor-version":[{"id":753,"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/posts\/722\/revisions\/753"}],"wp:attachment":[{"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/media?parent=722"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/categories?post=722"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bitsilla.com\/blog\/wp-json\/wp\/v2\/tags?post=722"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}