Category: Uncategorized

  • Standardizing a molecule using RDKit

    Cheminformatics is hard. That is a great quote from Prof. Paul Finn. I think part of it is due to the nature of chemistry (e.g. which is the correct tautomer for this molecule?), and part of it is because of the lack of “standard” process definitions.

    So I am revisiting the standardization (of the molecule)/normalization(of functional groups) pipeline for ML, and I had to post to the extremely helpful RDKit mailing list for help (here). Using the excellent sources they pointed to me, I ended up with the following (which will surely come in handy in a few months time when I go through the whole process again):

    def standardize(smiles):
        # follows the steps in
        # https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb
        # as described **excellently** (by Greg) in
        # https://www.youtube.com/watch?v=eWTApNX8dJQ
        mol = Chem.MolFromSmiles(smiles)
        
        # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
        clean_mol = rdMolStandardize.Cleanup(mol) 
        
        # if many fragments, get the "parent" (the actual mol we are interested in) 
        parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
            
        # try to neutralize molecule
        uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
        uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
        
        # note that no attempt is made at reionization at this step
        # nor at ionization at some pH (rdkit has no pKa caculator)
        # the main aim to to represent all molecules from different sources
        # in a (single) standard way, for use in ML, catalogue, etc.
        
        te = rdMolStandardize.TautomerEnumerator() # idem
        taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
        
        return taut_uncharged_parent_clean_mol
    
  • Computer Aided Drug Design (CADD) – Reading Lists

    I am always sending the same canned response to students who would like to do an FYP or a dissertation with me on the subjects I dabble in, Computer-Aided Drug Design (Discovery), Virtual Screening (VS), Ligand-based and Structure-Based methods, Cheminformatics, Bioinformatics and Computational Chemistry. Perhaps the first step for any student is to realize the hierarchy of these fields (and the differences between them). I am including a reading list – which helps you  bootstrap the subject, and hopefully helps you determine if this is really something for you. The jargon will be daunting at first (especially if you are a computer scientist), but that is only an initial hurdle and hopefully you get familiar with the big words quickly. You do not need to understand everything, you just need to understand enough. Remember brick walls are there to show us how badly we want things! (Watch this: long and touching).

    (more…)
  • Upgrading R on Ubuntu

    During our DataX course, we recently had an issue where installing the tm package had a dependency on the slam package (that’s Sparse Lightweight Arrays and Matrices for you). This package requires R >3.3.1, which is a shame as I asked students to install 3.2 at the beginning of the course. Don’t despair; keep calm and upgrade R.

    (more…)

  • Setting up a Bioinformatics Summer School

    As part of the TrainMALTA EU project activities, I volunteered/was tasked with setting up the IT infrastructure for the HTS (or NGS) bioinformatics summer school. It has been quite an experience, and the whole setup is far from trivial – so I thought I’d document parts of it here. Habitually, I turned to google to search what others in my shoes have done and nothing turned up. Nothing on google – this setup must be worth documenting!

    (more…)

  • Installing Cufflinks (RNA-Seq) on Ubuntu

    So, you have a ton of hard-disks spinning with RNA Seq data you need to analyse? Excellent. But first you need some software to do that. This post follows the nature protocol described by Trapnell et el. described here (something freely accessible from nature publishing group – must be my lucky day today).

    (more…)

  • Bioinformatics Big Data Hackathon

    Note: I should have written this ages ago, but only now has the cold weather caught up with my nocturnal habits.

    As part of the first keystone summer school we organised a summer school over four days in July 2015 at the University of Malta.  The summer school was titled Keyword Search over Big Data and I was asked to help with organising the one-and-a-half-day big data hackathon.  With my bioinformatics hat on, it was easy to fish for “big data” for the event.

    (more…)

  • Hello world!

    Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!