Thursday 24 January 2013

You can QSAR that again - Reproducible research with IPython

I've mentioned the IPython Notebook before (here and here). It's an interactive Python session that runs in the web browser, and can capture and display the output including plots. It can be saved, loaded and exported to a static HTML page. Entries in the notebook can be edited, and the whole notebook can be run in order to regenerate the output.

In other words, it's the perfect tool for documenting and presenting an analysis of data, thus bringing us one step closer to the goal of reproducible research. There is one area in which it is a particularly good fit for cheminformatics, and that's QSAR.

Greg Landrum and Nikolas Fechner of Novartis have led the way here. Check out this series of IPython notebooks originally presented at the RDKit UGM in 2012, and in particular the one on Using SciKit-Learn and Descriptors to Build Regression Models. Here's an excerpt:
It's pretty much a complete record of how they went about analysing a particular dataset from start to finish. The only thing that I would add is that I would ask the software used (RDKit, ipython, matplotlib and scikits-learn) to print out their version numbers of the top of the notebook (and add some pretty pictures of outliers too of course).

Hopefully others will follow in these footsteps. It would certainly be something to see such a Notebook included as part of the Methods section in a QSAR paper. Almost makes me want to do some QSAR work again...(almost). :-)

No comments: