Noel O'Blog: 2008

Sunday 21 December 2008

Have your hamburger and eat it - Edit molecules in PDFs

Wouldn't it be nice to be able to copy a molecule from a PDF into a molecular drawing package? Well, here are some instructions for doing this on Windows with BKChem, and using OSRA to do the conversion. Click on the image to the right for a screenshot of this in action.

(1) Install Python 2.6 (or just use 2.4 or 2.5 if you have one of these already)
(2) Install the Python Imaging Library 1.1.6 for your version of Python
(3) Download and extract BKChem-0.12.5.zip
(4) Drop convert_clipboard_image.py and convert_clipboard_image.xml into the BKChem plugins folder (Note: if the webserver is down, you can get these files here and here)
(5) Download and extract osra-mingw-1-1-0.zip
(6) Set the environment variable OSRA to the full path to osra.exe
(7) Find the Snapshot tool (it has a picture of a camera) in your version of Adobe Reader. In version 9 it's under Tools/Select and Zoom/Snapshot Tool, and you can add it to the toolbar under Views/Toolbars/More Tools.
(8) Open a PDF of a paper containing a molecular structure (e.g. Figure 4 in this paper of mine), and use the Snapshot tool to draw a box around a molecule and hit CTRL+C to copy (if not done automatically).
(9) Start BKChem by double-clicking on bkchem.py (in the bkchem subfolder)
(10) Click "Plugins", "Paste and Convert Image"

Notes:
(0) Open Source software allows you to implement crazy ideas as fast as you can think of them.
(1) This won't work with the latest exe release of BKChem as py2exe didn't include the ImageGrab module (part of PIL).
(2) For this to work with ChemDraw, I need to know how to place ChemDraw XML (which I can create with OpenBabel) on the clipboard so that ChemDraw will be able to paste it. (To be clear, I know how to place it on the clipboard in general, it's just how to place it in such a way that ChemDraw will recognise it as a chemical structure. Hmmm...just found this...)
(3) Bond angles are perturbed slightly (e.g. vertical bonds can become skewed). Maybe this can be fixed on the OSRA side.
(4) The hamburger reference and some background can be found on PMR's blog.
(5) Thanks to Leonard (see comment below) for the information on the Snapshot tool in Adobe Reader.

Friday 19 December 2008

The structure factor

Gerard Kleywegt does not take kindly to dodgy PDB files. I've been aware of his work in a general sense for a while, but after reading Practical Fragment's review of "Limitations and lessons in the use of X-ray structural information in drug design" by Andrew Davies, Stephen St-Gallay and Gerard Kleywegt I decided to read some of his papers. If you are looking for a broad overview (with excellent examples) of the issues to be considered if using PDB files, this paper is a good place to start.

The PDB is a wonderful resource, but it should be remembered that crystallography provides the electron density (at a specific resolution) not the structure, and everything from that point on is modelling. In other words, mistakes can be made due either due to ambiguities or incorrect assumptions. In most (?) PDB files, the electron density can be reconstructed to some extent from structure factors, and compared to the model to find regions where there is a lack of density, and hence a lack of evidence. In particular, the ligand can get very little love from the crystallographer (see references 43-48 in the paper).

I see that Kleywegt, after 15 years at Uppsala, has recently joined the Macromolecular Structure Database Group at the EBI as head of the PDBe, the European end of the PDB. Does this mean that dodgy PDB files will be a thing of the past?

Image credit: Ethan Hein

Monday 15 December 2008

Mono, C# and OpenBabel

Matt Sprague and I have put in a bit of work making OpenBabel accessible on the .NET platform on Windows (see Matt's recent post, for example). .NET is like Microsoft's JVM. There are several .NET languages (e.g. C#, Visual Basic, IronPython), just like there are several languages for the JVM (e.g. Java, Jython, JRuby). If you write a program in any of the languages and compile it, the resulting classes can be used from any of the other languages.

Having finished the work for Windows, I started to look at Linux and MacOSX. There's an Open Source version of .NET called Mono, which is available cross-platform. After a bit of hacking around, I'm pleased to announce that the next release of OpenBabel will provide support for Mono on Linux and MacOSX. I've included an example C# and IronPython script that uses OpenBabel (see here).

One thing I was wondering though: anyone have any idea how widely Mono is used?

Thursday 4 December 2008

Cinfony paper published in Chemistry Central Journal

My Cinfony paper has just come out:
Cinfony - combining Open Source cheminformatics toolkits behind a common interface NM O'Boyle, GR Hutchison. Chem. Cent. J. 2008, 2, 24.

The paper describes the Why? and How? of Cinfony, shows some examples of use, and discusses performance. To download Cinfony, or for documentation on using Cinfony, please see the Cinfony web site.

Update (05/12/08): Table 3 in the paper caused a stir over at Chem-Bla-ics (Who Says Java is Not Fast and Cheminformatics Benchmark Project #1) and Depth-First (Choose Java for Speed).

Here's the abstract:

Background

Open Source cheminformatics toolkits such as OpenBabel, the CDK and the RDKit share the same core functionality but support different sets of file formats and forcefields, and calculate different fingerprints and descriptors. Despite their complementary features, using these toolkits in the same program is difficult as they are implemented in different languages (C++ versus Java), have different underlying chemical models and have different application programming interfaces (APIs).

Results

We describe Cinfony, a Python module that presents a common interface to all three of these toolkits, allowing the user to easily combine methods and results from any of the toolkits. In general, the run time of the Cinfony modules is almost as fast as accessing the underlying toolkits directly from C++ or Java, but Cinfony makes it much easier to carry out common tasks in cheminformatics such as reading file formats and calculating descriptors.

Conclusion

By providing a simplified interface and improving interoperability, Cinfony makes it easy to combine complementary features of OpenBabel, the CDK and the RDKit.

Wednesday 3 December 2008

CrossRef + Ubiquity = Look up references without the pain

Imagine one of the following scenarios: you're reading a paper on your computer and want to look up a reference, or someone emails you the citation to an interesting paper, or you see some paper referenced on a website. If you want to read the paper that corresponds to the citation, you need to go to the website of the journal publisher, and enter all of the citation details or else click your way through years and journal issues to get to the paper you want. This is just the sort of tedious task that your computer should be doing for you instead.

And now, thanks to Geoffrey Bilder of CrossRef, it can. The details are described on the CrossTech blog. What it boils down to is that you can bring up the Ubiquity command window (CTRL+SPACE on Windows once Ubiquity is installed on Firefox), type "crossref" and use CTRL+V to paste a citation (or else select a citation on the webpage). After a couple of seconds, you'll see several hits for the paper in the CrossRef database, and clicking on Resolve will bring you directly to the online version of the paper. If you're feeling lucky you can just hit Enter after entering the citation and go directly to the first hit.

If you haven't already installed Ubiquity (see here for more info), I think it is worth it just to be able to use this particular CrossRef script.

Thursday 27 November 2008

Does my PNG look bad in this (journal)?

Did you ever submit a carefully-crafted image to a journal only to find that it looks terrible in the proofs from the publisher? This happened to me recently, and it took several iterations with the publisher to figure out what was happening.

The problem turned out to be transparency in PNGs. Basically, when you export a PNG from Inkscape, the background is left transparent and, more importantly, shading (e.g. to remove the jaggedness of diagonal lines) is done by mixing in a transparent layer called the alpha channel. Despite PNGs being around for years, it was only a couple of years ago that Internet Explorer finally got with the programme and handled transparency in PNGs. However, it seems it's still a problem with some publishers.

So, how to "fix" it? Open in Gimp, "Layer", "Transparency", "Remove Alpha Channel", "File", "Save". If you don't have an alpha channel, then the "Remove Alpha Channel" option will not be available.

Update 28/Nov/08: Pierre Lindenbaum suggested an alternative remedy: you can just draw a big white rectangle behind your image in Inkscape and everything will be okay. Paweł Szczęsny points out a better solution: in Inkscape, go to "File", "Document Properties", "Page", "Background", "RGB" and set the four values to 255 (white background, no alpha transparency).

Saturday 8 November 2008

Of OChRe, OSRA and OASA (but not OSCAR)

The field of Optical Chemical Recognition (OChRe) has been around a while with a number of well-established players (see Antony William's post on the subject). There's just a single open source program though, OSRA by Igor Filippov of the NIH, which was released in July 2007. Given an image containing one or several chemical structures, OSRA returns the SMILES string for the structures.

The dependencies of OSRA give an insight into the term "open source ecosystem". Rather than reinvent the wheel, OSRA makes use of open source libraries for optical character recognition (OCRAD, GOCR), bitmap to vector image conversion (POTRACE), messing about with images (ImageMagick, GREYstoration, ThinImage, CImg) and of course, cheminformatics (it uses either OpenBabel or RDKit). Luckily for Windows users, Igor provides a compiled version.

Back in July, I emailed Igor to request a new feature, the ability to output an SDF file containing the coordinates taken from the image. Three hours later he replied that he'd added it. And now, only four months later, I've gotten around to testing it (insert excuse here)...

Anyway, one nice way of testing conversion code is by roundtripping (or "There And Back Again"). So I took the now legendary depiction faceoff test file (see, for example, here), used the coordinates therein to create a PNG image using OASA (via Pybel or cinfony), ran OSRA on the resulting image to get some coordinates, and then used those coordinates to generate a PNG again. By eyeballing the two PNG images, it's possible to discover errors. So, here are the results of the OChRe.

Notes:
(1) If you notice any trends in the errors, comment below and Igor might fix them.
(2) Where there are missing images after the OChRe, this is where OSRA missed a bond (probably reasonably), generated two molecules, and caused OASA a headache (it only handles single molecules).

Here's the code:

import pybel
import popen2

odir = "images"
for mol in pybel.readfile("sdf", "onecomponent.sdf"):
    title = mol.title
    print title
    mol.draw(usecoords=True, show=False, filename=os.path.join(odir, "%s_oasa.png" % title))
    o, i, e = popen2.popen3("../osra-trunk/osra -f sdf %s/%s_oasa.png" % (odir, title))
    osrasdf = o.read()
    newmol = pybel.readstring("sdf", osrasdf)
    try:
        newmol.draw(usecoords=True, show=False, filename=os.path.join(odir, "%s_osra.png" % title))
    except AssertionError:
        print "Unconnected!"

Image credit: Jason.Hudson

Friday 7 November 2008

Food for thought? Citations vs Accesses

Checking out the latest article in Chemistry Central Journal, I was surprised to see that in just 5 days after publication it had 5000 accesses. Today after 8 days, it has around 9000. This already makes it the most accessed paper ever from that journal. The article in question is by DP Naughton and A Petroczi, "Heavy metal ions in wines: meta-analysis of target hazard quotients reveal health risks".

At first, I couldn't figure out what was going on. A quick google, however, shows that the article has been picked up by several bloggers and news sites chasing the "wine is bad for you" angle. I also noticed that the paper is top of the BioMedCentral most viewed articles in the last 30 days. Of the other top 10, 6 are on nutrition or food-related topics.

I guess that this highlights a difference between the number of citations and the number of accesses. The number of citations reflects the impact of the paper within the scientific community, while the number of accesses also includes the impact within the broader community, arguably a more valid measure of its importance :-). Of course, for closed access journals only scientists will be able to access the papers and so the difference will not be so great (except in order of magnitude, that is).

Thursday 30 October 2008

Generating InChI's Mini-Me, the InChIKey

A recent comment about Pybel's inability to calculate the InChIKey lead me to investigate. It's true enough that currently the InChIKey is not one of the available formats in Pybel. However, by accessing OpenBabel directly it's possible to generate InChIkeys. Here's how.

The InChIKey is available as an option on the InChI format. How would you find this out? Well, "babel -Hinchi" gives all of the options, one of which is option "K" indicating "output InChIKey". Here's how to do a SMILES to InChIKey conversion from Python:

import openbabel as ob

conv = ob.OBConversion()
conv.SetInAndOutFormats("smi", "inchi")
conv.SetOptions("K", conv.OUTOPTIONS)

mol = ob.OBMol()
conv.ReadString(mol, "CC(=O)Cl")
inchikey = conv.WriteString(mol)
assert inchikey == "WETWJCDKMRHUPV-UHFFFAOYAQ"

A future version of Pybel will include the InChIKey format directly.

Image credit: Gustty

Tuesday 28 October 2008

Cheminformatics toolkit face-off - Depiction Part 3

Part 1
Part 2

We've got two new additions to the face-off, both of which are still under development.

The first is a structure diagram generator which has been added to OpenBabel. This code comes from the MCDL Java applet, and has been translated from the original Java to C++ by one of its authors, Sergey Trepalin. There are a couple of rough edges so if you want anything sorted out before the next release, now's the time to test it out and get those bug reports in.

On the depiction front, Rich has just released the first beta of ChemPhoto, about which you can read more on his blog.

The same dataset was used as before, and the new additions are in the final columns: [depiction] and [structure diagram generation]. Feedback I'm sure is welcomed.

Notes:
(1) Some OB generated images missing due to taking too long. Remember those rough edges I mentioned...?
(2) OB generated coordinates depicted using OASA

Saturday 18 October 2008

Tip for scripting a workflow

If you are writing several Python scripts that make up a workflow, e.g. if one reads an intermediate output file or pickle from another, then it's a good idea to name each Python script starting with a number. For example, the first script could be 0_parse_dockings.py, and the next one 1_calculate_enrichments.py.

This is handy for a couple of reasons:

When you look at these files in 6 months time, you will know in what order you should run them
When you are running the files, they will autocomplete very easily at the command line (Windows or Linux), e.g. you type "python 0", then hit TAB, and the name of the file will autocomplete. No need to think about the name of the file, (is it calcresults.py or analyseresults.py?) or have the problem of several files which start with the same letter.

Anyone else got any labour-saving tips for the busy scientist?

Friday 17 October 2008

The SCCI - towards a novel metric for analysing your publication record

What's that sound? It's the credit crunch. Long term readers will know that I'm not one to stand by idly while the economy flounders and banks implode. Inspired by recent news that Thomson Scientific is suing Zotero, I've come up with a way to help scientists cope with the realities of the post-deprecession [1] world.

I've devised a new metric to analyse a person's publication record. It's the SCCI, Science Credit Crunch Index, a measure of the fraction of your papers that will not be available post-deprecession assuming that all of the companies/societies holding the copyright go down the tubes. In my case, it's a round 0.75. Any one interested in helping with a bailout?

[1] It's better than a depression - it's worse than a recession.

Tuesday 14 October 2008

Look, Cinfony no longer logo-less

Coming up with a name for a software project is pretty hard. For weeks I meditated on a remote peak trying to think of a name that encompassed the entirety of my vision for a cheminformatics toolkit. Well, that didn't work out so I just called it cinfony.

However, when it comes to a logo, it's basically a question of what's on the web that I can legally cannabalise and bung a benzene ring on top of. The result is displayed above. Not too bad, if I do say so myself, although most of the credit goes to Jean Victor Balin, OpenClipArt.org and Inkscape.

Journal of Cheminformatics - A new Open Access journal from Chemistry Central

The title says it all. Apparently, Chemistry Central is getting together a new Open Access journal, the Journal of Cheminformatics. There's a placeholder website already up, naming David Wild as editor-in-chief. There's no announcement yet on the Chemistry Central website, and no mention of a timeframe, but I guess we'll hear more in the near future...

It's a canny move, I'd say. Cheminformaticians are among the most tech-savvy of chemists, are used to constant change in their toolset (e.g. new programming languages, new libraries, new analysis methods), and thus the most likely to adopt to new paradigms.

Wednesday 8 October 2008

Molecular Graph-ics with Pybel

Graphs are great. There are books and books of algorithms written by generations of computer scientists that take graphs and do interesting things. And better still, there are open source programming libraries available that implement many of these algorithms. So, given a Pybel Molecule, how can it be converted for use by these libraries?

After some googling around, I found three graph libraries accessible from Python (on Windows): networkx, igraph and the Boost Graph Library (BGL). Whatever library is used, the solution is pretty much the same; iterate over all of the atoms and bonds of the molecule, and add nodes and edges to the graph.

Here's a function that takes a Pybel Molecule and returns a networkx graph (remember to install networkx first...):

def mol_to_networkxgraph(mol):
edges = []
bondorders = []
for bond in ob.OBMolBondIter(mol.OBMol):
bondorders.append(bond.GetBO())
edges.append( (bond.GetBeginAtomIdx() - 1, bond.GetEndAtomIdx() - 1) )
g = networkx.Graph()
g.add_edges_from(edges)
return g

What about making an igraph graph?

def mol_to_igraph(mol):
edges = []
bondorders = []
for bond in ob.OBMolBondIter(mol.OBMol):
bondorders.append(bond.GetBO())
edges.append( (bond.GetBeginAtomIdx() - 1, bond.GetEndAtomIdx() - 1) )

atomtypes = [atom.type for atom in mol]

g = igraph.Graph(edges=edges,
vertex_attrs={'atomtype':atomtypes},
edge_attrs={'bondorder': bondorders})
return g

And finally, making a BGL graph:

def mol_to_boostgraph(mol):
edges = []
bondorders = []
for bond in ob.OBMolBondIter(mol.OBMol):
bondorders.append(bond.GetBO())
edges.append( (bond.GetBeginAtomIdx() - 1, bond.GetEndAtomIdx() - 1) )

g = boost.graph.Graph(edges)
bondordermap = g.edge_property_map("integer")
for edge, bondorder in zip(g.edges, bondorders):
bondordermap[edge] = bondorder
g.edge_properties["bondorder"] = bondordermap

atomtypemap = g.vertex_property_map("string")
atomtypes = [atom.type for atom in mol]
for vertex, atomtype in zip(g.vertices, atomtypes):
atomtypemap[vertex] = atomtype
g.vertex_properties["atomtype"] = atomtypemap

return g

Now that you have a graph you can use any of the algorithms provided by the library.

Image credit: Creative Type websites as graphs by Erik Mallinson (CC BY-NC-SA 2.0)

Tuesday 7 October 2008

Ubiquity script for SourceForge

Did you know that for any SourceForge (SF) project, its website can be found at http://whatever.sf.net or that its project page is at http://sf.net/projects/whatever? If you're a big user of SF, remembering random stuff like that can save a lot of time. However, if you want to be able to jump directly to the bugs page for a particular project, you'll need to know more than the project name - you'll need both the project ID (which is a number) and the bug tracker ID (another number).

I've just written a Ubiquity script which makes life easier if you are involved with SF projects. I can bring up the Ubiquity box, type "sf openbabel" (for example), and I'm presented with links to the project page, website, download page, bugs tracker and web SVN for the OpenBabel project. This saves a lot of time clicking around on SF or trying to remember random numbers.

(For some background info on Ubiquity, see the link in my earlier post)

Wednesday 1 October 2008

OpenBabel, IronPython and ADME filtering

Two OpenBabel-related news items...

The first release of OBDotNet is available for download from SourceForge. This allows you to use OpenBabel from C# and IronPython. Read the text file contained therein for instructions on use from IronPython. Hopefully you can extrapolate from there for C#.

Secondly, some of you may be interested in the recent publication: Lagorce, D.; Sperandio, O.; Galons, H.; Miteva, M. A.; Villoutreix, B. O. FAF-Drugs2: free ADME/tox filtering tool to assist drug discovery and chemical biology projects. BMC Bioinformatics. 2008, 9, 396. This describes a Python library for filtering compound collections using simple ADME rules and SMARTS terms. This is of particular interest to me as it is built on top of Pybel.

Saturday 20 September 2008

Overview of cheminformatics toolkits

Yesterday, Andrew Dalke gave me a sneak preview of his EuroQSAR poster entitled "Python for Computational Chemistry". I see that it's now available on the web at his blog and I recommend you check it out.

It has an excellent diagram showing the history of various cheminformatics toolkits and how they relate to each other. I'm particularly pleased with the diagram as it includes some recent work of mine (Pybel and now Cinfony).

Andrew works on implementing cheminformatics systems in pharmaceutical companies and is a Python advocate. In his poster, he answers multiple "How do I do _____ in Python?" questions. If you want to support the use of Python in cheminformatics as well as let other students/coworkers see what toolkits are available, it's a really good poster to print out and stick up somewhere.

Oh yeah, in other news this month, Noel O'Blog is now being broadcasted from a secret location in University College Cork although I'll be back and forth to the CCDC on a regular basis.

Friday 12 September 2008

Ubiquity - it's everywhere!

So I disappear for a month and when I come back everyone has dropped Greasemonkey like a hot potato (mmmmm...potatoes) and adopted Ubiquity (watch the video on that page to get an idea of what it's about). So here are a couple of Ubiquity scripts of mine which translate from DOI to ACS or RSC formatted reference.

If you install the Ubiquity extension for Firefox (use the link above), select a DOI on any page, hit the magic key (CTRL + SPACE for windows), and type one of "doi2acs", "doi2ACS", "doi2rsc" or "doi2RSC", you can get the ACS or RSC formatted reference with or without the title. This could be useful when putting together the references for a paper or so.

Comments welcome as these are my first scripts...

Tuesday 12 August 2008

ANN: cinfony 0.8 - "the one with documentation"

I've just released a new version of cinfony. This one comes with full documentation of the API as well as a description of how the API is used. It should be ready for use on both Windows and Linux.

The website is at http://cinfony.googlecode.com and here's the front page:

Introduction
Cinfony presents a common API to several cheminformatics toolkits. It uses the Python programming language, and builds on top of OpenBabel, RDKit and the CDK.
Documentation
Install on Windows or Linux
A complete description of Cinfony
API documentation. This documentation can also be accessed using the help command, e.g. help(rdkit).
Further information on Pybel and using the OpenBabel Python bindings is available on the OpenBabel wiki

PubChemSR makes it to top 10

In the 3 months since publication in Chemistry Central Journal, Junguk Hur and David Wild's paper on PubChemSR has shot up the Chemistry Central rankings and has now entered the top 10 most highly accessed papers of all time (that is, since Jan 2007). I should know, as I've watched it overtake my Pybel paper and squeeze me out of that same top 10, but not for long I hope. :-)

Here's the abstract:

Background
Recent years have seen an explosion in the amount of publicly available chemical and related biological information. A significant step has been the emergence of PubChem, which contains property information for millions of chemical structures, and acts as a repository of compounds and bioassay screening data for the NIH Roadmap. There is a strong need for tools designed for scientists that permit easy download and use of these data. We present one such tool, PubChemSR.

Implementation
PubChemSR (Search and Retrieve) is a freely available desktop application written for Windows using Microsoft .NET that is designed to assist scientists in search, retrieval and organization of chemical and biological data from the PubChem database. It employs SOAP web services made available by NCBI for extraction of information from PubChem.

Results and Discussion
The program supports a wide range of searching techniques, including queries based on assay or compound keywords and chemical substructures. Results can be examined individually or downloaded and exported in batch for use in other programs such as Microsoft Excel. We believe that PubChemSR makes it straightforward for researchers to utilize the chemical, biological and screening data available in PubChem. We present several examples of how it can be used.

Thursday 31 July 2008

Calculate circular fingerprints with Pybel II

OpenBabel 2.2.0 came out a little while ago with a lot of new features. One of the new features is a breadth-first iterator over a molecule that returns the current depth as well as the current atom (OBMolAtomBFSIter). Nice. Because now it's a doddle to write a Python script to create circular fingerprints.

The script below creates a circular fingerprint for the example molecule in Bender et al., JCICS, 2004, 44, 170 as follows:

D:\>python25 fp.py
0-C.ar;1-C.ar;1-C.ar;1-N.pl3;2-C.2;2-C.ar;2-C.ar        0
-N.pl3;1-C.ar;2-C.ar;2-C.ar     0-C.ar;1-C.ar;1-C.ar;2
-C.ar;2-C.ar;2-N.pl3    0-C.ar;1-C.ar;1-C.ar;2-C.ar;2
-C.ar   0-C.ar;1-C.ar;1-C.ar;2-C.ar;2-C.ar      0-C.ar;1-
C.ar;1-C.ar;2-C.2;2-C.ar;2-C.ar 0-C.ar;1-C.2;1-C.a
r;1-C.ar;2-C.ar;2-C.ar;2-N.pl3;2-O.co2;2-O.co2  0-C
.2;1-C.ar;1-O.co2;1-O.co2;2-C.ar;2-C.ar 0-O.co2;1-
C.2;2-C.ar;2-O.co2      0-O.co2;1-C.2;2-C.ar;2-O.co2

Here's the script, which is considerably shorter than the original one.:

import pybel

# Set up a translator to Sybyl atom types
ttab = pybel.ob.OBTypeTable()
ttab.SetFromType("INT")
ttab.SetToType("SYB")

if __name__ == "__main__":
    D = 3 # default max depth
    mol = pybel.readstring("smi", "c1(N)ccccc1c(=O)[O-]")

    fp = []
    for i in range(len(mol.atoms)):
        ans = []
        for atom, depth in pybel.ob.OBMolAtomBFSIter(mol.OBMol, i + 1):
            if depth > D: break
            atomtype = ttab.Translate(atom.GetType())
            ans.append("%d-%s" % (depth - 1, atomtype))
        fp.append(";".join(sorted(ans)))
    print "\t".join(fp)

Wednesday 23 July 2008

Chemistry in R

For many cheminformaticians, R is the preferred way of analysing multivariate data and developing predictive models. However, it is not so widely known that there are R packages available that are directly aimed at handling chemical data.

Over the last few years, Rajarshi Guha (Indiana University) has been doing some nice work integrating the CDK and R. His publication in J. Stat. Soft., Chemical Informatics Functionality in R, describes the rcdk and rpubchem packages. The rcdk package allows the user to read SDF files directly into R, calculate fingerprints and descriptors, calculate Tanimoto values, view molecules in 2D (JChemPaint) and 3D (Jmol), calculate SMILES strings, and access the property fields of file formats. The rpubchem package is focused on downloading compounds, property values and assay data from PubChem. See also articles in R news and CDK news [1], [2], [3].

A more recent development is ChemmineR, described in the latest issue of Bioinformatics: "ChemmineR: a compound mining framework for R". The authors appear to be unaware of the earlier work by Rajarshi, and so there is no comparison of available features. However, based on the documentation on their website, it seems that much of the functionality revolves around a type of fingerprint called atom-pair descriptors (APD). SDF files, when read in, are converted to a database of APDs and these can be used for similarity searching, clustering, removal of duplicates and so on. Sets of molecules can be visualised using a web connection to the ChemMine portal (I'm not sure what software is used). According to the documentation, future work will include descriptor calculation with JOELib.

So, there you have it. An exhaustive survey of the two available methods for bringing chemistry into R. Is the time ripe for a cheminformatics equivalent to Bioconductor?

Tuesday 15 July 2008

Review of "IronPython in Action"

The three most well-known implementations of Python are the original C implementation (CPython), one in Java (Jython), and one in .NET (IronPython). Jython makes it easy to write Python programs that use Java classes and to test a Java library at an interactive prompt. Similarly, IronPython allows Python programs to be written that use the .NET libraries (these are the same libraries used by C# and Visual Basic).

IronPython (sometimes abbreviated as FePy) is relatively new and had its first release in 2006. It is Open Source and developed by Microsoft (by the same developer who started Jython). Although .NET is strictly for Windows, IronPython can also run on top of Mono, the open source implementation of .NET which is available cross-platform.

As a cheminformatician should you be interested in IronPython or .NET? Well, Dimitris Agrafiotis thinks so. And Eli Lilly recently open sourced a life science platform implemented in .NET (here's the SF website, but where's the mailing list?). I'm not yet convinced, but I'm keeping an open mind. I'm helping Matt Sprague to prepare C# bindings for OpenBabel. I think this will make OpenBabel the first cheminformatics library accessible from C# (although probably someone will contradict me below).

At this point, if you're still interested, you should check out the first chapter of "IronPython in Action", which you can read on the web. This explains in more detail the background to IronPython and why you might want to use it. IronPython in Action will be published later this year by Manning Press, and is written by Michael Foord and Christian Muirhead. Foord (aka Fuzzyman) is a very active blogger on Python and FePy in particular, and works at a company whose main product is implemented in FePy.

Just to make it clear, this is not a book for someone who wants to pick up programming from scratch. There's little hand holding here. The obligatory introduction to Python is quickly and efficiently dealt with before moving on to the main subject - accessing .NET classes and creating a GUI application. For those coming from C# to IronPython, there is a whole chapter on unit testing with Python, as well as another that covers everything from Python magic methods to metaprogramming. A particularly nice feature of the authors' style is to logically link each section to the following, so that you always understand the point of what you've just read and where it fits into the bigger picture.

From my point of view, it's nice to see that explanations are provided for those using the free version of Visual Studio, Visual Studio Express, as this means that I can test out all of the examples myself. Other chapters yet to be finalised cover using IronPython in a webbrowser (apparently IronPython can run in Silverlight, the new browser plugin from Microsoft), and extending IronPython with C# or VisualBasic.

In short, for any serious users of IronPython, this book is a must have. It may also convince those using C# to make the switch to a better and more productive life with Python...

Reacting to questions

Don't blink or you'll miss it. My 15 minutes of fame starts now.* David Bradley tracked me down in cyberspace, and turned me into a Reactive Profile.

*Update: 15 minutes now over.

Thursday 10 July 2008

Pipeline Python - Generate a workflow

Workflow packages such as Pipeline Pilot, Taverna and KNIME allow the user to graphically create a pipeline to process molecular data. A downside of these packages is that the units of the workflow, the nodes, process data sequentially. That is, no data gets to Node 2 until Node 1 has finished processing all of it. Correction (thanks Egon): The previous line is plain incorrect. Both KNIME and Taverna2, at least, pass on partially processed data as soon as it's available.

Wouldn't it be nicer if they worked more like Unix pipes, that is, as soon as some data comes out of Node 1 it gets passed onto the next Node and so on. This would have three advantages: (1) you get the first result quicker, (2) you don't use up loads of memory storing all of the intermediate results, (3) you can run things in parallel, e.g. Node 2 could start processing the data from Node 1 immediately, perhaps even on a different computer.

Luckily, there is a neat feature in Python called a generator that allows you to create a pipeline that processes data in parallel. Generators are functions that return a sequence of values. However, unlike just returning a list of values, they only calculate and return the next item in the sequence when requested. One reason this is useful is because the sequence of items could be very large, or even infinite in length. (For a more serious introduction, see David Beazley's talk at PyCon'08, which is the inspiration for this blog post.)

Let's create a pipeline for processing an SDF file that has three nodes: (1) a filter node that looks for the word "ZINC00" in the title of the molecule, (2) a filter node for Tanimoto similarity to a target molecule, (3) an output node that returns the molecule title. (The full program is presented at the end of this post.)

# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))

# Create an input source
dataset = pybel.readfile("sdf", inputfile)

# Feed the pipeline    
results = pipeline(dataset)

The variable 'results' is a generator, so nothing actually happens until we request the values returned by the generator...

# Print out each answer as it comes
for title in results:
print title

The titles of the molecules found will appear on the screen one by one as they are found, just like in a Unix pipe. Note how easy it is to combine nodes into a pipeline.

Here's the full program:

import re
import os
import itertools

# from cinfony import pybel
import pybel

def createpipeline(*filters):
def pipeline(dataset):
piped_data = dataset
for filter in filters:
piped_data = filter[0](piped_data, *filter[1:])
return piped_data
return pipeline

def titlematches(mols, patt):
p = re.compile(patt)
return (mol for mol in mols if p.search(mol.title))

def similarto(mols, target, cutoff=0.7):
target_fp = target.calcfp()
return (mol for mol in mols if (mol.calcfp() | target_fp) >= cutoff)

def moltotitle(mols):
return (mol.title for mol in mols)

if __name__ == "__main__":
inputfile = os.path.join("..", "face-off", "timing", "3_p0.0.sdf")
dataset = pybel.readfile("sdf", inputfile)
findtargetmol = createpipeline((titlematches, "ZINC00002647"),)
targetmol = findtargetmol(dataset).next()

# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))

# Create an input source
dataset = pybel.readfile("sdf", inputfile)

# Feed the pipeline    
results = pipeline(dataset)

# Print out each answer as it comes through the pipeline    
for title in results:
print title

So, if in future someone tells you that Python generators can be used to make a workflow, don't say "I never node that".

Image: Pipeline by Travis S. (CC BY-NC 2.0)

Tuesday 8 July 2008

Chemoinformatics p0wned by cheminformatics

The headline says it all. After two weeks, and thousands, er...40 votes, the results are in: 26 for cheminformatics, 14 for that other one.

Next week, bioinformatics versus binformatics...

Sunday 6 July 2008

Cheminformatics toolkit face-off: Speed (Python vs Java vs C++)

It's a bit meaningless to compare speeds of different toolkits performing the same operations. Functionality is really the reason you are going to favour one toolkit over another. Here I'm focusing on comparing the speed of accessing the same toolkit from CPython or Jython versus accessing it directly in its original language. In other words, what price do you pay to be able to work in Python rather than C++ or Java? (I'll discuss the advantages of working in Python in a later post.)

To begin with, let's consider accessing the CDK from Python versus from Java. The test input file for all of these tests is the first subset of the drug-like ligands in ZINC, 3_p0.0.sdf, which contains 24098 molecules. The three test cases are (1) Iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) calculate 25 descriptor values for the first 228 molecules. I implemented these in Java, and using cinfony. For example, here's the cinfony version for test case 2:

import time
from cinfony import cdk

t = time.time()
for mol in cdk.readfile("sdf", "3_p0.0.sdf"):
print mol.molwt
print time.time() - t

Here are the results (times are in seconds):

Method	Test 1	Test 2	Test 3
Java	22.2	38.9	31.7
CPython (cinfony, cdkjpype)	34.0	72.6	38.2
Jython (cinfony, cdkjython)	23.7	44.4	34.0

It's clear that accessing the CDK from Jython is almost as fast as using it in a Java program. However, there is an overhead associated with using it from CPython except where, as in Test 3, most of the time is spent in computation.

Next, let's look at accessing OpenBabel from Python versus from C++. Here I will compare the following tests cases: (1) iterate over all of the molecules, (2) iterate over all of molecules and write out the molecular weight, (3) apply 30 steps of a forcefield optimisation to the first 200 molecules. Here's an example of the cinfony script for (2). Notice any similarities to the one above? :-)

import time
from cinfony import pybel

t = time.time()
for mol in pybel.readfile("sdf", "3_p0.0.sdf"):
print mol.molwt
print time.time() - t

Here are the results (note: measured on a different machine than the tests above):

Method	Test 1	Test 2	Test 3
C++	77.7	126.0	56.8
CPython (cinfony, Pybel)	78.3	132.9	60.0
Jython (cinfony, Jybel)	80.4	135.7	59.4

In short, the cost of using Pybel or Jybel is small.

Technical notes: For the CDK, I calculated the values of all descriptors (except IP) that didn't return an array of values. This came to 25 descriptors. I also skipped over one molecule, #20, that took several seconds to process. Jython can natively access Java libraries such as the CDK, but to access the CDK from CPython, cinfony uses JPype. cinfony uses the Python SWIG wrappers to access OpenBabel from CPython; Jython is using the Java SWIG wrappers for OpenBabel. I need to repeat the runs a few times on a quiet machine to get better figures, but I note that the figures do not include the cost of loading the OpenBabel DLL or starting up the JVM.

Image credit:MacRonin47

Saturday 5 July 2008

ANN: OpenBabel 2.2.0 Released

Here is the official announcement from Geoff Hutchison:

I am very happy to finally announce the release of Open Babel 2.2.0, the latest stable version of the open source chemistry toolbox.

This release represents a major update and should be a stable upgrade, strongly recommended for all users of Open Babel. Highlights include improved force fields and coordinate generation, conformer searching, enhanced plugins including molecular descriptors, filters, and command-line transformations. Many formats are improved or added, including CIF, mmCIF, Gaussian cube, PQR, OpenDX cubes, and more. Improved developer API and scripting support and many, many bug fixes are also included.

What's new? See the full release notes.

To download, see our Install Page.

For more information, see the project website.

I would like to personally thank a few people for making this release a great one. In alphabetical order, Jean Bréfort, Andrew Dalke, Marcus Hanwell, Chris Morley, Noel O'Boyle, Kevin Shepherd, Tim Vandermeersch, and Ugo Varetto.

This is a community project and we couldn't have made this release without you. Many thanks to all the contributors to Open Babel including those of you who submitted feedback, bug reports, and code.

Cheers,
-Geoff

Wednesday 25 June 2008

ANN: cinfony 0.2 - the "easy to install" version

A new version of cinfony is now available. The good news is that it now much easier to install for Windows users.

All you need now is to have Python and Java, download the CDK and the RDKit, edit a configuration file, and away you go using OpenBabel, the RDKit and the CDK from Python. The full instructions are here.

I've updated some of the docstrings, but documentation is still sparse and the Pybel documentation is still the best guide, as linked to on the cinfony web site.

Here's an example of what's possible with cinfony. Suppose you want to convert a SMILES string to 3D coordinates with OpenBabel, create a 2D depiction of that molecule with the RDKit, calculate descriptors with the CDK, and write out an SDF file containing the descriptor values and the 3D coordinates.

from cinfony import rdkit, cdk, pybel
mol = pybel.readstring("smi", "CCC=O")
mol.make3D()
rdkit.Molecule(mol).draw(show=False,
                         filename="aldehyde.png")
descs = cdk.Molecule(mol).calcdesc()
mol.data.update(descs)
mol.write("sdf", filename="aldehyde.sdf")

This new version of cinfony also includes Jybel, so that you can now access both the CDK and OpenBabel from Jython, as well as from CPython.

Tuesday 24 June 2008

The Forced Authorship Licence - Get your users to write papers for you

You are probably familiar with commercial software licences. You may even have heard of open source licences. But are you familiar with the Forced Authorship Licence (FAL) model? Let me give you an example from real life (the name has been removed to focus on the actual licence):

"X is available free of charge for researchers belonging to Academic community. The download and use of X is subject to the X Academic Licence...Every users is associated with one of the X team. This will help the user in installing and using X and will be co-author of the first paper published by the user, containing X results. You must contact one of the three group leaders and discuss your proposed project before applying for the use of X."

I think this is a great idea. Think of all the publications. If I had thought of this a few years ago, I would now have 30 extra papers instead of the 30 citations of GaussSum.

But it's never to late to start. And why stop at software? If I'm going to be competing with people that use the FAL, I need to think smarter. From now on, all of my papers, software, blog posts, personal communications, and any ideas that arose while reading my papers, attending my talks or reading my posters will carry the Noel O'Blog Licence (NOABL - 'A' for apostrophe and don't you forget it). Instead of citing me, any resulting publication must carry my name as sole author, in bold - no, make that in fire in letters thirty feet high - and when applying for grants, I'm allowed to list these publications together with my own work.

Sounds fair, doesn't it? Oh - I almost forgot. It's spelt "Noel M. O'Boyle". And don't forget that apostrophe.

Image: Licences by Martin Deutsch (CC BY-NC-ND 2.0)

Monday 23 June 2008

O No - It's cheminformatics!

It is time to cast your vote in the greatest polarising debate of our times. Yes indeed: should it be cheminformatics or chemoinformatics? Vote now (see poll on right).

Not to sway any undecided voters, but I'm definitely in favour of "cheminformatics". My main reason is that I'm worried that if the other camp win out, they'll probably decide to change more words: we'll end up doing chemoistry, like our Australian cousins.

You have 13 days to cast your vote...

An RSS feed for the CCL list

Apparently some people still read the CCL (Computational Chemistry List) using email. I gave up on that some time ago. Here's an RSS feed I threw together some time ago, and which you might find useful: CCL feed

If you don't know what an RSS feed is, and how it might be useful, it's quicker just to test it out than to explain. First of all, you need a feed reader: for example get an account on Google Reader, a free online RSS reader. Next, subscribe to whatever RSS feeds you want, by clicking on "Add subscription" and copying and pasting the URL of the RSS feed into the box that appears.

Here's how the feed is created:

import sys
import email
import datetime
import pdb

from ftplib import FTP
from StringIO import StringIO

import PyRSS2Gen

def breakdaily(messages):
    """
    >>> import pickle
    >>> a = pickle.load(open("tmp20070105"))
    >>> len(breakdaily(a))
    1
    """
    broken = []
    message = []
    for line in messages.split("\n"):
        if line.startswith("From owner-chemistry@ccl.net"):
            broken.append("\n".join(message))
            message = []
        else:
            message.append(line)
    broken.append("\n".join(message))
    return broken[1:]

def getlatest(N):
    ftp = FTP('ftp.ccl.net')
    ftp.login('anonymous')
    ftp.cwd('/pub/chemistry/archived-messages')
    # ftp.dir()

    listoffiles = ftp.nlst()

    # Get current year
    year = datetime.datetime.now().year

    # Get N most recent messages
    messagetot = 0
    months = [str(x).zfill(2) for x in range(1, 13)]
    months.reverse()
    days = [str(x).zfill(2) for x in range(1, 32)]
    days.reverse()
    msgs = []


    while messagetot<N:
        ftp.cwd(str(year))

        for month in months:
            ftp.cwd(month)
            availabledays = ftp.nlst()
            for day in [x for x in days if x in availabledays]:
                # Go thru in reverse order but exclude days that are
                # non-existent
                messages = StringIO()
                ftp.retrbinary("RETR %s" % day, messages.write)
##                pickle.dump(messages.getvalue(), open("tmp%i%s%s" % (year,month,day), "w"))
                listmsgs = breakdaily(messages.getvalue())
                # print "="*24 + "\n", messages.getvalue()
                for i,msg in enumerate(listmsgs):
                    msg_content = email.message_from_string(msg)
                    text = ""
                    for part in msg_content.walk():
                        if (part.get_content_maintype()=="text" and
                            part.get_content_type()=="text/plain"):
                            text = part.get_payload()
                            text = text.replace("\n","<br/>").decode("iso-8859-1", "strict")
                    msgs.append( (year, month, day, msg_content, i+1, text) )
                messagetot += len(listmsgs)
                if messagetot>=N:
                    break
            if messagetot>=N:
                break
            ftp.cwd("..")
        ftp.cwd("..")
        year -= 1 # Continue into the previous year

    ftp.quit()
    msgs.reverse()

    return msgs

def main():

    print "\nStarting..."

    messages = getlatest(100)
##    outputfile = open("messages.pickle", "w")
##    pickle.dump(messages, outputfile)
##    outputfile.close()
##    import sys
##    sys.exit(1)
##    messages = pickle.load(open("messages.pickle", "r"))

    rssitems = []
    for year, month, day, msg, id, messagetext in messages:

        # Add the new item
        newitem = PyRSS2Gen.RSSItem(
                 title = msg['Subject'],
                 link = "http://ccl.net/cgi-bin/ccl/message-new?%d+%s+%s+%s" % (
                         year, month, day, str(id).zfill(3)),
                 description = messagetext,
        ## What's a guid? A globally unique id...used by RSS readers
        ## to determine whether they've seen a particular news item
        ## already
                 guid = PyRSS2Gen.Guid("http://ccl.net/cgi-bin/ccl/message-new?%d+%s+%s+%s" % (
                         year, month, day, str(id).zfill(3))),
                 pubDate = msg['Date']
                 )
        rssitems.append(newitem)

    rss = PyRSS2Gen.RSS2(
        title = "CCL",
        link = "http://www.ccl.net",
        description = "RSS Feed of the world's greatest computational chemistry "
                      "mailing list, chemistry@ccl.net (the CCL list)",
        lastBuildDate = datetime.datetime.now(),
        items = rssitems)
    rss.write_xml(open("ccl.rss", "w"))

    print "Finishing...\n"

def test():
    import doctest
    doctest.testmod()

def do_debugger(type, value, tb):
    pdb.pm()

if __name__=="__main__":

    # sys.excepthook = do_debugger
    main()

    ## test()

Tuesday 17 June 2008

Jmol gets competition - OpenAstexViewer now available

AstexViewer has just gone open source (LGPL), and is available from http://openastexviewer.net/.

Both Jmol and AstexViewer are 3D chemical structure applets that run in the browser. AstexViewer was originally developed by Mike Hartshorn at Astex Therapeutics for in-house visualisation of protein crystal structures. Although available at no cost for some time, it has not until now been open source.

Jmol is in many ways an open source success story. It has several enthusiastic developers (who make new releases with new features every few days!), a very busy mailing list, a large number of users worldwide, and even a recent book. It will be interesting to see whether OpenAstexViewer is sufficiently different to attract users away from this well-established project.

In any case, here's to diversity. Hopefully both projects will get interesting ideas from each other.

Wednesday 21 May 2008

Cheminformatics toolkit face-off - SMARTS matching

SMARTS strings are really useful for substructure searching in molecules. They are sort of like regular expressions for molecules. The idea and syntax of SMARTS comes from the people at Daylight.

Rajarshi Guha, together with Dazhi Jiao (Uni. Indiana), have put together a test suite for SMARTS matchers, and run the test suite against the CDK, OpenBabel and OEChem. Greg contributed results for RDKit. The overall performance is described on Rajarshi's website.

Here's the current summary of the results for the 158 test cases, but check Rajarshi's page for a more up-to-date picture:

Toolkit	True	False
CDK	149	9
OpenBabel	149	9
RDKit	151	7
OpenEye	150	8

Image: Everybody stand back by Chris Radcliff (CC BY-SA 2.0)

Tuesday 20 May 2008

Cheminformatics toolkit face-off - Depiction Part 2

[See update (28/10/08).]

One of the aims of the previous toolkit face-off was to get some feedback on the best options for drawing images with different toolkits. I also hoped that a comparison of the images would allow bugs to be easily identified. And finally, I thought that a bit of competition might help improve depiction and structure diagram generators across the board.

Since the last post:

Molinspiration have approached me to be included in the face-off
I've learnt that I should remove hydrogens from CDK and OASA depictions and control the size of the generated image
Beda has done amazing work enhancing depiction in OASA, and has in fact now released OASA separately from BKChem as an independent cheminformatics toolkit
Greg is just about to release a new version of the RDKit which has improved depiction, and he has fixed the bug identified by PMR
And I've fixed some bugs myself in cinfony

Now the images themselves (same dataset as before): [depiction] and [structure diagram generation].

Notes:

The face-off involves the open source toolkits the CDK, OASA and the RDKit, and the proprietary toolkits Cactvs and molinspiration.
If you want to help improve these depictions or coordinate generators, why not leave a comment below suggesting specific ways to improve them, or highlighting specific things they could do better.
Double bond stereochemistry doesn't seem to be preserved by the CDK, but this is possibly my fault (I'm awaiting a reply to an email to the cdk-users mailing list)
Several people complained to me last time that I didn't give OASA a fair chance. In order to make it up, this time round I've made OASA the star attraction. The coordinates generated by the three open source toolkits are all depicted by OASA

Monday 12 May 2008

RSS feeds for chemistry projects on SourceForge

Wouldn't it be nice to know whether any of your favourite chemistry projects has released a new version? Or to keep abreast of the latest registrations of chemistry projects on SourceForge? No? I guess it's just me then.

Anyway, here are some RSS feeds I've thrown together to allow me to do just that:

latest chemistry releases
registrations of chemistry projects: latest and two months old

Note: the two-months-old RSS feed is better if you want to find some actual code or a working website when you click through.

Want to do the same for another software category? Here's the code (requires BeautifulSoup and PyRSS2Gen):


import datetime
import urllib

from BeautifulSoup import BeautifulSoup
import PyRSS2Gen

def download():
    urls = ["http://sourceforge.net/search/index.php?words=trove%3A%28384%29"
            "&sort=latest_file_date&sortdir=desc&offset=0&limit=100&"
            "type_of_search=soft&pmode=0",
            "http://sourceforge.net/search/index.php?words=trove%3A%28384%29"
            "&sort=registration_date&sortdir=desc&offset=0&limit=100&"
            "type_of_search=soft&pmode=0"]
    urllib.urlretrieve(urls[0], "releases.html")
    urllib.urlretrieve(urls[1], "registrations.html")

def converttodate(text):
    if text=="(none)":
        return None
    t = map(int, text.split("-"))

    return datetime.datetime(*t)

def makerss(filename, items, sortby, title):
    rss = PyRSS2Gen.RSS2(
        title = title,
        link = "http://baoilleach.blogspot.com/2008/05/rss-feeds-for-chemistry-projects-on.html",
        description = "baoilleach's RSS feed of "
                      "Chemistry projects on SourceForge",

        lastBuildDate = datetime.datetime.now(),

        items = [
           PyRSS2Gen.RSSItem(
             title = item["title"],
             link = item["link"],
             description = item["description"],
             guid = PyRSS2Gen.Guid("%s %s" % (item["title"], item['lastrelease'])),
             pubDate = item[sortby])
           for item in items]
           )

    rss.write_xml(open(filename, "w"))

def analyse(project):
    ans = {}
    ans['title'] = project.a.string
    ans['link'] = "http://sf.net" + project.a['href']

    data = project.parent.parent
    ans['lastrelease']  = converttodate(data('td')[5].string.strip())

    ans['registered'] = converttodate(data('td')[4].string.strip())

    data = project.parent.parent.findNextSibling()
    ans['description'] = data.td.contents[0].strip()
    if not ans['description']:
        ans['description'] = data.td.contents[2].strip()

    return ans

def processfile(filename):
    html = open(filename, "r").read()
    soup = BeautifulSoup(html)
    projects = soup.findAll(lambda tag: tag.name=="h3" and tag.a
                            and tag.a['href'].startswith("/projects/"))

    data = [analyse(project) for project in projects]
    return data

if __name__=="__main__":
    download()

    data = processfile("registrations.html")
    sometimeago = datetime.datetime.now() - datetime.timedelta(days=60)
    olddata = [d for d in data if d['registered'] <= sometimeago]
    makerss("oldregistrations.rss", olddata, "registered", "Registrations 60 days ago on SF")
    makerss("newregistrations.rss", data, "registered", "Latest registrations on SF")

    data = processfile("releases.html")
    makerss("latestreleases.rss", data, "lastrelease", "Latest releases on SF")

Thursday 8 May 2008

Review of "Gnuplot in Action"

"Famous scientific plotting package". This succinct yet accurate description of gnuplot appears on the gnuplot SF project page. Despite being one of the most well-known open source scientific projects around, it has taken 15 years for the first book on gnuplot to be published. "Gnuplot in Action" by Phillip K. Janert will be published by Manning Publications later this year (one chapter available free) and aims to give an overview of how to use gnuplot, describing everything from how to read in data, style graphs, save images to more advanced topics such as macros, using gnuplot for CGI or calling it from scripting languages.

In the past I have used gnuplot to get a quick look at data in text files, and also to look at surfaces. I am used to looking up the gnuplot help system to figure things out, but although it's exhaustive, it's also a bit exhausting especially when you are simply trying to figure out whether something is possible or not. The fact that the gnuplot site has a long list of frequently-asked questions is also a sign of poor documentation. So there's definitely room for a book of this type.

Although I was afraid this book would simply be a long list of commands and switches (of which gnuplot has many, to be sure), "Gnuplot in Action" takes a much more interesting approach, illustrating everything with examples and focusing on the most useful options while simply referring the reader to the gnuplot help for more obscure options. The author isn't afraid of sympathising with the user, e.g. "Nevertheless, it is very different from the behavior we have come to expect from user interfaces in most programs" (in reference to how images are saved in gnuplot). Although the author has been using gnuplot for 15 years, he obviously is still aware of what the most annoying features are, and in this case, describes a handy macro to look after the horrible mess of writing an image to disk and resetting the terminal afterwards.

Somewhat bizarrely, the blurb for the book is pitched at business users (analysts and "Six Sigma Black Belts") and doesn't refer to scientists at all. But don't worry, the examples in the book contain everything from share price analysis to looking at phase transitions (reflecting the author's background in theoretical physics as well as in business), and in any case, the nature of the examples doesn't affect the tone of the book in any way.

So, overall I liked this book. I particularly liked the sidebars and comments on particular techniques. For example, the section on 3d plots includes a short discussion of the pitfalls of using 3d plots and alternative plotting methods that might be more suitable. The section on color, includes a similar warning as well a discussion on palette design (which references Why Should Engineers and Scientists Be Worried About Color?). This turns the book from being a straightforward technical manual to something more of a discussion of techniques. There are actually a couple of chapters still to come (13, 14, 15) which appear to be focused on where/how to use particular graphical analysis techniques.

Any negatives? Well, I wasn't very interested in the section on reading data from disk and transforming it - I think that anyone who knows Perl/Python is going to reshape/filter their data before getting it into gnuplot, but admittedly the book claims to be suitable for non-programmers. Also, the author's favorite interactive terminal, wxt, is not available for Windows and so I was a bit disappointed when I fired up gnuplot and couldn't find it. Regarding the connection between gnuplot and Python, the author doesn't mention the widely-used Gnuplot.py (although I'm not a big fan of this).

But overall, these are minor quibbles. I recommend this book to anyone who wants to make the most of gnuplot, that "famous scientific plotting package".

	Blog	Comm
Me
Rich
Rajarshi
Egon