Saturday, 21 December 2013

Top 5 favourite blogs of chemistry bloggers

I'm always looking for ways to free up time on the webs, and my New Year's resolution is to avoid reading all Top N lists. These ubiquitous lists somehow compel me to read the contents no matter how trivial. No more. Well, no more after this, my own top N list.

A list of all chemistry blogs is maintained by Peter Maas over at Chemical Blogspace (CB). You can subscribe to one of the feeds over there to keep on top of all of them at the same time (or a relevant subset). If you have a chemistry blog and it's not on there, you should submit it to the site and your readership will balloon overnight.

Keeping on top of new chemistry blogs is tricky though. One way I thought to find new blogs was to collate the blog rolls of all of the existing blogs on CB; to a first approximation one can do this by collating all of the links on each blog's front page. This gives some idea of which blogs other bloggers read.

The Top 5 are Derek Lowe's "In the Pipeline", Egon Willighagen's "Chem-bla-ics", Paul Bracher's "ChemBark", Milkshake's "Orp Prep Daily" and Nature Chemistry's "The Sceptical Chymist".

And here are the raw results (code at the end of post) sorted by frequency of occurrence.
34 http://pipeline.corante.com/
31 https://subscribe.wordpress.com/
27 http://wordpress.org/
19 http://chem-bla-ics.blogspot.com/
18 http://blog.chembark.com/
17 http://wordpress.com/
16 http://orgprepdaily.wordpress.com/
16 http://blogs.nature.com/thescepticalchymist/
15 http://www.chemspider.com/blog/
12 http://gaussling.wordpress.com/
12 http://ashutoshchemist.blogspot.com/
12 http://www.blogger.com/
11 http://depth-first.com/
11 http://wwmm.ch.cam.ac.uk/blogs/murrayrust/
11 http://www.chemistry-blog.com/
10 http://www.thechemblog.com/
10 http://curlyarrow.blogspot.com/
10 http://usefulchem.blogspot.com/
10 http://www.statcounter.com/
9 http://miningdrugs.blogspot.com/
9 http://cultureofchemistry.blogspot.com/
9 http://chemjobber.blogspot.com/
9 http://www.sciencebase.com/science-blog/
9 http://scienceblogs.com/moleculeoftheday/
9 http://www.chemspider.com/
8 http://kinasepro.wordpress.com/
8 http://cenblog.org/
8 http://liquidcarbon.livejournal.com/
8 http://chemistrylabnotebook.blogspot.com/
7 http://www.chemicalforums.com/
7 http://omicsomics.blogspot.com/
7 http://transitionstate.wordpress.com/
7 http://blog.chemicalforums.com/
7 http://homebrewandchemistry.blogspot.com/
6 http://www.fieldofscience.com/
6 http://www.coronene.com/blog/
6 http://blog.everydayscientist.com/
6 http://totallymechanistic.wordpress.com/
6 http://scienceblogs.com/insolence/
6 http://www.bccms.uni-bremen.de/cms/
6 http://www.researchblogging.org/
6 http://www.nature.com/
6 http://outonalims.wordpress.com/
6 http://www.totallysynthetic.com/blog/
6 http://www.tiberlab.com/
6 http://www.rscweb.org/blogs/cw/
5 http://wiki.cubic.uni-koeln.de/cb/
5 http://organometallics.blogspot.com/
5 http://baoilleach.blogspot.com/
5 http://wavefunction.fieldofscience.com/
5 http://atompusher.blogspot.com/
5 http://feedburner.google.com/
5 http://chemical-quantum-images.blogspot.com/
5 http://propterdoc.blogspot.com/
5 http://allthingsmetathesis.com/
5 http://cb.openmolecules.net/
5 http://the-half-decent-pharmaceutical-chemistry-blog.chemblogs.org/
5 http://www.ch.ic.ac.uk/rzepa/blog/
5 http://usefulchem.wikispaces.com/
4 http://www.rsc.org/chemistryworld/
4 http://justlikecooking.blogspot.com/
4 http://synthreferee.wordpress.com/
4 http://mndoci.com/blog/
4 http://scienceblogs.com/pharyngula/
4 http://jmgs.wordpress.com/
4 http://pubchem.ncbi.nlm.nih.gov/
4 http://walkerma.wordpress.com/
4 http://gmc2007.blogspot.com/
4 http://purl.org/dc/elements/1.1/
4 http://coronene.blogspot.com/
4 http://naturalproductman.wordpress.com/
4 http://blog.rguha.net/
4 http://syntheticnature.wordpress.com/
4 http://www.organic-chemistry.org/
4 http://www.emolecules.com/
4 http://www.livejournal.com/
4 http://carbontet.blogspot.com/
4 http://therealmoforganicsynthesis.blogspot.com/
4 http://syntheticenvironment.blogspot.com/
4 http://totallymedicinal.wordpress.com/
4 http://comporgchem.com/blog/
4 http://www.jungfreudlich.de/
4 http://graphiteworks.wordpress.com/
4 http://chemicalcrystallinity.blogspot.com/
4 http://www.ebi.ac.uk/
4 http://scienceblogs.com/principles/
4 http://sanjayat.wordpress.com/
4 http://greenchemtech.blogspot.com/
4 http://www.feedburner.com/
4 https://www.ebi.ac.uk/chembl/
3 http://blogs.nature.com/
3 http://chiraljones.wordpress.com/
3 http://pubs.acs.org/journals/joceah/
3 http://cen07.wordpress.com/
3 http://www.natureasia.com/
3 http://masterorganicchemistry.com/
3 http://chem-eng.blogspot.com/
3 http://pubs.acs.org/journals/orlef7/
3 http://molecularmodelingbasics.blogspot.com/
3 http://chemistswithoutborders.blogspot.com/
3 http://www.typepad.com/
3 http://wordpress.org/extend/ideas/
3 http://www.sciencebase.com/
3 http://codex.wordpress.org/
3 http://chemicalmusings.blogspot.com/
3 http://chemicalblogspace.blogspot.com/
3 http://wordpress.org/extend/themes/
3 http://drexel-coas-elearning.blogspot.com/
3 http://cenblog.org/terra-sigillata/
3 http://planet.wordpress.org/
3 http://l-stat.livejournal.com/
3 http://wordpress.org/extend/plugins/
3 http://jmol.sourceforge.net/
3 http://chembl.blogspot.com/
3 http://theme.wordpress.com/themes/regulus/
3 http://scientopia.org/blogs/ethicsandscience/
3 http://www.google.com/
3 http://youngfemalescientist.blogspot.com/
3 http://www.badscience.net/
3 http://paulingblog.wordpress.com/
3 http://l-api.livejournal.com/
3 http://waterinbiology.blogspot.com/
3 http://gmpg.org/xfn/
3 http://totallysynthetic.com/blog/
3 http://chemicalmusings.wordpress.com/
3 http://liberalchemistry.blogspot.com/
3 http://www.hdreioplus.de/wordpress/
3 http://pubs.acs.org/
3 http://wordpress.org/news/
3 http://verpa.wordpress.com/
3 http://www.opentox.org/
3 http://pubs.acs.org/journals/jacsat/
3 http://www.chemical-chimera.blogspot.com/
3 http://www.kilomentor.com/
3 http://invivoblog.blogspot.com/
3 http://blog.khymos.org/
3 http://www.livejournal.com/search/
3 http://chemicalsabbatical.blogspot.com/
3 http://wordpress.org/support/
3 http://scienceblogs.com/clock/
3 http://profmaster.blogspot.com/
3 http://www.orgsyn.org/
3 http://www.surechembl.org/
3 http://www.ebi.ac.uk/chebi/
3 http://scienceblogs.com/aetiology/
3 http://www.mazepath.com/uncleal/
3 http://syntheticremarks.com/
3 http://www.nature.com/nchem/
2 http://www.syntheticpages.org/
2 http://www.eyeonfda.com/
2 http://genchemist.wordpress.com/
2 http://cenblog.org/newscripts/2013/12/amusing-news-aliquots-128/
2 http://blogs.scientificamerican.com/the-curious-wavefunction/2013/10/09/computational-chemistry-wins-2013-nobel-prize-in-chemistry/
2 http://cenblog.org/just-another-electron-pusher/
2 http://www.uu.se/
2 http://bugs.bioclipse.net/
2 http://www.ch.cam.ac.uk/magnus/
2 http://archive.tenderbutton.com/
2 http://cenblog.org/the-safety-zone/2013/12/csb-report-on-chevron-refinery-fire-urges-new-regulatory-approach/
2 http://pubs.acs.org/cen/
2 http://interfacialscience.blogspot.com/
2 http://www.blogtopsites.com/
2 http://www.amazingcounters.com/
2 http://www.chemheritage.org/
2 http://drexelisland.wikispaces.com/
2 http://intermolecular.wordpress.com/
2 http://www.aldaily.com/
2 http://www.plos.org/
2 http://kashthealien.wordpress.com/
2 http://web.expasy.org/groups/swissprot/
2 http://theme.wordpress.com/themes/enterprise/
2 http://cenboston.wordpress.com/
2 http://infiniflux.blogspot.com/
2 http://laserjock.wordpress.com/
2 http://bkchem.zirael.org/
2 http://www.qdinformation.com/qdisblog/
2 http://syntheticorganic.blogspot.com/
2 http://www.sciencebasedmedicine.org/
2 http://scienceblogs.com/pontiff/
2 http://www.chemistryguide.org/
2 http://pubs.acs.org/journals/jmcmar/
2 http://blog.openwetware.org/scienceintheopen/
2 http://cenblog.org/transition-states/
2 http://theme.wordpress.com/themes/contempt/
2 http://www.fiercebiotech.com/
2 http://www.paulbracher.com/blog/
2 http://scienceblogs.com/ethicsandscience/
2 http://tripod.nih.gov/
2 http://sciencegeist.net/
2 http://spectroscope.blogspot.com/
2 http://kilomentor.chemicalblogs.com/
2 http://pharmagossip.blogspot.com/
2 http://chembioinfo.com/
2 http://philipball.blogspot.com/
2 http://browsehappy.com/
2 http://www.realclimate.org/
2 http://chem.vander-lingen.nl/
2 http://cenblog.org/the-safety-zone/2013/12/lab-safety-is-critical-in-high-school-too/
2 http://joaquinbarroso.com/
2 http://eristocracy.co.uk/brsm/
2 http://daneelariantho.wordpress.com/
2 http://blogs.discovermagazine.com/cosmicvariance/
2 http://johnirwin.docking.org/
2 http://www.scienceblog.com/cms/
2 http://www.chemistry-blog.com/2013/12/11/number-11-hydrogen/
2 http://thebioenergyblog.blogspot.com/
2 http://www.eclipse.org/
2 http://www.ebyte.it/stan/
2 http://scienceblogs.com/
2 http://boscoh.com/
2 http://www.nature.com/nchem/journal/v6/n1/
2 http://cdavies.wordpress.com/
2 http://chemistandcook.blogspot.com/
2 http://www.rheothing.blogspot.com/
2 http://openbabel.org/
2 http://www.metabolomics2012.org/
2 http://cenblog.org/grand-central/
2 http://www.scilogs.es/
2 http://openflask.blogspot.com/
2 http://d3js.org/
2 http://stuartcantrill.com/
2 http://blogs.discovermagazine.com/gnxp/
2 http://www.bioclipse.net/
2 http://rajcalab.wordpress.com/
2 http://bacspublish.blogspot.com/
2 https://cszamudio.wordpress.com/
2 http://scienceblogs.com/sciencewoman/
2 http://theorganicsolution.wordpress.com/2013/12/12/my-top-10-chemistry-papers-of-2013/
2 http://chem242.wikispaces.com/
2 http://icpmassspectrometry.blogspot.com/
2 http://theme.wordpress.com/themes/ocean-mist/
2 http://theme.wordpress.com/themes/andreas09/
2 http://impactstory.org/
2 http://blog.metamolecular.com/
2 http://lamsonproject.org/
2 http://agilemolecule.wordpress.com/
2 http://proteinsandwavefunctions.blogspot.com/
2 http://cenblog.org/newscripts/2013/12/heirloom-chemistry-set/
2 http://beautifulphotochemistry.wordpress.com/
2 http://www.etracker.com/
2 http://chem241.wikispaces.com/
2 http://practicalfragments.blogspot.com/
2 http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2013/
2 http://researchblogging.org/
2 http://retractionwatch.wordpress.com/
2 http://u-of-o-nmr-facility.blogspot.com/
2 http://www3.interscience.wiley.com/cgi-bin/jhome/26293/
2 http://scienceblogs.com/goodmath/
2 http://creativecommons.org/licenses/by-nc-sa/3.0/
2 http://cenblog.org/the-haystack/
2 http://www.simbiosys.com/
2 http://www.steinbeck-molecular.de/steinblog/
2 http://chemistry.about.com/
2 http://cen.acs.org/
2 http://cniehaus.livejournal.com/
2 http://chem.chem.rochester.edu/~nvd/
2 http://www.chemtube3d.com/
2 http://news.google.com/
2 http://theme.wordpress.com/themes/digg3/
2 http://theme.wordpress.com/themes/mistylook/
2 http://www.wordpress.org/
2 http://luysii.wordpress.com/
2 http://disqus.com/
2 http://openbabel.sourceforge.net/
2 http://networkedblogs.com/
2 http://www.chemspy.com/
2 http://www.openphacts.org/
2 http://cic-fachgruppe.blogspot.com/
2 http://weconsent.us/
2 http://cdktaverna.wordpress.com/
2 http://gilleain.blogspot.com/
2 http://scienceblogs.com/scientificactivist/
2 http://synchemist.blogspot.com/
2 http://www.compchemhighlights.org/
2 http://acdlabs.typepad.com/elucidation/
2 http://cdk.sf.net/
2 http://www.phds.org/
2 http://zinc.docking.org/
2 http://www.ch.imperial.ac.uk/rzepa/blog/
2 http://www.cas.org/
2 http://brsmblog.com/
2 http://www.sciencetext.com/
2 http://altchemcareers.wordpress.com/
2 http://theeccentricchemist.blogspot.com/
2 http://www.sciscoop.com/
2 http://www.agile2robust.com/
2 http://mwclarkson.blogspot.com/
2 http://www.jcheminf.com/
2 http://www.tns-counter.ru/V13a****sup_ru/ru/UTF-8/tmsec=lj_noncyr/
2 http://www.slideshare.net/
2 http://scienceblogs.com/eruptions/
2 http://www.scilogs.com/
2 http://scienceblogs.com/seejanecompute/
2 http://blog.tenderbutton.com/
2 http://www.amazingcounter.com/
2 http://creativecommons.org/licenses/by/3.0/
2 http://scienceblogs.com/greengabbro/
2 http://madskills.com/public/xml/rss/module/trackback/
2 http://edheckarts.wordpress.com/
2 http://www.scilogs.fr/
2 http://feed.informer.com/
2 http://scienceblogs.com/chaoticutopia/
2 http://www.chemaxon.com/
2 http://www.rsc.org/mpemba-competition/
2 http://neksa.blogspot.com/
2 http://orgchem.livejournal.com/
2 http://scienceblogs.com/transcript/
2 http://drexel-coas-elearning-transcripts.blogspot.com/
2 http://pmgb.wordpress.com/
2 http://www.acs.org/
2 http://www.scienceblogs.com/
2 http://www.milomuses.com/chemicalmusings/
2 http://www.the-scientist.com/
2 http://calvinus.wordpress.com/
2 http://www.pandasthumb.org/

import re
import os
import urllib
from collections import defaultdict

from bs4 import BeautifulSoup

def seedFromCB():
    if not os.path.isdir("CB"):
        os.mkdir("CB")
        url = "http://cb.openmolecules.net/blogs.php?skip=%d"
        N = 0
        while True:
            page = urllib.urlopen(url % N)
            # Need to remove apostrophes from tag URLs or BeautifulSoup will choke
            html = page.read().replace("you'll", "youll").replace("what's", "whats")
            if html.find("0 total") >= 0: break
            print >> open(os.path.join("CB", "CB_%d.html" % N), "w"), html
            N += 10
    N = 0
    allurls = []
    while True:
        filename = os.path.join("CB", "CB_%d.html" % N)
        if not os.path.isfile(filename): break

        soup = BeautifulSoup(open(filename).read())
        divs = soup.find_all("div", class_="blogbox_byline")
        urls = []
        for div in divs:
            children = div.find_all("a")
            anchor = children[2]
            urls.append(anchor['href'])

        print filename, len(urls)
        allurls.extend(urls)
        N += 10

    notfound = ["http://imagingchemistry.com", "http://www.caspersteinmann.dk"]
    allurls = [url for url in allurls if url not in notfound]
    return allurls

def url2file(url):
    for x in "/:?":
        url = url.replace(x, "_")
    return url

def norm(url):
    for x in [".org", ".com", ".es", ".fr", ".net"]:
        if url.endswith(x):
            return url + "/"
    for txt in ["index.html", "index.php", "index.htm", "blog.html"]:
        if url.endswith(txt):
            return url[:-len(txt)]
    if url == "http://www.corante.com/pipeline/":
        return "http://pipeline.corante.com/"
    return url

def getAllLinks(urls):
    if not os.path.isdir("Blogs"):
        os.mkdir("Blogs")
    for url in urls:
        filename = os.path.join("Blogs", url2file(url))
        print filename
        if not os.path.isfile(filename):
            html = urllib.urlopen(url).read()
            print >> open(filename, "w"), html

    countlinks = defaultdict(int)
    for url in urls:
        filename = os.path.join("Blogs", url2file(url))
        links = re.findall('"((http|ftp)s?://.*?)"', open(filename).read())
        links = set([x[0] for x in links if x[1]=='http'])
        for link in links:
            countlinks[norm(link)] += 1
    return countlinks

if __name__ == "__main__":
    blogURLs = seedFromCB()

    countlinks = getAllLinks(blogURLs)

    tmp = countlinks.items()
    tmp.sort(key=lambda x:x[1], reverse=True)
    err = open("err.txt", "w")
    for x, y in tmp:
        if y > 1:
            if x.endswith("/") and not (y==6 and "accelrysin x) and not (y==3 and "fieldofsciencein x):
                print y, x
            else:
                print >> err, y, x

Sunday, 15 December 2013

QM Speed Test: ERKALE

I've checked in the first results from the speed test: ERKALE, an open-source QM package by Susi Lehtola. You can find full details at https://github.com/baoilleach/qmspeedtest but here's the summary:

HF

QM PackageTime (min)Stepsper step Total EHOMOLUMO
erkale810 909 -644.67570139 -0.353712 0.074269

B3LYP
QM PackageTime (min)Stepsper step Total EHOMOLUMO
erkale933 5816.1 -686.49566820 -0.260899 -0.064457

ERKALE is a very interesting project as it shows how the existing ecosystem of open source QM and maths libraries can be exploited as part of a modern QM package. But is it fast or slow? Who knows - we'll have to measure some more programs first...

Feel free to play along by forking the project and checking in your own results.

Note to self: The ERKALE version tested was SVN r1013 (2013-10-21). I had to do an extra single point calculation afterwards to find out the energy of the HOMO/LUMO. This time was not included in the assessment (as it shouldn't really have been necessary).

Saturday, 30 November 2013

The shortest route to the longest path

Recently I had to quickly come up with Python code that found the longest path through a weighted DAG (directed acylic graph). This particular DAG had a single start node and a single end node, but there were multiple weighted paths from the start to the end. "Longest path" in this context meant the path with the highest sum of weights.

To begin with, I coded up an exhuastive algorithm for trying all possible paths. This wouldn't be fast enough in the general case, but it didn't take long to write and is useful for comparison.

Internet rumour (contradicted by Wikipedia I now see) had it that the well-known Dijkstra's algorithm for the shortest path could be adapted for longest paths (but only in a DAG) by using the negative of the weights. However, a test of a couple of libraries for doing so indicated that negative weights were not supported (segfault with R for example), and I didn't want to code Dijkstra myself, so I decided to hunt around the internet for a direct solution to the longest path problem.

Geeksforgeeks (appropriately enough I guess) had an lucid description of a 3-step algorithm which I implemented. The only snag was that Step 1 involved a topological sort. Again, I didn't want to spend any time figuring this out so I looked for an implementation in Python and found one at ActiveState's Python recipes. I note that another popular Python implementation is available at PyPI.

Here's the code (if you are using an RSS reader to read this, you may have to click through to the original blog post):

Thursday, 28 November 2013

Got CPUs to burn? Put 'em to work with GNU parallel

I've just been using GNU parallel for the first time. It makes running jobs over multiple CPUs trivial.

In the past, if I had a large number of single-CPU computationally intensive jobs and multiple CPUs to run them over, I would create separate bash scripts for each CPU with a line for each calculation, e.g. ./runthis input1.smi > output1.txt. This is not super-ideal as different jobs take different lengths of time and so any CPU that finishes its bash script ahead of schedule just sits there idle. It also involves making N separate bash scripts.

Enter GNU parallel. This comes with several Linux distributions but on Centos I just quickly installed from source. Once done, you just need to put all of the jobs in a single script and pipe it through parallel:
cat myjobs.sh | parallel -j7 # e.g. for 7 CPUs
There are a whole load of complicated ways of using parallel. I'm very happy with this one simple way.

Wednesday, 20 November 2013

Anatomy of a bug fix

There's been a longstanding problem with the 2d layout in Open Babel. When support for correct layout of cis/trans stereo was added, it was done as a correction to the initial layout rather than as an intrinsic part of the layout. This means that after nicely laying out a molecule to avoid overlaps, it may then flip two groups at the end of a double bond, crashing one part of a molecule into another.

At the time, given that the depiction code had just moved from no support for stereo to supporting it correctly (at a small expense of layout), sorting this out was way down my priority list. Anyhoo, a couple of days ago I decided to revisit this issue. Here's an example before and after:
I thought it might be interesting for some (or perhaps not) to see behind-the-scenes how one might go about sorting out something like this. In this case, I already knew the technical fix, but it still took some time to ensure it was done right. It's not quite #realtimechem, but probably as close I'll get.

Day one
1. Update local git repo by pulling from the main branch
2. Create (and switch to) branch for bugfix
3. Build Open Babel under MSVC (debug and release)
4. Some build issues with recent code (presumably only tested on Linux) for chemicaljson and chemdoodle plugins. Ignore for now as it doesn't impact this work (note to self: fix).
5. Need to generate problematic test cases. Convert several thousand ChEMBL molecules to SMIs and use grep to remove those with "." and keep those with either "\" or "/". Write Python script to sort by size and added line number as title (these are useful for debugging; smaller test cases are easier to handle).
6. Add print statements to the code to print out the measured "overlap error" before and after calling the code to fix the 2D layout
7. Use obabel to convert my test cases to SVG and pipe the output to a file
Night came. Morning came. The second day.
8. Coding time: move the double-bond correction to immediately after the initial layout.
9. When doing the somewhat exhaustive rotation around bonds for layout, do not alter bonds with defined double bond stereo.
10. Add print statements to the code to print out the measured "overlap error" afterwards
11. Use obabel to convert the test cases to SVG and pipe the output to a file
12. Write a Python script to compare the overlap errors for the old and new code
13. Visually inspect some of the cases where an improvement is supposed to have occured.
14. Oh oh! I see an example where the layout has improved but the stereo is now wrong.
Night came. Morning came. The third day.
15. I run the problem case in the debugger and step through the code inspecting the molecule object in the depiction code. Funny thing is, my code appears to be correct but none of the bonds are marked with stereo refs and so it doesn't do anything.
Night came. Morning came. The fourth day.
16. I run the problem case in the debugger with the original code. The stereo refs are there.
17. I update to the new code and run it again. The stereo refs are there originally but must be disappearing later. I step further along. A copy is made of the molecule; it seems like it uses its own special clone method. Looking further along, the bonds are cloned but the stereo refs are not being copied at that point. I fix this and all is well.
18. I run the test cases again, run my comparison Python script and start visually inspecting problem cases.
19. Seems like everything is working better than before.
20. I write a blog post before I forget everything.
21. Still need to remove the print statements, inspect the diff, commit the code, and send a pull request.

Saturday, 16 November 2013

GaussSum 3.0 now out

GaussSum is a GUI that can monitor the progress of comp chem calculations and also calculate predicted spectra for comparison with experimental results.

I've just made a release of GaussSum 3.0. This marks the first time I've released code that uses Python 3 so I'm interested to see whether this will affect take-up.

The files are available at SourceForge, where there are also install instructions.

Here are the release notes:
This release has no new functionality apart from updates to the parser. The main reason for this new 3.0 release is that the codebase has been modernised so that GaussSum will continue to work in future. Specifically, GaussSum is now a Python 3 application, and it uses Python's Matplotlib for all graphs instead of Gnuplot. Also the dependency on Python Imaging Library has been removed.

Unfortunately, because of the move to Python 3, Linux users that are not on fairly recent distributions may not find either python3-numpy or python3-matplotlib in their package manager. If this is the case, you'll either have to update your distro or try compiling them yourself.

The Windows release is now a self-installing .msi courtesy of Python's cx_Freeze.

Since I'm now using Matplotlib instead of Gnuplot, if you want to spruce up your staid scientific plots à la xkcd, uncomment "# pyplot.xkcd()" in mpl.py. Then you can make graphs like the following:

May the best drug discovery tutorial win

Like a perfect alignment of stars, the Teach-Discover-Treat initiative touches on several areas of interest of mine. Its aim is to encourage the development of tutorials for drug discovery that involve freely-available software. The winners get a cash prize plus partial reimbursement of expenses to attend the ACS National Meeting where they will present the results. The competition is now open over at the TDT website.

I was fortunate to be present at the recent Spring ACS in New Orleans where the winners of the inaugural competition were presenting: Paulo Tosco, David Koes, George Nicola and Khaled Elokely. Unfortunately the tutorials themselves do not seem to be directly available on the web (and so will not be found by Google) but after agreeing not to be unethical you can download them from this page.

The results of the current competition will be presented at the ACS in San Francisco next year. Maybe I'll see you there.

Wednesday, 23 October 2013

Summary slides should be summarily skipped

This is a summary of my blog post. It will begin with an introduction. After this, I will present the meat of it. I will then surprise you by discussing this. Finally, guess what, I will end with a pithy conclusion.

Ok - so you think I'm overdoing it. But this is exactly how summary slides in presentations now appear to me. Once I thought they were essential; now I think they take away from the presentation.

Think about it. The audience is sitting there all ready to hear about the work you're doing; they are at their most alert and ready to engage with your topic and try to understand your results. And what do you do? You show them a slide that sends them to sleep. Either it's that pro forma slide that we've all seen before, or you can't resist talking through every single aspect of the entire talk so that there's nothing to look forward to; no-one wants to hear the same talk twice.

Perhaps there's an argument to be made for a summary slide for longer talks, especially if it's the last 30 years of some professor's work and it's going to be rambling about a bit so a bit of steering could be useful. However (thankfully) the majority of talks these days are of the 25-30 minute variety and that time spent on a zero-information-content summary slide may translate into cramming the last five slides of results into 30 seconds.

Anyhoo, this is on my mind at the moment as I'm putting together a talk for the GCC in Fulda in a few weeks time. Thing is, I'm one slide short at the moment. Hmmmm...

Thursday, 10 October 2013

QM up for testing - The Quantum Chemistry Speed Test

History of UK speed enforcementEver wondered which quantum chemistry package is fastest? No? Well you're not alone - I can't find anywhere on the webs a comparison of the speeds of different QM packages. Enter the Quantum Chemistry Speed Test...

Over a series of weeks (weeks which may be spaced months or years apart depending on the ebb and flow of life), I will carry out the same calculation using a variety of packages. The calculation will be a geo-opt of a small size organic molecule on a single CPU and without any attempt to tune.

And you can play along at home. I'll be making all of the input and output files available for your viewing pleasure. Commercial software is unlikely to feature in this comparison (as I don't have access to any) but don't let that stop you. Note that for the usual reason, you should avoid publishing Gaussian timings.

The reason I'm interested in this is that it seems that the focus of many QM packages these days is towards carrying out massively-parallel super accurate calculations. But what I'd really like (and I think most users would be with me in this) is faster speeds for standard calculations. For example, in a project with Geoff Hutchison a few years back I carried out 1000s of single-CPU calculations using Gaussian on a 8-cpu per node cluster. If I had run these in parallel it would have been much slower (see Amdahl's Law) and those CPU hours would not have stretched so far.

Maybe if there were more of these speed tests, it would encourage developers to bring more performance to single-CPU calculations. Well, probably not, but it'll be fun to find out.

Image credit: Paul Townsend on Flickr

Sunday, 6 October 2013

2nd RDKit UGM gives rise to Fortran bindings

I've just been attending the 2nd RDKit User Meeting, which was organised by George Papadatos of John Overington's ChEMBL group at the EMBL-EBI. Interesting developments include the new PDB parser, MMFF94 forcefield, and Pandas integration for the IPython notebook.

Lots of great talks of which many will be available over at the github repo. I lightning-talked about creating portable applications on Windows using MinGW.

Unfortunately I didn't stick around for the third day, the hack day. Roger seems to have had fun; he used his gcc-fu to get Fortran code to compile against RDKit:

This has surely made Greg Landrum happy as he is on record as bemoaning the lack of Fortran examples in his talks. :-)
...Since no talk about programming languages and computational chemistry is complete without an example of handling legacy Fortran code, this talk will be incomplete; and we can all be happy for that.
The snake that fits your brain: Python for computational chemists, COMP 82, 238th ACS

Looking forward to the next meeting already.

Monday, 23 September 2013

Unavailable by request - Do you have to ask?

Available on request!I've decided to no longer review papers where the software is "available upon request from the authors". I'm sure the world will keep turning, but I just don't want to spend time reviewing this type of software.

You see, to my mind "available from the authors on request" means "not available at some indeterminate point in the future". If they actually wanted to make it available they would have put it up on the web somewhere. Ergo, since they haven't, it means...what exactly?

I don't get it. Ok, so there's probably an innocent explanation. Maybe this is how it was done in the good ole days ("Hey, the 80s are calling. They want a copy of that software you wrote"). All I know is that I don't want to spend an hour or whatever reviewing a manuscript that describes software which may vanish into thin air.

I'd be interested to know if you have a different view on this.

Image credit: Iain Farrell

Sunday, 22 September 2013

Marketing and Scientists

Social Media Marketing Madness Cartoon by HubSpotHaving left academia just over a year ago and joined NextMove Software, I am reminded of a comment that a postdoc made to me about why she would not like the idea of working for a software vendor: "Marketing". I once had the same idea, but after several years postdocing I realise that a big part of a scientist's job, and perhaps especially a postdoc's, is all about marketing. We can call it self-promotion or networking, if that's more palatable, but it's essentially selling oneself and one's ideas.

Let's start with the obvious stuff. Papers: they promote your ideas, and also yourself. To get the paper in, you first need to market it to the editor with the submission letter. To encourage people to read it, you need a punchy abstract and title. For further encouragement, you need some marketing: a poster, a talk, a write-up on the webs.

And yes, blogs do help. Sure, marketing is not the only reason to have a blog, but it is a positive side-effect (it's a mystery to me why more cheminformaticians don't write them). Similarly, putting talks and posters up on the web ensures a wider reach. As I've said before, more people will read it online than will ever hear you give it in person.

And then there's the pure marketing: grant applications and job applications. If like me, you are somewhat reluctant to write a page of text on how awesome you are, you'll need to overcome this handicap pretty quickly in order to fill in the various bits of grant and job applications. As for the rest of the sections, it's a fine balance between promoting your ideas as solving all of the world's problems and being scientifically realistic.

"Marketing" may not be a scientist's favourite word, but if your scientific study is buried in the literature and forgotten, it neither advances the field nor your career. So if you feel you have something worth talking about, get out there and get marketing!

Image credit: Hubspot

Wednesday, 11 September 2013

Poll answer: Time for Python 3?

Results are in for the Python poll: 33 are using Python 2 (of which 15 are thinking about moving to Python 3) while 8 are already on Python 3.

So still very much a Python 2 world, at least among respondents.

It'd be interesting to know what exactly is keeping people at Python 2. Inertia? Dependencies on legacy code? Deep-seated aversion to 'print' as a function?

Tuesday, 3 September 2013

ASCII depiction for Accelrys Draw

At work I've been creating some plugins for Accelrys Draw, and the thought occured to me, why not enhance the colourful razor-sharp depictions generated by the good folks at Accelrys with ASCII art versions beamed direct from the 80s?

So I wrote a plugin that uses Open Babel (through its .NET dll) to generate an ASCII depiction of the displayed structure and paste it as a text object. The image below gives some idea of it in action, and since you ask, I found that an aspect ratio of 1.7 worked well for Courier New.
Unfortunately, I'm not 100% sure that I can make either the plugin or sourcecode available, and so I'm going to err on the side of caution and leave it at the screenshots.

Thursday, 22 August 2013

Conference etiquette - Don't mention the ----!

Conferences are all about communication. But what can and cannot be communicated at a conference, or rather what should and should not?

Probably top of the list of "should not" is to use inappropriate examples or images to liven up proceedings. Recently I attended the 6th Joint Sheffield Conference on Chemoinformatics (more here at NM) and saw red when a speaker talking about ways to rank docking programs decided to use contestants in Miss World as an illustrative example, complete with pictures, references to "lovely girls" and "let's just treat them as objects". I noticed someone got up and walked out at this point.

In other fields such as technology, some conferences have found it necessary to develop a Code of Conduct to spell out appropriate behaviour for people who need it spelled out; e.g. here's an excerpt from PyCon's:
All communication should be appropriate for a professional audience including people of many different backgrounds. Sexual language and imagery is not appropriate for any conference venue, including talks.

Be kind to others. Do not insult or put down other attendees. Behave professionally. Remember that harassment and sexist, racist, or exclusionary jokes are not appropriate for PyCon.

Attendees violating these rules may be asked to leave the conference without a refund at the sole discretion of the conference organizers.
Meanwhile over in Vermont, the Gordon Research Conference on CADD was talking place at the same time. This particular meeting had a focus on the use or abuse of statistics with a goal to improve the situation. The thing is, GRCs are subject to the Chatham House Rule. Before I checked the Wikipedia link I thought that this meant that everything discussed at such a meeting is confidential, and I think this is the common understanding (see for example Peter Kenny's reference to this over at his blog). However, apparently it simply means that the discussions must not be attributed to anyone. Either way, probably more people misunderstand the Rule (like me) than understand it.

I understand that this may be out of the control of the organisers, but it seems to me that this rule holds back the communication of the information to a wider audience: can you tweet the talks? can you post the slides? can write a blog post about the conference? can you even mention it to co-workers at coffee? Do people really present controversial work that needs protecting by the Rule?

I guess what I'm really asking is, could everyone just follow Craig Bruce and Peter Kenny's lead and post their slides on Lanyard already? (Just don't mention the Rule) :-)

Poll question: Time for Python 3?

Holy Grail AleThis year's poll question is...if you use Python, which Python do you use? (See the sidebar on the left.)

By way of background, Python 3.0 was first released back in Dec 2008. Almost five years later, many are still using a Python 2.x version. For scientists, the main blocker was numpy, which only supported Python 3 since Sep 2010, but additionally IPython took until July 2011 and matplotlib has only this year (Jan) supported Python 3.

In cheminformatics, Open Babel has supported Python 3 since March 2009 but neither RDKit nor OEChem has yet added Python 3 support. However Greg is currently thinking about it (if you're for it, you can support this idea on Google+).

It's interesting to look in more detail at the Windows downloads for the OB Python releases. For OB 2.3.1 released in Oct 2011, the ratio of Python 2 to Python 3 downloads is 4:1 (2.5:148, 2.6:238, 2.7:983, 3.1:41, 3.2:301) while for OB 2.3.2 released in Nov 2012, the ratio is 2.5:1 (2.6:81, 2.7:424, 3.2:73, 3.3:129). The trend is obvious but it seems that change is slow.

Image credit: Fox Fotography (Oliver on Flickr)

Thursday, 25 July 2013

So shall we be social?

In case you are upset that I ignore your LinkedIn requests, don't follow you back on G+, do not RT your Ts, won't answer your Open Babel emails, mark your blog comments as spam, throw your letters into the recycling bin, and never buy you flowers, let me explain:

  • If we have never met in person, I will ignore your LinkedIn request. I use LinkedIn to remember the names/faces/bios of people I've met.
  • I only follow people with public posts on Google Plus. Non-public posts cannot be reshared publicly, so it's just a pain to follow people who only generate content for Google.
  • I don't log into Twitter very often.
  • Anyone who emails me personally (even if I know you) with questions about Open Babel is politely requested to resend to the mailing list. Open Babel has a lot of users and I don't scale.
  • If I spammed your blog comment by mistake, get in touch. Like, you know, leaving a non-spammy comment below.
  • Any communications from the ACS marked Important, Very Important, Your Eyes Only, This Week Only, or Find-a-Member-and-Get-a-Periodic-Table-Thing will automatically be recycled without opening.

None of these social media requests upset me though - don't get me wrong. I'm just trying to keep my digital life manageable.

A new home for Linux4Chemistry - Part II

Back in February, I was looking for someone to take over Linux4Chemistry. Thanks to those of you who got in touch.

I'm happy to announce that Riccardo Vianello and Gianluca Sforna have stepped forward to take over the site, and have established a new home for it at http://www.linux4chemistry.info.

I wish them all the best. Good luck guys!

Saturday, 20 July 2013

Cheminformatics in Science Fiction

Ever read a science fiction story, or indeed any story, that featured cheminformatics as a major plot point?

I've just come across a short story by Asimov that involves a murder mystery in a chemistry library where the Beilstein catalogue plays a major role. Stricly speaking this isn't science fiction, but let's go with science-based fiction. The story in question is Asimov's "What's in a name?". Since it's a mystery story, better not to google the plot unless you like spoilers. It's available as part of a collection, Asimov's Mysteries.

(And a shout out to last year's Alpha Shock by Murcko and Walters - there's a bit of cheminformatics in there too.)

Sunday, 14 July 2013

Using Open Babel to package chemistry software

Let's suppose you want to write a piece of C++ code that manipulates molecules to do something or other, and generates an output, either more molecules (or indeed the original molecules filtered), some descriptors, or some text (a report, or table, or something).

Plug In. (#76/365)I'm going to propose that you should write or adapt this code as an Open Babel plugin. I've just done this for Confab, the conformer generator I wrote some time back.

If you do this, you don't need to consider how to put together the build infrastructure, write the code for reading/writing file formats, or for handling command-line options and arguments (in fact, you get a lot of additional functionality for free). More generally, the software will compile cross-platform, be included in every major Linux distribution and be available to a very large number of people. It will also have a lifetime beyond the end of the grant that funded it.

I'm by no means the first to see the advantages of this. For example, Jiahao Chen added the QTPIE charge model he developed as one of Open Babel's charge model plugins.

Coming back to Confab, the original code was written as a modified version of Open Babel. I don't quite know what I was thinking but I meant to integrate it properly with the main Open Babel code at some point. In the end, the required push was provided by David Hall, who started off this integration early this year.

If you want to try it out now, clone the development version of Open Babel on github. The "--confab" option to obabel will invoke the Confab operation and "-oconfabreport" will replace calcrmsd in the original release. For command-line help on either, use "obabel -L confab" or " obabel -L confabreport".

Image credit: Sebastian Anthony (Mr Seb) on Flickr

Thursday, 20 June 2013

Least Publishable Unit #3: Novel Approach to Pharmacophore Discovery

Here's an idea I had towards the end of my last postdoc, but only played around with it a little bit. It relates to the pharmacophore search software Pharmer.

I've mentioned David Koes's Pharmer here before. For the purposes of this discussion the key point is that it can do ultra-fast pharmacophore searching. The catch is that you need to generate conformers and index them in advance, but once done the search software whizzes through them. For example, check out the ZINC Pharmer applet over at the Camacho Lab's website.

Once something becomes ultra-fast, it opens up new possibilities for its use. In particular, we can consider using an ultra-fast pharmacophore search tool to discover pharmacophores in the first place.* So imagine tuning the pharmacophore definition by searching against a set of inactives seeded with actives; you could have a GA handling the tweaking of the pharmacophore and so on.

This method has a couple of nice features. If you are interested in selectivity (think of the 5-HT2A versus 5-HT2C I mentioned in an earlier installment) you could penalise matches against molecules known to bind to only the wrong target and reward matches to molecules known to bind to only the right one. This was why I was interested in this method, but in the end I didn't have enough data to proceed down this road; I did put Pharmer through its paces though and tried some basic optimisations of pharmacophore definitions using this approach.

In short, as far as I know no-one has ever carried out pharmacophore discovery in this way, so it could be something interesting to try especially in the context of developing a selective pharmacophore.

* I think this possibility is mentioned at the end of the Pharmer paper. But as far as I remember I thought of this independently. Makes no difference in any case.

Sunday, 16 June 2013

(Almost) Translate the InChI code into JavaScript Part IV

Readers may remember that I made an attempt to convert the InChI code to Javascript some time ago.

Recently Michał Nowotka of the ChEMBL group used the same technique (Emscripten) but a different approach to achieve the same thing. Check out the results over at the ChEMBLog.

Saturday, 15 June 2013

Are you a Google Reader reader?

If you are using Google Reader to read this, you have under three weeks to migrate to another platform (Feedly looks like a good bet) before Google pulls the plug on it.

I try not to get too involved with monitoring stats but since Google Reader is about to disappear, it's somewhat interesting to compare the popularity of different RSS feeds (note that one blog may be available through several RSS feeds) based on the number of Google Reader subscribers. This is available from Google Reader if you subscribe to a URL and then click 'Feed Settings' and 'View Details and Statistics' (I think this value may even be available programatically).

For example, for this blog, the URL http://baoilleach.blogspot.com/feeds/posts/default has 132 subscribers. That's a good number when you consider that In The Pipeline's http://www.corante.com/pipeline/index.xml has 1543. But to put things in perspective, xkcd's http://xkcd.com/atom.xml has 246,701. :-)

Thursday, 13 June 2013

Using CTest with multiple tests in a single executable

Open Babel uses CTest for its testing. While admittedly fairly basic, its integration with the CMake build system and its support for dashboard builds makes it a reasonable choice for a C++ project. Out of the box though, the test framework appears to favour a separate executable for every test which is something of a nuisance; these take some time to build, and also don't provide much granularity over the tests (each executable is either Pass or Fail even if it contains multiple tests).

I've just spent some time modifying Open Babel's tests so that (a) they are all bundled in a single executable and (b) the parts of multi-part tests are called separately. The CTest docs are sparse on how to do this, so here is an attempt to describe the general gist of what I've done.

First of all, the relevant CMakeLists.txt is here. The following discussion changes some details for clarity.

This shows how to define the test names and consistuent parts:
set (cpptests
     automorphism builder ...
    )
set (automorphism_parts 1 2 3 4 5 6 7 8 9 10)
set (builder_parts 1 2 3 4 5)
...
For tests where a list of parts has not been defined we add a default of 1:
foreach(cpptest ${cpptests})
  if(NOT DEFINED "${cpptest}_parts")
     set(${cpptest}_parts "1")
  endif()
endforeach()
Don't forget the .cpp files for each test:
foreach(cpptest ${cpptests})
  set(cpptestsrc ${cpptestsrc} ${cpptest}test.cpp)
endforeach()
Each of these .cpp files should have a function with the same name as the file as follows:
int automorphismtest(int argc, char* argv[])
{
  int defaultchoice = 1;
  
  int choice = defaultchoice;

  if (argc > 1) {
    if(sscanf(argv[1], "%d", &choice) != 1) {
      printf("Couldn't parse that input as a number\n");
      return -1;
    }
  }

  switch(choice) {
  case 1:
    testAutomorphisms();
    break;
  // Add case statements to handle values of 2-->10
  default:
    cout << "Test #" << choice << " does not exist!\n";
    return -1;
  }
  return 0;
}
Now here's the magic. Using the filenames of the .cpp files, CMake can generate a test_runner executable:
create_test_sourcelist(srclist test_runner.cpp ${cpptestsrc}
add_executable(test_runner ${srclist} obtest.cpp)
target_link_libraries(test_runner ${libs})
When it's compiled you can run the test_runner executable and specify a particular test and subtest:
./test_runner automorphismtest 1
All that's left is to tell CMake to generate the test cases:
foreach(cpptest ${cpptests})
  foreach(part ${${cpptest}_parts})
    add_test(test_${cpptest}_${part}
             ${TEST_PATH}/test_runner ${cpptest}test ${part})
    set_tests_properties(test_${cpptest}_${part} PROPERTIES
      FAIL_REGULAR_EXPRESSION "ERROR;FAIL;Test failed"
  endforeach()
endforeach()
Now, when you run "make test" or CTest directly, you will see the test output:
C:\Tools\openbabel\mySmilesValence\windows-vc2008\build>ctest -R automorphism
Test project C:/Tools/openbabel/mySmilesValence/windows-vc2008/build
      Start  6: test_automorphism_1
 1/10 Test  #6: test_automorphism_1 ..............   Passed    2.83 sec
      Start  7: test_automorphism_2
 2/10 Test  #7: test_automorphism_2 ..............   Passed    0.31 sec
      Start  8: test_automorphism_3
 3/10 Test  #8: test_automorphism_3 ..............   Passed    0.13 sec
      Start  9: test_automorphism_4
 4/10 Test  #9: test_automorphism_4 ..............   Passed    0.10 sec
      Start 10: test_automorphism_5
 5/10 Test #10: test_automorphism_5 ..............   Passed    0.15 sec
      Start 11: test_automorphism_6
 6/10 Test #11: test_automorphism_6 ..............   Passed    0.12 sec
      Start 12: test_automorphism_7
 7/10 Test #12: test_automorphism_7 ..............   Passed    0.11 sec
      Start 13: test_automorphism_8
 8/10 Test #13: test_automorphism_8 ..............   Passed    0.12 sec
      Start 14: test_automorphism_9
 9/10 Test #14: test_automorphism_9 ..............   Passed    0.10 sec
      Start 15: test_automorphism_10
10/10 Test #15: test_automorphism_10 .............   Passed    0.12 sec

100% tests passed, 0 tests failed out of 10

Total Test time (real) =   4.45 sec

Monday, 10 June 2013

Least Publishable Unit #2: Find corresponding Raman peaks for deuterated species

Here's a little trick I came up with during my PhD to help figure out which resonance Raman peaks corresponded to which when comparing a non-deuterated to a deuterated species. This never made it to a publication but it's time to get it out there

The group to which I belonged, the Han Vos group, studied the photophysics of Ru and Os polypyridyl species. A useful tool to probe the excited state is Raman combined with deuteration of specific ligands. When you do a vibrational frequency calculation, you specify what isotopes to use when calculating the frequencies. In fact, you can instantly repeat the calculation for other isotopes without redoing the whole analysis. In this way you can calculate the frequencies for the deuterated and undeuterated forms.

This gives you a set of peaks but you don't exactly know which correspond to which. For example, in the diagram below for [Ru(bpy)3]2+, if you just consider the 1H and 2H spectra, it's fairly obvious which corresponds to v5 but you're relying on guesswork for the peaks on the left. However, it's useful to know which peaks correspond to which; one reason for this is to develop forcefields for IR calculation (I forget the details).

Anyhoo, the trick I thought of is to use isotopes with fractional masses intermediate between 1H and 2H. Then it's fairly easy to trace the shift in the peaks using the diagram above. Another way to figure out the correspondence would be to look at the wiggle vectors for the frequencies and match up the ones with corresponding wiggles. I used this to verify my results.

And it turned out that according to my awesome diagram which was never published, the peaks had been misassigned in the literature. v9/v'9 and v10/v'10 were not corresponding; instead it was v9/v'10 and v10/v'9. Take that literature!!

For more info, check out Chapter 5 of my thesis which I've just found someone has scanned in and deposited in DCU's Institutional Repository. Nice.

More (and more) Open Source cheminformatics tookits

When I were a lad, there weren't many Open Source cheminformatics toolkits out there. Things are different nowadays - you can't navigate the web without falling over new cheminformatics toolkits left, right and centre. This of course has its own problems, but having interesting new ideas floating around is a sign of a healthy community.

So here is a belated update on some new (and not-so-new) developments:
  • ChemKit - This came out some time ago now with the first release in June 2012. It is a C++ cheminformatics toolkit with Python bindings developed by Kyle Lutz (now working with Marcus Hanwell at Kitware).
  • The Avalon toolkit - Also first released publicly in June 2012, this is a Java toolkit developed internally by Bernhard Rohde at Novartis. This was used for example by Peter Ertl and him in their recent Molecule Cloud paper.
  • Rubabel - Not quite a new toolkit, but a Ruby-friendly wrapper around the Open Babel Ruby bindings from the Prince Lab at Brigham Young University. The general idea is similar to Pybel (i.e. make it easy to use from Ruby by using Ruby idioms and language features) but it takes a completely distinct approach and hopefully will find a following among OB's Ruby users.

Is there anything I've missed? Leave a comment to let me know...

Thursday, 9 May 2013

Least Publishable Unit: Selective GPCR Agonists

P1260441After a certain length of time, there comes a point when you realise that that idea of yours you always planned to pursue, well, you're never going to get around to it. Such is life and all that.

But maybe if I post some ideas here, it might spark someone else's imagination and lead to something. A mention in your Nobel speech is all I ask (preferably close to the start).

So here is something I worked on last (academic) year...

In collaboration with Dr JJ Keating (University College Cork) I was interested in finding some selective 5-HT2A receptor agonists. Turns out everyone else wants to find selective 5-HT2C agonists and so 2A is neglected. As far as I could tell, there are no known 2A-selective agonists. Such a compound would be useful both as a tool compound but also has some therapeutic potential for glaucoma.

To begin with I turned to the literature to see what had been done. In short, I never found any paper describing the search for a selective 2A agonist.

But then I thought of ChEMBL. Despite the fact that no-one had been looking for a selective 2A agonist, maybe there are some accidental examples to be found in papers on selective 2C agonists? So I downloaded the entire database, and searched for all instances where activity (EC50) for the same compound in the same paper was measured against both 2A and 2C. And lo and behold I found a few examples, the best of which was compound 5 in Bioorg. Med. Chem. Lett., 2005, 15, 4555.

So in short, your mission (if you choose to accept it) is to repeat the analysis, take these compounds as a starting point and to develop selective 5-HT2A agonists through a medicinal chemistry strategy. Note that a similar approach towards receptor subtype selectivity might yield useful results for other receptors.

Image credit: Xavier Béjar on Flickr

Thursday, 25 April 2013

How fast is chemfp? We investigate!

fingerprintThat is, um, I investigate. I described chemfp in an earlier blogpost. Simply put, it comprises software by Python/C guru Andrew Dalke to handle and generate binary fingerprints for molecules. It enables easy comparison of fingerprints from different toolkits, and provides super-fast similarity methods. With a new even faster version out in Feb 2013, I thought I should put it through its paces.

Pre-processing

The standard way to carry out a fast similarity search with Open Babel is the so-called fastsearch method, which precalculates an index file of fingerprints for an sdf file, and then linearly-scans through this for hits against a single query structure. With chemfp, the initial step is similar; you convert everything to fps files, a standard file format for fingerprints developed by Andrew. Previously you needed to use chemfp's ob2fps to do this, but OB 2.3.2 added support for fps files and can generate them directly 2 to 3 times faster.
$ obabel chembl_15mod.sdf -O chembl.fs # for fastsearch
$ obabel chembl_15mod.sdf -O chembl.fps # for chemfp

Single-molecule query

Let's take the first molecule in the database as a query, and find those molecules in ChEMBL that are within 0.8 similarity (Tanimoto). simsearch is the name of chemfp's tool for similarity searching:
$ obabel chembl.fs -osmi -s query.sdf -aa -at 0.8 > threshold.fs.txt
$ simsearch --threshold 0.8 -q query.fps chembl.fps > threshold.fps.txt
They both take about 1.2 seconds. I'm being a bit vague because the exact value doesn't matter; for some query molecules fastsearch takes longer, for some less. It's just to give an idea. But in short, fastsearch appears to be in the same ballpark as simsearch for single molecule queries.

Multi-molecule query

fastsearch doesn't provide an easy way to do multiple queries. If you want to do it yourself, you would just have to do each search one-by-one. So 1000 searches would take 1000s, let's say. In contrast, with simsearch 1000 searches on ChEMBL takes only 8s:
$ simsearch --threshold 0.8 -q largequery.fps chembl.fps > thresholdb.fps.txt
How does it manage this? Well, first of all, it runs in parallel on 4 available CPUs by default (...see Andrew's comment below). These figures come from a server which has 4 hyperthreading CPUs. The rest of the improvement comes from the use of in-memory data structures and algorithmic magic, some of which Andrew has described on his blog over the last year or so.

With speeds like this, it brings all-against-all similarity searches within reach, and simsearch provides an option just for this. The following finds the nearest 8 molecules for each molecule in the dataset:
$ simsearch -k 8 --NxN largequery.fps > NxN.fps.txt
I tried this for the first 10000 molecules in ChEMBL, and it took about 1s; 50K took 11s; 100K took 32s; 250K took 144s; and the whole of ChEMBLdb (1.2 million) took 2900s (48m 20s). To put this in context, an all-against-all search of ChEMBLdb using fastsearch would take something like 14 days.

Single-molecule query revisited

The alert reader may be wondering why, if the multi-molecule query is so darned fast, the single-molecule query is no faster than Open Babel's fastsearch. The answer is rather simple (...but not quite right - see Andrew's comment below): 99% of the time spent on the single-molecule query is setup. Once the search datastructure has been initialised, then the response for a query is within milliseconds. You can see timings for these over at the chemfp website. To achieve these timings in practice, you would need to write a Python script that used the chemfp library and was accessed via some sort of client/server architective, e.g. a local webservice. As Andrew points out, this would allow search results to be returned instantly as a chemist sketches a structure. Trying this out is left as an exercise for the reader. :-)

In conclusion

So, in short, if you have any interest in running multiple queries against a database, comparing two large datasets, or finding pairwise similarity within a dataset, check out chemfp.

Notes:
1. I should point out that Andrew is making chemfp available as open source but with delayed release. Commercial licensees get support and the latest features.
2. For ease of use with Open Babel, the ChEMBLdb SDF file was modified to add the chembl_id to the title. Oh ye gods of ChEMBL please consider doing this for your humble users.

Image credit: Russell J Watkins on Flickr

Monday, 15 April 2013

Talk on Universal SMILES at New Orleans ACS

Early on Wednesday I presented my recent paper on Universal SMILES at the New Orleans ACS. This is a canonical SMILES string that uses the InChI canonical labels. Usually I tell the audience that the slides will be made available, but this time there was someone in the audience who was standing up every so often and taking photos; I thought this was so awesome I said nothing.

Anyway, here are the slides. They are very wordy, partly because I used up all my visualisation skills fiddling around with the other formal talk I was giving (appearing soon on the NextMove blog), and partly because I was aware that for web readers a picture of a donkey surfing might not spell out how to create canonical SMILES quite as well as traditional bulletpoints.


Universal (and Inchified) SMILES are available right now in Open Babel. Rumour has it that the CDK and RDKit are considering supporting Universal SMILES. If you use these or any other toolkits, and think that having support for Universal SMILES would be nothing short of paradigm-shifting awesome, ask them to add support.

Thursday, 11 April 2013

Talk on Open Babel at New Orleans ACS

Now that Open Babel 2.3.2 has been released 6 months, I thought it might almost be time to talk about what's new, and also mention what's under development.

Here's the talk I gave at Rajarshi's CINF Flash session last Sunday. It was my first time at the flash session but I'm definitely going to make a point of attending and presenting at this in future; it is preceded by a free lunch!

Friday, 29 March 2013

See you at ACSNoelA?

I'll be presenting at the Spring ACS National Meeting in New Orleans in just over a week. The last ACS I was at was three years ago so I'm looking forward to catching up with what's been going on, and meeting up with some familiar faces.

I've got three talks lined up, the slides for which I'll post after the event:

1. I'll be talking about "What's new and cooking in Open Babel?", as part of Rajarshi's CINF Flash talks on April 7 (Sunday) sometime between 12.30 and 2.00 in Room 350 (Morial Convention Center). I think Rajarshi is still accepting CINF Flash talks so get them in.

2. "Universal SMILES: Finally, a canonical SMILES string?"
This presents the work described in a recent paper.
DIVISION: CINF: Division of Chemical Information
SESSION: Public Databases Serving the Chemistry Community
DAY & TIME OF PRESENTATION: April 10 (Wed) from 8:35 am to 9:05 am
LOCATION: Morial Convention Center, Room: 350

3. "Roundtripping between small-molecule and biopolymer representations".
This describes some of the issues and challenges that have arisen in the development of the Sugar & Splice software, a toolkit for perceiving biopolymer structures, depicting them and converting them between formats.
DIVISION: CINF: Division of Chemical Information
SESSION: Linking Bioinformatic Data and Cheminformatic Data
DAY & TIME OF PRESENTATION: April 09 (Tues) from 3:10 pm to 3:35 pm
LOCATION: Morial Convention Center, Room: 349
To any first-time ACSers reading this, don't miss the various CINF functions. (I seem to recall that they are not exactly easy to find in the printed programme but details as ever are on the webs.)

Monday, 25 March 2013

Time and the InChI


Notes:
1. Created using a Google spreadsheet and Timeline JS.
2. If anyone wants to send me updates (e.g. for commercial software like ChemDraw I found it hard to find dates and versions), feel free.

Monday, 18 March 2013

Police your code with fuzz testing

I've just become a convert to Fuzz testing. This is a rather simple idea for testing software that processes some user input: just send in some random junk and see what happens (and keep repeating until something does).

I tried this for some software I've been working on, thinking it was so dumb it couldn't possibly flush out anything useful, but quickly changed my mind. Even more interesting, where the space of all possible user input can be stepped through systematically, fuzz testing can be used to map out the space of allowed input (which may or may not be what you were expecting). For example, you could find all 4-letter words acceptable to OPSIN by using the code below to generate fuzz and use OPSIN through Cinfony (for example) to find out whether an error is raised.

import random

class Fuzzer:
    def __init__(self, allowed, length):
        self.allowed = allowed
        self.length = length

    def systematic(self, text=""):
        if len(text) == self.length:
            yield text
        else:
            for x in self.allowed:
                for y in self.systematic(x+text):
                    yield y

    def random(self):
        return "".join([random.choice(allowed) for x in range(self.length)])

if __name__ == "__main__":
    allowed = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVCWXYZ"
    fuzzer = Fuzzer(allowed, 3)
    for fuzz in fuzzer.systematic():
        print fuzz
    fuzzer = Fuzzer(allowed, 5)
    for i in range(10):
        print fuzzer.random()

Sunday, 3 March 2013

Avogadro hydrogen-clicking craziness with Sikuli

There's something that's always annoyed me when using Avogadro. On Windows, it often takes several clicks on a hydrogen before I hit the sweet spot and it sprouts a methyl group. This is only mildly annoying when building a structure, but very frustrating for operations where misclicking cause it to lose selection (e.g. rotating around a dihedral). It's awkward to report a bug about this though because it's hard to prove.

Enter Sikuli. It's a Java application running Jython that's used for testing GUIs (among other things). It uses a sort of visual programming style to match regions of the screen based on screenshots and carry out operations based on whether or not a match exists. (At this point you may want to check out the videos on the Sikuli website to figure out what I'm talking about.)

My original idea was to get Sikuli to click all over the offending hydrogen and see which points worked and which didn't. In the interests of time, I reduced this down to all points on a horizontal line through the centre of the hydrogen. First of all, here is the result, with the 'allowed area' indicated by the red line:
I would argue that that red line is neither long enough nor in the expected location. In any case, check out the sort of code needed to run this test:
You mightn't understand everything here, but you should get the gist. On every iteration of the loop it clears the Avogadro window, clicks to create a methane, and then clicks at point x, y on a particular hydrogen (where x, y are relative to the centre of the image of the matched hydrogen) trying to sprout a methyl. The "success test" checks that a methyl sprouted. The "fail test" checks whether a methane was added (i.e. the region to the right of the red bar in the diagram of the atom above). Note that fuzzy matching is used to match screenshots against the actual contents of the screen or a region.

Saturday, 23 February 2013

cclib 1.1 and GaussSum 2.2.6 released

cclib is a Python library for parsing for analysing comp chem log files from many different QM packages. GaussSum is a GUI that uses cclib to monitor the progress of comp chem calculations and calculate predicted spectra for comparison with experimental results. cclib is the work of Adam Tenderholt, Karol Langner and myself, while GaussSum is just by me.

New releases of both of these are now available for download at their respective websites: http://cclib.sf.net and http://gausssum.sf.net. For help, email cclib-users@lists.sf.net or gausssum-help@lists.sf.net. Here's what's new in these releases:

cclib 1.1
New Features:
  • Add progress info for all parsers
  • Support ONIOM calculations in Gaussian (Karen Hemelsoet)
  • New attribute atomcharges extracts Mulliken and Lowdin atomic charges if present
  • New attribute atomspins extracts Mulliken and Lowdin atomic spin densities if present
  • New thermodynamic attributes: freeenergy, temperature, enthalpy (Edward Holland)
  • Extract PES information: scanenergies, scancoords, scanparm, scannames (Edward Holland)
Bugfixes:
  •  Handle coupled cluster energies in Gaussian 09 (Björn Dahlgren)
  •  Vibrational displacement vectors missing for Gaussian 09 (Björn Dahlgren)
  • Fix problem parsing vibrational frequencies in some GAMESS-US files
  • Fix missing final scfenergy in ADF geometry optimisations
  • Fix missing final scfenergy for ORCA where a specific number of SCF cycles has been specified
  • ORCA scfenergies not parsed if COSMO solvent effects included
  • Allow spin unrestricted calculations to use the fragment MO overlaps correctly for the MPA and CDA calculations
  • Handle Gaussian MO energies that are printed as a row of asterisks (Jerome Kieffer)
  • Add more explicit license notices, and allow LGPL versions after 2.1
  • Support Firefly calculations where nmo != nbasis (Pavel Solntsev)
  • Fix problem parsing vibrational frequency information in recent GAMESS (US) files (Chengju Wang)
  • Apply patch from Chengju Wang to handle GAMESS calculations with more than 99 atoms
  • Handle Gaussian files with more than 99 atoms having pseudopotentials (Björn Baumeier)
GaussSum 2.2.6
New Features:
  • A patch from Thomas Pijper was integrated to enable calculation of Raman intensities (from Raman activity).
  • Support has been added for calculating charge density changes for unrestricted calculations (requested by Phil Schauer).
  • Blank lines in Groups.txt are now ignored.
  • Parser updated to cclib 1.1

Sunday, 17 February 2013

A new home for Linux4Chemistry?

Are you interested in taking over the stewardship of the Linux4Chemistry website? This website was setup by Nikodem Kuznik in 2001 to promote the usage of Linux for chemistry by listing chemistry software available for Linux, both commercial, free, and open source. In 2005 I took over its maintenance, moved it to its present location and made some changes including the awesome logo. I am currently on the lookout for someone interested and willing to take over the running of this site. Update (25/02/2103): A new home has been found.

Why am I looking for someone new? Well, I think I am no longer the right person to maintain this, as the goals I wished to achieve have already been met. To me the idea of promoting Linux as a platform for chemistry software today seems obsolete when almost all chemistry software vendors target Linux at a minimum. Every docking program is on Linux, every QM package runs on Linux, etc.

In taking over the website I also had the goal of clarifying the distinction between Open Source software and software available for free. To this end, I added license information to all of the software, and made it possible to filter the results by license type (I was very proud of my tongue-in-cheek logo for Shareware software). I think that today more people are aware of this distinction and in any case, I'm not sure that L4C is really playing a role here.

It is also true that while at the time my desktop machine was running Linux (Debian Sarge I think), since then I do most of my work on Windows but use a VM for Linux. And it's harder to maintain something which is outside my day-to-day usage.

So I'm looking for someone to bring a new vision to Linux4Chemistry and shake it up a bit. If you are this person, get in touch.

Wednesday, 6 February 2013

A compilation of speeds - Compiler face-off

Compilers cake!Let's get right into this one. I've compiled Open Babel with g++ in various ways, and am going to compare the speed with the MSVC++ release. Specifically I'm going to compare the wallclock time to convert 10000 molecules (the first 10000 in ChEMBL 13) from an SDF file to SMILES.

Our starting point is the time for the MSVC++ compiled release:
29.6s (MSVC++ 2010 Express 32-bit)

I have a Linux Mint 12 VM (VMWare) on the same machine, so let's run the same executable under Wine on Linux:
37.3s (MSVC++ 32-bit under Wine/Linux)
...so it's slower, pretty much as expected. The not-an-emulation layer slows things down a bit.

How about the MinGW compilation described in the previous post?:
24.1s (MinGW g++ 4.6.2 32-bit)
g++ beats MSVC++. To be honest, I was a bit surprised to see this, although I understand from Roger that g++ is surprisingly highly-optimised for cheminformatics toolkits. Maybe we should look into an official MinGW release in future.

What about Open Babel compiled with Cygwin's g++?:
39.5s (Cygwin g++ 4.5.3 32-bit)
As expected it runs like a pig compared to the MinGW version. Cygwin's handy, but when you're in a hurry it's maybe not the best choice.

So far, so not very unexpected. Now we will enter the realm of weirdness. Let's compile it on Linux in the VM and run it there:
14.8s (Linux Mint 12 g++ 4.6.1 64-bit)

So, in short, the fastest way to run Open Babel on Windows is to use a VM to run Linux. Huh? The like-with-like comparison of MinGW's 24.1 versus Linux's 14.8 is the most intriguing. It suggests that the slowdown is either due to rubbish file I/O by Windows, or sub-optimal platform-specific code in Open Babel's I/O handling code.

Either way, it's a pretty interesting result.

Notes:
1. Hardware was a Dell Latitude E6400 bought 3 years ago (Core 2 Duo 2.4 Ghz, 4GB Ram) running Win 7 64-bit. The timing was the best of three after timings had stabilised (the first one or two is usually a second or two slower).
2. After the initial post, I compiled clang on Linux, and then used it to compile Open Babel. Running the conversion took 15.3s.
3. Also, I ran the MinGW compiled version under Linux, and it took 30.7s.

Image credit: Venkatesh Srinivas (Extrudedaluminiu on Flickr)

Compiling Open Babel with MinGW on Windows

If you want to compile on Windows using GCC, you have two alternatives: Cygwin's GCC and MinGW's. The one from Cygwin is easier to use (easier installation) but has the disadvantage that the resulting software does not run natively on Windows, various system calls go through Cygwin's emulation layer which slows things down. Here I'll show how to compile Open Babel with MinGW.

Installing MinGW

I've previously found this a bit confusing. This time I did a manual installation by creating a folder C:\MinGW, and then downloading all the relevant dlls on the installation page. To do this quickly just middle click on several links, wait a few seconds, and then hit Save on all the dialog boxes. Once they are all downloaded, move them to C:\MinGW and unzip them there.

Installing MSYS

No need to install MSYS (a kind of build environment for MinGW) for a project such as Open Babel that uses CMake to build. Why do I mention it then? Because the MinGW page talks all about it.

Compiling Open Babel

1. Add C:\MinGW\bin to the PATH
2. Get Cygwin's stuff off the PATH (if it's there). This is most easily accomplished by renaming C:\Cygwin to C:\oldCygwin or so.
3. Configure CMake to create makefiles for MinGW. I had some problems (at runtime) with a shared library version, so I went with the static one:
cmake -G "MinGW Makefiles" ../openbabel-2.3.2 -DWITH_INCHI=FALSE -DBUILD_SHARED=FALSE
4. Build it with MinGW's make.
mingw32-make
Hmmm...I wonder if it's as fast as the MSVC-compiled version we distribute?