Pre-processingThe standard way to carry out a fast similarity search with Open Babel is the so-called fastsearch method, which precalculates an index file of fingerprints for an sdf file, and then linearly-scans through this for hits against a single query structure. With chemfp, the initial step is similar; you convert everything to fps files, a standard file format for fingerprints developed by Andrew. Previously you needed to use chemfp's ob2fps to do this, but OB 2.3.2 added support for fps files and can generate them directly 2 to 3 times faster.
$ obabel chembl_15mod.sdf -O chembl.fs # for fastsearch $ obabel chembl_15mod.sdf -O chembl.fps # for chemfp
Single-molecule queryLet's take the first molecule in the database as a query, and find those molecules in ChEMBL that are within 0.8 similarity (Tanimoto). simsearch is the name of chemfp's tool for similarity searching:
$ obabel chembl.fs -osmi -s query.sdf -aa -at 0.8 > threshold.fs.txt $ simsearch --threshold 0.8 -q query.fps chembl.fps > threshold.fps.txtThey both take about 1.2 seconds. I'm being a bit vague because the exact value doesn't matter; for some query molecules fastsearch takes longer, for some less. It's just to give an idea. But in short, fastsearch appears to be in the same ballpark as simsearch for single molecule queries.
Multi-molecule queryfastsearch doesn't provide an easy way to do multiple queries. If you want to do it yourself, you would just have to do each search one-by-one. So 1000 searches would take 1000s, let's say. In contrast, with simsearch 1000 searches on ChEMBL takes only 8s:
$ simsearch --threshold 0.8 -q largequery.fps chembl.fps > thresholdb.fps.txtHow does it manage this? Well, first of all, it runs in parallel on 4 available CPUs by default (...see Andrew's comment below). These figures come from a server which has 4 hyperthreading CPUs. The rest of the improvement comes from the use of in-memory data structures and algorithmic magic, some of which Andrew has described on his blog over the last year or so.
With speeds like this, it brings all-against-all similarity searches within reach, and simsearch provides an option just for this. The following finds the nearest 8 molecules for each molecule in the dataset:
$ simsearch -k 8 --NxN largequery.fps > NxN.fps.txtI tried this for the first 10000 molecules in ChEMBL, and it took about 1s; 50K took 11s; 100K took 32s; 250K took 144s; and the whole of ChEMBLdb (1.2 million) took 2900s (48m 20s). To put this in context, an all-against-all search of ChEMBLdb using fastsearch would take something like 14 days.
Single-molecule query revisitedThe alert reader may be wondering why, if the multi-molecule query is so darned fast, the single-molecule query is no faster than Open Babel's fastsearch. The answer is rather simple (...but not quite right - see Andrew's comment below): 99% of the time spent on the single-molecule query is setup. Once the search datastructure has been initialised, then the response for a query is within milliseconds. You can see timings for these over at the chemfp website. To achieve these timings in practice, you would need to write a Python script that used the chemfp library and was accessed via some sort of client/server architective, e.g. a local webservice. As Andrew points out, this would allow search results to be returned instantly as a chemist sketches a structure. Trying this out is left as an exercise for the reader. :-)
In conclusionSo, in short, if you have any interest in running multiple queries against a database, comparing two large datasets, or finding pairwise similarity within a dataset, check out chemfp.
1. I should point out that Andrew is making chemfp available as open source but with delayed release. Commercial licensees get support and the latest features.
2. For ease of use with Open Babel, the ChEMBLdb SDF file was modified to add the chembl_id to the title. Oh ye gods of ChEMBL please consider doing this for your humble users.
Image credit: Russell J Watkins on Flickr