Thursday, 2 November 2017

Announcing more closures

It is a good idea to avoid extending an existing format. If additional information needs to be stored, many formats provide a means to store it separately so that software that does not support the extension will still be able to read and process the core information. For example, with SMILES, ChemAxon have famously used the title field to store additional info.

However, this is not always possible if one runs up against a fundamental limitation of a format.

I recently ran into a situation where I needed to deal with the fact that the SMILES syntax only supports at most two digit ring closure numbers. This is the %nn notation. The problem is that %111 does not mean ring closure 111 but rather ring closures 11 and 1. If you need a three digit number you are out of luck - it simply can't be expressed as a SMILES string.

So, is it really necessary to have this ability in the first place? Daylight didn't think so. And they were mainly right - when writing SMILES, if you reuse the numbers and pick a traversal order that closes rings as soon as possible then it may be possible to avoid requiring three digits (though a diamond or graphene substructure may cause problems - a refutable hypothesis).

But the thing is, any particular implementation may not be reusing ring closure numbers, or may be favoring a different traversal order (e.g. nicer looking SMILES are made by following ring substituents first). Indeed, remembering that this syntax can be used to create a bond between any two atoms, you may wish to use the ring closure syntax as a means to construct a molecule, and there you don't want to be limited to just 100 bonds; e.g. by stitching together a large molecule by combining shorter SMILES strings (see SmiLib for example, and the note under Restrictions), or via this SMILES hack of Andrew's.

When Bob Hanson of Jmol had to deal with this issue he settled on the syntax "%(number)" (discussed here in 2012, and more recently in J. Cheminf. 2016, 8, 50). One alternative might be "%%", but again this would be limited, this time to three digits. Might as well come up with a general solution as Bob has done. It uses an extra character for the three-digit case but this is going to be rare in the first place, and these SMILES are going to be so big that having a few more characters is not going to break the bank.

Support for this syntax has recently been added to RDKit and Open Babel. For more info see the associated pull requests (here and here). Hopefully we will start to see support for this more generally in other toolkits.

Notes:
1. According to the Jmol docs, "%(n)" supports any non-negative number (integer, presumably?). My implementation supports up to 5 digits.

Image credit:
image by InExtremiss on Flickr licensed CC-BY

Sunday, 1 October 2017

Open Babel in a snap

Maintainers of Linux distributions do an amazing job packaging applications for users, so that "apt/dnf install openbabel"  makes software packages available for use within seconds. The only downside is that the version of the software may be older than desired, due to the infrequent release schedules of many distros combined with the infrequent release schedule of Open Babel itself.

To bridge the gap for such users, I have created a snap package for the latest release of Open Babel. For those distros that support snapd, "snap install openbabel" will install it and typing "openbabel." and hitting tab a few times will list the executables, e.g. "openbabel.obgui". Um, this is all in theory. If you actually get this to work on a distro other than Ubuntu, please leave a comment below.

Notes:
1. Snaps run in some sort of security sandbox. As far as I understand, the executables I have created can only read/write files in the user's HOME directory (and subdirectories).
2. Right now, I don't include the libraries or include files. This may come with time.
3. Similarly, right now I've only packaged the release version. Snaps are a good way to distribute nightly-builds or snapshot releases and this might be something to look into in future.

Saturday, 23 September 2017

How many cheminformaticians does it take to read a SMILES string?

A few days ago I posted the following poll question on Twitter:
How many Hs are on the N in the molecule described by this SMILES string? N(C)(C)(C)C
I provided four possible answers:
  • Zero
  • One
  • Depends
  • Can't say as no such molecule
Please take a moment to choose your own answer. To guide you, here is Inigo Montoya.
The results of the survey were interesting. I selected the answers to highlight common misconceptions, and indeed each was chosen at least as much as the correct answer. In fact, only two people got the correct answer, and one of those people shares the office with me.

First of all, it's not "can't say as no such molecule" (8/25 people). To even say whether any such molecule exists, one needs to read the SMILES string, and so could answer the question. Also, the scope of SMILES strings is any molecule that can be represented by a valence model; that is, it is not restricted to only molecules that exist. I think what people were really indicating here is that they work back from the molecule to answer how many hydrogens it must have; given that they couldn't work out what molecule it is, they weren't able to count the hydrogens.

And what about "depends"? (2 people) I'm guessing that people chose this answer to indicate that it's either 0 or 1 (or something else) but it's not possible to be certain. This might be where people are trying to second guess what the corret molecule might have been, given that the one presented is clearly wrong in some way. Here, the key point that people are missing is that a SMILES string exactly represents a molecule including all of its hydrogens.

But what is that hydrogen count? 0 or 1? Well, 13 people went for zero and 2 for one.

Which highlights the final misconception, that the number of hydrogens on an atom in a SMILES string is the same as the answer to "how many hydrogens would I guesstimate if I didn't know how many were attached?". The answer to that is probably zero here. However, when reading SMILES, the number of hydrogens on the nitrogen is exactly described by a set of rules outlined by Daylight at [1], and somewhat more clearly by OpenSMILES at [2] (a project which was spearheaded by former Daylight developer, Craig James). The rules are simple and have no special cases. For nitrogen, the rule is add hydrogens to round up the valence to 3 (if valence <= 3), or up to 5 (if 3 < valence <= 5), and zero otherwise. In short, one hydrogen should be added.

If at this point you're thinking, "but that doesn't make sense - you shouldn't add a hydrogen", you are missing the point. The structure that the SMILES writer was presented with had a hydrogen there. We can argue about whether it was intended, but it did have it. The SMILES writer then used the SMILES valence rule described above to decide whether or not it could omit square brackets. It worked out that it could, and then decided to omit them. Any SMILES reader that respects the rules would then read it in and perfectly reproduce the original structure.

...Or would they? I've just been testing the ability of a number of cheminformatics toolkits and drawing programs to agree on the number of hydrogens on each atom of a given SMILES string. The goal is not to give anyone a kicking, but to improve SMILES interoperability across the board, and in the last two weeks I've given feedback to (I think) 7 different toolkits with this goal in mind. So far I have test results for the Avalon toolkit, BIOVIA Draw, CACTVS, the CDK, ChemDraw, Indigo, IWToolkit, JChem, OEChem, Open Babel, OpenChemLib, the RDKit, and Weininger's CEX. If you would like to add another program to this list, then I encourage you to get in touch.

References:
[1] http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html (section 3.2.1)
[2] http://opensmiles.org/opensmiles.html (section 3.1.5)

Monday, 28 August 2017

My ACS talk on Kekulization and aromatic SMILES

Here are the slides for the talk I presented last week at the ACS meeting in Washington. It describes my understanding of the Daylight toolkit as deduced by John:
The fundamental reason for attempting to describe the behaviour of the Daylight toolkit is the assumption that this is the correct way to read/write SMILES, and that any deviation is either wrong or at least should be justified. The background to this is that I recently worked on the handling of kekulization and reading SMILES in Open Babel, but found that many of the details were not present in the Daylight documentation. As I say on the final slide, please let me know if you feel I have made any mistake. You can do this by email or by leaving a comment below.

Over at the NextMove blog, Peter Shenkin brought up the biphenylene case, which (to my mind) illustrates alternative approaches to reading aromatic SMILES. Consider the SMILES string c12ccccc1c3ccccc23. Some toolkits may read this, work out that only the two six-membered rings can be aromatic, and then make sure that the double bonds are not placed in the four-membered ring. I refer to this approach as dearomatisation, an approach that Open Babel used to use. It involves ring detection, 4n+2 counting and so forth. Apart from taking some time, an obvious problem is different aromaticity models may be used by the reader and writer, thus leading the reader to drop aromaticity from a particular ring, typically by setting those bonds to single bonds and adjusting hydrogens, resulting in a different structure than intended.

In any case, this is not the approach used by the Daylight toolkit, which did not consider 4n+2, or even detect cycles. The approach is described in the talk above so I won't repeat it here. For the SMILES above, I believe that it would generate one of two Kekulé forms depending on the atom order; one with the two double bonds in the four-membered ring, and one with two benzenes. It's for this reason that Daylight would never generate that SMILES for biphenylene ("don't generate aromatic SMILES that you can't kekulize"), but always write a single bond symbol for the bonds connecting the phenyl rings (e.g. something like c12-c3c(-c2cccc1)cccc3). When written that way, kekulization always gives the desired form.
This use of a single bond symbol is rather subtle. When writing an aromatic SMILES, the rule is "use a single bond symbol when a ring bond is between aromatic atoms but is not itself aromatic". This corrects for the fact that all default bonds joining aromatic atoms are themselves considered aromatic (except where the bond is not in a ring but the atoms are).

Following up on a comment by Rajarshi, while differences in aromaticity models are a problem for 'dearomatisation' algorithms, they are not a problem for the kekulization algorithm used by Daylight. So long as the structure is kekulizable (and appropriate single-bond symbols are used) then it can read in any structure without loss of information no matter what aromatic model is used.

Sunday, 23 July 2017

Faster toolkit, faster!

After an extended hiatus, I've been back doing a bit of work on Open Babel, and specifically I've been working on improving performance. Basic reading and writing of molecules should be limited by disk I/O speed and not CPU, but this is not (yet) the case with Open Babel. Through a combination of avoiding unneccessary work, trimming inefficiencies in existing code and using more efficient algorithms, I'm hoping to speed things up.

1. Replace slower algorithms with faster algorithms.

This is perhaps the hardest one, as it takes time to get grips with the existing code and figure out whether and where it can be improved. So far the only thing I've done in this line is replace our previous kekulisation procedure with a perfect matching algorithm. Though also, I guess, in this category are the changes I made to replace the original set of SMARTS patterns for aromaticity 'atom-typing' with logic to do the typing directly in code.

2. Streamline existing code.

This can be tedious and doesn't give a big win, but it's a case of avoiding what Roger refers to as Death by a Thousand Cuts. Individually, they don't count for much (and Stack Overflow would have you believe you shouldn't worry about them), but you should think of reading a molecule as the inner loop and consider that it might be done millions of times (e.g. when processing ChEMBL).

In particular, in the context of file format reading and writing, unnecessary string copies are to be avoided. This can be everything from a function that takes a std::string as a parameter, copying part of the input buffer unneccessarily, or concatenating strings with strcat.

3. Avoid unneccesary work by considering the OBMol to be a container for the contents of the file format.

This is a roundabout way of saying that the file format reader should not worry too much about the chemical content of the described molecule and shouldn't spend time checking and validating it. If there's a carbon atom with a +20 charge, fine. If there's a septuple bond between hydrogens, sure, go right ahead. Just read it in and bung it in an OBMol. That's not to say that there isn't a role for validation, but it should be an option as it takes time, may be completely unneccessary (e.g. you have just written out this molecule yourself) and is, strictly speaking, distinct from file format reading.

We already did this to a certain extent, but we didn't follow through completely. If the user said an atom was aromatic, there was no way to preserve this and avoid reperception. This has now been fixed in the current master, and the SMILES reader has an option to preserve aromaticity. Similarly, we currently reperceive stereocenters rather than accept them at face value as present in SMILES for example. This is next on my list of things to change.

Related pull requests:
* Improve performance of element handling
* Improve performance of SMILES parser
* Keep count of implicit hydrogens instead of inferring them
* Change the OBAromTyper from using SMARTS patterns to a switch statement

Image credit:
Renato Carvalho on Flickr

Friday, 5 May 2017

Using WebLogo3 to create a sequence logo from Python

A sequence logo is a way to display the variation at particular positions of multiple-aligned sequences. Here I used WebLogo 3 to do this. The documentation is not great, even once you find it, and working out how to do my specific use case required a certain amount of looking at the source code of the library. Hence this post for future me.

My sequences are peptides, and use lowercase to indicate D forms of the amino acids. So I needed to create my own 'alphabet' as all of those provided by the library uppercase everything. I also wanted to highlight a reference sequence in a particular colour. This is made easy by the RefSeqColor rule, but I needed to override it, as again it wanted to uppercase everything.
import os
import StringIO
import weblogolib as w

class RefSeqColor(w.ColorRule):
    """
    Color the given reference sequence in its own color, so you can easily see 
    which positions match that sequence and which don't.
    """

    def __init__(self, ref_seq, color, description=None):
        self.ref_seq = ref_seq
        self.color = w.Color.from_string(color)
        self.description = description

    def symbol_color(self, seq_index, symbol, rank):
        if symbol == self.ref_seq[seq_index]:
            return self.color

baserules = [
            w.SymbolColor("GSTYC", "green", "polar"),
            w.SymbolColor("NQ", "purple", "neutral"),
            w.SymbolColor("KRH", "blue", "basic"),
            w.SymbolColor("DE", "red", "acidic"),
            w.SymbolColor("PAWFLIMV", "black", "hydrophobic")
        ]

protein_alphabet = w.Alphabet('ACDEFGHIKLMNOPQRSTUVWYBJZX*-adefghiklmnopqrstuvwybjzx', [])

def plotseqlogo(refseq, mseqs, name):
    fasta = "> \n" + "\n> \n".join(mseqs)
    seqs = w.read_seq_data(StringIO.StringIO(fasta), alphabet=protein_alphabet)

    colorscheme = w.ColorScheme([RefSeqColor(refseq, "orange", "refseq")] + baserules,
                                alphabet = protein_alphabet)

    data = w.LogoData.from_seqs(seqs)
    options = w.LogoOptions()
    # options.logo_title = name
    options.show_fineprint = False
    options.yaxis_label = ""
    options.color_scheme = colorscheme
    mformat = w.LogoFormat(data, options)

    fname = "%s.pdf" % name
    with open(fname, "wb") as f:
        f.write(w.pdf_formatter(data, mformat))

if __name__ == "__main__":
    testdata = ["ACDF", "ACDF", "ACDE", "CCDE"]
    plotseqlogo("ACDF", testdata, "testdata")
Notes: The code above writes the logo out to a PDF file, which I subsequently converted to SVG with Inkscape at the commandline:
"C:\Program Files (x86)\Inkscape\inkscape.com" --without-gui --file=fname.pdf --export-plain-svg=svg/fname.svg

Monday, 23 January 2017

Whiskas statistics and the pitfalls of mean rank

In the fingerprint similarity paper I published last year, I made the following observation: "However the use of the mean rank is itself problematic as the pairwise similarity of two methods can be altered (and even inverted) by adding additional methods to an evaluation." I thought about providing an example illustrating this, but I didn't want to digress too much. And besides, that's perfect blog fodder...

Let's say we want to compare N methods, and we have their scores for 10 diverse datasets. Because they are diverse, we think to ourselves that it wouldn't be an accurate evaluation if we used the scores directly, e.g. taking the mean scores or some such, because some of the datasets are easy (all of the scores are high but spread out) and some are hard (all of the scores are low, but close together). As I reference in the paper, Bob Sheridan has discussed this problem, and among other solutions suggested using the mean rank.

So here's the thing. Consider two methods, A and B, where A outperforms B on 9 out of the 10 datasets, i.e. A has rank 1 nine times and rank 2 once, while vice versa for B. So the mean rank for A is 1.1 while that for B is 1.9, implying that A is better than B. So far so good.

Now let's suppose I start tweaking the parameters of B to generate 9 additional methods (B1 to B9). Unfortunately, while they have similar performance to B, they are all slightly worse on each dataset. So the rank order for 9 of the 10 datasets in now A > B > B1...B9, and that for the 10th dataset is B > B1...B9 > A. So A has rank 1 nine times (as before) and rank 11 once (giving a mean rank of 2.0), while B has the same rank as before. So now B is better than A.

Wait a second. Neither A nor B has changed, so how can our evaluation of their relative performance have changed? And therein lies the problem.

So what's the solution? Even a cat knows the answer: as I stated at the start, on 9 out of 10 datasets A outperformed B. So A is better than B, or A>B. It's as simple as that. We don't need to calculate the mean anything. Our goal is not to find a summary value for the overall performance for a method, and then to compare those. It is to evaluate relative performance, and for diverse datasets each dataset has an equal vote towards that answer. So on dataset 1 (taking a different imaginary dataset), each method dukes it out giving A>B, A>C, A>D, B>C, B>D, C=D. Then round 2, on dataset 2. Toting up the values over all datasets will yield something like A>B 10 times, while B>A 5 times, and A=B twice, so A>B overall (and similar results for other pairwise comparisons).

The nice thing about all of these pairs is that you can then plot a Hasse diagram to summarise the info. There are some niggly details like handling incomparable values (e.g. A>B, B>C, C>A) but when it's all sorted out, the resulting diagram is quite information-rich.

I didn't mention statistical significance, but the assessment of whether you can say A>B overall requires a measurement of significance. In my case, I could generate multiple datasets from the same population, and so I used a a basic T-test over the final results from each of the datasets (and corrected for multiple testing).

I've gone on and gone on a bit about what's really quite a simple idea. In fact, it's so simple that I was sure it must be how people already carry out performance evaluations across diverse datasets. However, I couldn't find any evidence of this, nor could I find any existing description of this method. Any thoughts?