Monday, 28 August 2017

My ACS talk on Kekulization and aromatic SMILES

Here are the slides for the talk I presented last week at the ACS meeting in Washington. It describes my understanding of the Daylight toolkit as deduced by John:
The fundamental reason for attempting to describe the behaviour of the Daylight toolkit is the assumption that this is the correct way to read/write SMILES, and that any deviation is either wrong or at least should be justified. The background to this is that I recently worked on the handling of kekulization and reading SMILES in Open Babel, but found that many of the details were not present in the Daylight documentation. As I say on the final slide, please let me know if you feel I have made any mistake. You can do this by email or by leaving a comment below.

Over at the NextMove blog, Peter Shenkin brought up the biphenylene case, which (to my mind) illustrates alternative approaches to reading aromatic SMILES. Consider the SMILES string c12ccccc1c3ccccc23. Some toolkits may read this, work out that only the two six-membered rings can be aromatic, and then make sure that the double bonds are not placed in the four-membered ring. I refer to this approach as dearomatisation, an approach that Open Babel used to use. It involves ring detection, 4n+2 counting and so forth. Apart from taking some time, an obvious problem is different aromaticity models may be used by the reader and writer, thus leading the reader to drop aromaticity from a particular ring, typically by setting those bonds to single bonds and adjusting hydrogens, resulting in a different structure than intended.

In any case, this is not the approach used by the Daylight toolkit, which did not consider 4n+2, or even detect cycles. The approach is described in the talk above so I won't repeat it here. For the SMILES above, I believe that it would generate one of two Kekulé forms depending on the atom order; one with the two double bonds in the four-membered ring, and one with two benzenes. It's for this reason that Daylight would never generate that SMILES for biphenylene ("don't generate aromatic SMILES that you can't kekulize"), but always write a single bond symbol for the bonds connecting the phenyl rings (e.g. something like c12-c3c(-c2cccc1)cccc3). When written that way, kekulization always gives the desired form.
This use of a single bond symbol is rather subtle. When writing an aromatic SMILES, the rule is "use a single bond symbol when a ring bond is between aromatic atoms but is not itself aromatic". This corrects for the fact that all default bonds joining aromatic atoms are themselves considered aromatic (except where the bond is not in a ring but the atoms are).

Following up on a comment by Rajarshi, while differences in aromaticity models are a problem for 'dearomatisation' algorithms, they are not a problem for the kekulization algorithm used by Daylight. So long as the structure is kekulizable (and appropriate single-bond symbols are used) then it can read in any structure without loss of information no matter what aromatic model is used.

Sunday, 23 July 2017

Faster toolkit, faster!

After an extended hiatus, I've been back doing a bit of work on Open Babel, and specifically I've been working on improving performance. Basic reading and writing of molecules should be limited by disk I/O speed and not CPU, but this is not (yet) the case with Open Babel. Through a combination of avoiding unneccessary work, trimming inefficiencies in existing code and using more efficient algorithms, I'm hoping to speed things up.

1. Replace slower algorithms with faster algorithms.

This is perhaps the hardest one, as it takes time to get grips with the existing code and figure out whether and where it can be improved. So far the only thing I've done in this line is replace our previous kekulisation procedure with a perfect matching algorithm. Though also, I guess, in this category are the changes I made to replace the original set of SMARTS patterns for aromaticity 'atom-typing' with logic to do the typing directly in code.

2. Streamline existing code.

This can be tedious and doesn't give a big win, but it's a case of avoiding what Roger refers to as Death by a Thousand Cuts. Individually, they don't count for much (and Stack Overflow would have you believe you shouldn't worry about them), but you should think of reading a molecule as the inner loop and consider that it might be done millions of times (e.g. when processing ChEMBL).

In particular, in the context of file format reading and writing, unnecessary string copies are to be avoided. This can be everything from a function that takes a std::string as a parameter, copying part of the input buffer unneccessarily, or concatenating strings with strcat.

3. Avoid unneccesary work by considering the OBMol to be a container for the contents of the file format.

This is a roundabout way of saying that the file format reader should not worry too much about the chemical content of the described molecule and shouldn't spend time checking and validating it. If there's a carbon atom with a +20 charge, fine. If there's a septuple bond between hydrogens, sure, go right ahead. Just read it in and bung it in an OBMol. That's not to say that there isn't a role for validation, but it should be an option as it takes time, may be completely unneccessary (e.g. you have just written out this molecule yourself) and is, strictly speaking, distinct from file format reading.

We already did this to a certain extent, but we didn't follow through completely. If the user said an atom was aromatic, there was no way to preserve this and avoid reperception. This has now been fixed in the current master, and the SMILES reader has an option to preserve aromaticity. Similarly, we currently reperceive stereocenters rather than accept them at face value as present in SMILES for example. This is next on my list of things to change.

Related pull requests:
* Improve performance of element handling
* Improve performance of SMILES parser
* Keep count of implicit hydrogens instead of inferring them
* Change the OBAromTyper from using SMARTS patterns to a switch statement

Image credit:
Renato Carvalho on Flickr

Friday, 5 May 2017

Using WebLogo3 to create a sequence logo from Python

A sequence logo is a way to display the variation at particular positions of multiple-aligned sequences. Here I used WebLogo 3 to do this. The documentation is not great, even once you find it, and working out how to do my specific use case required a certain amount of looking at the source code of the library. Hence this post for future me.

My sequences are peptides, and use lowercase to indicate D forms of the amino acids. So I needed to create my own 'alphabet' as all of those provided by the library uppercase everything. I also wanted to highlight a reference sequence in a particular colour. This is made easy by the RefSeqColor rule, but I needed to override it, as again it wanted to uppercase everything.
import os
import StringIO
import weblogolib as w

class RefSeqColor(w.ColorRule):
    Color the given reference sequence in its own color, so you can easily see 
    which positions match that sequence and which don't.

    def __init__(self, ref_seq, color, description=None):
        self.ref_seq = ref_seq
        self.color = w.Color.from_string(color)
        self.description = description

    def symbol_color(self, seq_index, symbol, rank):
        if symbol == self.ref_seq[seq_index]:
            return self.color

baserules = [
            w.SymbolColor("GSTYC", "green", "polar"),
            w.SymbolColor("NQ", "purple", "neutral"),
            w.SymbolColor("KRH", "blue", "basic"),
            w.SymbolColor("DE", "red", "acidic"),
            w.SymbolColor("PAWFLIMV", "black", "hydrophobic")

protein_alphabet = w.Alphabet('ACDEFGHIKLMNOPQRSTUVWYBJZX*-adefghiklmnopqrstuvwybjzx', [])

def plotseqlogo(refseq, mseqs, name):
    fasta = "> \n" + "\n> \n".join(mseqs)
    seqs = w.read_seq_data(StringIO.StringIO(fasta), alphabet=protein_alphabet)

    colorscheme = w.ColorScheme([RefSeqColor(refseq, "orange", "refseq")] + baserules,
                                alphabet = protein_alphabet)

    data = w.LogoData.from_seqs(seqs)
    options = w.LogoOptions()
    # options.logo_title = name
    options.show_fineprint = False
    options.yaxis_label = ""
    options.color_scheme = colorscheme
    mformat = w.LogoFormat(data, options)

    fname = "%s.pdf" % name
    with open(fname, "wb") as f:
        f.write(w.pdf_formatter(data, mformat))

if __name__ == "__main__":
    testdata = ["ACDF", "ACDF", "ACDE", "CCDE"]
    plotseqlogo("ACDF", testdata, "testdata")
Notes: The code above writes the logo out to a PDF file, which I subsequently converted to SVG with Inkscape at the commandline:
"C:\Program Files (x86)\Inkscape\" --without-gui --file=fname.pdf --export-plain-svg=svg/fname.svg

Monday, 23 January 2017

Whiskas statistics and the pitfalls of mean rank

In the fingerprint similarity paper I published last year, I made the following observation: "However the use of the mean rank is itself problematic as the pairwise similarity of two methods can be altered (and even inverted) by adding additional methods to an evaluation." I thought about providing an example illustrating this, but I didn't want to digress too much. And besides, that's perfect blog fodder...

Let's say we want to compare N methods, and we have their scores for 10 diverse datasets. Because they are diverse, we think to ourselves that it wouldn't be an accurate evaluation if we used the scores directly, e.g. taking the mean scores or some such, because some of the datasets are easy (all of the scores are high but spread out) and some are hard (all of the scores are low, but close together). As I reference in the paper, Bob Sheridan has discussed this problem, and among other solutions suggested using the mean rank.

So here's the thing. Consider two methods, A and B, where A outperforms B on 9 out of the 10 datasets, i.e. A has rank 1 nine times and rank 2 once, while vice versa for B. So the mean rank for A is 1.1 while that for B is 1.9, implying that A is better than B. So far so good.

Now let's suppose I start tweaking the parameters of B to generate 9 additional methods (B1 to B9). Unfortunately, while they have similar performance to B, they are all slightly worse on each dataset. So the rank order for 9 of the 10 datasets in now A > B > B1...B9, and that for the 10th dataset is B > B1...B9 > A. So A has rank 1 nine times (as before) and rank 11 once (giving a mean rank of 2.0), while B has the same rank as before. So now B is better than A.

Wait a second. Neither A nor B has changed, so how can our evaluation of their relative performance have changed? And therein lies the problem.

So what's the solution? Even a cat knows the answer: as I stated at the start, on 9 out of 10 datasets A outperformed B. So A is better than B, or A>B. It's as simple as that. We don't need to calculate the mean anything. Our goal is not to find a summary value for the overall performance for a method, and then to compare those. It is to evaluate relative performance, and for diverse datasets each dataset has an equal vote towards that answer. So on dataset 1 (taking a different imaginary dataset), each method dukes it out giving A>B, A>C, A>D, B>C, B>D, C=D. Then round 2, on dataset 2. Toting up the values over all datasets will yield something like A>B 10 times, while B>A 5 times, and A=B twice, so A>B overall (and similar results for other pairwise comparisons).

The nice thing about all of these pairs is that you can then plot a Hasse diagram to summarise the info. There are some niggly details like handling incomparable values (e.g. A>B, B>C, C>A) but when it's all sorted out, the resulting diagram is quite information-rich.

I didn't mention statistical significance, but the assessment of whether you can say A>B overall requires a measurement of significance. In my case, I could generate multiple datasets from the same population, and so I used a a basic T-test over the final results from each of the datasets (and corrected for multiple testing).

I've gone on and gone on a bit about what's really quite a simple idea. In fact, it's so simple that I was sure it must be how people already carry out performance evaluations across diverse datasets. However, I couldn't find any evidence of this, nor could I find any existing description of this method. Any thoughts?

Saturday, 14 January 2017

Counting hydrogens in a SMILES string - The Rules

Hydrogens are not usually listed explicitly in a SMILES string, but instead can be inferred from a set of rules. To be precise, when reading a SMILES, the position of every hydrogen is known unambiguously once you know the rules. Oh - did I forget to mention the catch? - those rules are not written down anywhere. Nice.

So let's fix the world.

What is written down (i.e. on the Daylight website, the OpenSMILES spec) is the following:
1. For atoms in brackets, the number of hydrogens is either listed or zero. e.g. [Na] has 0 zeros, [NaH5] has 5.
2. For atoms outside brackets (which means they must be in the so-called 'organic subset'), if they are not written lowercase (indicating aromaticity), the number of hydrogens is described by the SMILES valence model (see the OpenSMILES specification for the details), e.g. the carbons in CC have 3 implicit hydrogens.

What is not written down (apart from references to pyrrole-/pyridine-type hydrogens) is what to do about (unbracketed) aromatic atoms when reading. These should be handled as follows: [Updated 03/Aug/2017]
1. Calculate the bond order sum by treating aromatic bonds as single bonds
2. Apply normal SMILES implicit valence rules using this sum
3. Subtract one from the number of implicit hydrogens, if there are any

As an example, consider aromatic C and N as in pyridine, c1ccncc1. The bond order sum for both the c and the n is 2. Their implicit valence is 4 and 3 respectively, and so the number of implicit hydrogens would be 4-2 and 3-2 respectively. After subtracting one, this gives 1 and 0, respectively. As another example, consider 'cn'. Without worrying too much about whether this is a sensible SMILES string, we can still read off the number of hydrogens based on the rules above: 4-1 on the carbon, and 3-1 on the nitrogen, then subtract one from each, and so [CH2] and [NH] connected by an aromatic bond (if kekulized, this should give H2C=NH). As a final example, consider the nitrogen of c1ccn(C)c1: 3-3 on the nitrogen gives 0, and we can't do any further subtraction.

Probably because of the lack of clarity around these latter rules, not every writer or reader follows them, and Wikipedia in particular is rife with generated SMILES from REDACTED which writes lowercase n whether or not the nitrogen has a hydrogen present. Some toolkits fail to read such SMILES, or interpret them differently than intended.

This raises the question as to what to do when presented with, for example, the SMILES c1cncc1? A correct SMILES for pyrrole would be c1c[nH]cc1. As written, the first SMILES is not kekulisable according to the Daylight aromaticity model (a neutral 'n' without a hydrogen must have a double bond, or alternatively, radicals contibute only a single electron) but one could infer that the structure intended was pyrrole. This is a slippery slope, though, once you consider aromatic rings with multiple nitrogens where it may not be possible to unambiguously assign hydrogens. Also, I would argue that it is not the job of a reader to change the structure because "it knows best".

For related reading, see John Mayfield's posts on SMILES implicit valence of aromatic atoms and New SMILES behaviour - parsing (CDK 1.5.4). Also worth noting is that at no point in identifying hydrogen locations was determination of aromatic systems or kekulisation required. Of course, if you go down the route of editing erroneous structures that may be a different story.

Image credit: Licensed CC-BY-NC by Sean Davis (image on Flickr)

Friday, 30 December 2016

The clockwisdom of SMARTS

In earlier posts I discussed/investigated how stereochemistry is represented in SMILES. Here I'm going to try to figure out what I thought would be relatively simple, how to write a SMARTS pattern that matches a chiral molecule and all of its superstructures. For example, consider wanting to search for glucopyranose-containing molecules in a database that also contained other hexopyranose epimers. As background, see the Mischievous SMARTS post by John Mayfield.

As a simple example, let's take the molecule represented by the SMILES F[C@@H](Br)Cl. I want to write a SMARTS pattern that matches this, as well as all superstructures. In this context, given that the halides typically are single valent, such superstructures are replacements of the hydrogen by arbitrary R groups.

Now, you may be aware that every SMILES string is also a valid SMARTS pattern. Unfortunately, it is also true that this is rarely the SMARTS pattern that you want. In this particular case, the original SMILES string, when interpreted as a SMARTS query, requires that the C has exactly 1 hydrogen attached. In other words, it won't match any superstructures (except for the elusive 5-valent carbon).

So let's leave out the H to give F[C@@](Br)Cl. This cannot be read as SMILES (at least not without warnings) since a chiral carbon requires four neighbours, but it is a valid SMARTS pattern. The question is what does it match?

The answer is more subtle than I, at least, expected. It will only match molecules that correspond to the following pseudosmiles, F[C@@](X)(Br)Cl, where X is anything including an implicit H.

Equally relevant is what it won't match. If X is F, thereby losing the chirality, then you are out of luck, but I would consider that a perfectly reasonable superstructure. And following on from this, it also won't match any other cases where the stereo is not defined at the carbon.

So in the end, I have come to the view that the best SMARTS pattern to use is F[C@@?](Br)Cl, which also matches the case where stereo is undefined. Better to cast the net wide and if someone really doesn't want to match those cases it is easy to do a search-and-replace.

Friday, 23 September 2016

Open Babel 2.4.0 released

As announced by Geoff on the mailing list, Open Babel 2.4.0 is now available to download:
I'm pleased to announce that Open Babel 2.4.0 has finally been released.

This release represents a major update and should be a stable upgrade, strongly recommended for all users.

We intend to move to an annual major release every September, with bug fix releases as needed.

A sample of major new features:
  • Integration of the confab conformer generator
  • Improved partial charge models, including EEM methods, EQeq
  • ECFP radial fingerprints
  • Initial support for ring rotamer / conformer sampling
  • Improved GAFF atom typing and parameterization
  • New PHP scripting bindings
  • Many new file formats, features and bug fixes

For a full list of changes and to download:

Thanks to a cast of many for this release, including:
Alexandr Fonari, Anders Steen Christensen, Andreas Kempe, arkose, Benoit Leblanc, Björn Grüning, Casper Steinmann, Chris Morley, Christoph Willing, Craig James, Dagmar Lenk, David Hall, David Koes, David Lonie, David van der Spoel, Dmitriy Fomichev, Fulvio Ciriaco, Fredrik Wallner, Geoff Hutchison, Heiko Becker, Itay Zandbank, Jean-Noel Avila, Jeff Janes, Joaquin Peralta, Joshua Swamidass, Julien Nabet, Karol Langner, Karthik Rajagopalan, Katsuhiko Nishimra, Kevin Horan, Kirill Okhotnikov, Lee-Ping, Matt Harvey, Maciej Wójcikowski, Marcus Hanwell, Mathias Laurin, Matt Swain, Mohamad Mohebifar, Mohammad Ghahremanpour, Noel O'Boyle, Patrick Avery, Patrick Fuller, Paul van Maaren, Peng Bai, Philipp Thiel, Reinis Danne, Roger Sayle, Ronald Cohen, Scott McKechnie, Stefano Forli, Steve Roughley, Steffen Moeller, Tim Vandermeersch, Tomas Racek, Tomáš Trnka, Tor Colvin, Torsten Sachse, Yi-Shu Tu, Zhixiong Zhao

On a personal note, in particular I'd like to thank Stefano, Steve and Steffen for contributing; not to mention Matt, Matt, Mathias and Maciej and David, David, David and David. Without them, ordering the authors by first name would not have so richly paid off.

As Chris Morley has taken a step back from the project, I did the Windows release for the first time. One change is that now we have a 64-bit version along with the 32-bit. I've also moved to supporting "pip install openbabel" as the primary means of installing the Python bindings - there's already been some work on this for Mac/Linux by Matt Swain and others.

If you have any comments/criticisms or need help, the best place to go is to our mailing list ( or file bugs on Github (click the green "New Issue").