Tuesday 5 October 2010

Measuring information loss in file format conversion

If you've followed my posts on parsing stereochemistry in SMILES, you'll have realised that every conversion between different chemical formats has the very real possibility of losing or confusing information. There are several ways to identify such problems. One way is to compare results of a particular conversion to an independent standard.

Here I'll calculate the error rate of conversion from SDF to InChI using Open Babel, compared to doing the same conversion using the official InChI binary. Note that what we are actually calculating is the error rate of conversion from SDF -> Open Babel's internal chemical model -> InChI; it's not that Open Babel just hands the InChI code the raw SDF file.

This test is using the OB 2.3 development code. The test file is the first entry in PubChem3D. This contains 18084 3D structures.

First, run the InChI binary (I'm using Windows):
inchi-1.exe Conformers_00000001.sdf /AuxNone 2> errors.txt

Next, convert from SDF to InChI with obabel.exe (InChI format options described here):
obabel Conformers_00000001.sdf -oinchi -xw -O ob_results.txt

Clean up the InChI output (I have Cygwin installed on Windows):
grep "^InChI=" Conformers_00000001.sdf.txt > official_results.txt

Finally, compare the results:
C:\> diff official_results.txt ob_results.txt

C:\>

Too easy huh? Let's try the first 10 files in PubChem 3D instead: 166735 molecules.

Ah...now we have something. I found 15 disagreements on the InChI. Hmmmm...but 13 of these involve molecules with isotopes of Br...one quick bug fix later (SVN r4134), I have 2 errors left: molecules 144031 and 144132. These both have multiple double bonds in ring systems, and I think there may be a difference in opinion between Open Babel and InChI over the cutoff for the size of ring in which the stereochemistry of double bonds should be considered...but that's a problem for another day.

So how does the current release compare to this? Not so well, not so well at all. We started reimplementing stereochemistry in Open Babel about 1.5 years ago, and it's only now we're getting such good performance. In short, if stereochemistry in InChI is important for your application, you should wait for the 2.3 release (or run the development code).

2 comments:

Richard Hall said...

That is a useful analysis Noel - have you tried other intermediate steps to see if you lose more information - eg sdf->smiles->inchi?

when does 2.3 get released?

Noel O'Boyle said...

@Richard: I understand that's all covered in Part II of this series (from __future__ insert link).

"When does 2.3 get released?" Sometime between 2.2 and 2.4 I guess. :-) Oh, OpenBabel 2.3? Maybe before the end of this month (resisting writing "for some value of month").