Friday 8 October 2010

Measuring information loss in file format conversion Part III

In this (final) post on the topic (see Parts I, II), I will look at the error rate going from SDF->Canonical SMILES (using Open Babel) versus SDF->SMI->Canonical SMILES. Note that unlike InChI, there is no standard 'canonical SMILES' (although there has been some talk about this in the context of OpenSMILES) - each toolkit has its own way of creating it.
C:\>obabel Conformers_00000001.sdf -osmi -O sdf_to_smi.txt
18084 molecules converted

C:\>obabel Conformers_00000001.sdf -ocan -O sdf_to_can.txt
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
18084 molecules converted

C:\>obabel -ismi sdf_to_smi.txt -ocan -O smi_to_can.txt
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
18084 molecules converted

C:\>diff sdf_to_can.txt smi_to_can.txt
136c136
< c12=NCCN=c1ncnc2 167
---
> C12=NCCN=C1NCNC2 167
5716c5716
< N12CC[C@@H](CC1)CC2 7527
---
> N12CC[C@H](CC1)CC2 7527
6928c6928
< c12c3c(cc4c1c1c(nn2)c2c(cc1cc4)cccc2)cccc3 9107
---
> c12c3c(cc4c1c1c([nH][nH]2)c2c(cc1cc4)cccc2)cccc3 9107
16377c16377
< c12c(c(c[nH]1)C[C@@H]1N3CC[C@H](C1)CC3)cccc2 21918
---
> c12c(c(c[nH]1)C[C@@H]1N3CC[C@@H](C1)CC3)cccc2 21918
17517c17517
< c\1(=c/2\[n+](=O)cccc2)/n(cccc1)[O-] 23699
---
> C1(C2[N+](=O)CCCC2)N(CCCC1)[O-] 23699

C:\>

So, a failure rate of five out of 18084. 7527 and 21918 seem to be the same problem (same substructure involved) which is related to graph symmetry, while 167, 9107 and 23699 are kekulization issues. (Update 11/10/2010: 167 and 23699 fixed.)

Credits: The main Open Babel developers are Geoff Hutchison, Chris Morley, Tim Vandermeersch, Craig James and myself. Code contributions from many more.

No comments: