Subscribe

RSS Feed (xml)

Powered By

Skin Design:
Free Blogger Skins

Powered by Blogger

Monday, May 19, 2008

Introduction from chemistry notes

Hello world!

With this post I would like to start tracking the progress of my Google Summer of Code project. The idea of the project is to integrate chemistry and biology knowledge into the KDE desktop. Think of (bio)chemical meta data extraction, indexing and search and this is where you meet Strigi.

Based on the powerful concept of Jstreams, Strigi is a high performance desktop search engine, which is now an inalienable part of KDE4. Strigi has the power to use different backends (clucene, sqlite3, ...) and a simple yet very powerful idea of pluggable stream analyzers. This architecture leads to a very small number of dependencies and places an emphasis on interfaces. The interface is of your choice: link directly, use sockets, dbus or even command line utilities. The main Strigi developers Jos van den Oever and Flavio Castelli already did a great job by providing a stable engine and now Strigi is moving towards the integration with Nepomuk semantic desktop project and Freedesktop.org Xesam specifications.

Nepomuk focuses on meta data ontologies and relations. Sebastian Trueg is the leader of KDE-Nepomuk project and there is also one GSoC student, Dmitriy Soloduhin, involved in it. And thanks to Phreedom (Evgeny Egorochkin), we now have Nepomuk ontologies in Strigi.

Xesam is providing unified api specifications for search and metadata services as a result of collaboration of Freedesktop.org with Strigi, Beagle, Tracker, Pinot, Recoll and Nepomuk-KDE projects.

Now back to chemistry. Blue Obelisk establishes interaction between open projects dealing with chemical systems and cultivates the standards, such as InChI and CML. Blue Obelisk has been born in the US at the ACS meeting, but has many of its roots in the University of Cambridge, group of Peter Murray Rust, and the University of Cologne (CUBIC). Christoph Steinbeck's group at CUBIC brought to life open projects such as CDK, Bioclipse and NMRShiftDB. I am happy that Egon Willighagen, who was the member of Steinbeck's group and is an active contributor to numerous open source projects, is now my mentor and supervisor in this GSoC project.

I was lucky to study Bioinformatics in CUBIC for the last year. I am very excited about my ion channel project which is now over, and I hope to stay with the topic during my PhD studies. By the way, if you have any open PhD positions for bioinformaticians, please let me know.

It is hard to resist the temptation to tell you some interesting facts on ion channels, but returning to the main topic I should tell you about the key projects that are very important for my GSoC project. These are BODR, Chemical MIME, OpenBabel, InChI, CML, chemical structures, Avogadro and Kalzium.

BODR stands for Blue Obelisk Data Repository and is a shared repository for many important chemoinformatics data.

Chemical MIME expands the list of standard MIME types with chemical file formats and provides example files for each format. Daniel Leidert maintains the chemical-mime-data database in Linux distributions. It conforms David Faure's specifications for MIME type databases in KDE4, the automagical type detection relies on it. The file extension is not enough to uniquely identify the MIME type: e.g. ".sdf" stands for SD chemical format and at the same time StarOffice Math Document.

OpenBabel is both a library and command line toolbox, which allow to manipulate chemical data in different formats. It fully supports Chemical MIME. Jerome Pansanel maintains KOpenBabel wrapper (Qt and KDE GUI for OpenBabel converter) and also a large set of molecules in CML format, called Chemical Structures 2.0.1. This is also very important, because CML (Chemical Markup Language) is an XML-based chemical format which is supposed to be the standard.

Some public databases, like PubChem and BODR Chemical Structures, implement InChI identifier, which is an IUPAC standard. InChI allows to represent a chemical structure in an unambiguous way. OpenBabel can generate InChIs from chemical structures. NCI and Kegg databases in CML with InChIs generated can be viewed at NCI and Kegg.

BKchem chemical drawing program by Beda Kosata can regenerate structures from InChIs. Other interesting chemical drawing programs, which at the moment can not import InChIs, are GChemPaint and Molsketch. BKchem uses Tk widgets, and GChemPaint is a part of GNOME desktop. Molsketch by Harm van Eersel is a molecular drawing tool for KDE. If supplied as a KPart, Molsketch can find a bright future in different KDE4 application.

Kalzium is a part of kdeedu, it started as the periodic table of the elements program by Carsten Niehaus and now is gaining momentum and attracting more hackers, who want a better chemistry support in KDE. Kalzium/Avogadro is a 3D molecular visualization library maintained by Benoît Jacob. It uses Eigen, a lightweight linear algebra C++ template library which is already a part of KDE4. Kalzium/Avogadro has acquired another GSoC student -- Marcus Hanwell. The leader of another interesting chemical KDE project KryoMol , Armando Navarro Vázquez, recently has sent the patch to separate Kalzium Molecular Viewer as a KPart.

Kfile-chemical is a project started by Egon and later supported by Jerome and Daniel. Initially it was a set of kfile plugins that allowed chemical meta data extraction. But with the initiative to port kfile plugins to Strigi, kfile-chemical now provides Strigi with chemistry aware stream analyzers. It is hosted in KDE SVN Playground, and since it is aimed to have a low number of dependencies it has the potential to become a part of kdeedu, for example.

Since kfile-chemical is where I make my first efforts, I'll briefly describe what I am doing now and what you can expect by the end of the project.

  1. Make all kfile-chemical analyzers compatible with Strigi/KDE/Nepomuk chemical ontology. This means that there are chemical filed properties defined in Strigi and during the metadata extraction process stream analyzers are supposed to fill in the relevant fields. The chemical field properties at the moment are: chemistry.inchi, chemistry.molecular_formula, chemistry.molecular_weight, chemistry.pdbid, chemistry.xray_resolution. Other properties are supposed to be stored generic field properties, such as content.title and container.items;
  2. Generate InChI (chemistry.inchi) for structures, which do not have it already, using OpenBabel library;
  3. Provide a test suite for the analyzers to make sure nothing breaks when one of the libraries in this mixture is updated;
  4. Expand the list of supported chemical file types to cover as many of Chemical MIMEs as possible.
If it goes smooth, I will try to integrate OSCAR3 to process plain text and create InChIs for molecules found in that text. This will allow indexing and semantic linking of the literature and the chemical files.

On our first meeting, Egon suggested that I should provide a KDE4 GUI chemical search tool, which could possibly be expanded to more generic purposes, like querying abstract field properties from the KDE-Nepomuk ontology. This is also great to test all the technologies and libraries involved. I won't bloat this post with the mockups and screenshots, because it is already quite long, but I will certainly come back to it later this week. So, the idea is to have the following workflow implemented:
  1. While indexing, InChI string is extracted or generated with the help of libOpenBabel by one of the kfile-chemical analyzers;
  2. InChI string is stored in chemistry.inchi field property in Strigi storage;
  3. It can be queried directly by issuing a "chemistry.inchy:" query in strigiclient
  4. The GUI tool can use Molsketch KPart to input the structure with the mouse. The structure is then converteed using OpenBabel to InChI and used as a search key;
  5. The name of the compound, or the synonym, can be specified as a search key;
  6. The search query is sent to Strigi via dbus and the search results received in response;
  7. Search results are either sort of text documents, or the chemical structures. To visualize chemical structures Kalzium/Avogadro KPart can be used.
BTW, is it possible to do a substructure search using InChI, not talking about ignoring some InChI layers?

This project is a good powertesting for all Strigi technologies. But I also hope to be useful to Strigi by extending some functionality and writing unit testcases. And I am sure Jos wont let me go like that :-) also because my primary affiliation here is KDE/Strigi.

I am happy to have this opportunity to work side-by-side with very skilled open source developers and to teach myself a good style, and of course, to have this project right on the intersection of my interests: open source, Linux, KDE and Bio(Chemo)Informatics.

The initial project proposal can be found here.

Now I will tell few words about my current progress. While trying to get all the tools and libraries listed above working on my machine, I was surprised by a crash in KDE/Avogadro, which was caused by a bug in my radeon Mesa DRI drivers. Fortunately I was able to trace the problem and fixed this annoying crash in Avogadro OpenGL initialization. Then I switched to kfile-chemical and started with adapting CML stream analyzer to current standard and ontology. To have my testcases done, I had to introduce passing filters to the command line strigicmd tool. While working on the tests, involving all CML structures from the Chemical Structures 2.0.1, I realized that because the current CML metadata extraction is not XML aware I have to rewrite it using StreamSaxAnalyzer to make it work as desired. This is what I am doing at the moment.

My next steps would be: to generate InChI for CML lacking the identifier. Then I will improve all other available chemical analyzers and create tests for them. Then I will run some productivity tests involving the mirror of the PDB database. And, of course, I will start implementing the GUI chemical query tool.

Progress report and back on track from chemistry notes

Preamble.
I'm happy to get back to hacking this week.
Last two weeks have been almost completely lost due to some urgent reallife issues back in my home country. So I had nothing to do than just to shift my flight and to solve the problems. This was completely unplanned, and left me for two weeks without a single commit, thus making my supervisors nervous about the outcome of my project. Now I can be on the channel, commit daily and blog twice a week, as Egon recommended and as he now insists.

CML2 SAX streamanalyzer in kfile-chemical.
While reading the CML specifications, I thought that there is too much flexibility in it, hence making it hard to parse. To start I took few CML2 samples from Jerom's Chemical Structure 2.0 project which is a part of the BlueObelisk data repository. These files already contain the information, it just had to be extracted. I wrote the analyser based on streamsaxanalyzer. I used xmlindexer and strigicmd tools to see how the analyzer works. I will try to extend the analyzer to support the variety of CML's I can find in the wild. To disribute sample test files together with kfile-chemical I need them to be free/to have a proper license. I am not sure whether the test files from the Chemical MIME project can be included. Please comment on that if you have any clue.

Test suite in kfile-chemical.
I have added a python test suite and the first testcase of 20+ tests is for the CML analyzer. Strigi is intefaced via xmlindexer and strigicmd. I find these command line tools useful for testing, since they do not use any central storage or daemons to work. The test fixtures prepare a clean directory and the list of sample CML files, so that every test in the testcase is executed in a clean environment. For all the test I have used clucene backend. All tests run pretty fast, except for the valgrind test for memory leaks.

The analyzers which are not covered by tests and which are not compliant with current Strigi ontology fieldproperties have been temporary disabled. You can expect most of them to be fixed and enabled back later this week.

CML testcase showed that querying by InChI (chemical.inchi=...) gives me false positives. So there is a question now whether FieldRegister::stringType is suitable to handle exact identifiers like InChI or it is better to make it binary.

I was also wondering why chemistry.name (content.title is its parent) in xmlindexer is turned to content (exactly, not content.title) in strigicmd with clucene backend.

The search by content.version field returns no results and when querying a float molecular weight (chemistry.molecular_weight:58.1222) is gives me no results too.

This leaves me with 3/20 tests failing.

InChI generator.
InChI is uniquely identifying a chemical structure. That is why it is a good idea to have InChI's for all the analyzed chemical files, where possible. OpenBabel can convert any recognised format to InChI strings. I made a working example to see if it is easy and fast enough to generate InChI's in a Strigi streamanalyzer. It is called inchi-generator and it works for valid CML2 files only. I had to buffer the contents of the Strigi stream to pass it to OpenBabel convertor, but I feel there could be a more elegant solution, since OpenBabel works with streams as well, they are just not compatible with Strigi streams.

Linking to OpenBabel.
I had very strange problems with unresolved symbols in OpenBabel format plugins until
Geoffrey helped me. It's all about plugins! Strigi loads streamanalyzers with dlopen() on Linux and so does libopenbabel when it needs a format plugin. The solution was simple, to add RTLD_GLOBAL to code which loads libopenbabel. Since libopenbabel is linked to inchi-generator RTLD_GLOBAL had to be added to Strigi loader. I wonder if it can cause problems to other analyzers. Another solution would be to load libopenbabel from inchi-generator in runtime.

Openbabel 2.1 (SVN) Debian packages
The FindOpenBabel2.cmake script by Carsten is used in KOpenBabel, Kalzium and now in kfile-chemical. It requires --atleast-version=2.1.0. In Debian unstable you can only find version 2.0.2. Michael Banck, Debian maintainer, provides build rules in debichem repository. I do not know what could be the reason for the new version not to be available in SID. Probably it's related to the patches to provide a better version abstraction, e.g. to have two OpenBabel versions installed at the same time.
Anyway, as OB 2.1 is a requirement and can be packaged, I have put the x86 Debian libinchi, libopenbabel and openbabel packages here http://neksa.net/debian/.

Strigi chemical fieldproperties.
Talking to Phreedom in the very beginning of my project, I thought that the chemical fildproperties should represent the minimal set of metadata attributes, but in practice, taking into account the variety of chemical formats, it is hard to define the list once and for all. That is why I have added few other chemistry.fieldproperties. Among these are IUPAC Name, PubChem Compound ID, experimental method of structure elucidation, some physicochemical properties which are to my mind most queried in PDB and few more statistical counters. We better remove some unused properties later rather than keep of storing the metadata which could be extracted.

Further steps.
The test suite has to be expanded to cover all the formats currently existing in kfile-chemical. The analyzers need to be fixed to match current Strigi ontology. This could be done during this week.

Openbabel integration requires more attention, since there is no "magic" MIME detection. I will try to employ Chemical MIME patterns to do the detection. InChI generation is only possible if we know the source format. Strigi is stream-based, hence we can not look at the file extension in streamanalyzers.

And of course, and eye candy, a GUI chemical search tool is one of my deliverables.

I would also love to spend some time on Strigi, perhaps Jos will find the kfile-chemical testsuite good for testing the built-in analyzers.

Offtopic.
July 18th -- 25th I will be attending the annual conference of the Society for Computational Biology (ISCB) and the satellite meetings. This time in Vienna, Austria. If you happen to be there at the same time please contact me.

19-20 : 3DSig Structural Bioinformatics and Computational Biophysics meeting.
21 : 3-rd ISCB Student Council Symposium
21-25 : ISMB/ECCB conferences.

My submission has been accepted, so I will be presenting some of the results of my CUBIC project.

One good news from CUBIC. Thanks to the project, my final grade is now A = First Class Honours. I hope this will improve my chance to get a nice place for the PhD.

Strigi website

I am not yet addicted to blogging and it is still a pain. It is more like you are preparing the food and every now and then people enter and ask you "what's cooking?". But there actually is much in common between Open Source development and TV cooking show. So this time I will split my reports into small posts: about website, testsuite, analyzers and GUI. Hope it works better this way.

Strigi website has been broken for some weeks. Not that broken, but people could not log in and post updates. For dynamic projects, like Strigi, staying on-air means a lot. And even much more with aKademy 2007 around the corner.

I have spent too much time programming PHP in the last years (I wish that was C++), but I did not expect this experience to be helpful in my chemical GSoC project. Well, I fixed Drupal and now the site is alive again. But there won't be a story without a mystery. In this case it's the mysterious mail service at Sourceforge. Many content management systems, and Drupal is not an exception, want to send mail to the users. By abuse/security reasons Sourceforge web hosting does not provide access to sendmail, nor does it allow outgoing network connections. But there is a workaround.

Sourceforge shell servers and web servers are different machines sharing same disk partitions. Though you can not send any mail from the web servers, you can do it from your account on your Sourceforge shell server. Thus, put your outgoing mail from web server in a queue in mysql database, fetch it regularly by a cron script running on your shell server and feed it to sendmail. It works well, except for the cron: there is crontab on the shell server, but unfortunately, no crond running. Another workaround is to fetch mail from the mysql queue by a cron script running on a remote mailserver. A PHP XML RPC call does the job.

Résumé: moving the website to another hosting is not a bad idea after all. What do you think?

kfile-chemical/STRIGI -> strigi-chemical from chemistry notes

Kfile-chemical had three branches:

  • KDE3, where all chemical analyzers were KFilePlugin's and provided KFileMetaInfo;
  • KDE4, where nothing happened since it was branched;
  • STRIGI, all the metadata extractors were Strigi StreamLineAnalyzers.
I have started my project in kfile-chemical, but it will end up in a different tree. As it was recently proposed by Egon (my GSoC mentor) and confirmed by Jerome (kfile-chemical maintainer) and Jos (Strigi core developer), the STRIGI branch was separated from kfile-chemical.

Now it is called strigi-chemical. The reason is that it has no KDE dependencies at the moment. Strigi-chemical also lives in playground /utils/strigi-chemical/.

The situation at the moment is:
  • kfile-chemical is now what was previously kfile-chemical/KDE3 branch
  • kfile-chemical/KDE4 branch removed
  • strigi-chemical is now what was previously kfile-chemical/STRIGI branch

Strigi-chemical test suite from chemistry notes

The test suite of strigi-chemical deserves special a attention, because it will be probably used for all other Strigi analyzers, moreover it could be useful for Blue Obelisk projects.

The test suite is a set of python scripts using python unittest infrastructure. Suite provides StrigiTestCase as a base class for all test cases. It is a wrapper upon Strigi command line tools strigicmd and xmlindexer and assures that the fixtures have proper isolation. Test runner executes all testcases it can find in the current directory.

Each test focuses on a certain data format. Sample data is very important to have in hand before writing and testing the analyzers. Egon started a project recently to provide a central repository of chemical test files for Blue Obelisk. The problem at the moment is that the repository is incomplete. CALL FOR DATA is announced -- any chemical file with a free license can enter the repository. If you want your chemical data files to be recognized by Strigi, please release samples of your files under an OSI-approved license!

Subversion provides a very nice trick which allows to include external repositories into the project tree. Once set up it requires no further actions. For those developers interested: use svn propset, propget and proplist to manipulate svn:externals property. In our case, after checkout, ctfr will appear as subdirectory in /test:

ctfr http://blueobelisk.svn.sourceforge.net/svnroot/blueobelisk/ctfr/trunk/

TestFileRepository is a python class for generalized access to contents of XML-based CTFR repository. Every testcase inherits from StrigiTestCase an initialized self.ct object, which you can ask to getTOC(), getFiles() or getFleByName() without taking care of the CTFR internals.

Strigi-chemical testcases already helped me to detect problems with text tokenizers, float values and keyword queries. Fortunately, all these problems already found their solutions in the Strigi core. And now, tests won't allow these problems to appear again without been noticed.

Strigi plugins, tokenizers and ontology from chemistry notes

This is a belated post on changes in Strigi I made in the beginning of July.
The tests in Strigi-chemical have highlighted the problems in Strigi. Below I list problem and its solution.

Problem with plugins.

Strigi analyzers are loaded as plugins with dlopen(). I noticed random and strange effects, when multiple copies of one analyzer have been initialized. The explanation was simple, do not ever use RTLD_GLOBAL flag to load Strigi analyzers. A while ago I blogged about linking to OpenBabel and enabled this flag in Strigi. The solution was to implement a loader for libopenbabel in strigi-chemical analyzer. LibOpenBabel is a highly recommended runtime dependency, although the chemical analyzers can work without it. A question to packagers would be: is it a good idea to specify OB as a dependency or suggested or recommended. There is also an option to make a metapackage with dependency. We don't want the user to miss some key-features just because he overlooked a soft-dependency in apt, for example.

Later, OB loader was wrapped in a singleton with mutex locking. I had some random core dumps of the unit tests due to thread safety violation. I had no idea what could cause it inside OpenBabel, so I just protected the instance.

Problem with tokenizers.

One of the key features of chemical analyzers is to perform an exact structure match search by chemical identifiers. The recommended IUPAC identifier is InChI, although it is not that spread as SMILES, for example. The power of InChI is that it represents a chemical structures with all layers of information (like charges, fixed hydrogens, isotopes and stereochemistry) as a string. And the reverse transformation is possible.
The problem here was that an InChI string was tokenized, i.e. cut into pieces. Few words about how it works. Each token represents a "word" in the dictionary of search engine's backend. Strigi has an indexreader/indexwriter abstraction over the search backends, it can even support hybrid backend, e.g. the mixture of clucene and sqlite. Clucene backend is well supported while the support of other backends is still rudimentary (developers welcome!). So, each field is processed by an analyzer/tokenizer during indexing, and the same tokenizer to be used to process the field during search query analysis. Some tokens, like InChI, deserve a special treatment.

Have a look at the Caffeine InChI to have an impression of what I'm talking about:

InChI=1/C8H10N4O2/ c1-10-4-9-6-5(10)7(13) 12(3)8(14)11(6)2/ h4H,1-3H3

The solution was to add a special flag to chemistry.inchi ontology field property that would indicate that a special tokenizer is required. I added special index control flags, that could tune the behavior of fields in the index (at the moment supported only in clucene backend). These flags are boolean: Binary, Compressed, Indexed, Stored, Tokenized. By default Stored|Indexed|Tokenized are enabled.

Ontology database.

This would not worth a blog post if it would be so simple. Current Strigi uses fieldproperties files as a draft ontology database. The registerField() API did not respect the database and passed cardinality and child-parent relations as call parameters. To make my index control flags working this behavior was changed and the values are now loaded from the database. This left registerField() API call with only one parameter: field name. Loading and control of MinCardinality and MaxCardinality from the database was implemented as well.

Why is the filedproperties database obsolete? Well, XESAM is here and Strigi has to be 100% XESAM compatible. Jos implemented the new dbus XESAM interface, Flavio added new query parser, and Phreedom is hacking an RDF parser for the new ontology database. Add new user query support and you will get a completely XESAM-compatible Strigi (cross-)desktop search engine.

Strigi's-eye view on chemistry support (aKademy slides) from chemistry notes

In his aKademy 2007 talk Jos van den Oever explained why Strigi is more than searching. I have prepared two slides on strigi-chemical for this presentation. It was definitely not enough time for Jos to give all the details during his talk, so I decided to repeat the two slides here, because I think they give quite good picture of the components, and would be nice supplements to my previous posts on Strigi analyzers.


August 10, 2007

Strigi-chemical analyzers inside

Ten things to have a better understanding of Strigi analyzers:

  1. the only data source is an input stream (think jstreams, not std);
  2. stream providers take care about the embedded substreams, unzip, decode, deflate, etc;
  3. every analyzer gets the stream and takes a bite (the bite is 1024 bytes at the moment);
  4. if it does not like the taste, the analyzer gives the stream back, e.g. a JPEG file does not sound nice to the MP3 analyzer, so it won't take another bite;
  5. analyzers work in parallel, drawback: one analyzer can not use the results of analysis from the other one;
  6. analyzer is parsing the stream and takes the decisions to index some data as fields of the ontology;
  7. ontology describes the (meta)data fields, all the attributes, flags and the hierarchy;
  8. according to ontology field description, indexwriters handle the data passed by the analyzers;
  9. indexing process can be controlled by analyzer configurations, the configuration can be represented as a config file;
  10. analyzers can be distributed as plugins and loaded/included/excluded at runtime
Two problems have to be taken into account:
  1. some greedy analyzers have a bad habit to read every stream till the very end (e.g. cppanalyzer, yes C++ source code), this could stress the performance;
  2. in some cases, the extracted data is not enough and additional meta data has to be calculated or generated. A very simple example would be to count the comments in the C++ source file, a more advanced example: to generate an InChI identifier based on the chemical structure extracted.
If a greedy analyzer performs some calculations and it is slow, this will slow down the whole indexing procedure.

The obvious solution would be to make the analyzers more selective, and if possible -- not greedy. As for the slow analyzers, they should be optional and highly selective about what they process.

Black magic of MIME

Shared-database-info is a specification, a freedesktop.org standard for MIME description databases. Chemical-mime-data follows the same specifications, but adds support for chemical formats. MIME is a contents-based identifier of the stream type. Sometimes MIME is mistakenly assigned by file extension only. The rules in shared-database-info and chemical-mime-data check not only the file extension, but look at the contents of the stream. To tell the MIME type by a few certain bytes in a certain position range is considered to be black magic. A lot of formats, text formats, can not be identified easily with these magic rules.

For some file formats, header checks in analyzers perform exactly the same procedure as MIME checks. For this kind of analyzers it would be a good idea to rely on mimetype. Due to the fact that you can not use the result of one analyzer in another (explained in 10 points above), you can not expect mimetype(s) to be there when you start analysis. Actually, there is a workaround, but it's tricky and potentially unstable.

For chemical formats, it is essential to know exactly the MIME type of the stream. Chemical-mime-data could partly help here. That's why I took the code from MIME type analyzer to make a helper out of it, just like in external library case.

Concerning the greediness of the analyzers, there could be inverse logic involved. Jos has already introduced some optional constraints in analyzerconfiguration to limit the amount of data a certain analyzer can consume: not more than a limited number of bytes. This could help. Those analyzers which do not look for a fingerprint of the file format in the header, could be guided by a negative-control rule: if not an ASCII char is encountered -- stop, if not an UTF8 char found -- stop, i.e. if something unexpected found we stop processing.
Other technique, which is already employed, is to look for the necessary minimum of data and stop analysis. If all the fields we are looking for are already found, why ask for more?

Helpers

The helpers in strigi-chemical are shared libs with thread-safe wrappers above MIME analyzer and libOpenBabel.

the typical workflow is:
  1. detect and check the MIME type;
  2. if does not matches -- stop
  3. if MIME matches or could not be detected -- perform further analysis
  4. look for recognizable data, but do not index it yet
  5. if something strange found, stop and discard the stream, we do not need false positives
  6. if all the data collected -- stop
  7. if something is missing and could be generated without using OpenBabel -- generate it
  8. if something is missing and could be generated by OpenBabel -- call OB helper
  9. add data to index

Optical Structure Recognition in Strigi-chemical?

More people blogged on GPL Optical Structure Recognition tool OSRA (1, 2, 3, 4) since its first release. OSRA is a young project and though it has a poor quality of recognition at the moment, its open license guarantees the bright prospectives.

Journal articles, patent documents, textbooks, etc represent chemical structures as graphics. The idea to have an OCR analyzer to extract chemical structures from graphical files in Strigi-chemical is natural, but even with OSRA there is a long way until a decent implementation. There are obstacles of different kind.

OSRA deployment

At the moment to build an OSRA binary is an effort. It has no automake/autoconf or cmake build system and a long list of compile-time an runtime dependencies. ImageMagick, POTRACE, GOCR and OpenBabel are the major dependencies. Since it is out of the scope of my GSoC, I would have to wait for the upstream maintainer to ship OSRA as a library with an API. From my side I can make a strigi-chemical OSRA-helper (described before) with a runtime optional dependency, but it would take some time to figure out an API first.

Performance

I did some benchmarks with OSRA. It takes 1'3o'' to process the sample patent document:


The general overview of the OCR workflow is as follows:

  • it uses ImageMagick to detect type
  • the PDF and PS files are rendered as images
  • set the resolution, which is fixed (150 dpi) for PDF/PS files
  • it iterates over the pages and detects minimal boxes which most probably contain molecular structures. Here is the first box from the sample patent document:

  • the box is traced to obtain a vector representation
  • atoms, chars, fixed chars are detected
  • bonds are fixed, broken bonds are removed
  • valency-check is performed
  • the structure is converted to SMILES (could be an InChI though):
SMILES: OC(=O)C(C)(C)CCCCOCCCCC(C)(C)C(=O)O
InChI: InChI=1/C16H30O5/c1-15(2,13(17)18)9-5-7-11-21-12-8-6-10-16(3,4)14(19)20/h5-12H2,1-4H3,(H,17,18)(H,19,20)

  • continue with the next box
I tried to see how it scales and created a PDF with 64 compounds. OSRA detected 60 boxes in 1'5" with a quite hight error rate, unfortunately. This benchmark shows that OSRA can't rotate paper well (landscape), recognition of 9 compounds can take much more than 60 compounds and quality of recognition is low.

Simple example from PDF with 64 compounds, the second one is rendered by PubChem using the produced SMILES:




Jos suggested, that we take images from PDF as substreams. There is a substreamprovider for that in Strigi.

I think that taking into account the CPU time required to process a document, we should carefully control and select what we pass to OSRA-helper.

We can make it like that, for example:
  • Create a chemical image analyzer, which would check first whether there is no structural information embedded into the image itself (yes, it is possible, more in my next blogpost),
  • then it should carefully check the context, whether it is a chemical paper (looking at DOI, for example) and
  • then it can pass the extracted substream to OSRA-helper.

Strigi-chemical GSoC final timeline from chemistry notes

On August 20th all GSoC students and mentors are supposed to start the final evaluation. This means that the major goals of the projects should be complete and working code submitted.

The environment of my Strigi-chemical project is very dynamic: Strigi is under heavy development and undergoing XESAM-reforms at the moment, there is also much happening in open source cheminformatics world, like new OpenBabel release and things like OSRA optical structure recognition or structural information embedded in PNG images. This post outlines the final TODO of my project which you can expect to be ready by the end of this SoC.

  1. JStreams SDF substream provider is a nice feature which represents an SDF file as a virtual folder of MOL files. I'm trying to make it work stable at the moment. This makes SDF analysis transparent as if it really was a set of MOL files. It also allows to browse the contents of SDF with jstream:// KIO;
  2. I need to fix CML2 analyzer and make sure it correctly recognizes the newly generated CMLs from Jerome's chemical-structures-2 repository;
  3. Use MIME helper to decide whether to continue with stream analysis or to skip the stream. It will probably allow to optimize some greedy analyzers;
  4. Make sure that OpenBabel helper works well and does not crash with parallel analyzers;
  5. Create a chemical PNG analyzer to extract chemical information (MOL or InChI) from images; Unittests are based on samples generated by Firefly and gchempaint software;
  6. Make sure all strigi-chemical analyzers conform Strigi PLUGIN architecture and have unittests with files taken from Blue Obelisk CTFR (Chemical Test File Repository)
  7. Update chemical ontology to prepare it for migration to XESAM ontology (fix types, cardinalities, child-parent links and indexing flags);
  8. And finally, build a sample GUI application, using molsKetch and avogadro as Kparts. This application should be able to input a structure by drawing it, represent it as OpenBabel molecule, convert it to InChI using OB, make a XESAM query over dbus to strigidaemon, display the list of results with a structural preview powered by Avogadro (Kalzium).

It is quite a lot of work for 5 days left, but I have drafts of everything listed above so it should be feasible.

MDL SD file support from chemistry notes

Chemical MDL SD files are now powered by advanced KDE/Strigi technologies.

Jstreams is a lightweight C++ streams library and gives us a powerful notion of substreams. Substream providers are very fast and feed Strigi analyzers with data.

One of the main goals of my GSoC project was to powertest Strigi. SD files are good examples of that. They could be really large containers of MOL molecules. It is a natural idea to access them like normal folders with MOLs inside.

I will make a tutorial-like dissection of the implementation here to encourage others to implement support of their favorite file formats in a similar way.

SdfInputStream is providing MOL entries as substreams. The important thing is to make sure it does not mistake any other format for SD. An SdfInputStreamTest testcase checks some basic stream operations.

ArchiveReader is a facility which enables kio_jstream to represent files as directories and dive deep inside archives, email attachments and now SD files. ArchiveReader checks the stream header by calling a subsequent InputStream Provider and if matches it tries to recourse the tree. This is a greedy approach and it hurts. There is probably some space for improvements.

Testcase with a sample file (I have used a 10-compound SD file as a test) can give you the basic idea if your substream provider works or not, but it does not test it as thorough as ArchiveReader does. My major troubles were with ArchiveReader. Once solved you can enjoy the KDE interface access.

I have taken a large, 500 Mb SD file with ~250,000 of compounds and some smaller files of 40, 75, 150 Mb. On the screenshots you can see 40Mb file with ~11,000 of compounds in Dolphin.



Note that not all the entry information is propagated to KIO at the moment, e.g. size of the "directory" is missing. And the interface gradually slows down to an unusable state when trying larger and larger files (1, 2, 3, .. 20 million lines of text). Probably it is not the best idea to put thousands of virtual files in one virtual folder. One possibility is to introduce some virtual subfolders, with <=100 molecules each. Naming is also a problem, because the title is optional in MOL files. I used "MoleculeN" as a name substitution for molecule #N. Another nice test is to read the sub"files" in kwrite. Below are examples of a 10-compound SD in file open dialog and Molecule2 file opened. Of course all screenshots are KDE4, running in Xephyr session in my case.

Now switching to data analysis.

SdfEndAnalyzer uses SdfInputStream to explore SD files, executes an indexChild() per molecule found and stores the number of molecules in chemistry.molecule_count field.

This is all done in Strigi, not in Strigi-chemical because I had some troubles writing and using external substream providers, this could be solved with the help of Jos, hopefully. Since it does not add much overhead, it is not a problem.

indexChild() starts a new chain of analysis and this is where MOL files are indexed. MdlMolFileLineAnalyzer is completely unaware where the data stream comes from, moreover it does not even have direct access to data input stream, it only analyzes the text lines in sequential order. Now it detects MOL signature, makes sure it is not an SD, and collects (and calculates) the chemical meta data, so far: chemistry.name, content.comment, chemistry.molecular_formula, chemistry.atom_count, chemistry.bond_count, chemistry.chirality.

Xmlindexer is a handy command line tool to check what is the outcome. Testing a 10-compound SD file:

xmlindexer ligs3d.sdf




3609
MFCD02681585
C28N4O4
36
39
1
1


3362
FCD01567969
C28N3O2
33
37
1
1


[:skip:]


34150
10
0




Test suites, using the sample files from Blue Obelisk Chemical Test File Repository will make sure the analyzers won't be broken in the future: SDFTestCase, MOLTestCase.

Strigi now extracts chemical information from PNG files from chemistry notes

Many people blogged recently about storing molecular connectivity tables in images (Egon summarized it). Strigi-chemical now can extract and index this data.

This is how it works: PngChemicalEndAnalyzer is an endAnalyzer which takes control over the stream. It detects a chemical chunk in PNG (Molfile, CML, InChI, ...) and creates a substream to pass it to indexChild(). Then, again the whole chain of analyzers is executed and chemical data extracted by a respective stream analyzer.

It does not replace a normal PNG endAnalyzer, which is in charge for extracting all image-related information from the stream.

By the way, the InChI analyzer was upgraded and can now detect InChIs in various text sources, it can now fix spaces and in some cases even line breaks.

PNG chemical analyzer has a testcase and let's have a look at file samples and xmlindexer output:

Caffeine with embedded InChI (thanks Jean):






66
1
InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
1
1
InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3



image/png
3323
1
Jean Brefort
171
193
32
RGB/Alpha
Deflate
None
Public domain
0


Rosiglitazone with Molfile (thanks Rich):






2411
name
C18N3O3S1
25
27
comments
0
1


image/png
7984
1
109
327
32
RGB/Alpha
Deflate
None
0

Saturday, May 17, 2008

Mechanophores: a force to be reckoned with from chemistry notes

The concept is logical and easy to follow--if reactions can be triggered by light, heat, pressure, or electrical potential, then why can't mechanical force also be harnessed to distort molecules in a way that promotes reaction? Actually, using molecules appropriately termed mechanophores, the Moore group has succeeded in employing the mechanical forces generated from ultrasound to promote and influence chemical reaction pathways. Although ultrasound generally has no effect on small molecules, the collapse of cavitation bubbles produced during sonication can agitate polymers in solution, generating friction, otherwise known as mechanical force. By incorporating small molecule mechanophores (either trans- or cis-1,2- dimethoxybenzocyclobutenes, BCBs) into a larger polymer, researchers were able to take advantage of this force and promote ring opening. According to the Woodward-Hoffmann rules, in a reaction promoted through light energy, both cis- and trans- BCBs undergo a disrotatory ring opening, while thermal activation produces conrotatory products. On the other hand, computational studies indicated that under mechanical influences, the cis-BCB would produce the disrotatory product, but the trans-BSB would generate the conrotatory product.



In order to test this hypothesis, the BCB-containing polymer was sonicated at 6-9 degrees C with an excess of a pyrene functionalized maleimide, which was meant to function as a a dienophile trap. Indeed, the mechanophore did behave as predicted; by distorting bond lengths and angles sonication produced the trans-BCB through a conrotatory process, while the mechanical force worked to reduce the energy barrier for a disrotatory reaction pathway for the cis-BCB. As the article title indicates, reactions can certainly be biased through the use of mechanical force.

So what will the next mechanophore be?



doi:10.1038/nature05681

Friday, May 16, 2008

Beware of the Portugese Man-of-War from chemistry notes


Since I'll be traveling at the end of this week (unfortunately not to the ACS meeting), I decided to post something relevant to my destination. According to the people I'm visiting, there is currently an infestation of Portugese Men-of-War on the beaches and in the surrounding water of this sunny locale. As I consider myself equal parts organic chemist and chemical biologist, I thought that this might make for an informative post. Interestingly, the Portugese Man-of-War is a siphonophora; thus, it is a colonial species, made up of four different types of polyps. While each is an individual, they are completely integrated with each other and the colony is often mistaken for one large jellyfish. Its tentacles can be up to 50 meters long! Although I couldn't find much information about the chemical composition of Physalia physalis venom, I did learn that it consists of ATPase, RNase, AMPase, and nonspecific aminoesterases [1], most of which work to degrade cellular content, producing extreme pain in the process. 28% of the venom protein consists of physalitoxin, which is a large heterotrimeric glycoprotein that hemolyses mammalian erythrocytes. Cells treated with man-of-war poison generally release histamine. Research has shown that the venom itself creates pores in the cell membrane, allowing for the free transport of mono- and divalent cations [2], which has a major impact on the cardiac system of poisoned animals. Usually man-of-war stings are not fatal to humans, unless one is stung while swimming in extremely deep water. According to many websites, treatment with either hot or cold water best relieves pain from stings, while vinegar may cause the tentacles to release more venom, and should be avoided.

Also, I shouldn't fail to mention that Richet won the Nobel Prize for his work with the Portugese Man-of-War, in which he discovered and characterized anaphylaxis (an extreme allergic reaction).

Can anyone guess where I am headed?

photo taken from: Lilactree

Review: Molecular Modeling Kits from chemistry notes

Check out this review of 4 of the most common molecular modeling kits out there (in German):

Molecular Modeling Kits

As a chemist can never have too many molecule building kits, I actually own kits number 1, 2, & 4 as they are presented in the link.

Alphabet Soup from chemistry notes

Image from Carlos J. Hernandez/Thomas G. Mason, UCLA Chemistry. This image is published in the Journal of Physical Chemistry C.


"Colloidal Alphabet Soup: Monodisperse Dispersions of Shape-Designed LithoParticles" by the Mason group at UCLA is the cover article for the Journal of Physical Chemistry C this week. Although this article was brought to my attention by my husband (a physical chemist of course), I found it interesting nonetheless. Taking advantage of high-throughput automated stepper lithography, graduate student Hernandez literally generated a soup of three-dimensional colloidal particles--including letters of the alphabet, crosses capable of alignment and formation of columnar structures, donut particles that can aggregate to form tubes as well as various combinations of the above. Lithioparticles, as the authors call these miniature polymeric colloids, range in size from micron to submicron units and can be colored through the incorporation of green, red, or blue fluorescent dyes.

So what exactly is "stepper lithography"? First, both a sacrificial layer and a polymeric photoresist layer are placed on top of a silicon wafer through a process known as spin coating. Using the "stepper," or a fully automated lithiographic projection exposure system, UV light is shined through a stencil-like "mask"and through the stepper's lens onto the photoresist layer. This step crosslinks the part of the photoresist that was exposed to UV light (the part not covered by the "mask"). Exposure to an organic developing solvent removes the unexposed photoresist, while leaving behind both the sacrificial layer and crosslinked LithoParticles. Finally the sacrificial layer is dissolved in water, and the alphabet soup is lifted off the surface of the silicon wafer into an aqueous solution. Once they are in solution, the particles are relatively stable, and the aqueous solvent can be exchanged for something organic.

For me the final paragraphs of the article were the most exciting part to read, as the authors discussed possible applications of designed Lithoparticles. By incorporating fluorescent molecules or other probes such as DNA or charged molecules, Lithioparticles might be useful for studying microstructures inside of cells. Tiny tweezers made out of lasers can be used to move letters of the "alphabet soup," (which is nicely illustrated by the UCLA below) and in this fashion cells could be identified with a unique symbol. Could it be possible to use this technology to mark cancer cells with a "X" and thus facilitate their elimination?

Well, with this technology, that dream might be one step closer to reality.



DOI: 10.1021/jp0672095

Bioethics and DCA from chemistry notes

Earlier this week I got an email from a boy in China, asking me to send him a compound that was synthesized by one of my lab-mates and was subsequently shown to kill cancer cells. At first I thought the email was spam, but after closer inspection I realized that it wasn't; he wasn't well informed, but had obviously read an article related to our lab's work and wanted the compound to give to his mother. Unfortunately, because this drug is still in pre-clinical phase my lab can't do anything to help cancer patients like this Chinese boy's mother. I felt awful and didn't know what an appropriate response would be to his email.

Anways, this relates to an article that I read at the end of this week entitled "Cancer patients opt for unapproved drug." It was pretty fascinating and made me think. Basically, in January, Bonnet and coworkers [1] at the University of Alberta demonstrated that the small molecule dichloroacetate (DCA) can force cancer cells to undergo apoptosis and decrease tumor growth with limited toxicity. It sounds too good to be true, but the science behind it makes sense. Cancer cells have a unique metabolic profile, as the glucose oxidation that normally takes place in the mitochondria is not functional; thus, the mitochondria is considered "inactive" and the cells rely on cytoplasmic aerobic glycolysis for energy production. As a result of this mitochondrial damage, tumors have increased glucose uptake and metabolism, and this is considered one of the better markers of cancer cells. Studies have shown that several human cancers cell lines have hyperpolarized mitochondria and reduced oxidative metabolism; through inhibition of the mitochondrial enzyme pyruvate dehydrogenase kinase (PDK), DCA is able to reverse these changes to the mitochondria, which in turn allows tumor cells to be killed through apoptosis. Most amazingly, tumor volumes were reduced in of nude rats that drank DCA dissolved in drinking water and no toxicity was observed. As DCA has been used in clinical trials for the treatment of mitochondrial diseases and has a patented structure, big pharma wasn't interested in developing it as a drug.

This is where things start to get a little more interesting. After a little research on DCA, Jim Tassano, the owner of a pest control company in California, teamed up with chemist Joseph Ryan to make DCA. After they came up with a suitable synthesis, he set up two websites: one is devoted to selling this homemade DCA for veterinary use, and the other provides contains excerpts from the Bonnet paper as well as a DCA discussion forum with over 1,000 posted messages. Although the FDA has not approved of the use of DCA in humans, many of the posts on the forum are from cancer patients taking DCA and reporting on its effectiveness, a "clinical trial" of sorts. Researchers are worried that these patients are not only endangering themselves by taking an unapproved drug, but also hindering attempts of completing a real clinical trial. Approximately 95% of cancer drugs in clinical trials don't get approved for human use, usually due to ineffectiveness or undesirable side effects. Sadly many patients don't have time to wait for clinical trials to be completed, and therefore they are willing to subject themselves to the unknown in hopes of beating cancer.

I certainly see both sides of the issue. While the chemist in me cringes at the thought of ingesting any non-pharmaceutical grade chemical (the website that sells DCA claims a purity of more than 99%, with impurities of 0.5% monochloroacetic acid and/or trichloroacetic acid, which doesn't come close to the purity requirements for pharmaceuticals), my compassionate side wants to offer a ray of hope to those suffering.

Reactions that work from chemistry notes

Looking for a way to prepare nucleoside-5'-carboxylic acids? Then I have just the reaction for you. Using a procedure adapted from Epp and Widlanski [1], one can easily make these carboxylic acids in just under three hours. In my hands, this reaction has worked wonderfully every single time, producing relatively pure product without much effort on my part; over the years it has become one of my favorite reactions.


Here is a sample procedure, exactly as it appears in my lab notebook:

Place 5g of acetonide protected adenosine in a 100ml round bottom flask. Add stir bar, 11.53g of DIB and 0.51g of TEMPO. Add 15ml of acetonitrile to 15ml of water and add to the reaction flask. Stir. After about 15 minutes the the reaction will turns a deep brown-orange color and the components begin to dissolve. Shortly after this a white precipitate forms. Stir for an additional 3 hours. Filter the solid and triturate sequentially with acetone and diethyl ether (3x each, 15ml). Dry the resulting solid under vacuum. No further purification necessary.

Yield: 4.98g, 96.8%

Large or small scale, the yield for this reaction is usually in the 90% range.

Organometallics, the final frontier?

For the past few months I've been following the work coming out of the Meggers lab at the University of Pennsylvania. A nice summary of their ruthenium based protein kinase inhibitors was recently published in Synlett [1], and while I'm certainly a little rusty on organometallic chemistry, I find their approach fascinating nonetheless.


Exploring chemical space with organometallics makes complete sense; carbon-based molecules can only form linear, trigonal planar, or tetrahedral geometries, so why not explore elements that are pentavalent or hexavalent and can thus form unique bioactive scaffolds? In a recent lecture, Meggers pointed out that an asymmetric tetrahedral carbon can form 2 stereoisomers, but an octahedral center with six substituents can form 30 different stereoisomers. (For those non-believers, Meggers had actually drawn out all 30 different stereoisomers on a slide).

In their search for a octahedral carbon-substitute, Meggers and coworkers concentrated on ruthenium because of its low cost, low toxicity (in the II and III oxidation states), high stability, and synthetic tractability. Using the ATP-competitive protein kinase inhibitor staurosporine as the basis for a ruthenium ligand, a small library of complexes was synthesized (100 members total, with 12 different ligands overall). Screening against several kinases revealed that several of the ruthenium based inhibitors were quite potent, with IC50 values in the nanomolar range. What I find most amazing is the fact that the staurosporine-based pyridocarbazole ligand is 19,000 times less potent than the ruthenium complex containing the same ligand. Further application of this strategy has led to the discovery of highly selective protein kinase inhibitors (for Pim-1, GSK-3, MSK-1), some with picomolar binding constants. A crystal structure confirmed the initial hypothesis that the ruthenium center is not involved in any direct interactions with the protein; the metal center works to orient the organic ligands in a conformation that favors binding.

Photovoltaic devices from viruses from chemistry notes

Viral capsids are attractive scaffolds for the preparation of nanomaterials; they are highly robust in nature, monodisperse, easy to assemble, and small in size. In particular, the hollow tube-like capsid of the tobacco mosaic virus (TMV) provides an intriguing template for the development of organic nanowires. When fully assembled, each TMV particle is 300nm in length and is made up of over 2000 identical protein subunits that can be assembled into other aggregate structures; depending on pH and ionic strength conditions during assembly.

Thus, the idea to create light-harvesting systems out of assembled TMV capsids is not a surprising one. In fact, in the literature numerous methods have been developed to modify both the interior and exterior of the viral capsid with inorganic substrates,[1,2,3] but little success has been as been achieved with organic ones. So you can imagine that I was especially excited when I saw the title "Self-Assembling Light-Harvesting Systems from Synthetically Modified Tobacco Mosaic Virus Coat Proteins" in JACS quite r
ecently. Ever since Prof. Matthew Francis gave a seminar here, I've had his group website bookmarked and have been watching for new developments.

In nature, sunlight is converted into chemical bonds with a high efficiency, mostly due to the fact that photosynthetic systems incorporate several types of chromophores (covering a large spectral bandwidth) spaced precisely to optimize energy transfer. In an attempt to mimic the ingenuity of nature, Miller and coworkers utilized a mutant TMV monomer bearing a reactive cysteine residue. At pH 7 in a phosphate buffer, this reactive cysteine was coupled with maleimide functionalized Oregon Green (primary donor), tetrame
thylrhodamine (secondary acceptor), or Alexa Fluor 594 (acceptor). In order for this to work, FRET must occur between the selected donors and acceptor, and these dyes were chosen for their high degree of overlap in the solar spectrum as well as their high extinction coefficients and stability. By mixing various ratios of donor and acceptor monomers together and then adjusting ionic strength and pH, both disk and long rod aggregate structures were formed; the attached chromophores apparently had no effect on the systems ability to self assemble. A ratio of 33:1 (donor Oregon green) to acceptor (Alexa Fluor 594) produced an overall efficiency of 47%, while the 3-chromophore system containing 8:4:1 Oregon green: tetramethylrhodamine: Alexa Fluor 594 resulted in a stunning 90% efficiency!



Though it is a simple concept, the combination of self-assembling biological scaffolds and synthetic organic chromophores seems to have great potential for the development of new solar cells.

Gypsum megacrystals from chemistry notes


picture taken from Garcia-Ruiz, J.M. et al. 2007 Geology 35(4), 327.

Although this isn't exactly a chemistry article, it is most certainly chemistry related, and I hope that you will agree that these pictures are too awesome to believe. The gigantic crystals pictured above made the cover of this month's Geology. Almost 80 years ago, the excavation of caves and tunnels at the Naica mine (112km Southeast of Chihuahua, Mexico) led to the discovery of meter-sized single crystals of selenite, which is one of the four crystal forms of gypsum. (The other three forms are satin spar, desert rose, and gypsum flower. As a side note, when I was younger I had a great collection of rocks and minerals that included a very nice sample of desert rose). Often these crystals of calcium sulfate dihydrate are found coated in calcite (calcium carbonate), celestite (strontium sulfate), or trace amounts of iron oxide, which give the crystals either a white or slightly red hue; selenite is colorless/transparent in its pure form. Amazingly, the Cueva de los Cristales (Cave of Crystals) contains selenite crystals up to 11 meters in length and 1 meter thick, with minimal contamination from other minerals.

While several have made conjectures as to how these crystals formed, none had been investigated carefully until now. Garcia-Ruiz and coworkers set out to explain the formation and growth of the Naica megacrystals after closely considering several factors. First, gypsum is slightly soluble in water, with a maximal solubility observed at 58 degrees C; conveniently, water samples from the Naica mines have temperatures ranging from 48-59 degrees C. Thus, the water found in the area is slightly supersaturated for gypsum and slightly undersaturated for the anhydrite form of calcium sulfate, suggesting a self-feeding mechanism. In other words, crystal growth might have been driven by a solution controlled anhydrite-gypsum phase transition. Calculation of the nucleation rate indicated that this suggested mechanism is a probable one, but only within a very narrow range of temperatures--46 to 60 degrees C. Such calculations indicate that these crystals have been growing in the caves at Naica for over one million years!

For more information:

The Largest Crystals on Earth

More pictures

Protein folding from chemistry notes

Another interesting link that my husband recently pointed out:

Folding@Home project (FAH)

Basically, using a technique called "distributed computing," researchers in the Pande group at Stanford hope to better understand protein folding and mis-folding. This of course is a noble cause, as incorrect protein folding or aggregation might be responsible for a variety of disease states; Alzheimer's, Huntington's, Parkinson's, and the big one--cancer (as related to p53)--have all been linked to protein misbehavior. Instead of using a supercomputer for all of these protein folding calculations, FAH relies on people like us to download and run software devoted to their cause. While there are almost 200,000 active CPUs in FAH, a typical supercomputer has only 5000. So far FAH has been quite successful, as of March 21, 2007 over 40 publications have been attributed to FAH calculations.

Would you be willing to donate your computer's down time to a good cause?

Impact factors from chemistry notes

Have you ever taken a few seconds to explore the impact factors of your favorite journals? If you've never done it before, I highly recommend taking a closer look at the ISI Web of Knowledge, especially the Journal Citation Reports (JCR). Whether or not you believe impact factors doesn't really matter--it's pretty interesting nonetheless.

For instance, the first article ever published with my name on it was in Organic Letters, which has an impact factor of 4.368 according to JCR. More recently, some of my work could be read in the international edition of Angewandte Chemie--impact factor 9.596. Does this mean I am slowly moving up the ladder of scientific respect? Well, there is actually a lot of debate about this subject, and some people believe that journal impact factors don't accurately represent the real importance of journals; would it be better to just use actual article citation numbers?

Before I move on, I think it is pretty important to understand how impact factor is calculated. Here is what goes into an impact factor calculation:


Using Angewandte Chemie International Edition as a real life example--in 2005 there were 11384 other articles citing articles from the year 2004, and 10620 other articles citing articles from the year 2003, for a grand total of 22004 citations. Divide this by the total number of articles published in 2003 and 2004 (2293) to get 9.596, the impact factor. Pretty simple, right? Well, the JCR reports a number of other interesting factors including the immediacy index (number of cites to "current" articles divided by number of current articles), journal cited half life (the median age of articles that are cited in the current year), and several graphs that condense some of this information.

Does the impact factor really measure the quality of a journal (or the importance of the articles published in the journal)? Well, it is true that some of the journals that I consider to be the best in the field have some of the highest impact factors. On the other hand, it's important to keep in mind that these numbers also reflect the latest trends in the literature. Availability of journals can be an issue, along with the amount of current interest and publication in a particular area.

Below is a condensed list of my favorite journals and their 2005 JCR impact factors:


I want one... from chemistry notes

A few days ago jungfreudlich posted pictures of his Element Collection. Very cool, don't you think? You can buy them online here, but probably not on a graduate student's salary :o)

Janus Disks from chemistry notes


What exactly is a Janus disk? Well, with a quick internet search you can easily find several references to Janus, the Roman god of doorways, gates, and beginnings (hence the word January for the first month of the year), but a picture search is actually most revealing. Usually Janus is shown with two different faces that look in opposite directions; one represents the sun and the other symbolizes the moon. Interesting--but what does this have to do with chemistry??

Well after that brief review of Roman mythology, one can easily imagine that a Janus particle is composed of two fused hemispheres of different materials--similar to the bust of Janus pictured above. Depending on their actual shape, Janus particles are placed into three categories: spheres, disks and cylinders. Several potential applications of these two-sided particles have been envisioned. For instance, in solar cells two very different types of molecules (donors and acceptors) must work together and convert light into electron movement; thus, using Janus particles within light harvesting devices might increase solar cell efficiencies. One could also imagine a Janus-scaffold as a drug delivery system; half of the disk might target cancer cells, while the other end would deliver a cytotoxic drug.

Synthesis of Janus structures is a daunting task and only a few examples of non-spherical Janus particles exist in the literature; thus, when I came across this article in JACS today, it caught my attention. Researchers at the University of Bayreuth in Germany have recently succeeded in producing Janus disks utilizing a template-assisted synthesis. Polymers made of polystyrene-
block-polybutadiene-block-poly(tert-butyl methacrylate) were self-assembled and then treated with either AIBN or S2Cl2 to crosslink the inner polybutadiene layer; this step preserves the orientation of the polystyrene and poly(tert-butyl methacrylate). Finally, after sonication the Janus disks are obtained in their final form; size of the disks is tunable and ranges from the micro- to nanometer scale. As Janus structures have also been proposed to have potential as surfactants, the effect of these Janus disks on the interfacial tension of liquid-liquid interfaces was studied as well. Compared to their un-crosslinked starting materials, the Janus disks have a remarkable ability to decrease interfacial tension, and therefore future technological applications might include the stabilization of emulsions or encapsulation of molecules.

Picture taken from http://dx.doi.org/10.1021/ja068153v

Aggravating Aggregation from chemistry notes

Anyone interested in the field of high-throughput screening shouldn't miss this article which appeared online in the ASAP section of J. Med. Chem last week. Generally medicinal chemists can avoid false positives in screens by utilizing the well known Lipinski's Rule of Five or other computational methods that identify potential problematic molecules. Unfortunately, compounds that form colloidal aggregates are particularly troublesome; through sequestration of an enzyme from its substrate, these molecules usually appear to be good inhibitors (with IC50 values as low as 1 micromolar) with rather steep dose response curves. As aggregate-based inhibition is abrogated through the use of moderate concentrations of non-ionic detergents such as Triton X-100 (0.01 to 0.1%), Feng and coworkers developed an assay to test 70,563 compounds for detergent-sensitive inhibition. This screen has really opened my eyes to the prevalence of aggregators among screening hits. Astonishingly, of 1274 beta-lactamase inhibitors identified, 1204 were detergent sensitive, indicating an aggregation based mechanism of inhibition for 1.7% of the library! Anyone that has sorted through thousands or hundreds of initial hits will see the advantage of being able to identify or eliminate these artifacts from screens.

Discovering that a molecule is an aggregator is not a death sentence for its future use; as aggregation is concentration and condition dependent, molecules known to aggregate in one screen might not in a different setting. Additionally, several known drugs are aggregators at concentrations below 100 micromolar, including clotrimazole, nicardipine, delavirdine, and benzyl benzoate as pointed out by this 2003 article in J. Med. Chem.

Chemistry and......Sports??? from chemistry notes

Imagine my surprise this morning when I took a peek at the sports section of the daily newspaper here on campus:


The article doesn't have anything to do with chemistry (you can read it here if you are interested), but I still thought it was cool to see a periodic table on the front page of the sports section. All press is good press, right?

Fair use?

I just read about this over on Chemistry Central. Basically, a graduate student blogger at the University of Michigan was threatened with legal action for using some copyrighted figures in her blog. Fortunately the matter has been resolved, but it still opens up the question: What is fair use?

Anyway, I'd almost prefer an email like that over this kind of unpleasantness. I guess I'm lucky that my boss is a nice guy.

Space, the final frontier... from chemistry notes

With a great title like"Chemical Space Travel" I just couldn't pass up this early view article in ChemMedChem. Though I'm not sure that I totally buy into this as a method for discovering new drugs, it is an interesting concept nonetheless. Currently, it is estimated that there are 1020 to 10200 "drugable" organic molecules. As it is impossible sift through all of these structures when searching for new lead compounds, knowing what region of chemical space to explore beforehand might be beneficial. Thus, researchers in the Reymond group at the University of Berne in Switzerland have developed a computer program that serves as a "spaceship" for chemical space travel; a point mutation generator serves as a "propulsion device," and a similarity score serves as a "compass." In simpler terms, starting from any molecular structure "A", this program first completes one of eight possible mutations on each atom/bond in the molecule: atom exchange, atom inversion, atom removal, atom addition, bond saturation, bond unsaturation, bond rearrangement, or aromatic ring addition. Then, the similarity between each mutant and the target compound "B" is measured. The 10 mutants that are most similar to the target "B" and 20 random mutant molecules are carried on for another round of mutation/selection. This continues on until one arrives at the target molecule "B," and along the way thousands of unique structures are generated.
One easy example is illustrated below: Starting from methane, 12 mutations produced cubane--but along the way 6638 unique compounds were generated, taking the 10 most similar to the target (in this case cubane) and 20 random compounds at each mutation step. All compounds that were unstable or not synthetically feasible were eliminated. In the same fashion, from cubane to methanol, there were only 7 steps necessary, and during the process almost 1000 new molecules were generated.


So how could this be used for drug discovery? Well, to do this, the authors investigated the chemical space between AMPA and CNQX (shown below); both are known to be agonists of the AMPA receptor, which is a glutamate receptor in the central nervous system. Using these two compounds, over 559,656 compounds were obtained after after 500 runs, which created this cool looking graph. Colors for the graphs are as follows: AMPA to CNQX, in green; CNQX to AMPA in blue, run-away compounds in gray, AMPA to CNQX mutant series in orange, CNQX to AMPA mutant series in pink, and in red are the best docking compounds--or in other words compounds that actually are predicted to bind into the active site of the AMPA receptor (this was determined through computational docking studies). If you haven't noticed, the novel inhibitor with the best predicted affinity for the AMPA receptor is a combination of an amino acid group from AMPA and an aromatic group originating from CNQX.

Image taken from ChemMedChem 2(5), 636.


So the next time you are looking for novel chemical inhibitors, why don't you just take a ride in a chemical spaceship...