This is a belated post on changes in Strigi I made in the beginning of July.
The tests in Strigi-chemical have highlighted the problems in Strigi. Below I list problem and its solution.
Problem with plugins.
Strigi analyzers are loaded as plugins with dlopen(). I noticed random and strange effects, when multiple copies of one analyzer have been initialized. The explanation was simple, do not ever use RTLD_GLOBAL flag to load Strigi analyzers. A while ago I blogged about linking to OpenBabel and enabled this flag in Strigi. The solution was to implement a loader for libopenbabel in strigi-chemical analyzer. LibOpenBabel is a highly recommended runtime dependency, although the chemical analyzers can work without it. A question to packagers would be: is it a good idea to specify OB as a dependency or suggested or recommended. There is also an option to make a metapackage with dependency. We don't want the user to miss some key-features just because he overlooked a soft-dependency in apt, for example.
Later, OB loader was wrapped in a singleton with mutex locking. I had some random core dumps of the unit tests due to thread safety violation. I had no idea what could cause it inside OpenBabel, so I just protected the instance.
Problem with tokenizers.
One of the key features of chemical analyzers is to perform an exact structure match search by chemical identifiers. The recommended IUPAC identifier is InChI, although it is not that spread as SMILES, for example. The power of InChI is that it represents a chemical structures with all layers of information (like charges, fixed hydrogens, isotopes and stereochemistry) as a string. And the reverse transformation is possible.
The problem here was that an InChI string was tokenized, i.e. cut into pieces. Few words about how it works. Each token represents a "word" in the dictionary of search engine's backend. Strigi has an indexreader/indexwriter abstraction over the search backends, it can even support hybrid backend, e.g. the mixture of clucene and sqlite. Clucene backend is well supported while the support of other backends is still rudimentary (developers welcome!). So, each field is processed by an analyzer/tokenizer during indexing, and the same tokenizer to be used to process the field during search query analysis. Some tokens, like InChI, deserve a special treatment.
Have a look at the Caffeine InChI to have an impression of what I'm talking about:
InChI=1/C8H10N4O2/ c1-10-4-9-6-5(10)7(13) 12(3)8(14)11(6)2/ h4H,1-3H3
The solution was to add a special flag to chemistry.inchi ontology field property that would indicate that a special tokenizer is required. I added special index control flags, that could tune the behavior of fields in the index (at the moment supported only in clucene backend). These flags are boolean: Binary, Compressed, Indexed, Stored, Tokenized. By default Stored|Indexed|Tokenized are enabled.
Ontology database.
This would not worth a blog post if it would be so simple. Current Strigi uses fieldproperties files as a draft ontology database. The registerField() API did not respect the database and passed cardinality and child-parent relations as call parameters. To make my index control flags working this behavior was changed and the values are now loaded from the database. This left registerField() API call with only one parameter: field name. Loading and control of MinCardinality and MaxCardinality from the database was implemented as well.
Why is the filedproperties database obsolete? Well, XESAM is here and Strigi has to be 100% XESAM compatible. Jos implemented the new dbus XESAM interface, Flavio added new query parser, and Phreedom is hacking an RDF parser for the new ontology database. Add new user query support and you will get a completely XESAM-compatible Strigi (cross-)desktop search engine.
Monday, May 19, 2008
Strigi plugins, tokenizers and ontology from chemistry notes
Labels:
GSoC,
Strigi,
strigi-chemical
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment