Posts tagged ‘search’

OfflineSearch

Quite a while ago I started my first Ruby project – OfflineSearch. I know that the name does not sound fancy, but at least it tells you what it does. OfflineSearch is a semantic search mainly for offline html documentation.

Some history

In the company I work our product is delivered with an offline html documentation, that includes a possibility to search for keywords in all documents. This search was powered by Webetiser, which is a tool to generate an index out of html files for a JavaScript powered search. It had several problems – at least in the version that we used:

  1. It was not capable of UTF-8
  2. It ranked the results in a very illogical way – I do not remember the exact algorithm, but the quality of the results was rather poor
  3. It produced a quite large search index. Obviously it was not designed to hold some thousand documents

The main problems where point 1 and 2. On the one hand I had to convert the documentation to ISO-8895-1 and than back again, otherwise the search index was a bit corrupt. This led to quite a lot of other errors that I had to fix. On the other hand I was not very fond of the search result quality.

I searched the net a lot for offline search programs but was not able to find anything better than Webetiser. In the end I decided to write one myself and this was the birth of OfflineSearch. I decided to search for semantics in the documents to improve the result quality. I gave some tags (e.g. strong, em) more weight than others (e.g. p or span) and accumulated the value, if the tags were nested. I also wanted to evaluate the importance of a document by the links it got from other pages – as far as I know this is one algorithm that search engines like google use to generate their page rank.

Goals

I set these goals for OfflineSearch:

  • The top 10 of the result set should contain the page that the user was searching for
  • The search index should be optimized for large document sets
  • A search result should be returned in a decent amount of time

I was not trying to optimize the indexing time in my first approach, because this time is neglect able in comparison to the time a user has to wait for a result to display or the time the user spends for looking for the right document. You only have to index once, but many users search often. The index should provide a fast way for searching.

A first prototype

After about three weeks of programming in my leisure time, I had a prototype ready for testing. The first tests on the documentation were quite promising, because my search generated better results than the former. At that time the quality was far from good, but by tweaking the tag values it quickly went in the right direction. This is one of the main advantages of OfflineSearch: You can tweak it until it fits your situation. My search index was optimized for a large number of documents and was smaller than what came from Webetiser. The prototype was such a motivation, that I pushed version after version until the program got stabler and faster.

At first I had a really poor indexing performance. Indexing 1300 Documents took about 5 minutes. Of course you can not really compare a Ruby program against a C++ one, but a multiplication factor of 5-6 was not what I wanted. Several parts of the program were not optimized. After playing around with the Ruby profiler and the Benchmarking tool I found the bottlenecks and quickly removed them. In the end OfflineSearch had a really good performance.

Did you mean … ?

A feature I also missed from Webetiser was the possibility to suggest some ‘Did you mean…’ terms, if the user misspelled a term or if a term was not in the search index. I did a lot of research and decided to try the double metaphone algorithm. The decision was not easy but this algorithm seemed the only useful one, if you needed support for more languages than English and if you could not have too much load on the processor. With the double metaphone algorithm you only need to compare two strings. You don’t need to calculate the distance between two strings – that is a lot faster and works fine even on a large index. JavaScript is too slow for calculating the distance between lots of terms.

The Ruby version was easy to find, it is located in the Text gem – where else? But I did not find a JavaScript version. With Ruby I could produce a cached version for all terms that went in the search index. But I needed to transform the user input on the client side. Again I had to write it myself, what was quite a tedious job. Character after character I rewrote the Ruby version into JavaScript code. The tests provided by the author were a great help and a bit of motivation as more and more passed. When I finally succeeded in passing all tests, I was able to try it out, and was not too pleased with the results. It was better than nothing, but especially the ranking – again – was not too good. The algorithm does not provide any ranking mechanism, it just tells you whether to terms are similar or not. Finally I ranked the result of the double metaphone algorithm with the Levenshtein distance and voilá the results were better. It is still not the best way but the best way that I found.

The double metaphone algorithm does not provide really good results especially on long words it is rather useless. You can find the JavaScript version of the algorithm in my toolbox on github, if you are interested in trying it out.

Problems

I want to list here some of the problems that I had during developing OfflineSearch, that I did not foresee.

Search index

I first decided to nest arrays in the search index in 3 layers. This was a really bad idea for performance in IE6. I solved this problem by using only 2 layers and splitting the string that is contained. Here is an example:

Bad:

var ranks = [[[29,4],[47,3],[77,3],[52,2],[58,1]],[[63,62],[66,6],[3,4]]]

Good:

var ranks = [[‘29-4’,‘47-3’,‘77-3’,‘52-2’,‘58-1’],[‘63-62’,‘66-6’,‘3-4’]]

Indexing

I first thought REXML would be faster, if the document was valid XHTML. In the end I just used Hpricot for all document crawling.

Caching

Cache everything you can. If you use a regular expression more than once, cache it. If you are using the same array more than once cache it,…

Do not initialize these things more often than it is really needed. This gives you an immense performance boost with nearly no effort.

Profiling and Benchmarking

Ruby and JavaScript both provide excellent tools for profiling and benchmarking, just use them and learn to interpret the results.

Ruby has a good built in tool for benchmarks and a gem for profiling, which is better than the built in version. In Firebug a JavaScript profiler is included.

Future

What my OfflineSearch misses right now are some fancy templates for easy integration in various situations. For now it is programmed as a jQuery plugin but the code base is not heavily bound to jQuery, so it should be rather easy to transform it to Prototype, Mootools or whatever. Maybe someone wants to do a fork and provide this functionality.

Lots of other features are on my mind to improve OfflineSearch. They are all documented in the project’s ToDo list.

Conclusion

I am glad that my first Ruby project turned out that good. The code base is often not as rubyesque as I wish it was, so I find myself still refactoring code to use more Ruby features and to make the code more readable. One of the gems that were born out of OfflineSearch was FancyLog, which I presented you in a previous post. The next big step is to get nearly all configuration out of the code and replace it with JigSaw.

Advertisements

October 27, 2008 at 6:37 am 1 comment


Recent Posts

Archives

del.icio.us

Feeds