Quite a while ago I started my first Ruby project – OfflineSearch. I know that the name does not sound fancy, but at least it tells you what it does. OfflineSearch is a semantic search mainly for offline html documentation.
- It was not capable of UTF-8
- It ranked the results in a very illogical way – I do not remember the exact algorithm, but the quality of the results was rather poor
- It produced a quite large search index. Obviously it was not designed to hold some thousand documents
The main problems where point 1 and 2. On the one hand I had to convert the documentation to ISO-8895-1 and than back again, otherwise the search index was a bit corrupt. This led to quite a lot of other errors that I had to fix. On the other hand I was not very fond of the search result quality.
I searched the net a lot for offline search programs but was not able to find anything better than Webetiser. In the end I decided to write one myself and this was the birth of OfflineSearch. I decided to search for semantics in the documents to improve the result quality. I gave some tags (e.g. strong, em) more weight than others (e.g. p or span) and accumulated the value, if the tags were nested. I also wanted to evaluate the importance of a document by the links it got from other pages – as far as I know this is one algorithm that search engines like google use to generate their page rank.
I set these goals for OfflineSearch:
- The top 10 of the result set should contain the page that the user was searching for
- The search index should be optimized for large document sets
- A search result should be returned in a decent amount of time
I was not trying to optimize the indexing time in my first approach, because this time is neglect able in comparison to the time a user has to wait for a result to display or the time the user spends for looking for the right document. You only have to index once, but many users search often. The index should provide a fast way for searching.
A first prototype
After about three weeks of programming in my leisure time, I had a prototype ready for testing. The first tests on the documentation were quite promising, because my search generated better results than the former. At that time the quality was far from good, but by tweaking the tag values it quickly went in the right direction. This is one of the main advantages of OfflineSearch: You can tweak it until it fits your situation. My search index was optimized for a large number of documents and was smaller than what came from Webetiser. The prototype was such a motivation, that I pushed version after version until the program got stabler and faster.
At first I had a really poor indexing performance. Indexing 1300 Documents took about 5 minutes. Of course you can not really compare a Ruby program against a C++ one, but a multiplication factor of 5-6 was not what I wanted. Several parts of the program were not optimized. After playing around with the Ruby profiler and the Benchmarking tool I found the bottlenecks and quickly removed them. In the end OfflineSearch had a really good performance.
Did you mean … ?
I want to list here some of the problems that I had during developing OfflineSearch, that I did not foresee.
I first decided to nest arrays in the search index in 3 layers. This was a really bad idea for performance in IE6. I solved this problem by using only 2 layers and splitting the string that is contained. Here is an example:
var ranks = [[[29,4],[47,3],[77,3],[52,2],[58,1]],[[63,62],[66,6],[3,4]]]
var ranks = [[‘29-4’,‘47-3’,‘77-3’,‘52-2’,‘58-1’],[‘63-62’,‘66-6’,‘3-4’]]
Cache everything you can. If you use a regular expression more than once, cache it. If you are using the same array more than once cache it,…
Do not initialize these things more often than it is really needed. This gives you an immense performance boost with nearly no effort.
Profiling and Benchmarking
What my OfflineSearch misses right now are some fancy templates for easy integration in various situations. For now it is programmed as a jQuery plugin but the code base is not heavily bound to jQuery, so it should be rather easy to transform it to Prototype, Mootools or whatever. Maybe someone wants to do a fork and provide this functionality.
Lots of other features are on my mind to improve OfflineSearch. They are all documented in the project’s ToDo list.
I am glad that my first Ruby project turned out that good. The code base is often not as rubyesque as I wish it was, so I find myself still refactoring code to use more Ruby features and to make the code more readable. One of the gems that were born out of OfflineSearch was FancyLog, which I presented you in a previous post. The next big step is to get nearly all configuration out of the code and replace it with JigSaw.