Posts tagged ‘ruby’
Quite a while ago I started my first Ruby project – OfflineSearch. I know that the name does not sound fancy, but at least it tells you what it does. OfflineSearch is a semantic search mainly for offline html documentation.
- It was not capable of UTF-8
- It ranked the results in a very illogical way – I do not remember the exact algorithm, but the quality of the results was rather poor
- It produced a quite large search index. Obviously it was not designed to hold some thousand documents
The main problems where point 1 and 2. On the one hand I had to convert the documentation to ISO-8895-1 and than back again, otherwise the search index was a bit corrupt. This led to quite a lot of other errors that I had to fix. On the other hand I was not very fond of the search result quality.
I searched the net a lot for offline search programs but was not able to find anything better than Webetiser. In the end I decided to write one myself and this was the birth of OfflineSearch. I decided to search for semantics in the documents to improve the result quality. I gave some tags (e.g. strong, em) more weight than others (e.g. p or span) and accumulated the value, if the tags were nested. I also wanted to evaluate the importance of a document by the links it got from other pages – as far as I know this is one algorithm that search engines like google use to generate their page rank.
I set these goals for OfflineSearch:
- The top 10 of the result set should contain the page that the user was searching for
- The search index should be optimized for large document sets
- A search result should be returned in a decent amount of time
I was not trying to optimize the indexing time in my first approach, because this time is neglect able in comparison to the time a user has to wait for a result to display or the time the user spends for looking for the right document. You only have to index once, but many users search often. The index should provide a fast way for searching.
A first prototype
After about three weeks of programming in my leisure time, I had a prototype ready for testing. The first tests on the documentation were quite promising, because my search generated better results than the former. At that time the quality was far from good, but by tweaking the tag values it quickly went in the right direction. This is one of the main advantages of OfflineSearch: You can tweak it until it fits your situation. My search index was optimized for a large number of documents and was smaller than what came from Webetiser. The prototype was such a motivation, that I pushed version after version until the program got stabler and faster.
At first I had a really poor indexing performance. Indexing 1300 Documents took about 5 minutes. Of course you can not really compare a Ruby program against a C++ one, but a multiplication factor of 5-6 was not what I wanted. Several parts of the program were not optimized. After playing around with the Ruby profiler and the Benchmarking tool I found the bottlenecks and quickly removed them. In the end OfflineSearch had a really good performance.
Did you mean … ?
I want to list here some of the problems that I had during developing OfflineSearch, that I did not foresee.
I first decided to nest arrays in the search index in 3 layers. This was a really bad idea for performance in IE6. I solved this problem by using only 2 layers and splitting the string that is contained. Here is an example:
var ranks = [[[29,4],[47,3],[77,3],[52,2],[58,1]],[[63,62],[66,6],[3,4]]]
var ranks = [[‘29-4’,‘47-3’,‘77-3’,‘52-2’,‘58-1’],[‘63-62’,‘66-6’,‘3-4’]]
Cache everything you can. If you use a regular expression more than once, cache it. If you are using the same array more than once cache it,…
Do not initialize these things more often than it is really needed. This gives you an immense performance boost with nearly no effort.
Profiling and Benchmarking
What my OfflineSearch misses right now are some fancy templates for easy integration in various situations. For now it is programmed as a jQuery plugin but the code base is not heavily bound to jQuery, so it should be rather easy to transform it to Prototype, Mootools or whatever. Maybe someone wants to do a fork and provide this functionality.
Lots of other features are on my mind to improve OfflineSearch. They are all documented in the project’s ToDo list.
I am glad that my first Ruby project turned out that good. The code base is often not as rubyesque as I wish it was, so I find myself still refactoring code to use more Ruby features and to make the code more readable. One of the gems that were born out of OfflineSearch was FancyLog, which I presented you in a previous post. The next big step is to get nearly all configuration out of the code and replace it with JigSaw.
Today I had a really nasty Ruby error, that bugged me quite a while.
I wanted to extract parts of an XML file and insert them into another one using the libxml-ruby gem. From time to time I got the following segementation fault error.
[BUG] Segmentation fault ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32] This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.
I tracked down the place in my code, where it happened, but could not really explain why it did. I tried to rewrite my code several times, always with the same result. The only thing that helped in the end, was to make sure that the Ruby garbage collector was running with
GC.enable and to start it with
GC.start after each XML file was processed.
What bugs me most, is the fact that I am not sure, what the real problem is. Otherwise I could write a bug report to ruby or libxml.
A fun way to learn Ruby
Ruby Warrior is a role based game, where you have to guide a warrior through a dungeon. It starts quite easy, but quickly gets harder. Each level the brave warrior gets attributes, that help him to get through all trouble. You control your warrior with Ruby commands like
warrior.attack! – see the readme for all commands.
On higher levels life gets really hard for the warrior. Trying to develop an artifical intellegence for the hero gets tricky. Of course you can cheat by defining the exact moves for each level, because the game does not know any randomness. The code you write to guide your warrior to the end of a level, has to be changed quite often – methods are defined and refactored nearly each level, because you do not know what expects you on the next level, it is hard to predict how to define your methods.
I encourage you to give the game a try, even if you have not any Ruby knowledge at all.
This is the second post in the toolbox series. FancyLog is basically a proxy for Ruby’s Logger that solves some problems that occurred in my development and enhances slightly the logger API. All original methods should work as expected, so you can use FancyLog as a Logger replacement without any changes to your code.
But lets start from the beginning. In my first project I faced the problem that I did not see errors on the console when they occurred. They were nicely written to the log file or the console, but the log could get really large and I felt not to comfortable searching a large log for possible errors. I wanted an easy way to see the errors on my console, when I was developing, and to log them to the log file when in production mode. Replacing the calls to the logger with
puts seemed a bad idea, managing two logger instances, too. So I decided to factor out the logger to a separate gem and that was the birth of FancyLog.
FancyLog uses the singleton pattern, so you always have only one instance flying around your application, agnostic to where and when it was first invoked. At the first call you can pass options to it, to let it behave the way you want it to, but you can change most of them later. The only thing you can not change at run time are the log devices. Maybe this will be included in a future version.
Want an example? Here it is:
log = FancyLog.instance log.error('An undefined error occurred') log.just_an_information('I like FancyLog!!!') log.information_to_err('This is an information')
What happens here?
- In line 1, we get an instance of FancyLog. Because we do not specify any options, all informational log messages (debug, info, warn) go to STDOUT and all errors (error, fatal) go to STDERR. This is FancyLog’s default behavior, that can be overridden.
- Line 2 sends an error to the logger, that is printed to STDERR – just like Ruby’s Logger would.
- Line 3 gets interesting, because this is no more the standard Logger API. FancyLog uses Ruby’s method_missing to determine what you wanted it to do. It scans the provided method for debug, info, warn, error and fatal and calls the appropriate Logger method – in this case
info– and passes the arguments to it. If it can not find a match it defaults to the default_level specified in the options.
- Line 4 sends – as it explicitly says – an information to the error log. How this is done, you ask? You can always tell FancyLog to log messages to err or out with the appropriate ending of your method (
As you can see, FancyLog enables you to specify log methods that tell you as a programmer what really happens, regardless of the message printed to the user.
If you want to switch from development to production mode, it is just one option, you have to specify in the setup of your logger –
log = FancyLog(:errs_to_err => false).instance and you are done. Of course, FancyLog does not perform as well as the original Ruby Logger, but I do not expect logging to be a performance critical task.
To summarize: FancyLog lets you easily define were your log messages are logged to and it lets you name the log method in a way that your code makes more sense and is better readable.
- Ruby’s Logger
- Singleton pattern
The beginning of a gem factory
Emerald is a good starting point as it is a RubyGem that creates a bare directory structure for another gem. The default structure, it generates, is:
[gemname] - bin [gemname] - lib [gemname].rb - tests tc_[gemname].rb rakefile README [gemname].gemspec
If you do not want to generate all directories you simply specify the directories you want from the command line, e.g.
emerald gemname lib tests to create only the lib and tests directories.
- The bin directory contains the executable (named after the gem) which requires the class in the lib directory.
- The main class lies in the lib directory and contains a bare structure – the class definition and an initialize method
- The test for the main class is in the tests directory. It is named after the main class and sets a require statement for the main class, so it can be used in the tests. It contains a basic (setup, _tc_method and _teardown_) method structure. All you have to do, is write your tests.
Lets do an example an see what happens:
When we type
emerald FancyTest in the console. Emerald generated the following structure for us:
- gem directory
- binaries directory, used to invoke your gems from the console
- this is our executable after installing the gem
- this is a bare gemspec, where you can define dependencies, tests, documentation, …
- the lib directory is the applications main directory where all classes and modules are found
- the main class of the application
- with the rakefile you can perform tasks that are described in the next section in detail
- the readme is used in the rdoc and is a starting point for users of your application
- directory for all your tests
- this is the test case for our application that contains a bare unit test case layout
The real magic of Emerald is in the Rakefile. It has predefined tasks to
- create new classes + associated tests
- create the gem from the gemspec
- generate rdoc documentation for development or release
- run your tests.
The latest gem is stored in the pkg directory, all previous gems are deleted. It is assumed that old gems are no longer needed, because you have them stored in your SCM (don’t you?) and would otherwise just litter up this directory. All this saves a developer a lot of typing and leads to a standard gem development, where you can easily find your way through a gem.
Here is a complete listing of all rake tasks:
Tasks: rake class n=Klass # generate class with test GEMS rake clear_packages # clear packages rake clobber_package # Remove package products rake clobber_rdoc # Remove rdoc products rake clobber_rdoc_dev # Remove rdoc products rake gem # create gem package / Build the gem file rake package # Build all the packages DOCS rake rdoc # Build the rdoc HTML Files rake rdoc_dev # Build the rdoc_dev HTML Files rake repackage # Force a rebuild of the package files rake rerdoc # Force a rebuild of the RDOC files rake rerdoc_dev # Force a rebuild of the RDOC files TESTS rake test # run tests normally rake test TEST=just_one_file.rb # run just one test file. rake test TESTOPTS="-v" # run in verbose mode
Emerald is designed to keep repetive tasks away from the developer. I do not say that the structure it generates is perfect for everyone. It is just the way I like to write my gems. It does encourage you to write tests and not to leave them out of your programs, because each class you generate has an associated test case.
In my case this gem design pattern has led to a cleaner programming style.