Posts tagged ‘rubygems’

OfflineSearch

Quite a while ago I started my first Ruby project – OfflineSearch. I know that the name does not sound fancy, but at least it tells you what it does. OfflineSearch is a semantic search mainly for offline html documentation.

Some history

In the company I work our product is delivered with an offline html documentation, that includes a possibility to search for keywords in all documents. This search was powered by Webetiser, which is a tool to generate an index out of html files for a JavaScript powered search. It had several problems – at least in the version that we used:

  1. It was not capable of UTF-8
  2. It ranked the results in a very illogical way – I do not remember the exact algorithm, but the quality of the results was rather poor
  3. It produced a quite large search index. Obviously it was not designed to hold some thousand documents

The main problems where point 1 and 2. On the one hand I had to convert the documentation to ISO-8895-1 and than back again, otherwise the search index was a bit corrupt. This led to quite a lot of other errors that I had to fix. On the other hand I was not very fond of the search result quality.

I searched the net a lot for offline search programs but was not able to find anything better than Webetiser. In the end I decided to write one myself and this was the birth of OfflineSearch. I decided to search for semantics in the documents to improve the result quality. I gave some tags (e.g. strong, em) more weight than others (e.g. p or span) and accumulated the value, if the tags were nested. I also wanted to evaluate the importance of a document by the links it got from other pages – as far as I know this is one algorithm that search engines like google use to generate their page rank.

Goals

I set these goals for OfflineSearch:

  • The top 10 of the result set should contain the page that the user was searching for
  • The search index should be optimized for large document sets
  • A search result should be returned in a decent amount of time

I was not trying to optimize the indexing time in my first approach, because this time is neglect able in comparison to the time a user has to wait for a result to display or the time the user spends for looking for the right document. You only have to index once, but many users search often. The index should provide a fast way for searching.

A first prototype

After about three weeks of programming in my leisure time, I had a prototype ready for testing. The first tests on the documentation were quite promising, because my search generated better results than the former. At that time the quality was far from good, but by tweaking the tag values it quickly went in the right direction. This is one of the main advantages of OfflineSearch: You can tweak it until it fits your situation. My search index was optimized for a large number of documents and was smaller than what came from Webetiser. The prototype was such a motivation, that I pushed version after version until the program got stabler and faster.

At first I had a really poor indexing performance. Indexing 1300 Documents took about 5 minutes. Of course you can not really compare a Ruby program against a C++ one, but a multiplication factor of 5-6 was not what I wanted. Several parts of the program were not optimized. After playing around with the Ruby profiler and the Benchmarking tool I found the bottlenecks and quickly removed them. In the end OfflineSearch had a really good performance.

Did you mean … ?

A feature I also missed from Webetiser was the possibility to suggest some ‘Did you mean…’ terms, if the user misspelled a term or if a term was not in the search index. I did a lot of research and decided to try the double metaphone algorithm. The decision was not easy but this algorithm seemed the only useful one, if you needed support for more languages than English and if you could not have too much load on the processor. With the double metaphone algorithm you only need to compare two strings. You don’t need to calculate the distance between two strings – that is a lot faster and works fine even on a large index. JavaScript is too slow for calculating the distance between lots of terms.

The Ruby version was easy to find, it is located in the Text gem – where else? But I did not find a JavaScript version. With Ruby I could produce a cached version for all terms that went in the search index. But I needed to transform the user input on the client side. Again I had to write it myself, what was quite a tedious job. Character after character I rewrote the Ruby version into JavaScript code. The tests provided by the author were a great help and a bit of motivation as more and more passed. When I finally succeeded in passing all tests, I was able to try it out, and was not too pleased with the results. It was better than nothing, but especially the ranking – again – was not too good. The algorithm does not provide any ranking mechanism, it just tells you whether to terms are similar or not. Finally I ranked the result of the double metaphone algorithm with the Levenshtein distance and voilá the results were better. It is still not the best way but the best way that I found.

The double metaphone algorithm does not provide really good results especially on long words it is rather useless. You can find the JavaScript version of the algorithm in my toolbox on github, if you are interested in trying it out.

Problems

I want to list here some of the problems that I had during developing OfflineSearch, that I did not foresee.

Search index

I first decided to nest arrays in the search index in 3 layers. This was a really bad idea for performance in IE6. I solved this problem by using only 2 layers and splitting the string that is contained. Here is an example:

Bad:

var ranks = [[[29,4],[47,3],[77,3],[52,2],[58,1]],[[63,62],[66,6],[3,4]]]

Good:

var ranks = [[‘29-4’,‘47-3’,‘77-3’,‘52-2’,‘58-1’],[‘63-62’,‘66-6’,‘3-4’]]

Indexing

I first thought REXML would be faster, if the document was valid XHTML. In the end I just used Hpricot for all document crawling.

Caching

Cache everything you can. If you use a regular expression more than once, cache it. If you are using the same array more than once cache it,…

Do not initialize these things more often than it is really needed. This gives you an immense performance boost with nearly no effort.

Profiling and Benchmarking

Ruby and JavaScript both provide excellent tools for profiling and benchmarking, just use them and learn to interpret the results.

Ruby has a good built in tool for benchmarks and a gem for profiling, which is better than the built in version. In Firebug a JavaScript profiler is included.

Future

What my OfflineSearch misses right now are some fancy templates for easy integration in various situations. For now it is programmed as a jQuery plugin but the code base is not heavily bound to jQuery, so it should be rather easy to transform it to Prototype, Mootools or whatever. Maybe someone wants to do a fork and provide this functionality.

Lots of other features are on my mind to improve OfflineSearch. They are all documented in the project’s ToDo list.

Conclusion

I am glad that my first Ruby project turned out that good. The code base is often not as rubyesque as I wish it was, so I find myself still refactoring code to use more Ruby features and to make the code more readable. One of the gems that were born out of OfflineSearch was FancyLog, which I presented you in a previous post. The next big step is to get nearly all configuration out of the code and replace it with JigSaw.

October 27, 2008 at 6:37 am 1 comment

JigSaw – Part I

The option you missed

JigSaw is a project that I’m currently working on. It is still under heavy development and far from being a release candidate. It is supposed to become an all in one config file or option parser and option provider.

What I miss in Ruby out of the box is an easy way to handle the configuration of your application. You have lots of options but you have to write most of the code yourself. My vision is to set up a possible configuration either with an application config file or by passing options to JigSaw and let it handle all the heavy stuff like reading the user configuration, validation and an easy way to access the provided options.

I like the way Javascript provides for accessing hashes with the dot-operator. Like

var fruits = {
  green:'apple',
  yellow:'banana'
};

alert(fruits.yellow)

This notation is easy to read, to understand and to write. I do not know what it is like on an english keyboard, but on a german keyboard it is a pain to write something like this: fruits['curved']['yellow'].... Instead I want to be able to type fruits.curved.yellow and get my banana. This is easily done in Javascript as you have seen, in Ruby it is a bit difficult.

The dot-operator in Ruby is reserved for methods. For example:

class Fruits
  def yellow
    'banana'
  end
end

f = Fruits.new
f.yellow == 'banana'

This works but is very far from where we want to be.

class Fruits
  attr_accessor :yellow, :green
  def initialize
    @yellow = 'banana'
    @green = 'apple'
  end
end

f = Fruits.new
f.yellow != f.green  #bananas are not apples

This gets a bit closer. But still I do not want to have a class, where all methods are predifined. Ruby is a dynamic language. So we can add methods to a class at run time. We can define instance variables and attr_accessors on the fly and we can then access the instance variables with their accessor methods. Sounds easy, lets do it.

class Options
  def initialize
    @stored_options = []
  end
  
  #get the stored options (only names no values)
  def get_options
    @stored_options
  end
  
  #set the class to be immutable - no options can be added after this method is invoked.
  #the function applies recursive to all children
  #all variables that have not recieved a value are set to nil
  def set_immutable
    self.class.class_eval do
      define_method :method_missing do |meth, *args|
        raise NoOptionsEntry, "no option #{meth} defined"
      end
    end
    set_unset_values_to_nil
  end

  private
  #set option accessor methods
  def method_missing(meth, *args)
    self.class.class_eval do
      attr_accessor meth.skip_equal
    end
    @stored_options << meth.skip_equal unless @stored_options.include? meth.skip_equal
    if args.empty?
      __send__("#{meth}=", Options.new)
    else
      __send__(meth, *args)
    end
  end
  
  #set options to nil that have been initialized but have not recieved a value
  def set_unset_values_to_nil
    @stored_options.each do |o|
      if __send__(o).class == JigSaw::Options::Options
        if __send__(o).get_options.empty?
          __send__("#{o}=", nil)
        else
          __send__(o).set_immutable
        end
      end
    end
  end
  
end

This is an excerpt of the options class (you can find the complete on github). The important part is done in the method_missing. When a class method is called that does not exist, it creates attribute accessors for an instance variable named after this missing method and it applies the value, that is supplied as an argument, using the newly created accessor method. For example op.awesome = true would create the accessor methods for awesome and apply the value true to the instance variable awesome.

But this is only one half of the story. We also want to be able to chain options, like op.green.apple = true. This is done by checking if the arguments given to the method are empty. If they are, an options tree is build by setting the return value of the current method to be a new option class, that will hold the sub-options of the current. This goes on recursively until an argument is not empty.

After initialization of the options tree, it is recommended to make it immutable. This sets all variables that have no value set to nil recursively and does not allow any modification of the tree. If not done, you can not distinguish between an unset option or an option that should not exist due to your configuration.

First I thought that this kind of class manipulation could be a bit slow, but my benchmarks showed that setting about 1000 options – even recursively in two levels – was pretty fast. About 0.1s for setting 1000 values on the first level, and about 0.2s for the second level. As you probably will not have 1000 options, I believe that the time is neglect-able. Data access performance is no problem either, because the classes are not modified while accessing their instance variables, they just return values. The manipulation is only performed, when first accessed.

This initialization process is of course a bit complicated and should not be bothering the developer, therefore the JigSaw gem will have a setter class for converting a hash or a yaml file easily to an options class with initialized accessor methods. This will be part of a future post, because the setter part is not ready yet.

October 4, 2008 at 8:15 pm Leave a comment

FancyLog – a proxy for Ruby’s Logger

This is the second post in the toolbox series. FancyLog is basically a proxy for Ruby’s Logger that solves some problems that occurred in my development and enhances slightly the logger API. All original methods should work as expected, so you can use FancyLog as a Logger replacement without any changes to your code.

But lets start from the beginning. In my first project I faced the problem that I did not see errors on the console when they occurred. They were nicely written to the log file or the console, but the log could get really large and I felt not to comfortable searching a large log for possible errors. I wanted an easy way to see the errors on my console, when I was developing, and to log them to the log file when in production mode. Replacing the calls to the logger with puts seemed a bad idea, managing two logger instances, too. So I decided to factor out the logger to a separate gem and that was the birth of FancyLog.

FancyLog uses the singleton pattern, so you always have only one instance flying around your application, agnostic to where and when it was first invoked. At the first call you can pass options to it, to let it behave the way you want it to, but you can change most of them later. The only thing you can not change at run time are the log devices. Maybe this will be included in a future version.

Example

Want an example? Here it is:

log = FancyLog.instance
log.error('An undefined error occurred')
log.just_an_information('I like FancyLog!!!')
log.information_to_err('This is an information')

What happens here?

  • In line 1, we get an instance of FancyLog. Because we do not specify any options, all informational log messages (debug, info, warn) go to STDOUT and all errors (error, fatal) go to STDERR. This is FancyLog’s default behavior, that can be overridden.
  • Line 2 sends an error to the logger, that is printed to STDERR – just like Ruby’s Logger would.
  • Line 3 gets interesting, because this is no more the standard Logger API. FancyLog uses Ruby’s method_missing to determine what you wanted it to do. It scans the provided method for debug, info, warn, error and fatal and calls the appropriate Logger method – in this case info – and passes the arguments to it. If it can not find a match it defaults to the default_level specified in the options.
  • Line 4 sends – as it explicitly says – an information to the error log. How this is done, you ask? You can always tell FancyLog to log messages to err or out with the appropriate ending of your method (_to_err or _to_out).

As you can see, FancyLog enables you to specify log methods that tell you as a programmer what really happens, regardless of the message printed to the user.

If you want to switch from development to production mode, it is just one option, you have to specify in the setup of your logger – log = FancyLog(:errs_to_err => false).instance and you are done. Of course, FancyLog does not perform as well as the original Ruby Logger, but I do not expect logging to be a performance critical task.

Conclusion

To summarize: FancyLog lets you easily define were your log messages are logged to and it lets you name the log method in a way that your code makes more sense and is better readable.

Further reading

October 2, 2008 at 8:36 pm Leave a comment

Emerald

The beginning of a gem factory

A while ago I started to collect my development tools and snippets and put them in my toolbox at github. I will now start to present each of my tools starting with Emerald.

Emerald is a good starting point as it is a RubyGem that creates a bare directory structure for another gem. The default structure, it generates, is:

  [gemname]
    - bin
      [gemname]
    - lib
      [gemname].rb
    - tests
      tc_[gemname].rb
    rakefile
    README
    [gemname].gemspec

If you do not want to generate all directories you simply specify the directories you want from the command line, e.g. emerald gemname lib tests to create only the lib and tests directories.

  • The bin directory contains the executable (named after the gem) which requires the class in the lib directory.
  • The main class lies in the lib directory and contains a bare structure – the class definition and an initialize method
  • The test for the main class is in the tests directory. It is named after the main class and sets a require statement for the main class, so it can be used in the tests. It contains a basic (setup, _tc_method and _teardown_) method structure. All you have to do, is write your tests.

Example

Lets do an example an see what happens:

When we type emerald FancyTest in the console. Emerald generated the following structure for us:

fancy_test
gem directory
fancy_test/bin
binaries directory, used to invoke your gems from the console
fancy_test/bin/fancy_test
this is our executable after installing the gem
fancy_test/fancy_test.gemspec
this is a bare gemspec, where you can define dependencies, tests, documentation, …
fancy_test/lib
the lib directory is the applications main directory where all classes and modules are found
fancy_test/lib/fancy_test.rb
the main class of the application
fancy_test/rakefile
with the rakefile you can perform tasks that are described in the next section in detail
fancy_test/README
the readme is used in the rdoc and is a starting point for users of your application
fancy_test/tests
directory for all your tests
fancy_test/tests/tc_fancy_test.rb
this is the test case for our application that contains a bare unit test case layout

Rakefile

The real magic of Emerald is in the Rakefile. It has predefined tasks to

  • create new classes + associated tests
  • create the gem from the gemspec
  • generate rdoc documentation for development or release
  • run your tests.

The latest gem is stored in the pkg directory, all previous gems are deleted. It is assumed that old gems are no longer needed, because you have them stored in your SCM (don’t you?) and would otherwise just litter up this directory. All this saves a developer a lot of typing and leads to a standard gem development, where you can easily find your way through a gem.

Here is a complete listing of all rake tasks:

  Tasks:
    rake class n=Klass # generate class with test
   GEMS
    rake clear_packages # clear packages
    rake clobber_package # Remove package products
    rake clobber_rdoc # Remove rdoc products
    rake clobber_rdoc_dev # Remove rdoc products
    rake gem # create gem package / Build the gem file
    rake package # Build all the packages
   DOCS
    rake rdoc # Build the rdoc HTML Files
    rake rdoc_dev # Build the rdoc_dev HTML Files
    rake repackage # Force a rebuild of the package files
    rake rerdoc # Force a rebuild of the RDOC files
    rake rerdoc_dev # Force a rebuild of the RDOC files
   TESTS
    rake test # run tests normally
    rake test TEST=just_one_file.rb # run just one test file.
    rake test TESTOPTS="-v" # run in verbose mode

Conclusion

Emerald is designed to keep repetive tasks away from the developer. I do not say that the structure it generates is perfect for everyone. It is just the way I like to write my gems. It does encourage you to write tests and not to leave them out of your programs, because each class you generate has an associated test case.

In my case this gem design pattern has led to a cleaner programming style.

October 1, 2008 at 9:59 pm Leave a comment


Recent Posts

Archives

del.icio.us

Feeds