Snowball stemmers on C#
Stemmers pack for .Net Framework

Any human’s activity has some basic things which should be known by any specialist in this field. One of such important and necessary things in computer linguistics is the operation of a word’s reduction to its basic form – lemmatization. This operation makes the aggregation of different forms of the same word to work with their common statistics, searching with any case, etc. possible.

The dictionary based morphological engines make this process the most qualitative. These engines are usually paid for most of languages except English. Also they are often written on C++ or they can have another limits which make their using with C# uncomfortable.

Luckily, the simplified morphological analyzers – stemmers can be used for many research projects (and often for non-research). As a rule, these analyzers don’t use huge dictionaries – only a set of heuristics which allow to make the same string from different forms of a word. It is enough in many cases. For example, the task of thematic texts’ classification can be solved with these simplified analyzers well.

The most famous stemmers’ project is Snowball. The very important feature of the project is their BSD license. So anyone can do anything with these codes. The small weakness of this project is the limited set of programming languages for which stemmers are generated only for – C++ and Java. In due time, the Iveonik Systems made the adopting of Snowball stemmers for C#. Some of them had already been ported and had been collected from different sites. The rest had been ported ourselves from Java language. This language was selected because it is the subset of C# from the functionality view. Them, the stemmers were “smoothed out” to make their using comfortable in a uniform style.

Now it’s time to share this set with other developers like the Snowball makers (and other often nameless authors) shared their work with us.

The source codes of stemmers and a simple usage example can be found by this link.

Demo is divided in two projects – the stemmers set (StemmersNet.csproj) and the console application which uses three stemmers (for Russian, English, German). Totally, the set contains stemmers for 14 languages: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian and Spanish.

To use any of the stemmers, you should do the following steps:
  1. Add the reference to the StemmersNet into your project.
  2. Use Iveonik.Stemmers namespace.
  3. Create an instance of the stemmer for the necessary language(s).
  4. Use IStemmer.Stem() method for finding the basic form.

Here you can see these steps in the demo sample:

namespace Iveonik.Stemmers
  class Program
    static void Main(string[] args)
      TestStemmer(new RussianStemmer(), "краcОта", "красоту", "красоте", "КрАсОтОй");
      TestStemmer(new EnglishStemmer(), "jump", "jumping", "jumps", "jumped");
      TestStemmer(new GermanStemmer(), "mochte", "mochtest", "mochten", "mochtet");
    private static void TestStemmer(IStemmer stemmer, params string[] words)
      Console.WriteLine("Stemmer: " + stemmer);
      foreach (string word in words)
        Console.WriteLine(word + " --> " + stemmer.Stem(word));

The result of the program’s work is:
Stemmer: Iveonik.Stemmers.RussianStemmer
краcОта --> краcот
красоту --> красот
красоте --> красот
КрАсОтОй --> красот
Stemmer: Iveonik.Stemmers.EnglishStemmer
jump --> jump
jumping --> jump
jumps --> jump
jumped --> jump
Stemmer: Iveonik.Stemmers.GermanStemmer
mochte --> mocht
mochtest --> mocht
mochten --> mocht
mochtet --> mochtet

Here we can see in what way the stemmers differ from the dictionary based lemmatizers:
  1. As a rule, the stemmer’s output is the trimmed substring of the source word, not a real basic form (lemma). This substring is similar to the lemma, but no more than that.
  2. Heuristics can make fail even for spreading words. This situation can be seen in the example for the German word – the stem for the last word’s form “mochtet” is not equivalent to the other forms’ stem.

And yet lemmatizers have a right for using. They have a small size, a pure C# (in our case). Besides it, their quality allows to solve many practical tasks of computer linguistics. Enjoy it!

To sum up, the most important links are:
  1. Original Snowball project’s stemmers.
  2. Ported to C# stemmers with the usage example.

Last edited Sep 2, 2012 at 1:11 PM by ssotnyk, version 8