Friday, 3 September 2010

Something about Lucene.NET - How to highlight terms.

As we all know that lucene.net is class to class and api to api copy of Java Lucene. Its an open source search engine API and one of the best in the market.

To work with highlighting we need Highlighter.Net module, which is available on Lucene.Net website. I have put link to svn repository of highlighter, you can get highlighter from there.

Here i am going to discuss highlighting in Lucene.NET. To implement highlight we need two kind of objects in our project.
1. Query object
2. Text string object to produce token stream

I will take you through the code line by line. First we make QueryScorer object which is available in Lucene.Net.Highlight namespace. Query scorer class scores each text fragment by number of unique query terms found. Here 'q' is the Query object.

QueryScorer scorer = new QueryScorer(q);

Then we are going to use SimpleHTMLFormatter to format the founded terms, it helps to put HTML tags before and after to highlight terms. By default start and end tages are "<b></b>"

Formatter formatter = new SimpleHTMLFormatter();

After initialising formatter, we can initialise Highlighter object using scorer and formatter.

Highlighter highlighter = new Highlighter(formatter, scorer);

Now we need to tokenize each term in our Lucene Document. Here wer are using TokenStream method of StandardAnalyzer class to get tokens of our text.

TokenStream stream = new StandardAnalyzer().TokenStream("title", new StringReader(text));

Now we have most of the things we require for getting best fragments from our text. By calling GetBestFragments method of highlighter object and passing our token strea, our text, number of highlighted terms and separator we can get our text in highlighted format.


highlighter.GetBestFragments(stream, text, 50, "....");


Here is complete code.
QueryScorer scorer = new QueryScorer(q);
Formatter formatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(new SimpleFragmenter());
TokenStream stream = new StandardAnalyzer().TokenStream("title", new StringReader(text));
return highlighter.GetBestFragments((stream, text, 50, "....");




for further details please Read following.

References:
1. http://lucene.apache.org/lucene.net/docs/2.4.0/
2. https://svn.apache.org/repos/asf/lucene/lucene.net/tags/Lucene.Net_2_9_2/contrib/