Lower-bounding Term Frequency Normalization

(Source code download)

This is an implementation of a series of retrieval models which improve state-of-the-art retrieval models (including BM25, language modeling with Dirichlet prior smoothing, PL2, etc.) by properly lower-bounding the score of term frequency normalization by document length. Please refer to the following two papers for more details of the algorithms:

[1] Yuanhua Lv and ChengXiang Zhai. "Lower-Bounding Term Frequency Normalization". In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11), pages 7-16, 2011. Best Student Paper Award

[2] Yuanhua Lv and ChengXiang Zhai. "When Documents Are Very Long, BM25 Fails!".  Poster paper. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11), pages 1103-1104, 2011.


The main source code file "LBTFEval.cpp" is implemented in C++ and works with the Lemur toolkit. It currently does not support Indri search engine, but an Indri-based implementation will be available soon. Most of the codes were used in the experiments for our CIKM'11 and SIGIR'11 papers. The algorithm has only been tested on Lemur 4.10 (probably it can also work with other versions of lemur, but we haven't tested it yet) in a Linux environment, where the index type is "key", built using the BuildIndex application provided by Lemur.

Our current implementation includes BM25L [2], BM25+ [1], Dir+ [1], Piv+ [1], and PL2+ [1]. A sample parameter file is also provided to show how to use each retrieval function in details. For example, the key parameters of BM25+ in file param.BM25Plus is shown as follows, where every parameter is self-explainable with the comments:

<!-- Retrieval model selection:  -1   BM25/BM25L;  0   BM25/BM25+;  1   Dir/Dir+;  2   Piv/Piv+;  3   PL2/PL2+  -->
<retModel>0</retModel>
<!--   Standard BM25 parameters -->
<BM25K1>1.6</BM25K1>
<BM25B>0.6</BM25B>
<BM25K3>1000</BM25K3>
<!--   Lower bound parameter; setting it to 0 will degenerate the retrieval model to its traditional version (i.e., BM25) -->
<delta>1.0</delta>

Other parameters in the param.BM25Plus file are standard parameters used in Lemur. Besides, we support standard Lemur query format, as shown below:
<DOC 301>
intern
organ
crime
</DOC>
<DOC 302>
poliomyel
post
polio
</DOC>
<DOC 303>
hubbl
telescop
achiev
</DOC>

where 301-303 are query topic ids. (Please note that for the above query topics, we have done stemming and stopword removal.)


To run our algorithm, you need to first install the Lemur toolkit. See http://sourceforge.net/apps/trac/lemur/wiki/Compiling and Installing on Linux and Mac OS X for more details regarding compiling and installing Lemur toolkit on Linux and/or Mac OS X. After that, change the "prefix" value in the Makefile file to your installation path.

Finally, you can compile our algorithm and run it like this: LBTFEval param.BM25Plus


If you have more questions, please email me (Yuanhua Lv, ylv2@uiuc.edu)