This is an implementation of a series
of retrieval models which improve state-of-the-art retrieval models (including
BM25, language modeling with Dirichlet prior smoothing, PL2, etc.) by properly
lower-bounding the score of term frequency normalization by document length. Please refer to the
following two papers for more details of the algorithms:
 Yuanhua Lv and ChengXiang Zhai. "Lower-Bounding Term Frequency Normalization". In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11), pages 7-16, 2011. Best Student Paper Award
 Yuanhua Lv and ChengXiang Zhai.
"When Documents Are Very Long,
BM25 Fails!". Poster paper. In Proceedings of the 34th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR'11), pages 1103-1104, 2011.
The main source code file "LBTFEval.cpp" is implemented in C++ and works with the Lemur toolkit. It currently does not support Indri search engine, but an Indri-based implementation will be available soon. Most of the codes were used in the experiments for our CIKM'11 and SIGIR'11 papers. The algorithm has only been tested on Lemur 4.10 (probably it can also work with other versions of lemur, but we haven't tested it yet) in a Linux environment, where the index type is "key", built using the BuildIndex application provided by Lemur.
Our current implementation includes
BM25L , BM25+ , Dir+ , Piv+ ,
and PL2+ . A sample parameter file is also provided to show how to
use each retrieval function in details. For example, the key parameters of BM25+
in file param.BM25Plus is shown as follows, where
every parameter is self-explainable with the comments:
<!-- Retrieval model selection: -1 BM25/BM25L; 0 BM25/BM25+; 1 Dir/Dir+; 2 Piv/Piv+; 3 PL2/PL2+ -->
<!-- Standard BM25 parameters -->
<!-- Lower bound parameter; setting it to 0 will degenerate the retrieval model to its traditional version (i.e., BM25) -->
Other parameters in the param.BM25Plus file are standard parameters used in Lemur. Besides, we support standard Lemur query format, as shown below:
where 301-303 are query topic ids. (Please note that for the above query topics, we have done stemming and stopword removal.)
To run our algorithm, you need to first install the Lemur toolkit. See http://sourceforge.net/apps/trac/lemur/wiki/Compiling and Installing on Linux and Mac OS X for more details regarding compiling and installing Lemur toolkit on Linux and/or Mac OS X. After that, change the "prefix" value in the Makefile file to your installation path.
Finally, you can compile our algorithm
and run it like this: LBTFEval param.BM25Plus
If you have more questions, please email me (Yuanhua Lv, firstname.lastname@example.org)