This is an implementation of a series
of retrieval models which improve state-of-the-art retrieval models (including
BM25, language modeling with Dirichlet prior smoothing, PL2, etc.) by properly
lower-bounding the score of term frequency normalization by document length. Please refer to the
following two papers for more details of the algorithms:
[1] Yuanhua Lv and ChengXiang Zhai. "Lower-Bounding Term Frequency
Normalization". In Proceedings of the 20th ACM International
Conference on Information and Knowledge Management (CIKM'11), pages
7-16, 2011. Best Student Paper Award
[2] Yuanhua Lv and ChengXiang Zhai.
"When Documents Are Very Long,
BM25 Fails!". Poster paper. In Proceedings of the 34th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR'11), pages 1103-1104, 2011.
The main source code file "LBTFEval.cpp" is
implemented in C++ and works with the
Lemur toolkit. It currently does not support Indri search engine, but an
Indri-based implementation will be available soon. Most of the
codes were used in the experiments for our CIKM'11 and SIGIR'11 papers. The algorithm has
only been tested on Lemur 4.10 (probably it can also work with other
versions of lemur, but we haven't tested it yet) in a Linux environment, where the index
type is "key", built using the BuildIndex application provided by Lemur.
Our current implementation includes
BM25L [2], BM25+ [1], Dir+ [1], Piv+ [1],
and PL2+ [1]. A sample parameter file is also provided to show how to
use each retrieval function in details. For example, the key parameters of BM25+
in file param.BM25Plus is shown as follows, where
every parameter is self-explainable with the comments:
<!-- Retrieval model selection: -1 BM25/BM25L; 0
BM25/BM25+; 1 Dir/Dir+; 2 Piv/Piv+; 3
PL2/PL2+ -->
<retModel>0</retModel>
<!-- Standard BM25 parameters -->
<BM25K1>1.6</BM25K1>
<BM25B>0.6</BM25B>
<BM25K3>1000</BM25K3>
<!-- Lower bound parameter; setting it to 0 will degenerate the
retrieval model to its traditional version (i.e., BM25) -->
<delta>1.0</delta>
Other parameters in the param.BM25Plus file are
standard parameters used in Lemur. Besides, we support standard
Lemur query format, as shown below:
<DOC 301>
intern
organ
crime
</DOC>
<DOC 302>
poliomyel
post
polio
</DOC>
<DOC 303>
hubbl
telescop
achiev
</DOC>
where 301-303 are query topic ids.
(Please note that for the above query topics, we have done stemming and stopword
removal.)
To run our algorithm, you need to first install the
Lemur toolkit. See
http://sourceforge.net/apps/trac/lemur/wiki/Compiling and Installing on Linux
and Mac OS X for more details regarding compiling and installing Lemur
toolkit on Linux and/or Mac OS X. After that, change the "prefix"
value in the Makefile file to your
installation path.
Finally, you can compile our algorithm
and run it like this: LBTFEval param.BM25Plus
If you have more questions, please email me (Yuanhua
Lv, ylv2@uiuc.edu)