CS410
Assignment #2: Pivoted Normalization vs. BM25 (Okapi)
(due
Wednesday, Feb 27, 2008, in Class)
Introduction
The goal of this assignment is to implement two basic traditional retrieval
algorithms with the Lemur toolkit (in
C++) and to experiment with the algorithms on some sample text data. You will be
using a "light" version of the Lemur toolkit, and are given an
incomplete implementation of a simple retrieval system, which has
everything implemented but the scoring function. The code along with the
Lemur can be downloaded from here.
Note that while the regular Lemur can run on both Windows and Unix, this light
version is only ready for running on Unix, and has only been tested on Linux..
(The source code is portable, so you can create a Windows project manually by
adding all the source files to your project space if you use visual c++.) All of
you should have an account in the Computer Science Instructional Laboratories
(CSIL), and you are strongly encouraged to use the machines in the Linux lab. Contact CSIL if you have any
problem with your account.
The usage of Lemur can be found online here. You will need to know how to
compile the toolkit and how to modify an application. Basically, you go to the
root directory, and type in "./configure" and then "gmake". Every time you
change a program, just go to the root directory and do "gmake".
The code you need to complete is in a file named "RetrievalEval.cpp"
in the directory "app/src". You need to implement two scoring functions
computeWeight and computeAdjustedScore, and to complete the
main scoring function "retrieval", which is already implemented except
that one line is missing. You will need to completely understand how the
function retrieval works in order to be able to insert the weighting
code.
The provided code implements the following general scoring function:
s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d)
where,
q =(q1,q2,...,qN) is a query,
d=(d1,d2,...,dN) is a document, and
q1,...,qN and
d1,...,dN are terms. That is, the score of a
document d against a query q is a function g of the
accumulated weight w for each matched term. The score is thus
determined by two functions g and w; both may depend on the
whole query or document. The function w gives the weight of each
matched term, and is implemented by computeWeight in the code; the
function g makes it possible to perform any further transformation of
the sum of the weight of all matched terms based on the "summary" information of
a query or a document (e.g., document length), and is implemented by
computeAdjustedScore in the code.
The function retrieval is the main scoring function. Basically, for
each query, it would go through each query term, and look up the inverted list.
Then, it would iterate over all the entries in the inverted list and for each
entry, compute the contributed weight to the score of the corresponding
document. The contributed weight is to be computed by computeWeight. At
the end, depending on the scoring formula, you may need to adjust the score of
each document by some document-dependent (but term-independent) quantity. This
is the task of computeAdjustedScore.
You will need to understand how to access information such as the length of a
document, the average length of documents in the collection, the total counts of
a term in the collection, and the total counts of all the terms in the
collection. (Some of these counts have an integer type, so you may need to do
type conversion when they are involved in division. ) All these information is
available through an abstract Index class interface. See Section 2.4 of the Lemur documentation.
Tasks
- [15 points] Compare the TF-IDF pivoted normalization formula and Okapi
BM25 formula analytically. Both formulas are given in Table 1 of Singhal's review paper.(Note that
there is an error in the Okapi formula.). What are the
common statistical information about documents and queries that they both use?
How are the two formulas similar to each other, and how are they different?
- [35 points] Implement the TF-IDF pivoted normalization method and the BM25
(i.e., Okapi) retrieval formula given in Table 1 of Singhal's review paper.(Note that
there is an error in the Okapi formula.) For the Okapi formula, set
k3=1000 and k1=1.2 (you can hard code it, if you want), so
that you have only one parameter b to vary. In this way, both
algorithms have precisely one parameter to tune.
- [20 points] Test both algorithms with the CACM test collection included in
the data directory of the Lemur toolkit.
The CACM collection is a collection of titles
and abstracts from the journal CACM. There are about 3,000 documents, and 64
test queries. For each query, human assessors have read all the documents and
judged which of them are relevant. To test your code with the CACM collection,
go to the data directory and do the following:
- First, build the index by running "../app/obj/BuildIndex
build_param source".
- Then, try the retrieval evaluation program by running
"../app/obj/RetrievalEval ret_param". The parameter file
ret_param specifies all the parameters needed for the retrieval
experiment, including information such as where the queries are, where the
index is, which retrieval method to use, and what weighting parameter values
to use if any. After running RetrievalEval, you should get
retrieval results written in a file named "res". Use "perl
../app/src/ireval.pl -j qrel3column < res " to generate
precision/recall performance of the results in "res". The file
qrel3column is the relevance judgments. By tuning the parameters,
you should be able to easily get an average precision (over all the topics)
above 0.25. The file "ret_param" provides the parameters for the
evaluation program. You can change the values in this file to use different
weighting parameters or different queries/databases. The length
normalization parameter (i.e., s or b) is set through a
variable named "param" in the parameter file. Inside the code, the value is
available to weighting functions through a variable in the class
"LocParam" (i.e., LocParam::weightParam).
Change the values of the parameter in each algorithm (between 0 and 1), and
evaluate the performance using the evaluation perl script included in Lemur (
app/src/ireval.pl). Plot how the non-interpolated average precision
(averaged over all the queries) changes as the parameter value changes.
Compare the two algorithms' performance and behavior. Is the performance
sensitive to the setting of the parameter in each of them? Do they perform
similarly or differently? When optimally tuned, which method gives the best
performance?
- [15 points] In both retrieval formulas, we intend to penalize long
documents. To avoid over-penalizing a long document, we may expect any
reasonable retrieval formula to satisfy the following length normalization
constraint: If we duplicate a document to obtain a new document that is twice
as long as the original one, the score of the new document w.r.t. the query
should be no less than the score of the original document. Formally, let
c(w,d) be the counts of word w in document d. Our constraint is: given any two
documents d1 and d2, if for any word w, c(w,d2)=2*c(w,d1) (thus d2 is twice as
long as d1), then we require that the score of d2 w.r.t. any query should be
at least as high as the score of d1 w.r.t. the same query. Check both
retrieval formulas you implemented to see if they satisfy this constraint
unconditionally. If not, can they satisfy the constraint for some values of
the parameter? Is there any relation between this constraint and the empirical
performance of the corresponding retrieval formula?
- [15 points] Can you think of any way to further improve the performance
for either of the two methods? Try some of the following ideas, or, better,
any idea of your own.
- Alternative TF-IDF formulas: E.g., a different way of normalizing
document TF; a different IDF formula (e.g., log(N/n),where N is the total
counts of all words in the collection, and n is the total counts of a
particular word); a different way of achieving document length
normalization. Read [Fang & Zhai 05]
for other effective retrieval functions and read
[Zobel & Moffat 98] for many different ways of doing TF-IDF weighting.
- Alternative indexing units:
E.g., remove some stop words in the source file, or remove tokens that are
clearly noise. Group a word pair that occurs frequently as one single
phrase. Or use ltchunk from assignment one to identify noun phrases
and then indexing both noun phrases and single component words.
Describe the ideas that you tried and how they work.
What to turn in
A. Please turn in a hardcopy of your written answers at the class.
B. Please pack all the following files into one single zip file or tar file,
name it as "assign2-YOURNETID", then submit the package to our TA, Maryam
Karimzadehgan by email (maryamcs410@gmail.com)
- Your complete code (i.e.,RetrievalEval.cpp): If you have a separate file
for each retrieval method, then include all the files. In general, include any
code that you have modified.
- Three retrieval results, i.e., ranked list of document IDs with scores as
generated by "RetrievalEval.cpp" (NOT the precision/recall files
generated by "ireval.pl"!) : (1) Result for pivoted normalization (s=0.2); (2)
Result for Okapi (k1=1.2, b=0.75, k3=1000); (3) The best result you have ever
got with any method explored in this assignment.