CS410 Assignment #3: Dirichlet prior and model-based feedback
(due Wednesday, March 12, 2008, in Class)
Introduction
The goal of this assignment is to implement and experiment with two retrieval algorithms using language models. You will be using a "light" version of the Lemur toolkit, which is available here. Note that while the regular Lemur can run on both Windows and Unix, this light version is only ready for running on Unix. (The source code is portable,
so you can create a Windows project manually by adding all the source files to
your project space if you use visual c++.) All of you should have an account
in the Computer Science Instructional Laboratories (CSIL), and you are strongly encouraged to use the machines in the Linux lab. Contact CSIL if you have any problem with your account.
The general usage of Lemur
can be found online here. You
will need to know how to compile the toolkit and how to modify an application.
Basically, you go to the root directory, and type in "./configure" and then "gmake". If it is the first time you type "gmake" for this package, it will take a few minutes. If you get any error message during the compiling, please try to do it again at some other off-peak time.
Every time you change a program, just go to the root directory and do "gmake".
The code you need to complete is in a file named "FBEval.cpp" in
the directory "app/src". You need to implement two scoring functions
computeWeight and computeAdjustedScore, and to complete
the feedback model computation function "computeFeedbackModel", which is almost complete except that a few lines are missing. You will
need to completely understand how the function computeFeedbackModel works and how the EM algorithm is supposed to work, in order to be able to insert the missing code.
The scoring part of the code implements the following general scoring function:
s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d)
where,
q =(q1,q2,...,qN) is a query,
d=(d1,d2,...,dN) is a document,
and q1,...,qN and d1,...,dN are terms.
That is, the score of a document d against a query q is a
function g of the accumulated weight w for each matched term.
The score is thus determined by
two functions g and w; both may depend on the whole query or
document. The function w gives the weight of each matched term,
and is implemented by computeWeight in the code;
the function g makes it possible to perform any
further transformation of the sum of the weight of all matched terms based on the
"summary" information of a query or a document (e.g., document length), and is
implemented by computeAdjustedScore in the code.
The function retrieval is the main scoring function. Basically, for each query, it would go through
each query term, and look up the inverted list. Then, it would iterate over
all the entries in the inverted list and for each entry, compute the
contributed weight to the score of the corresponding document. The
contributed weight is to be computed by computeWeight.
At the end, depending on the scoring formula, you may need to adjust the score of each document by some document-dependent (but term-independent) quantity.
This is the task of computeAdjustedScore.
You will need to understand how
to access information such as the length of a document, the average length
of documents in the collection, the total counts of a term in the collection,
and the total counts of all the terms in the collection. (Some of these counts have an integer type, so you may need to do type conversion when they are involved in division. ) All these information
is available through an abstract Index class interface. See the Lemur documentation.
Tasks
- [25 points] Implement the KL-divergence scoring formula with Dirichlet prior for
smoothing the document language model as described on
this note (formula 2). Here are some explanations about the code. First, the function computeWeight gets the probability of a term given by the query model.
Second, the function computeAdjustedScore
gets a sum of all the query term probabilities, which is always 1 in your assignment,
because we re-normalize the probabilities after truncating the model by taking only
terms with probabilities larger than 0.001.
A detailed description of the Dirichlet prior smoothing method can be found
in this paper. You can use the variable LocParam::dirPrior provided in
the code for passing in a value for the smoothing parameter. You can also
use any other method that you like to pass in parameter values. Your implementation would involve completing two functions computeWeight and
computeAdjustedScore. There is no need to complete the function
computeFeedbackModel at this point.
Test your code with the CACM test collection included in the data directory of the Lemur toolkit. Specifically, go to the data directory and do the following:
- First, build the index by running "../app/obj/BuildIndex build_param source".
- Then, try the retrieval evaluation program by running "../app/obj/FBEval init_ret_param". You should get retrieval results written in a file
named "initres". Use "perl ../app/src/ireval.pl -j qrel3column < initres " to generate precision/recall performance of the results in "initres".
By tuning the parameters, you should be able to easily get an average precision (over all the topics) above 0.25,
so if you can not get it, you probably have a bug in the implementation.
The file "init_ret_param" provides the parameters for the evaluation program to do an initial retrieval with the Dirichlet prior smoothing method.
That is, it uses the query empirical distribution to estimate the query model, so the KL-divergence scoring formula would be equivalent to query likelihood scoring. You can change the values in this file to use different queries/databases as well as to use different values for the Dirichlet prior parameter, which
is set through the variable dirPrior; inside the code, the value is available to weighting functions through a variable in the class "LocParam" (i.e., LocParam::dirPrior).
- [15 points]
Evaluate the Dirichlet prior query likelihood retrieval formula with the CACM test collection included in the data directory of the Lemur toolkit. Change the values of the smoothing parameter (e.g., ranging from 100 to 10,000), and evaluate the performance using the evaluation perl script included
in Lemur ( app/src/ireval.pl). Plot how the average retrieval precision (over all the topics) changes as the parameter value changes. Note that the Dirichlet prior query likelihood method is the one
presented in Zhai and Lafferty 2001, sigir. In terms of scoring, this formula is
equivalent to using the relative query term frequency to compute the query
model in the KL-divergence retrieval method. See the notes for a detailed explanation .
Is the performance sensitive to
the setting of the smoothing parameter?
How is the best performance (measured by the mean average precision) of Dirichlet prior compared with
the best performance that you obtained for the pivoted normalization method and the BM25 method?
- [20 points]
Complete the implementation of the EM algorithm for mixture model estimation
in the function computeFeedbackModel. The locations of the missing
lines are indicated in the body of the function. The code tests whether
a set of feedback documents has been specified (parameter feedbackDocs), and decides whether it should do feedback accordingly.
Assuming that you have generated some initial retrieval results in
the file initres, you can use the parameter file pseudo_fb_param to test your code by typing in " FBEval pseudo_fb_param".
If your implementation is correct, you should see the log likelihood value
(written to stderr) increases at each iteration of the EM algorithm and eventually converges.
- [15]
The feedback retrieval results obtained using pseudo_fb_param are based on using top 5 documents in initres. The new feedback results are in psfbres. Use the evaluation perl script to compute the precision/recall figures for psfbres.
Compare this pseudo feedback performance with the initial retrieval performance. Does pseudo feedback improve the overall average precision? Does it improve the performance for every topic? Plot the precision-recall curve
(i.e., recall levels as x-axis and precision at different levels of recall as y-axis) for both the initial retrieval and the pseudo feedback (use the average performance figures over all topics). If you are not sure about what a precision-recall curve is, read
more about "TREC measures".
Compare the two curves, and discuss where at the curves pseudo feedback
improves the performance most significantly. Can you explain why?
- [10 points]
Now, try using rel_fb_param. This parameter file specifies the
relevance judgment file (qrel3column) as the feedback documents,
so we are now doing relevance feedback (with 5 relevant
documents). The results will
be written in a file named relfbres. Again, use the evaluation
perl script to compute the precision/recall figures for relfbres.
How is the performance of relevance feedback compared with that of
pseudo feedback, where we simply assume that the top 5 documents to
be relevant. Try to use only 2 relevant documents for relevance feedback
(by changing the parameter fbDocCount). How is the performance
compared with that of pseudo feedback?
- [5 points]
Pick any of the feedback runs, and examine the file query.update
which is a dump of the updated query model. The terms with a prefix of
"+" are the new terms introduced through feedback. Look at
a few topics. Are those new terms closely related to the topic?
Try a few different values of lambda in the parameter file (i.e. noise parameter in the mixture model). How does it affect the nature of new terms
introduced to the query model?
- [10 points]
There are at least four parameters that may need to be tuned for
pseudo feedback: (1) the interpolation parameter alpha;
(2) the noise parameter lambda; (3) the number of top
documents to use; (4) the Dirichlet prior smoothing parameter.
Pick at least one of the parameters, try
out a few different values, plot how the retrieval performance
(as measured by both precision at 20 documents and the average
precision) varies according to the parameter(s) that you chose.
Try to provide an explanation for any pattern that you observe.
What to turn in
A. Please turn in a hardcopy of your written answers at the class.
B. Please pack all the following files into one single zip file or tar file, name it as "assign3-YourNetID", then email
the package to "maryamcs410 AT gmail DOT com".
- Your complete code (i.e.,FBEval.cpp).
- Three retrieval results (i.e., ranked list of document IDs with scores)
as generated by "FBEval.cpp" when using
init_ret_param, pseudo_fb_param, and rel_fb_param,
respectively.