CS410 Assignment #3: Dirichlet prior and model-based feedback
(due Wednesday, March 12, 2008, in Class)

Introduction

The goal of this assignment is to implement and experiment with two retrieval algorithms using language models. You will be using a "light" version of the Lemur toolkit, which is available here. Note that while the regular Lemur can run on both Windows and Unix, this light version is only ready for running on Unix. (The source code is portable, so you can create a Windows project manually by adding all the source files to your project space if you use visual c++.) All of you should have an account in the Computer Science Instructional Laboratories (CSIL), and you are strongly encouraged to use the machines in the Linux lab. Contact CSIL if you have any problem with your account.

The general usage of Lemur can be found online here. You will need to know how to compile the toolkit and how to modify an application. Basically, you go to the root directory, and type in "./configure" and then "gmake". If it is the first time you type "gmake" for this package, it will take a few minutes. If you get any error message during the compiling, please try to do it again at some other off-peak time. Every time you change a program, just go to the root directory and do "gmake".

The code you need to complete is in a file named "FBEval.cpp" in the directory "app/src". You need to implement two scoring functions computeWeight and computeAdjustedScore, and to complete the feedback model computation function "computeFeedbackModel", which is almost complete except that a few lines are missing. You will need to completely understand how the function computeFeedbackModel works and how the EM algorithm is supposed to work, in order to be able to insert the missing code.

The scoring part of the code implements the following general scoring function:

s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d)
where,

q =(q1,q2,...,qN) is a query, d=(d1,d2,...,dN) is a document, and q1,...,qN and d1,...,dN are terms. That is, the score of a document d against a query q is a function g of the accumulated weight w for each matched term. The score is thus determined by two functions g and w; both may depend on the whole query or document. The function w gives the weight of each matched term, and is implemented by computeWeight in the code; the function g makes it possible to perform any further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length), and is implemented by computeAdjustedScore in the code.

The function retrieval is the main scoring function. Basically, for each query, it would go through each query term, and look up the inverted list. Then, it would iterate over all the entries in the inverted list and for each entry, compute the contributed weight to the score of the corresponding document. The contributed weight is to be computed by computeWeight. At the end, depending on the scoring formula, you may need to adjust the score of each document by some document-dependent (but term-independent) quantity. This is the task of computeAdjustedScore.

You will need to understand how to access information such as the length of a document, the average length of documents in the collection, the total counts of a term in the collection, and the total counts of all the terms in the collection. (Some of these counts have an integer type, so you may need to do type conversion when they are involved in division. ) All these information is available through an abstract Index class interface. See the Lemur documentation.

Tasks

  1. [25 points] Implement the KL-divergence scoring formula with Dirichlet prior for smoothing the document language model as described on this note (formula 2). Here are some explanations about the code. First, the function computeWeight gets the probability of a term given by the query model. Second, the function computeAdjustedScore gets a sum of all the query term probabilities, which is always 1 in your assignment, because we re-normalize the probabilities after truncating the model by taking only terms with probabilities larger than 0.001.

    A detailed description of the Dirichlet prior smoothing method can be found in this paper. You can use the variable LocParam::dirPrior provided in the code for passing in a value for the smoothing parameter. You can also use any other method that you like to pass in parameter values. Your implementation would involve completing two functions computeWeight and computeAdjustedScore. There is no need to complete the function computeFeedbackModel at this point.

    Test your code with the CACM test collection included in the data directory of the Lemur toolkit. Specifically, go to the data directory and do the following:

  2. [15 points] Evaluate the Dirichlet prior query likelihood retrieval formula with the CACM test collection included in the data directory of the Lemur toolkit. Change the values of the smoothing parameter (e.g., ranging from 100 to 10,000), and evaluate the performance using the evaluation perl script included in Lemur ( app/src/ireval.pl). Plot how the average retrieval precision (over all the topics) changes as the parameter value changes. Note that the Dirichlet prior query likelihood method is the one presented in Zhai and Lafferty 2001, sigir. In terms of scoring, this formula is equivalent to using the relative query term frequency to compute the query model in the KL-divergence retrieval method. See the notes for a detailed explanation . Is the performance sensitive to the setting of the smoothing parameter? How is the best performance (measured by the mean average precision) of Dirichlet prior compared with the best performance that you obtained for the pivoted normalization method and the BM25 method?
  3. [20 points] Complete the implementation of the EM algorithm for mixture model estimation in the function computeFeedbackModel. The locations of the missing lines are indicated in the body of the function. The code tests whether a set of feedback documents has been specified (parameter feedbackDocs), and decides whether it should do feedback accordingly. Assuming that you have generated some initial retrieval results in the file initres, you can use the parameter file pseudo_fb_param to test your code by typing in " FBEval pseudo_fb_param". If your implementation is correct, you should see the log likelihood value (written to stderr) increases at each iteration of the EM algorithm and eventually converges.
  4. [15] The feedback retrieval results obtained using pseudo_fb_param are based on using top 5 documents in initres. The new feedback results are in psfbres. Use the evaluation perl script to compute the precision/recall figures for psfbres. Compare this pseudo feedback performance with the initial retrieval performance. Does pseudo feedback improve the overall average precision? Does it improve the performance for every topic? Plot the precision-recall curve (i.e., recall levels as x-axis and precision at different levels of recall as y-axis) for both the initial retrieval and the pseudo feedback (use the average performance figures over all topics). If you are not sure about what a precision-recall curve is, read more about "TREC measures". Compare the two curves, and discuss where at the curves pseudo feedback improves the performance most significantly. Can you explain why?
  5. [10 points] Now, try using rel_fb_param. This parameter file specifies the relevance judgment file (qrel3column) as the feedback documents, so we are now doing relevance feedback (with 5 relevant documents). The results will be written in a file named relfbres. Again, use the evaluation perl script to compute the precision/recall figures for relfbres. How is the performance of relevance feedback compared with that of pseudo feedback, where we simply assume that the top 5 documents to be relevant. Try to use only 2 relevant documents for relevance feedback (by changing the parameter fbDocCount). How is the performance compared with that of pseudo feedback?
  6. [5 points] Pick any of the feedback runs, and examine the file query.update which is a dump of the updated query model. The terms with a prefix of "+" are the new terms introduced through feedback. Look at a few topics. Are those new terms closely related to the topic? Try a few different values of lambda in the parameter file (i.e. noise parameter in the mixture model). How does it affect the nature of new terms introduced to the query model?
  7. [10 points] There are at least four parameters that may need to be tuned for pseudo feedback: (1) the interpolation parameter alpha; (2) the noise parameter lambda; (3) the number of top documents to use; (4) the Dirichlet prior smoothing parameter. Pick at least one of the parameters, try out a few different values, plot how the retrieval performance (as measured by both precision at 20 documents and the average precision) varies according to the parameter(s) that you chose. Try to provide an explanation for any pattern that you observe.

What to turn in

A. Please turn in a hardcopy of your written answers at the class.

B. Please pack all the following files into one single zip file or tar file, name it as "assign3-YourNetID", then email the package to "maryamcs410 AT gmail DOT com".

  1. Your complete code (i.e.,FBEval.cpp).
  2. Three retrieval results (i.e., ranked list of document IDs with scores) as generated by "FBEval.cpp" when using init_ret_param, pseudo_fb_param, and rel_fb_param, respectively.