CS410 Assignment #2: Pivoted Normalization vs. BM25 (Okapi)
(due Wednesday, Feb 27, 2008, in Class)

Introduction

The goal of this assignment is to implement two basic traditional retrieval algorithms with the Lemur toolkit (in C++) and to experiment with the algorithms on some sample text data. You will be using a "light" version of the Lemur toolkit, and are given an incomplete implementation of a simple retrieval system, which has everything implemented but the scoring function. The code along with the Lemur can be downloaded from here. Note that while the regular Lemur can run on both Windows and Unix, this light version is only ready for running on Unix, and has only been tested on Linux.. (The source code is portable, so you can create a Windows project manually by adding all the source files to your project space if you use visual c++.) All of you should have an account in the Computer Science Instructional Laboratories (CSIL), and you are strongly encouraged to use the machines in the Linux lab. Contact CSIL if you have any problem with your account.

The usage of Lemur can be found online here. You will need to know how to compile the toolkit and how to modify an application. Basically, you go to the root directory, and type in "./configure" and then "gmake". Every time you change a program, just go to the root directory and do "gmake".

The code you need to complete is in a file named "RetrievalEval.cpp" in the directory "app/src". You need to implement two scoring functions computeWeight and computeAdjustedScore, and to complete the main scoring function "retrieval", which is already implemented except that one line is missing. You will need to completely understand how the function retrieval works in order to be able to insert the weighting code.

The provided code implements the following general scoring function:

s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d)
where,

q =(q1,q2,...,qN) is a query, d=(d1,d2,...,dN) is a document, and q1,...,qN and d1,...,dN are terms. That is, the score of a document d against a query q is a function g of the accumulated weight w for each matched term. The score is thus determined by two functions g and w; both may depend on the whole query or document. The function w gives the weight of each matched term, and is implemented by computeWeight in the code; the function g makes it possible to perform any further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length), and is implemented by computeAdjustedScore in the code.

The function retrieval is the main scoring function. Basically, for each query, it would go through each query term, and look up the inverted list. Then, it would iterate over all the entries in the inverted list and for each entry, compute the contributed weight to the score of the corresponding document. The contributed weight is to be computed by computeWeight. At the end, depending on the scoring formula, you may need to adjust the score of each document by some document-dependent (but term-independent) quantity. This is the task of computeAdjustedScore.

You will need to understand how to access information such as the length of a document, the average length of documents in the collection, the total counts of a term in the collection, and the total counts of all the terms in the collection. (Some of these counts have an integer type, so you may need to do type conversion when they are involved in division. ) All these information is available through an abstract Index class interface. See Section 2.4 of the Lemur documentation.

Tasks

  1. [15 points] Compare the TF-IDF pivoted normalization formula and Okapi BM25 formula analytically. Both formulas are given in Table 1 of Singhal's review paper.(Note that there is an error in the Okapi formula.). What are the common statistical information about documents and queries that they both use? How are the two formulas similar to each other, and how are they different?
  2. [35 points] Implement the TF-IDF pivoted normalization method and the BM25 (i.e., Okapi) retrieval formula given in Table 1 of Singhal's review paper.(Note that there is an error in the Okapi formula.) For the Okapi formula, set k3=1000 and k1=1.2 (you can hard code it, if you want), so that you have only one parameter b to vary. In this way, both algorithms have precisely one parameter to tune.
  3. [20 points] Test both algorithms with the CACM test collection included in the data directory of the Lemur toolkit.

    The CACM collection is a collection of titles and abstracts from the journal CACM. There are about 3,000 documents, and 64 test queries. For each query, human assessors have read all the documents and judged which of them are relevant. To test your code with the CACM collection, go to the data directory and do the following:

    Change the values of the parameter in each algorithm (between 0 and 1), and evaluate the performance using the evaluation perl script included in Lemur ( app/src/ireval.pl). Plot how the non-interpolated average precision (averaged over all the queries) changes as the parameter value changes. Compare the two algorithms' performance and behavior. Is the performance sensitive to the setting of the parameter in each of them? Do they perform similarly or differently? When optimally tuned, which method gives the best performance?

  4. [15 points] In both retrieval formulas, we intend to penalize long documents. To avoid over-penalizing a long document, we may expect any reasonable retrieval formula to satisfy the following length normalization constraint: If we duplicate a document to obtain a new document that is twice as long as the original one, the score of the new document w.r.t. the query should be no less than the score of the original document. Formally, let c(w,d) be the counts of word w in document d. Our constraint is: given any two documents d1 and d2, if for any word w, c(w,d2)=2*c(w,d1) (thus d2 is twice as long as d1), then we require that the score of d2 w.r.t. any query should be at least as high as the score of d1 w.r.t. the same query. Check both retrieval formulas you implemented to see if they satisfy this constraint unconditionally. If not, can they satisfy the constraint for some values of the parameter? Is there any relation between this constraint and the empirical performance of the corresponding retrieval formula?
  5. [15 points] Can you think of any way to further improve the performance for either of the two methods? Try some of the following ideas, or, better, any idea of your own.

    Describe the ideas that you tried and how they work.

What to turn in

A. Please turn in a hardcopy of your written answers at the class.

B. Please pack all the following files into one single zip file or tar file, name it as "assign2-YOURNETID", then submit the package to our TA, Maryam Karimzadehgan by email (maryamcs410@gmail.com)

  1. Your complete code (i.e.,RetrievalEval.cpp): If you have a separate file for each retrieval method, then include all the files. In general, include any code that you have modified.
  2. Three retrieval results, i.e., ranked list of document IDs with scores as generated by "RetrievalEval.cpp" (NOT the precision/recall files generated by "ireval.pl"!) : (1) Result for pivoted normalization (s=0.2); (2) Result for Okapi (k1=1.2, b=0.75, k3=1000); (3) The best result you have ever got with any method explored in this assignment.