Information Retrieval

CS 6501 Fall 2014

MP2—Retrieval Functions

Posted: September 25th, 2014

In this assignment, you will be implementing some standard retrieval functions in the Lucene toolkit. We are aware that some of the functions already exist in Lucene, but we'd like you to add them yourself for this assignment. This assignment is required to be finished by students individually.

The assignment is composed of the following parts consisting of 100 total points (20pts extra bonus):

The 60 points in the scoring functions is for coding completion -- it has to be done in the correct way, and you need to explain your implementation in your report (pasting the modified code section is ok).

The 40 points are for correct, justified answers to the questions at the end, and reasonable search performance generated by your implementations.

The 20 points bonus question requires you to read a related paper before answering it.

Please download the provided project here. Inside, you will find all the necessary files.

In the next section, we'll give an overview of the six functions you'll need to implement.

Boolean Models

$$ r(q,d)=\sum_{w\in q,d} 1 $$

Vector Space Models

TF-IDF dot product

$$ r(q,d)=\sum_{w\in q,d} c(w,d)\cdot \log\left(\frac{N + 1}{df}\right) $$

where

NOTE: this is simply an unnormalized version of TF-IDF dot product. Because in Lucene' inverted index the term frequencies are not normalized, we will only ask you to implement this simpler version.

Okapi BM25

Parameters: $k_1\in [1.2,2],k_2\in (0,1000],b\in[0.75,1.2]$.

$$ r(q,d)=\sum_{w\in q,d} \ln\left(\frac{N-df+0.5}{df+0.5}\right) \cdot \frac{(k_1 + 1)\cdot c(w;d)}{k_1(1 - b + b\frac{n}{n_{avg}}) + c(w;d)} \cdot \frac{(k_2 + 1)\cdot c(w;q)}{k_2+c(w;q)} $$

where

Pivoted Length Normalization

Parameter: $s\in [0,1]$.

$$ r(q,d)=\sum_{w\in q,d} \frac{1+\ln(1 + \ln(c(w;d)))}{1 - s + s\frac{n}{n_{avg}}} \cdot c(w;q) \cdot \ln\left(\frac{N+1}{df}\right)$$

This is another version of TF normalization, which we did not cover in our course lecture.

Language Models

As we have discussed in class the language models rank documents according to query likelihood:

$$ r(q,d) = \sum_{w\in q} \log p(w|d) $$

After proper smoothing, the scoring functions for Language Models become,

$$ r(q,d) = \sum_{w\in q, d} \log\frac{p_s(w|d)}{\alpha_d p(w|C)} + |q|\log\alpha_d $$

In the following two language model retrieval functions, we define different smoothing strategies. You can then plug these two smoothed document language models into the general language model formula above, and we will only use unigram language models.

Jelinek-Mercer

Parameter: $\lambda\in[0,1]$.

$$p_s(w|d) = (1-\lambda)p_{ml}(w|d)+\lambda p(w|C)$$

where $p_{ml}$ is the maximum likelihood estimation. Accordingly $\alpha_d=\lambda$

Dirichlet Prior

Parameter: $\mu>0$. Try a number like 2000 or 3000.

$$p_s(w|d) = \frac{c(w;d) + \mu p(w|C)}{n + \mu}$$.

Accordingly $\alpha_d=\frac{\mu}{\mu+n}$

Scoring Functions in Lucene

In Lucene, all the retrieval functions have the following function signature to score an individual word in the query:

protected float score(BasicStats stats, float termFreq, float docLength)
{
   return 0;
}

This would be equivalent to one term in each sum above; this function is called once per word in the query for each document where that word occurs. Once all the documents are scored, Lucene outputs a list of documents ranked by their score.

The BasicStats object has the following functions that will be useful:

For the language models, you will need the additional functionality of the member variable model, which is of type LMSimilarity.DefaultCollectionModel. It has the following function that will be of use:

For the language models, also note:

Your task is to complete the score function for each of the six retrieval models listed above. All the retrieval models are located in the package edu.illinois.cs.index.similarities.

Implementation Details

Running Lucene

Creating an index

There is a small data set distributed with this assignment. It is the NPL dataset of physics paper titles located in the data/ folder in the root of the project.

Two different main functions are provided in the edu.illinois.cs.index.Runner.java file. You can read the comments and decide which one you will use.

Searching the index

Two different main functions are provided in the edu.illinois.cs.index.Runner.java file for you to interactively search the index. You should read the comments and decide which one you will use.

Keep in mind the documents you're searching are physics paper titles. You can also specify which retrieval function to use when starting the search engine.

The complete list of options is

--dp     Dirichlet Prior
--jm     Jelinek-Mercer
--ok     Okapi BM25
--pl     Pivoted Length Normalization
--tfidf  TFIDF Dot Product
--bdp    Boolean Dot Product

Evaluation

To test your retrieval functions, you should use either one of the two main functions provided in edu.illinois.cs.eval.Evaluate.java. You should read the comments and decide which one you will use.

Notice the option to use a specific retrieval function. These are the same options as in the interactive search engine.

For each query, the average precision is printed. We'll discuss this measure later on in the course. The final number printed is the mean average precision, or MAP. This is the overall score using all the queries. Simply, the higher the MAP score, the better the retrieval function performed. Using the given code before any modifications, you will probably get a MAP of around 0.001.

Once you implement the functions, you should get at least a MAP of 0.10 for each one. Some will be better than others. Hint: the language model-based methods did not perform very well on this dataset (but you can still get them over 0.10). We had the two good vector space models get around 0.25.

Questions

  1. Copy and paste your implementation of each ranking algorithm into your report, together with the final MAP performance you get from the evaluation function. Please briefly explain your implementations and prove the provided $\alpha_d$ in each smoothing setting is correct (i.e., show me how could you derive that). (5 pts)

  2. Please carefully tune the parameters in BM25 and Dirichlet Prior smoothed Language Model. Report the best MAP you have achieved and correposnding parameter settings. (10 pts)

  3. In edu.illinois.cs.index.SpecialAnalyzer.java, we defined a special document analyzer to process the document/query for retrieval purpose. Basically, we built up a pipeline with filters of LowerCaseFilter, LengthFilter, StopFilter, and PorterStemFilter. Please disable some of the filters, e.g., without stopword removal or stemming, and test the new analyzer with the BM25 model (with your best parameters). What is your conclusion about the effect of document analyzer on retrieval effectiveness? (10pts)

  4. With the default document analyzer, choose one or two queries, where TF-IDF dot-product model performed significantly better than Boolean dot-product model, i.e., better average precision, and analysis what is the major reason for such improvement? Do the same analysis for TF-IDF dot-product model v.s. BM25, and BM25 v.s. Dirichlet Prior smoothed Language Model (using your best parameters for BM25 and Dirichlet Prior smoothed Language Model). (15 pts)

  5. (Bonus question: 20pts) Pick one of the previously implemented scoring functions out of

    • Okapi BM25
    • Pivoted Length Normalization
    • Langauge Model with Dirichlet Smoothing
      to analyze under what circumstance the chosen scoring function will mistakenly favor some less relevant document (i.e., ranks a less relevant document than a more relevant one). After reading the paper An Exploration of Axiomatic Approaches to Information Retrieval, how do you think you can fix the problem? Please relate your solution in the report.

Submission

Answer the above questions in your report and submit via collab in PDF.

Deadline for MP2

The deadline is 11:59pm, Friday, October 10th.

Sample solutions

I have selected two sample solutions from Muhammad Yanhaona and Christian K├╝mmerle for your reference. Please carefully read their solutions for Question 4.