# MP2—Retrieval Functions

Posted: September 25th, 2014In this assignment, you will be implementing some standard retrieval functions in the Lucene toolkit. We are aware that some of the functions already exist in Lucene, but we'd like you to add them yourself for this assignment. This assignment is required to be finished by students individually.

The assignment is composed of the following parts consisting of 100 total points (20pts extra bonus):

**Overview of Retrieval Models**- Boolean model: boolean dot product
- Vector space models: TF-IDF dot product, Okapi BM25 and pivoted length normalization
- Language models: Jelinek-Mercer and Dirichlet prior

**Scoring functions in Lucene***(60 points: 10 points each)***Running Lucene****Questions***(40 points with a 20pts bonus question)*

The 60 points in the scoring functions is for coding completion -- it has to be done in the correct way, and you need to explain your implementation in your report (pasting the modified code section is ok).

The 40 points are for correct, justified answers to the questions at the end, and reasonable search performance generated by your implementations.

The 20 points bonus question requires you to read a related paper before answering it.

Please download the provided project **here**. Inside, you will find all the necessary files.

In the next section, we'll give an overview of the six functions you'll need to implement.

# Boolean Models

$$ r(q,d)=\sum_{w\in q,d} 1 $$

# Vector Space Models

## TF-IDF dot product

$$ r(q,d)=\sum_{w\in q,d} c(w,d)\cdot \log\left(\frac{N + 1}{df}\right) $$

where

- $c(w;d)$ is the count of word $w$ in the document $d$
- $N$ is the total number of documents, and
- $df$ is the document frequency.

**NOTE**: this is simply an unnormalized version of TF-IDF dot product. Because in Lucene' inverted index the term frequencies are not normalized, we will only ask you to implement this simpler version.

## Okapi BM25

Parameters: $k_1\in [1.2,2],k_2\in (0,1000],b\in[0.75,1.2]$.

$$ r(q,d)=\sum_{w\in q,d} \ln\left(\frac{N-df+0.5}{df+0.5}\right) \cdot \frac{(k_1 + 1)\cdot c(w;d)}{k_1(1 - b + b\frac{n}{n_{avg}}) + c(w;d)} \cdot \frac{(k_2 + 1)\cdot c(w;q)}{k_2+c(w;q)} $$

where

- $c(w;q)$ is the count of word $w$ in query $q$
- $n$ is the document length, and
- $n_{avg}$ is the average document length.

## Pivoted Length Normalization

Parameter: $s\in [0,1]$.

$$ r(q,d)=\sum_{w\in q,d} \frac{1+\ln(1 + \ln(c(w;d)))}{1 - s + s\frac{n}{n_{avg}}} \cdot c(w;q) \cdot \ln\left(\frac{N+1}{df}\right)$$

This is another version of TF normalization, which we did not cover in our course lecture.

# Language Models

As we have discussed in class the language models rank documents according to query likelihood:

$$ r(q,d) = \sum_{w\in q} \log p(w|d) $$

After proper smoothing, the scoring functions for Language Models become,

$$ r(q,d) = \sum_{w\in q, d} \log\frac{p_s(w|d)}{\alpha_d p(w|C)} + |q|\log\alpha_d $$

In the following two language model retrieval functions, we define different smoothing strategies. You can then plug these two smoothed document language models into the general language model formula above, and we will only use unigram language models.

## Jelinek-Mercer

Parameter: $\lambda\in[0,1]$.

$$p_s(w|d) = (1-\lambda)p_{ml}(w|d)+\lambda p(w|C)$$

where $p_{ml}$ is the maximum likelihood estimation. Accordingly $\alpha_d=\lambda$

## Dirichlet Prior

Parameter: $\mu>0$. Try a number like 2000 or 3000.

$$p_s(w|d) = \frac{c(w;d) + \mu p(w|C)}{n + \mu}$$.

Accordingly $\alpha_d=\frac{\mu}{\mu+n}$

# Scoring Functions in Lucene

In Lucene, all the retrieval functions have the following function signature to score an individual word in the query:

```
protected float score(BasicStats stats, float termFreq, float docLength)
{
return 0;
}
```

This would be equivalent to one term in each sum above; this function is called
once per word in the query for each document **where that word occurs**. Once all
the documents are scored, Lucene outputs a list of documents ranked by their
score.

The `BasicStats`

object has the following functions that will be useful:

`getAvgFieldLength()`

: average document length`getNumberOfDocuments()`

: total number of documents in the index`getDocFreq()`

: the number of documents the current term appears in

For the language models, you will need the additional functionality of the
member variable `model`

, which is of type
`LMSimilarity.DefaultCollectionModel`

. It has the following function that will
be of use:

`computeProbability(stats)`

: computes $p(w|C)$ taking the`BasicStats`

object described above as a paramter

For the language models, also note:

- To compute $p_{ml}(w|d)$, you can use two existing variables
- There is a member variable
`queryLength`

that you can use for the value of $|q|$

Your task is to complete the `score`

function for each of the six retrieval
models listed above. All the retrieval models are located in the package
`edu.illinois.cs.index.similarities`

.

## Implementation Details

- If there are parameters in the scoring function (e.g., as in BM25), it's probably easiest to add these as member variables (e.g., as a constant)
- You can use any logical values for parameters that you'd like
- You may assume that the queries are short -- that is,
**you may assume that the query term frequency is always one**. This simplifies your code a bit.

# Running Lucene

## Creating an index

There is a small data set distributed with this assignment. It is the NPL
dataset of physics paper titles located in the `data/`

folder in the root of
the project.

Two different main functions are provided in the `edu.illinois.cs.index.Runner.java`

file. You can read the comments and decide which one you will use.

## Searching the index

Two different main functions are provided in the `edu.illinois.cs.index.Runner.java`

file for you to interactively search the index. You should read the comments and decide which one you will use.

Keep in mind the documents you're searching are physics paper titles. You can also specify which retrieval function to use when starting the search engine.

The complete list of options is

```
--dp Dirichlet Prior
--jm Jelinek-Mercer
--ok Okapi BM25
--pl Pivoted Length Normalization
--tfidf TFIDF Dot Product
--bdp Boolean Dot Product
```

## Evaluation

To test your retrieval functions, you should use either one of the two main functions provided in `edu.illinois.cs.eval.Evaluate.java`

. You should read the comments and decide which one you will use.

Notice the option to use a specific retrieval function. These are the same options as in the interactive search engine.

For each query, the *average precision* is printed. We'll discuss this measure
later on in the course. The final number printed is the *mean average
precision*, or *MAP*. This is the overall score using all the queries. Simply,
the higher the MAP score, the better the retrieval function performed. Using
the given code before any modifications, you will probably get a MAP of around
0.001.

Once you implement the functions, you should get at least a MAP of 0.10 for each one. Some will be better than others. Hint: the language model-based methods did not perform very well on this dataset (but you can still get them over 0.10). We had the two good vector space models get around 0.25.

# Questions

Copy and paste your implementation of each ranking algorithm into your report, together with the final MAP performance you get from the evaluation function. Please briefly explain your implementations and prove the provided $\alpha_d$ in each smoothing setting is correct (i.e., show me how could you derive that). (5 pts)

Please carefully tune the parameters in BM25 and Dirichlet Prior smoothed Language Model. Report the best MAP you have achieved and correposnding parameter settings. (10 pts)

In

`edu.illinois.cs.index.SpecialAnalyzer.java`

, we defined a special document analyzer to process the document/query for retrieval purpose. Basically, we built up a pipeline with filters of`LowerCaseFilter`

,`LengthFilter`

,`StopFilter`

, and`PorterStemFilter`

. Please disable some of the filters, e.g., without stopword removal or stemming, and test the new analyzer with the BM25 model (with your best parameters). What is your conclusion about the effect of document analyzer on retrieval effectiveness? (10pts)With the default document analyzer, choose one or two queries, where TF-IDF dot-product model performed significantly better than Boolean dot-product model, i.e., better average precision, and analysis what is the major reason for such improvement? Do the same analysis for TF-IDF dot-product model v.s. BM25, and BM25 v.s. Dirichlet Prior smoothed Language Model (using your best parameters for BM25 and Dirichlet Prior smoothed Language Model). (15 pts)

(Bonus question: 20pts) Pick one of the previously implemented scoring functions out of

- Okapi BM25
- Pivoted Length Normalization
- Langauge Model with Dirichlet Smoothing

to analyze under what circumstance the chosen scoring function will mistakenly favor some less relevant document (*i.e.*, ranks a less relevant document than a more relevant one). After reading the paper An Exploration of Axiomatic Approaches to Information Retrieval, how do you think you can fix the problem? Please relate your solution in the report.

# Submission

Answer the above questions in your report and submit via collab in PDF.

## Deadline for MP2

The deadline is 11:59pm, Friday, October 10th.

# Sample solutions

I have selected two sample solutions from Muhammad Yanhaona and Christian Kümmerle for your reference. Please carefully read their solutions for Question 4.