CS410 Assignment #3: Search Engine Competition
Part 1 is due on Mar, 04, 2013 11:59pm
Part 2 is due on Mar, 15, 2013 11:59pm
Part 3 is due on Mar, 30, 2013 11:59pm

The schedule of the assignments has been changed:

Important Update for What to Turn In!!

Performance Ranking for Part 2 !!

Introduction

The goal of this assignment is to run a TREC-like competition and evaluate any ideas you can propose to improve the ranking accuracy for the forum search problem. This assignment has three parts. Part 1 is an individual assignment, part 2 and part 3 are group assignments.

Part 2 and Part 3 are group assignments, and you are strongly encouraged to form a team of up to four students. Graduate students and undergraduate students are encouraged to work together. Of course, if you choose to, you may also work on your own (i.e., a single-person team). All the members of the same team will receive the same grade for this assignment provided that your report indicates clearly the contribution of each member (e.g., xxx implemented this feature; xxx ran preliminary experiments to test this idea; xxx proposed this idea). We suggest you start to form groups and brainstorm ideas as early as possible. We have finished testing your search engine for assignment#2, so you can change it however you want now.

Part 1: Generate query judgements (20 points)

We ask each of you to enter the keyword query you proposed in assignment#2 (not the question from Yahoo! Answers) to the search engine we provide, and judge the top 50 documents as relevant or not. (if the number of returned documents is less than 50, judge all documents returned.)

Create a text file under your home folder and name it with your student ID e.g. xwang95.txt, and the file path is /home/xwang95/xwang95.txt. Make sure the file is readable to everyone (We will directly locate this file for grading). The first line would be your question, and then the second line are your query words delimited by a tab '\t'. Following, you should start from the first result, judge whether each result (a forum page) is relevant to the query. A forum page is regarded as “relevant” if it can at least partially answer the question. If a forum page is relevant, append its DOC_ID (showed in the bottom of each search result) to the file as a separate line so that the original order of the results would be preserved.

We asked you to do the similar judgements in task 3 of assignment#2 on your own forums for only 10 documents. The purpose of that is merely to guarantee the quality of your proposed query. The judgement you perform in this assignment, however, will be used to evaluate the performance for each team's search engine in the end.

What to turn in for part 1

A text file under your home folder and name it with your student ID e.g. xwang95.txt. Make sure the file is readable to everyone and double check its name, as we will automatically retrieve the file. Nothing needs to be submitted to Compass.

EXAMPLE: xwang95.txt

	Ln1: What is the best rustic camping site in illinois?
	Ln2: rustic	camping	site	illinois
	Ln3: xwang95_0
	Ln4: xwang95_2
	Ln5: jwang112_3
	Ln6: duan11_4
	   :  ...

Part 2: Search engine competition: warming up (40 points)

Step 1: Form a team of up to four people.

When the team is finalized, write down your team name, students' names and the student IDs on this wiki page. Your team will remain the same in part 2 and part 3.

In the following, we assume each team has four members. For those teams with fewer people, if they have 2 people, they only need to submit two version of search engines.

GROUP LIST PAGE: https://wiki.engr.illinois.edu/display/timan/CS410S13+Assign+3+Group+List

Step 2: Implement new ideas on your search engine

SOLR SETUP:

Note: You can reuse the solr in Assign#2 for the compiled file (you also need to download the source file), or you can setup a new Solr from the beginning.

Suggestion: Use the solr compiled package given provided in Assignment 2 (you can download it in Assign 2 page), and together with the source package given in below. Some students told me that there is some problem in using the new one. I will look into the reason, but the old package we gave in assignment 2 is good for use.

  1. Download Solr source code (and compiled files) and uncompress them in the same folder: In addition to modify the solr-4.1.0/example/etc/jetty.xml, one also needs to change the file solr-4.1.0/example/solr/solr.xml to: (assuming that the port number is 8983)
      1 <?xml version="1.0" encoding="UTF-8" ?>
      2 <solr persistent="true">
      3   <cores defaultCoreName="cs410" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8983" hostContext="solr">
      4     <core schema="schema.xml" loadOnStartup="true" instanceDir="cs410/" transient="false" name="cs410" config="solrconfig.xml" dataDir="data"/>
      5   </cores>
      6 </solr>
    	
  2. Compile:
    cd solr-4.1.0
    ant ivy-bootstrap
    ant compile
    cp dist/solr-4.1.0.war example/solr-webapp/
    cp ./lucene/build/core/lucene-core-4.1-SNAPSHOT.jar ./example/solr-webapp/webapp/WEB-INF/lib/lucene-core-4.1.0.jar 
    cd example
    java -Dsolr.clustering.enabled=true -jar start.jar
    cd ..
    	

You should start with the basic search engine that you've set up in assignment#2, and whose performance has been tested using the first dataset (after we collect all students' judgement, we will merge them and release the first half of judgement to you, as well as some script that can help you to evaluate the performance of the search engine). This is to make sure that your basic search engine works appropriately and is bug-free.

Then try out new ideas to improve your baseline search engine. You may try all kinds of ideas to improve your baseline search engine and test the ideas using the first half test collection to see if your ideas can outperform the baseline (Lucene default search engine). The following are some general ideas you can consider trying:

  1. Try existing approaches in solr-4.1.0/lucene/core/src/java/org/apache/lucene/search/similarities, such as BM25, Language Models with smoothing, etc.
  2. Try pseudo relevance feedback: you may try Rocchio in the vector space model or the mixture model method for the KL-divergence retrieval method. You may also try multiple iterations of feedback. You may also try to use heuristics to improve the quality of feedback documents (e.g., you may ignore extremely long documents when using the top-k documents for pseudo feedback).
  3. Try pseudo feedback using a general Web search engine such as Google (i.e., sending the query to Google to fetch, say top 10 snippets, and use these top 10 snippets for feedback).
  4. Try query expansion based on some online thesaurus (e.g., http://thesaurus.reference.com/).
  5. You may also try different smoothing methods, different term weighting heuristics, removing stop words (vs. not removing).
  6. You may also try different ways of combining different strategies (e.g., combining passage retrieval with document retrieval).

Step 3: Finalize the four versions of your search engines based on your preliminary experiments

Since each student was assigned a unique port number in Assign#2, one is assumed to reuse his/her port number again in this assignment.

Before the due date for part 2, submit a Report.txt file to Compass (each team only needs to submit one report). The top four lines of your Report.txt file should contain nothing but your search engines and their names. Following, let us know who is the main contributor for each search engine, and what is the function/feature of that. In addition, include the absolute path to the folder in which you have the "start.jar" to run the Solr. For example:

Important Update for What to Turn In

http://cs410-server.cs.illinois.edu:PORT_NUMBER_1/solr/cs410/browse TARUN1
http://cs410-server.cs.illinois.edu:PORT_NUMBER_2/solr/cs410/browse TARUN2
http://cs410-server.cs.illinois.edu:PORT_NUMBER_3/solr/cs410/browse TARUN3
http://cs410-server.cs.illinois.edu:PORT_NUMBER_4/solr/cs410/browse TARUN4
TARUN1: net_id_1, BM25 Similarity
		/home/netid_1/cs410/assign3/solr-4.1.0/example
TARUN2: net_id_2, Language Model Similarity with Dirichlet Smoothing
		/home/netid_2/cs410/assign3/solr-4.1.0/example
TARUN3: net_id_3, Pseudo Feedback
		/home/netid_3/cs410/assign3/solr-4.1.0/example
TARUN4: net_id_4, Pivoted Length Normalization
		/home/netid_4/cs410/assign3/solr-4.1.0/example

Note: For each run, add the absolute path to the folder where you have the start.jar for running the Solr

Make sure the names of each search engines are unique, as we will use them to post the ranking of the performances of your search engines. Avoid exposing your identity in naming the search engines if privacy is a concern for you. In the rest of the report, describe what was done for each search engine. Also describe clearly who did what in your report.

The ranking of performances in part 2 will NOT be used for grading. It is released so that you could have a sense of how well your team is doing. You will get 40 points if you implemented the baseline search engine, proposed new ideas for improving the baseline and have the ideas implemented.

What to turn in for part 2

The Report.txt file and all the your codes in one zip file, submitted to Compass. Each team only need to submit one file.

NOTE:

In Part Two:

Part 3: Search engine competition: final competition (40 points)

In part 3, you will further improve your search engines after the forum data and query judgements are released after your submission of part 2. You can either implement new ideas, improve the original implementation or simply tune the parameters based on the forum data we release. In the end, we ask you again to submit four search engines that you believe would have the best performance on forum data. You do not need to include the baseline search engine in these four search engines. The names of your search engines should be unique and different from the names you use in part 2.

Submit a Report.txt file, which has the same structure as the one you submited for part 2. Mention in the descriptions what you have done to improve the search engines and who did what exactly. You will get 20 points if you propose ideas to further improve your search engines and have them implemented. The rest of the points will be graded based on the ranking of the best search engine from your team. We will produce a ranking of teams based on your best run; suppose there are n teams and your team is ranked at k, your score will be 20*(n-k+1)/n. So the best team will get full 20 points and the last team will get 20/n points.

In the end, we will release the latter half of the data to you (without judgement), and you are supposed to replace the indexed data into the solr engines. After we get your search engine URLs, we will test your engines and then release the result.

What to turn in for part 3

The Report.txt file and all the your codes (you modified) in one zip file, submitted to Compass. Each team only need to submit one file.

HINT:

  1. Some Tutorials/Materials

  2. Key classes in Lucene

  3. Coding Suggestions

TA will continuing modifying this assignment page in recent future and provide more help as much as possible...

Supplemented Information

How to build your index?

You do NOT need to use XML file and upload to Solr as in Assignment 2. Instead, you will "link" to our prepared index from your Solr.

It is simple!

In your folder "solr-4.1.0/example/solr/cs410/data":

How to test your search engine?

TA will has down most of the work for you!

A script named map.py is provided and you can download together with the partial relevance judgement put in the folder rel_jdg. After downloading the package, uncompress it and place both the python file and the relevance judgement folder in the same folder. (Actually, they would be in the same folder after you uncompress them).

To assess the performance of your search engine placed at the port number "port_number" (the search engine should be able to assess by the url "http://cs410-server.cs.illinois.edu:port_number/solr/cs410/browse"), refer the following usage.

The output contains the average precision for each query (a set of query words) as well as the mean average precision (MAP). We will use the MAP to measure and rank your search engine performance in the competation.

DOWNLOAD: judge.tar

USAGE: python map.py port_number

After you finish Part 2, we will run all of your search engines when grading, and publish the MAP of all your search engines (by your search engine names you provide) in Piazza so that you know how well you have done. However, the ranking will not affect the grading.

Performance Ranking of Part 2

LINK:http://sifaka.cs.uiuc.edu/course/410s13/ranking.htm