CS410 Assignment #4: MapReduce and kNN
(due on Thrusday, April 04, 2013 11:59pm)

Introduction

This is an individual assignment. Your tasks will be:
(1) complete the implementation of the MapReduce indexer, build index on a Hadoop cloud-computing cluster, and run retrieval experiments using a basic TF-IDF retriever of a simple Java retrieval toolkit on a given retrieval test collection to obtain baseline retrieval accuracy;
(2) complete the implementation of k-Nearest Neighbor (kNN) text categorization algorithm and experiment with AP news data;
(3) In this assignment, we will use this server: altocumulus.cloud.cs.illinois.edu

Task 1. [50 points] Complete the implementation of a simple retrieval toolkit based on Hadoop

The purpose of this task is to give you handson experience with implementing a simple IR system using MapReduce/Hadoop. As the data sets in many applications keep growing, parallel indexing using a framework such as MapReduce/Hadoop is becoming increasingly important for both research in IR and development of practical applications. Through this assignment, you will become familiar with the Hadoop File System (HDFS), learn how to create an inverted index using MapReduce, learn how to run a standard retrieval model by using an inverted index stored in HDFS. You are provided with an almost-complete simple retrieval toolkit and your first task is to complete the implementation of this toolkit. Starting with such a toolkit would minimize your work on coding and allow you to focus on learning the most essential knowledge about this topic. This simple retrieval toolkit has all the essential components of a retrieval system except that two of the Java source files have a few lines of code missing, and your task is to fill in those a few lines of code for each to make it really work. Here's what you need to do exactly:

First, do the following to set up the toolkit.

  1. Login to your CCT account by using the following command:
     ssh YourID@altocumulus.cloud.cs.illinois.edu 
    
    Create a directory under your home directory for finishing this assignment, and name it "cs410" by doing the following:
     
    mkdir cs410
    
  2. Enter the "cs410" directory and download the toolkit from the HDFS path "/home/zhou18/cs410/simir.tar.gz" by doing the following:
     
    hadoop dfs -copyToLocal /home/zhou18/cs410/simir.tar.gz ./
    
    This command is one of those DFS Shell commands of HDFS, and it would copy the toolkit file from HDFS to your local directory (i.e., /home/YourID/cs410/). If you aren't already familiar with HDFS DFS Shell, you should take a look at this tutorial to at least get familiar with a few commonly used commands such as "-ls", "-cat", "-copyFromLocal", "-copyToLocal", "-mkdir", "-rmr", "-cp".

  3. Next, you should unpack the .tar.gz package:
     
    tar -zxvf simir.tar.gz
    
    You should see the "assign4" directory under "cs410", and if you enter "assign4", you will see three sub-directories: (1) assign4/src: all the Java source files and Perl scripts for evaluation; (2) assign4/obj: directory to put all the object code (i.e., *.classes); (3) assign4/exp: the directory which you will use for storing and analyzing retrieval results. You will be mostly working in the directory "assign4".
Second, take a look at the following two provided data files in the HDFS: They and the "qrel" file in the "assign4/exp" directory form a test collection that you can use to evaluate the baseline retrieval algorithm.

To view these files, use the following commands:

 
hadoop dfs -cat /home/zhou18/cs410/docsrc/apsrc.txt | more
hadoop dfs -cat /home/zhou18/cs410/query | more
You will see that both are formatted such that each document (or query) occupies one separate line with the document ID (or query ID) at the beginning, followed by a sequence of terms. All the terms were stemmed with a Porter stemmer.

Third, study the four Java source files in the "src" subdirectory to understand how the toolkit works:

All the code is fairly well documented so if you just read through the code carefully, you can understand how it works.

Fourth, fill in the missing lines in InvertedIndex.java and IndexGeneration.java. Each file only has a few lines missing, so once you understand how the code works, it won't take long to fill in the missing statements. If you aren't familiar with notations of Java, you may need to look up relevant functions/classes using this website to understand how to use a particular function. The places where you need to add missing statements are all marked with comments of the following format, so it's easy for you to spot them:

 
 //#########################################################//
 // add a statement (statements) here so that ... 
 //
 // Hint: ....
 //#########################################################//

1.1: Complete the file InvertedIndex.java

Start with the file InvertedIndex.java. After you finish this file, do the following to test the file:

  1. Create a directory for this assignment in HDFS by using the following command:
     
    hadoop dfs -mkdir /home/YourID/cs410
    
    Further create a subdirectory to store the inverted index that you will build.
     
    hadoop dfs -mkdir /home/YourID/cs410/index
    
  2. Going to the "assign4" directory. All the following commands should be run while you are in "assign4".
    1. Compile it using the following command (make sure to run it in "assign4")
       
      javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/InvertedIndex.java
      
      This would generate the .class file and put it in the directory "obj".
    2. Generate a Java Archive (.jar) file using the following command
       
      jar -cvf simir.jar -C obj ./ 
      
      This would generate a file called "simir.jar" in the directory of "assign4".
    3. Execute InvertedIndex using the following command
       
      hadoop jar simir.jar InvertedIndex /home/zhou18/cs410/docsrc/ /home/YourID/cs410/tmp1
      
    This will probably take about 10 minutes. The program would put the output (i.e., the raw inverted index file) in /home/YourID/cs410/tmp1/. Note that if the directory /home/YourID/cs410/tmp1/ already exists, you should delete it before running InvertedIndex, or you should specify a directory that doesn't already exist as output. The output is usually written to a file named as "part-00000". You may view the file by using
     
    hadoop dfs -cat /home/YourID/cs410/tmp1/part-00000 |more
    
    To verify whether your program generates the results correctly, you may use a toy test data to test your program:
     
    hadoop jar simir.jar InvertedIndex /home/zhou18/cs410/test/ /home/YourID/cs410/tmp0
    
    The file in /home/zhou18/cs410/test/apsrc.txt has just 8 documents with a few words in each. (You may view them by using "hadoop dfs -ls /home/zhou18/cs410/test/") Don't forget that after you test with these 8 documents, you still need to run InvertedIndex on /home/zhou18/cs410/docsrc/. So if you write the result to the same place (e.g. /home/YourID/cs410/tmp1), you need to first remove the results generated from the 8 testing documents.

Once your InvertedIndex works well, you can compile and run the program ComputeDocLen to generate a document length file. Again, make sure that you are in "assign4". Do the following:

 
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/ComputeDocLen.java
jar -cvf simir.jar -C obj ./ 
hadoop jar simir.jar ComputeDocLen /home/YourID/cs410/tmp1/part-00000  /home/YourID/cs410/tmp2
This would generate a document length file and put it in /home/YourID/cs410/tmp2/part-00000. You may take a look at the file to see if it looks right and then copy it to the final index directory
 
hadoop dfs -cp /home/YourID/cs410/tmp2/part-00000 /home/YourID/cs410/index/ind.dlen

1.2: Complete the file IndexGeneration.java

Next, work on IndexGeneration.java and add the missing statements. Go to the "assign4" directory and test your implementation by doing:

 
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/IndexGeneration.java
jar -cvf simir.jar -C obj ./ 
hadoop jar simir.jar IndexGeneration /home/YourID/cs410/tmp1/part-00000  /home/YourID/cs410/index/ind
Again, you may want to test your program by using the toy data set first to make sure it works correctly. Once your IndexGeneration program works correctly, it would generate "ind.lex" and "ind.pos" and put them in the "cs410/index" directory, which together with "ind.dlen" that you generated earlier form the complete inverted index for the provided news data set.

1.3: Experiment with the retrieval toolkit

Finally, experiment with the retrieval toolkit based on Retrieval.java. Go to the "assign4" directory and do:

 
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/Retrieval.java
jar -cvf simir.jar -C obj ./
hadoop jar simir.jar Retrieval /home/YourID/cs410/index/ind  /home/zhou18/cs410/query > exp/result
Note that when you compile Retrieval.java, you would see the following warning. This isn't a problem, so please ignore it.
 
Note: src/Retrieval.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

If your implementation is correct, you should get retrieval results in the file "result" under the "exp" sub-directory. Note that the retrieval results are now in your local directory (i.e., it's not an HDFS file). You can view the file normally by using, e.g., "more exp/result". The results are a sequence of tuples of the form "queryID docID score".

Now, go to the directory "exp" and evaluate this result by doing the following

 
perl ../src/ireval.pl  -j qrel -o pr < result 
This would generate a TREC-style evaluation result and store it in the file "pr" in the directory "exp". From the file pr, you can extract the Mean Average Precision (MAP), precision at 10 documents, and other measures. Specifically, the pr file will have such results for each of the queries, and in the end, the average of all the figures over all the queries. The MAP over all the queries is the most important measure, and it's reported in the line "Set average (non-interpolated) precision = ...".

Since the toolkit only implemented a naive TF-IDF retrieval model, its retrieval accuracy is not very good, but you should be able to get a MAP above 0.1 if your implementation is correct.

Task 2. [50 points] Implementing and experimenting with the kNN text categorization algorithm

In this task, you are going to implement the k-Nearest Neighbor (kNN) categorization algorithm. The basic steps for kNN are: (1) Keep all training examples. (2) Find k examples that are most similar to the new document ("neighbor" documents) (3) Assign the category that is most common in these neighbor documents (neighbors vote for the category). You will be given a list of training documents with their category tags (use this command to see its content: hadoop dfs -cat /home/zhou18/cs410/train.list | more), and a list of test documents (use this command to see its content: hadoop dfs -cat /home/zhou18/cs410/kNN.query | more). During testing, in order to find k nearest neighbors of a test document, you will use the retrival function implemented in Task 1 to rank all the documents. The basic idea is to use the test document as a "long" query, so that the top ranked retrieval restuls will be its nearest neighbors. Since only documents in the training data set have category labels, the algorithm will look up from the top first ranked document to the bottom until collecting k labeled documents. The major category among these k labels will be assigned as the category of the test document. In this task, you will use the news data in Task 1 as the experimental data set. So the index you built in the previous task can still be used here.

You are going to use Cloud-Computing Testbed (CCT) to finish this task. To begin the experiments:

  1. Login to your CCT account by using the following command:
    ssh YourID@altocumulus.cloud.cs.illinois.edu
  2. Enter your "cs410" directory created in Task 1. You will find ./assign4/src/kNN.java, evaluation script ./assign4/src/knneval.pl, and the gold standard for the test data ./assign4/exp/test.ans under your local directory.

2.1 Your first step is to complete the kNN.java file.

  1. Complete the kNN.java file. The core function you need to complete in this file is the "categorization" fuction. The input and the output of the function have been well documented in the file. You only need to fill in a few number of statements.
  2. Go to the "assign4" directory and test your implementation by doing:
    javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/kNN.java
    jar -cvf simir.jar -C obj ./ 
    hadoop jar simir.jar kNN /home/YourID/cs410/index/ind /home/zhou18/cs410/train.list /home/zhou18/cs410/kNN.query 5 > exp/kNNresult
    
    The last argument "5" is the value for k. It will take about 10 minutes for the program to finish.

2.2 Experiment with kNN.java

Now, go to the directory "exp" and evaluate the result by doing the following:
perl ../src/knneval.pl test.ans kNNresult
Here, we use "accuracy" as the metric to evaluate the classification results. It's defined as:
accuracy = correct_classified_cases / total_number_of_test_cases
What is the accuracy of the classification result when k is set to 5? Please also set k to 1, 5, 15, 25, 50. Report the accuracy of the classification results according to different k values. What do you find from the results? (Hint: if you got accuracy lower than 50%, there must be some bugs in your program)

What to turn in

Please pack all the following files into one single zip file or tar file and upload the package to Compass by midnight of the due date
  1. Task 1: two completed Java source files (i.e., InvertedIndex.java and IndexGeneration.java) and the precison file of the basic TF-IDF retrieval function as implemented in the original toolkit (i.e., the file generated by "ireval.pl").
  2. Taks 2: your completed kNN.java file
  3. Accuracy report for Task 2.2 according to different values of k and your finding in Task 2.2 in PDF or DOC file.