This is an individual assignment. Your tasks will be:
(1) complete the implementation of the MapReduce indexer, build index on a Hadoop cloud-computing cluster, and run retrieval experiments using a basic TF-IDF retriever of a simple Java retrieval toolkit on a given retrieval test collection to obtain baseline retrieval accuracy;
(2) complete the implementation of k-Nearest Neighbor (kNN) text categorization algorithm and experiment with AP news data;
(3) In this assignment, we will use this server: altocumulus.cloud.cs.illinois.edu
First, do the following to set up the toolkit.
ssh YourID@altocumulus.cloud.cs.illinois.eduCreate a directory under your home directory for finishing this assignment, and name it "cs410" by doing the following:
hadoop dfs -copyToLocal /home/zhou18/cs410/simir.tar.gz ./This command is one of those DFS Shell commands of HDFS, and it would copy the toolkit file from HDFS to your local directory (i.e., /home/YourID/cs410/). If you aren't already familiar with HDFS DFS Shell, you should take a look at this tutorial to at least get familiar with a few commonly used commands such as "-ls", "-cat", "-copyFromLocal", "-copyToLocal", "-mkdir", "-rmr", "-cp".
tar -zxvf simir.tar.gzYou should see the "assign4" directory under "cs410", and if you enter "assign4", you will see three sub-directories: (1) assign4/src: all the Java source files and Perl scripts for evaluation; (2) assign4/obj: directory to put all the object code (i.e., *.classes); (3) assign4/exp: the directory which you will use for storing and analyzing retrieval results. You will be mostly working in the directory "assign4".
To view these files, use the following commands:
hadoop dfs -cat /home/zhou18/cs410/docsrc/apsrc.txt | more hadoop dfs -cat /home/zhou18/cs410/query | moreYou will see that both are formatted such that each document (or query) occupies one separate line with the document ID (or query ID) at the beginning, followed by a sequence of terms. All the terms were stemmed with a Porter stemmer.
Third, study the four Java source files in the "src" subdirectory to understand how the toolkit works:
Fourth, fill in the missing lines in InvertedIndex.java and IndexGeneration.java. Each file only has a few lines missing, so once you understand how the code works, it won't take long to fill in the missing statements. If you aren't familiar with notations of Java, you may need to look up relevant functions/classes using this website to understand how to use a particular function. The places where you need to add missing statements are all marked with comments of the following format, so it's easy for you to spot them:
//#########################################################// // add a statement (statements) here so that ... // // Hint: .... //#########################################################//
Start with the file InvertedIndex.java. After you finish this file, do the following to test the file:
hadoop dfs -mkdir /home/YourID/cs410Further create a subdirectory to store the inverted index that you will build.
hadoop dfs -mkdir /home/YourID/cs410/index
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/InvertedIndex.javaThis would generate the .class file and put it in the directory "obj".
jar -cvf simir.jar -C obj ./This would generate a file called "simir.jar" in the directory of "assign4".
hadoop jar simir.jar InvertedIndex /home/zhou18/cs410/docsrc/ /home/YourID/cs410/tmp1
hadoop dfs -cat /home/YourID/cs410/tmp1/part-00000 |moreTo verify whether your program generates the results correctly, you may use a toy test data to test your program:
hadoop jar simir.jar InvertedIndex /home/zhou18/cs410/test/ /home/YourID/cs410/tmp0The file in /home/zhou18/cs410/test/apsrc.txt has just 8 documents with a few words in each. (You may view them by using "hadoop dfs -ls /home/zhou18/cs410/test/") Don't forget that after you test with these 8 documents, you still need to run InvertedIndex on /home/zhou18/cs410/docsrc/. So if you write the result to the same place (e.g. /home/YourID/cs410/tmp1), you need to first remove the results generated from the 8 testing documents.
Once your InvertedIndex works well, you can compile and run the program ComputeDocLen to generate a document length file. Again, make sure that you are in "assign4". Do the following:
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/ComputeDocLen.java jar -cvf simir.jar -C obj ./ hadoop jar simir.jar ComputeDocLen /home/YourID/cs410/tmp1/part-00000 /home/YourID/cs410/tmp2This would generate a document length file and put it in /home/YourID/cs410/tmp2/part-00000. You may take a look at the file to see if it looks right and then copy it to the final index directory
hadoop dfs -cp /home/YourID/cs410/tmp2/part-00000 /home/YourID/cs410/index/ind.dlen
Next, work on IndexGeneration.java and add the missing statements. Go to the "assign4" directory and test your implementation by doing:
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/IndexGeneration.java jar -cvf simir.jar -C obj ./ hadoop jar simir.jar IndexGeneration /home/YourID/cs410/tmp1/part-00000 /home/YourID/cs410/index/indAgain, you may want to test your program by using the toy data set first to make sure it works correctly. Once your IndexGeneration program works correctly, it would generate "ind.lex" and "ind.pos" and put them in the "cs410/index" directory, which together with "ind.dlen" that you generated earlier form the complete inverted index for the provided news data set.
Finally, experiment with the retrieval toolkit based on Retrieval.java. Go to the "assign4" directory and do:
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/Retrieval.java jar -cvf simir.jar -C obj ./ hadoop jar simir.jar Retrieval /home/YourID/cs410/index/ind /home/zhou18/cs410/query > exp/resultNote that when you compile Retrieval.java, you would see the following warning. This isn't a problem, so please ignore it.
Note: src/Retrieval.java uses unchecked or unsafe operations. Note: Recompile with -Xlint:unchecked for details.
If your implementation is correct, you should get retrieval results in the file "result" under the "exp" sub-directory. Note that the retrieval results are now in your local directory (i.e., it's not an HDFS file). You can view the file normally by using, e.g., "more exp/result". The results are a sequence of tuples of the form "queryID docID score".
Now, go to the directory "exp" and evaluate this result by doing the following
perl ../src/ireval.pl -j qrel -o pr < resultThis would generate a TREC-style evaluation result and store it in the file "pr" in the directory "exp". From the file pr, you can extract the Mean Average Precision (MAP), precision at 10 documents, and other measures. Specifically, the pr file will have such results for each of the queries, and in the end, the average of all the figures over all the queries. The MAP over all the queries is the most important measure, and it's reported in the line "Set average (non-interpolated) precision = ...".
Since the toolkit only implemented a naive TF-IDF retrieval model, its retrieval accuracy is not very good, but you should be able to get a MAP above 0.1 if your implementation is correct.
You are going to use Cloud-Computing Testbed (CCT) to finish this task. To begin the experiments:
javac -classpath /hadoop/hadoop-0.19.0-core.jar -d obj src/kNN.java jar -cvf simir.jar -C obj ./ hadoop jar simir.jar kNN /home/YourID/cs410/index/ind /home/zhou18/cs410/train.list /home/zhou18/cs410/kNN.query 5 > exp/kNNresultThe last argument "5" is the value for k. It will take about 10 minutes for the program to finish.
perl ../src/knneval.pl test.ans kNNresultHere, we use "accuracy" as the metric to evaluate the classification results. It's defined as:
accuracy = correct_classified_cases / total_number_of_test_casesWhat is the accuracy of the classification result when k is set to 5? Please also set k to 1, 5, 15, 25, 50. Report the accuracy of the classification results according to different k values. What do you find from the results? (Hint: if you got accuracy lower than 50%, there must be some bugs in your program)