CS397CXZ Assignment #4: Naive Bayes for Spam Filtering
(due Nov 12, 2003, 2:00pm)

Introduction

Naive Bayes is one of the simplest, yet effective learning algorithms for text classification. The goal of this assignment is to implement the Naive Bayes algorithm for text classification and to experiment with it for junk email filtering. Junk email filtering can be regarded as a classification problem involving two categories of which one represents legitimate messages and the other spams.

You will be using a "light" version of the Lemur toolkit, which is available here. The provided code has an incomplete implementation of Naive Bayes using the multinomial document model. You need to complete the implementation by adding a few lines of code at several places. Note that while the regular Lemur can run on both Windows and Unix, this light version is only ready for running on Unix. (The source code is portable, so you can create a Windows project manually by adding all the source files to your project space if you use visual c++.) All of you should have an account in the Computer Science Instructional Laboratories (CSIL), and you are strongly encouraged to use the machines in the Linux lab, where I have requested a 100MB disk quota for each of you. Contact CSIL if you have any problem with your account.

The usage of Lemur can be found online here. You will need to know how to compile the toolkit and how to modify an application. Basically, you go to the root directory, and type in "./configure" and then "gmake". Every time you change a program, just go to the root directory and do "gmake".

The code you need to complete is in a file named "NB.cpp" in the directory "app/src". You need to complete the implementation of two functions training and testing. See the source code for the location where you need to insert new code. You need to thoroughly understand how the Naive Bayes classifier works in order to be able to complete the implementation. The provided code uses two arrays to represent the two multinomial models (i.e., p(w|LEGI) and p(w|SPAM)). The training function first goes through each term in each training document and accumulates the counts that are necessary for estimating these two models. The training function then normalize these counts to obtain an estimate of probabilities. The testing function goes through each test document, and for each test document, goes through all the terms and computes log p(LEGI|D) and log p(SPAM|D). It then makes a classification decision by comparing these two quantities. It prints out each classification decision in the following format:

docid "log P(LEGI|D)" "log P(SPAM|D)" Tag
where, Tag is either "L" for "legitimate" or "S" for "spam".

The Data Set

The data set to be used for this assignment is the ``bare word'' version of the Ling-Spam test collection used by Ion Androutsopoulos in his study of Naive Bayes classifier applied to spam filtering. The original data is available at http://www.iit.demokritos.gr/~ionandr. We have rearranged and preprocessed the data slightly so as to make it easier to finish this assignment. Each message is treated as a document and, and all the documents are concatenated as one single source file, just like the CACM collection. A legitimate email message has a document ID starting witha digit, while a spam has a document ID with a prefix "spm". Thus, one can tell whether a message or a document is a spam by just looking at the first character of the document ID.

Tasks

  1. [30 points] Complete the implementation of the Naive Bayes classifier by completing the functions training and testing. Note that we use the multinomial document model. In training you need to fill in appropricate code to (1) accumulate counts for estimating the two multinomial models; (2) estimate the category prior probability; (3) add Laplace smoothing (i.e., the "add-one" smoothing as described on the lecture slides). In testing, you need to fill in appropriate code to compute the log posterior probability of each category (i.e., LEGI and SPAM) given a test document.

    Test your code with the SPAM test collection included in the data directory of the Lemur toolkit. Specifically, go to the data directory and do the following:

  2. [15 points]The NB program would print out various kinds of information about the trained model after finishing training. (1) Examine the top 20 words with the highest word probabilities in the spam category as well as in the legitimate category (i.e., p(w|SPAM) and p(w|LEGI)). Do the two lists look different? Are there any overlapping words? In general, what kind of words are they? (2) Now examine the top 20 words with the highest ratio p(w|SPAM)/p(w|LEGI) and p(w|LEGI)/p(SPAM). What kind of words are they? Are there any overlapping words between the two lists? How are these words different from what you printed out in the previous question?
  3. [15 points] Try different training sets. The file "nb_param" provides the following information for the Naive Bayes program: (1) The index; (2) The training doc id file; (3) The test doc id file; and (4) a cutoff for removing stop words. See nb_param for the actual variable names for each of the four pieces of information. In the data directory, you will find three training sets -- train1, train2, train3, and one test set test. Test your program using all the three training sets, and compare their classification performance in terms of all three measures. Which training set performs the best and which performs the worst? Can you explain why one training set is better than the other?
  4. [15 points] Try removing stop words. For all the three training sets, try to vary the parameter stopWordCutoff, and study how this parameter may or may not affect the classification performance. This parameter controls how many of the highest frequency words (in the whole collection) should be treated as stop words and thus ignored for both training and testing. E.g., when the value is "0.01", it means the top 1% highest frequency words would be regarded as stop words. Try the following five values: "0.0001", "0.001", "0.01", "0.1", and "0.5". What did you find? Did removing stop words affect the performance? Why?
  5. [10 points] Error analysis. Examine some of the messages that the NB program has mis-labeled. Why do you think these messages are not classified correctly? Look at both types of errors, i.e., a legitimate message mis-labeled as a spam and a spam mis-labeled as legitimate. Is there any way to improve the Naive Bayes classifier to avoid such mistakes?
  6. [10 points] The Naive Bayes classifier can make two types of errors, which are not distinguished when we computing the classification accuracy. However, mis-labeling a legitimate message as a spam is a much more serious mistake than mis-labeling a spam as legitimate. Can you design an evaluation metric that can reflect our preference for making one type of mistake rather than the other way? How can you change our NB program to reflect this preference? (You do not need to actually change the code; you just need to describe the idea.)
  7. [5 points] Open exploration. Try out some ideas to improve the multinomial Naive Bayes classifier, and test how they work with this data. Some possibilities are: (1) Based on the error analysis, you may think of some way to correct those mistakes; (2) You may want to try a different smoothing method; (3) You may want to try a different feature selection method. Try at least one of these.

What to turn in

A. Please turn in a hardcopy of your written answers at the class. Computer printout is strongly recommended and will entitle you to get an extra credit of 5 points.

B. Please copy the completed file NB.cpp and any other code (if any) to your handin directory on machines of CSIL, which is /home/class/cs397cxz/handin/assign4/YOUR_ID.