CS410 Assignment #5: Mixture Models for Clustering
(due Friday, April 18, 2008, in Class)

Introduction

The goal of this assignment is to implement and experiment with a clustering algorithm. You will be using a "light" version of the Lemur toolkit. Assignment #4 contains all necessary files for this assignemnt. Note that while the regular Lemur can run on both Windows and Unix, this light version is only ready for running on Unix. (The source code is portable, so you can create a Windows project manually by adding all the source files to your project space if you use visual c++.) All of you should have an account in the Computer Science Instructional Laboratories (CSIL), and you are strongly encouraged to use the machines in the Linux lab. Contact CSIL if you have any problem with your account.

The general usage of Lemur can be found online here. You will need to know how to compile the toolkit and how to modify an application. Basically, you go to the root directory, and type in "./configure" and then "gmake". If it is the first time you type "gmake" for this package, it will take a few minutes. Every time you change a program, just go to the root directory and do "gmake".

The code you need to complete is in a file named "aspect-cluster.cpp" in the directory "app/src". "aspect-cluster.cpp" is implementing a simple mixture model for clustering an arbitrary collection of documents. We assume there are k latent common themes in the collection of documents where each theme is characterized by a multinomial word distribution (unigram language model). A document is regarded as a sample of a mixture model with these theme models as components. We fit such a mixture model to the set of documents to obtain component multinomial models. The details of the clustering algorithm are given in the paper:

ChengXiang Zhai, Atulya Velivelli, Bei Yu, A cross-collection mixture model for comparative text mining, Proceedings of ACM KDD 2004 ( KDD'04 ), pages 743-748, 2004. pdf

You should read the first three sections of the paper and make sure you fully understand the clustering algorithm. In this assignment, you only need to fill in one line of code to complete the implementation. For that, you have to completely understand how the EM algorithm is supposed to work.

In the implemented code, we have also included support for incorporating priors for clusters. Thus what we have actually implemented in the code is:


Here mu is the prior coefficient and p_0(w|Theta_j) are given as priors. Setting mu to the default value of zero will give us the results corresponding to the clustering method discussed in the paper. By setting mu to larger values, we can incorporate cluster priors.

Tasks

  1. [35 points] Complete the code by filling in the missing line of code. The location of the missing line is indicated in the program.

    Test your code with the laptop reviews test collection included in the data directory of the Lemur toolkit. The data set is composed of 3 review sets downloaded from epinions.com: Apple iBook, Dell Inspiration and IBM ThinkPad. Specifically, go to the data directory and do the following:

    Take a look at the results. Do you see any interesting clusters?

  2. [15 points] Try different values for the background coefficient (e.g. 0.1 and 0.5). Change the value of the background coefficient “lambdaB” in the parameter file and execute the clustering program. How do the clusters change when you vary the background coefficient?

  3. [15 points] Try different number of clusters (e.g. 3 and 20). Change the number of clusters by changing the value of “cluster” in the original parameter file to different values and execute the clustering program. Are the result clusters coherent? How do the results change when you vary the number of clusters?

  4. [15 points] Do the clustering on a subset of the documents. The file ibm-reviews in the data directory provide the sub list of the reviews corresponding to IBM laptops. Cluster the documents in this subset by changing “docset” in the original parameter file. Are the results different from the results of clustering all the documents?

  5. [20 points] Do the clustering with cluster priors. Set mu to a positive value (e.g. 60) and use a prior file for clustering. Here is the format of each line of the prior file:

    cID w prob

    where cID is the cluster ID (a number between 0 and k – 1 where k is the number of clusters), w is a word in the vocabulary and prob is the prior probability of w in cluster cID. Here is an example of a prior file:

    0

    speaker

    0.5

    0

    sound

    0.5

    1

    drive

    0.5

    1

    dvd

    0.3

    1

    cd

    0.2

    In the original parameter file, set “mu” to a positive value (e.g. 60) and change “prior” to the "review-prior" file in the data directory. Now do the clustering with priors. How are the clusters different from the original clusters? Take a look at the "review-prior" file. Are the clustering results similar to what you expected to get using these priors?

What to turn in

A. Please turn in a hardcopy of your written answers at the class.

B. Please pack all the following files into one single zip file or tar file, name it as "assign5-YourNetID", then email the package to "maryamcs410  AT gmail DOT com".

  1. Your complete code (i.e., aspect-cluster.cpp).

  2. For each task: