Information Retrieval

CS 6501 Fall 2014

MP1— Web Crawling and Basic Text Analysis

Posted: September 02nd, 2014

This assignment consists of two parts totaling 100 points:

You need to use the CS lab servers to store the crawled web documents. Please make sure you have account properly set up on the following three servers before the deadline:

If you encountered any technique difficulty in accessing the above servers, e.g., lose of data, software install requirement, or out of disk space, you should send an email to for help.

NOTE: Since part 2 directly depends on the data collected in part 1, we will have two separated deadlines for this homework assignment. Be award of each deadline.

Part 1: Crawl the web

In this semester, we are going to build a vertical search engine system for medical forums, e.g., finding relevant online discussions to satisfy users' health-related information needs. Given the fact that online health communities constitute an important source of information for patients and doctors alike: with 59% of the adult U.S. population consulting online health resources, and nearly 50% of U.S. physicians relying on online resources for professional use, it is critical for us computer scientists to build effective infrastructure to support their daily practice.

The first step to build the system is to crawl online discussions from medical forums. In this assignment, you are required to crawl discussion data from the following four online medical forums:

This is class-level collaborative effort in collecting the data. The crawled data will be used in the next two machine problems for building our medical forum search engine. In order to avoid any duplicated effort and maximize the coverage of our crawled content, every student is required to register the entry point of your crawling target here (please login with you UVa email account). Each entry point will be taken care of by only one student.

Once you registered the entry point, make sure you will exhaust all the discussions under that topic such that when we merge the data from all our students, we will have a good coverage of those medical forums.


Each student is required to cover at least three different topics from a single medical forum, e.g., choosing the topics of cancer, obesity and allergies from HealthBoards. However, in order to guarantee the coverage of different forums, each forum can have maximal 10 different students for registration and crawling. The registration is first come first serve; and once you find one forum is not available for registration, you have to move onto the next available forum for crawling.

In addition, in order to make it fair to all the students, e.g., some topics might only have a handful discussions, each student is required to crawl at least 3000 different threaded discussions. As a result, if your selected three topics cannot provide 3000 threaded discussion, you should consider additional (and available) topics in that forum.

The instructor has provided a set of basic HTML wrappers for those four medical forums. The wrappers are written in Java with jsoup, and they can be downloaded here). The package includes sample Java implementation of four wrappers for each of the selected medical forum, sample input HTML file and corresponding json output. Besides, the eclipse project configuration files are also included for your reference.

The provided wrappers only support basic parsing of the crawled HTML files, e.g., extract author, timestamp, discussion content for each post, and convert the extract content to a standard json output. However, it cannot identify links to other discussions, nor process the threaded discussions across multiple pages. You need to extend it to support the full function of a web crawler.

For those who prefer Python, you can use Beautiful Soup to implement a set of similar wrappers for your selected medical forum. Please make sure the output json format is the same as the instructor's definition, which is explicitly stated below.

json format definition:
1. thread object
2. post object
  'title':'discussion-title-of the-post',
  'content':'discussion-content-of the-post',
The filed will be missing if it does not occur in the corresponding discussion, e.g., first post of a threaded discussion does not have a replyToID.

[Clarification] Please note, for each threaded discussion, there will be only one json output expected. In other words, when the threaded discussion goes across multiple html pages, please crawl all the html pages, save them individually, extract all the discussion posts from them, and store them as a single json object (i.e., a single json file) properly. The purpose of this requirement is to facilitate later indexing step: by putting all the related posts in a single file, we can easily explore the discussion structure via the reply-to relation and the temporal relation. For example, if one threaded discussion, e.g., topic123 in WebMD, spans 3 pages, please name your crawled html files as YourcomputingID-topic123-p1.html, YourcomputingID-topic123-p2.html, YourcomputingID-topic123-p3.html, and the corresponding json file as YourcomputingID-topic123.json.

Packages for web crawling

If you are using Java, jsoup provides reasonable support for scrapping the webpages. In addition, working directly with the URL class in Java is also convenient. Here is a short tutorial about it.

If you are using Python, urllib2 provides all the necessary support for you to perform web crawling. One powerful package for web crawling in Python is mechanize, which provides stateful programmatic web browsing simulation. It provides mechanism to support complex web access, e.g., user login and cookies. All the CS lab servers should have already installed this package.

Packages for HTML parsing

jsoup is again the recommended package for Java users, and Beautiful Soup is recommended for Python users. Sample usage of jsoup can be found in the instructor's provided wrapper implementation. Hand-crafted regular expressions are also a good choice for this specific homework assignment, given the page layout is fixed and known a prior in our selected medical forums.

Important notes

  1. Be polite: please design necessary mechanism to slow down your crawler, e.g., sleep 1-3 seconds after every web access. Because all the students' crawlers are concurrently accessing those web sites from our department's servers, avoiding any potential issue of crawling, e.g., IP blocking, is necessary. Using proxy together with proper waiting scheme is highly preferred. It is pretty easy to set up proxy in both aforementioned Java and Python libraries.
  2. Be focused: since we are building a vertical search engine for medical related discussions, we only need the threaded discussions, all the other content/links, e.g., content/links on the sidebar, links to the related materials/external documents, are out of our interest. You should avoid crawling those content. The general workflow should look like this: a) identify the pointers to the detailed discussions under the entry page you have registered; b) crawl all the posts under that threaded discussion (if it goes across several pages, go to find all the posts from those pages), and apply your modified wrapper to convert the discussion content into standard json format; c) repeat steps a) and b) for all the next pages of your entry point.
  3. Be organized: please properly organize and name your crawled web documents and output json files such that when we merge the documents across all the students' results, we won't face any naming conflict. The suggested folder layout is as follows:
                |- hw5x-thread1.html
                |- hw5x-thread2.html
                |- hw5x-thread1.html
                |- hw5x-thread2.html
    The topic name should be the same as listed on the discussion board, e.g., cancer, and you can name the threads with the following fashion: YourcomputingID-UniqueThreadID.html.
    Accordingly, the output json file folder should be organized in the same manner, and the file names should be only different in the suffix, e.g., '.html' v.s. '.json'.

Submission guideline

The instructor will announce a shared directory on the CS lab server shortly. Properly pack your crawled data, e.g., make a tarball for your whole data folder and source code folder, and copy it to the shared directory before the deadline.

Please make sure your submitted codes are executable on the CS lab server.

Deadline for part 1

The deadline for part 1 is 11:59pm, Thursday, September 11th (updated).

Part 2: Analyze the documents

The second part of MP1 is to help you to get familiar with the basic text analysis techniques and procedures. We will go through the basic processing steps of tokenization, stemming, stopword removal and N-gram document representation. And we will also visualize Zipf's law on our crawled medical forum data.

Let's first go over several important concepts and techniques for basic text analysis.


Tokenization is the process that one breaks a stream of text into meaningful units. Simple tokenizatoin can be achieved by regular expressions. For example, the follow statement in Java split the input string into tokens:

"I've practiced for 30 years in pediatrics, and I've never seen anything quite like this.".split("[\\W]+")

In this statement, we define the boundary for a token to be any non-word sequence. And the corresponding output of this statement is,

*I*, *ve*, *practiced*, *for*, *30*, *years*, *in*, *pediatrics*, *and*, *I*, *ve*, *never*, *seen*, *anything*, *quite*, *like*, *this*

where ** indicate the boundary of a token.

A more advanced solution is the statistic machine learning based approaches. And in this assignment, we will learn how to use the tokenizer in OpenNLP (in Java) and NLTK (in Python) to perform tokenization.

1. Tokenizer in OpenNLP

The detailed documentation for this tokenizer can be found here. You can download the library here and the trained model file here (please choose the English tokenizer).

Once you have properly load the model from file, tokenization can be simply performed by,

String tokens[] = tokenizer.tokenize("An input sample sentence.");

2. Tokenizer in NLTK

NLTK provides several implementations of tokenization modules, and many of them are actually regular expression based.

The usage of them is the same and very simple,

>>> import nltk
>>> tokenizer = nltk.tokenize.treebank.TreebankWordTokenizer()
>>> tokenizer.tokenize("I've practiced for 30 years in pediatrics, and I've never seen anything quite like this.")
['I', "'ve", 'practiced', 'for', '30', 'years', 'in', 'pediatrics', ',', 'and', 'I', "'ve", 'never', 'seen', 'anything', 'quite', 'like', 'this', '.']


Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form. For example, "ladies" would be mapped to "ladi" as a result of stemming (although "lady" would be a more desirable result).

1. Stemmers in Java

Unfortunately, OpenNLP does not support stemming function currently. There are several existing implementations of stemmer in Java available, including Snowball Stemmer and Porter Stemmer. The Snowball Stemmer package contains both of these two popularly used stemmers.

The usage of stemmers Snowball package is very simple,

SnowballStemmer stemmer = new englishStemmer(); // get an instance of SnowballStemmer for English
stemmer.setCurrent(token); // set the input
stemmer.stem(); //perform stemming
String stem = stemmer.getCurrent(); // get the output

2. Stemmers in NLTK

NLTK provides several implementations of stemming modules, which includes the Porter Stemmer and Snowball Stemmer.

The usage of either stemmer in NLTK is very simple. For example, to use Snowball Stemmer,

>>> from nltk.stem.snowball import EnglishStemmer # load the stemmer module from NLTK
>>> stemmer = EnglishStemmer() # get an instance of SnowballStemmer for English
>>> stemmer.stem('ladies') # call stemmer to stem the input

Stopword removal

In Information Retrieval, stopwords are the words which are filtered out before or after processing of natural language text data, based on the assumption that such words do not carry specific semantic meanings. However, there is not one definite list of stopwords which all systems use, and the definition of stopwords are always domain specific.

Here is a popularly used list of stopwords: Smart system's stopword list. And we will use it as the major reference to create our own stopword list for our medical retrieval system.


An N-gram is a contiguous sequence of n items from a given sequence of text or speech. For example the bigram (2-gram) representation of the sentence "Information retrieval is helpful for everyone." would be ["information-retrieval", "retrieval-is", "is-helpful", "helpful-for", "for-everyone"].

To generate the N-grams, you scan through the list of split tokens and concatenate the consecutive tokens into N-grams.


In this part of assignment, you will be asked to perform tokenization, stemming and N-gram construction on the collaboratively collected forum discussion data from part 1. All the crawled json files can be found on the CS lab servers at the following directory: /home/hw5x/HW/MP1/Part2. A json object reader has been implemented in the same Java project of forum discussion wrappers. You can use it to load the json files for analysis purpose. And this implementation can be downloaded here. Sample usage of the OpenNLP tokenizer and Snowball stemmer has been given there.

WARNING: some students' crawled json files did not strictly follow the instructor's definition of json objects. As a result, properly handling the exceptions in json parsing is necessary.

There are several steps you need to take in this part of assignment.

  1. Tokenization: tokenize the content and title (if the post has a title field) of each post into tokens. Java users please use OpenNLP, and Python users please use NLTK's treebank tokenizer for this step;
  2. Normalization: normalize the tokens from step 1, by removing individual punctuation marks (here is a list of English punctuation marks) and converting tokens into lower cases.
  3. Stemming: stem the tokens back to their root form. Both Java and Python users please use the Snowball stemmer.

Based on these processing steps, all the documents are represented as bag-of-words, and we are ready to do some word counting.

First, let's validate the Zipf's law with our medical forum discussion data. This can be achieved by following steps.

  1. For each word, go over all the documents containing it and accumulate the its counts in those documents (i.e., total term frequency);
  2. Order the words by their total term frequency in descending order;
  3. Create a dot plot by treating each word's order as x-axis and its frequency as y-axis. And please use log-log scale for this plot.

Hint: basically, you can maintain a look-up table in memory while you are scanning the documents, so that you only need to go through the corpus once to get the counts for all the tokens.

From the resulting plot, do we find strong linear relationship between the x and y axes in log-log scale? If so, what is the slope of this linear relationship? To answer these questions, you can dump the data into excel and use the plot function there. (For Python users, matplotlib would be a good choice.)

Second, let's validate whether Zipf's law will be still true for N-grams (personally I haven't heard any study about this before). In this assignment, we will study bigrams. Repeat the same steps above, but this time focuses on counting only the bigrams.

From the resulting plot, are we stilling seeing the linear relationship between the x and y axes in log-log scale? If so, what is the slope in this case? And comparing to plot for single tokens (i.e., unigrams), which fits Zipf's law better. And any explanation of it?

Third, according to the word frequency you just got for the unigrams, can you manually compile a more appropriate stopword list for our medical forum data? You should compare the standard stopwords provided above against the high frequency words in our corpus, and decide which words should be added into the stopword list and which words should not be considered as stopwords. A suggested routine is,

  1. Select the top k most frequent unigrams from our corpus, where k is the number of stopwords list in Smart system's stopword list, and order them in a descending order with respect to their frequency;
  2. Get the frequency of the stopwords listed in Smart system's stopword list, and order them in a descending order with respect to their frequency;
  3. Compare these two lists and give your suggestions of which word should be added and which word should be removed from the standard stopword list.

If you have any other better ideas of compiling this specialized stopword list, you are more than welcome to try and report it.

Submission guideline

For this part of assignment, you are required to submit a written report, which should include two plots for the visualization of Zipf's law with unigrams and bigrams based on our medical forum data, your brief answers to the questions list in the Requirement section, and your suggested modification of the stopword list for our corpus.

Please send the written report in PDF to the instructor before the deadline.

Deadline for part 2

The deadline for part 1 is 11:59pm, Friday, September 19th.