CS410 Assignment #1: Initial Project Ideas and Statistical NLP
(due Tuesday, Feb. 5, 2013)

  1. >Initial Project Ideas [30 points]

    Innovations are often motivated by a disatisfaction with our current living environment and a vision about how to address this disatisfaction and change our world for the better. For example, the visionary article "As We May Think" written by Vannevar Bush in 1945 stimulated research in information retrieval in early days. The Memex machine vision laid out in this article was based on an analysis of the world then, which led to the recognition of many inadequancies of the conventional library systems in helping a researcher to collect, select, and digest all the accumulated knowledge in the world. The Memex machine was then "designed" to remove these inadequancies.

    Do a similar analysis of the world today in terms of information services provided to people by a search engine like Google, identify some major limitations, and write a short essay (with 500~1000 words) to analyze the limitations and discuss possible ways to break these limitations and thus improve a current search engine. While there are many ways to improve the current search engines, you are also encouraged to envision a new system very different from the current search engines with ideas that go beyond search to help people access information in other ways or better support users in decision making.

    A good way to think about a project topic is to note that an information system is usually determined by three elements: (1) users; (2)data/information; (3) task support (i.e., functions). For example, in the case of a general web search engine, (1) users=everyone; (2) data = the whole Web; (3) function = keyword search. If we vary these three variables, we will get interesting variations of systems, where often both new challenges and new opportunities for improving the system service quality occur. For example, we may assume users=children, then we have something like "Google Kids" where we should filter out pages that are not appropriate for kids to read or difficult pages that are not understandable (readable) by kids. If we assume that (1) users=everyone; (2) data= text descriptions of all computer games (e.g., constructed by crawling the web); (3) task=search, we'll have a "computer game search engine". Also, if we assume that function=learning, we would have a system that helps users learn about a subject from the Web. It is not immediately clear how we should design such a Web-based tutoring system, but one can clearly imagine many ways to improve over Google for this purpose. For example, allowing a user to see definitions/explanations of any term would be very useful. Also, tutorials or ppt slides may be more useful than regular home pages. In general, Web search can be improved by considering specialized users, specialized data/information, and more advanced functions for information access. A good way to brainstorm topics is to start with the current Web search engines and identify where users are not satisfied. If you can come up with a "wish list" of features for Web search, then you can easily convert them into interesting project topics. Another way to think about it is to consider the needs of yourself and people around you. Can we do a better job to improve the way we access, manage, and utilize information? We all need to write research papers/reports, can we develop some tool to help us specifically in writing a paper? For example, the tool may automatically fetch related work from the Web or a digital library. How to improve our own UIUC library system is another direction to think about. In terms of functions, in general, we can think about how to move beyond search toward information access. Search is just one way to access information. What about information recommendation? What about allowing a user to find information flexibly in multiple ways (browsing+search+recommendation)? How can we better present search results? We can also further think about how to move beyond information access to task support or decision-making support. For example, many users use a Web search engine to find information about products to help optimizing their purchasing decisions. How can we best help such users more directly to finish their tasks? Systems like Google Products and Orbitz have been designed for this purpose. Can we do better? Many people want to compare products, can we pull out reviews about similar products and generate a "comparative summary" to help people compare these products? Information integration is another major line of directions. We currently have two extremes - either search the entire Web or search a specific website. Can we provide some federated search portal that would allow us to search all the computer science department websites? How do we integrate the search results and present them in a useful way to users in such a case? For example, can we generate a comprehensive summary of all the Ph.D. programs in Computer Science?

    If you cannot seem to come up with ideas, you can try to identify a specific (difficult) query that doesn't seem to work well on Google. You can then try to analyze the reason why the query doesn't work well, and discuss how you can possibly improve the search engine based on your analysis.

    This exercise is intended to lead you to start thinking about what you can do for your course projects. Your real course project doesn't have to follow the idea(s) that you discuss in this essay, though; as time goes and you learn more about the topic, you might have other ideas for your project or you might want to work with others on a joint project.

    Please submit your essay via the course wiki. First go to Your Student Space; then click on the "personal space" link under your name to get to your personal wiki page; and finally, click on the "Edit" button at the top-right corner of that page to post your essay.

    Your essay will be graded as either fail or pass based on the length of the essay. Your essay should contain at least 500 words to get full marks.

  2. Getting familiar with text [40 points]

    Here are two current news articles downloaded from the Web: article1 and article2. Download these two articles and save them on your local disk, and perform some simple statistical analysis of the text without reading the contents of the articles.

    Some Unix commands can be very useful for manipulations of text. Here is an example of how you can use commands such as "awk", "uniq", and "sort" to produce word counts. Suppose the text you want to count is named "mysource". You can do the following to generate counts of each word in "mysource":

    1. Do "tr '[A-Z]' '[a-z]' < mysource > mysource.lowercase". This will convert all capital letters to lower cases.
    2. Do "tr -dc '[:alpha:][0-9] ' < mysource.lowercase | awk '{for (i=1;i<=NF;i++) print $i;}' >mysource.onewordperline". This will split the words on a line so that each line has only one word.
    3. Do "sort mysource.onewordperline | uniq -c > mysource.count". This will first sort all the words and then count the number of occurrences of each word.
    4. Do "sort -rn -k1 mysource.count > mysource.countsorted". This will sort the words in descending order of counts so that you can easily see the high frequency words.

    These steps can also be combined together as one single command:

    "tr '[A-Z]' '[a-z]' < mysource |tr -dc '[:alpha:][0-9] ' | awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 > mysource.countsorted".

    A. [10 points] Without reading the articles, do a simple word count for each article with the commands given above. Visually examine the word distribution for each article. Does the word distribution give you some idea about what each news article might be about?

    B. [10 points] Compare the word distributions of these two articles. How are they similar to each other and how are they different?

    C. [10 points] Do part of speech tagging and partial parsing on each article. A POS tagger called LBJ Part of Speech Tagger can be downloaded here. A chunker called LBJ Chunker can be downloaded here. The LBJ library can be downloaded here. You can put them into your directory on CSIL(EWS) Linux machines (remlnx.ews.illinois.edu). You can also view the documentation at this page.

    First do POS tagging with the following command. This will assign a POS tag to each word; see the explanation of Penn Treebank TAGS.

    "java -classpath /Your_directory/LBJPOS.jar:/Your_directory/LBJ2Library.jar edu.illinois.cs.cogcomp.lbj.pos.POSTagPlain mysource.lowercase > mysource.tagged"

    Then do chunking with the following command, which will break sentences to phrases which are shown in brackets.

    "java -classpath /Your_directory/LBJPOS.jar:/Your_directory/LBJChunk.jar:/Your_directory/LBJ2Library.jar LBJ2.nlp.seg.SegmentTagPlain edu.illinois.cs.cogcomp.lbj.chunk.Chunker mysource.lowercase > mysource.chunked"

    Examine the results. Did LBJ POS tagger and chunker make any mistakes in tagging and grouping words? If so, give a few examples.

    D. [10 points] Use the output of the chunker to generate a list of noun phrases (i.e. phrases tagged as "NP") for each article, and use the "sort" and "uniq" commands to generate counts of noun phrases for each article. Sort the counts so that the phrases with high frequency would be on the top. Are these phrase counts more informative than the single word counts in suggesting what each article might be about? Now, read the two articles. Are their contents similar to what you were expecting?

  3. Unigram Language Models [30 points]

    A unigram Language Model is a probabilistic distribution over words. We will use the CACM test collection to do some real computation of Language Models. You can download the data from here. The CACM collection is a collection of titles and abstracts of computer science papers from the journal CACM. There are about 3,000 documents in the collection. The data set has been processed into lines. Each line contains one document, and the terms of each document are separated by blank space.

    A. Collection Background Language Model [9 points] Use all the documents in the dataset to compute the collection (background) unigram Language Model, pc(w). Rank all words according to their probabilities in descending order and list the top 10 words with their probabilities.

    B. Topic Language Model [10 points] Compute a unigram language model for documents containing word "system" and word "data", respectively. That is, compute p(w|"system") and p(w|"data"). For both models, rank the words according to their probabilities in descending order and list the top 10 words with their probabilities.

    C. Normalization of Topic Language Models [5 points] Normalize the topic language models you have computed in Question B by the collection background language model that you computed in Question A. Let's use pc(w) to denote the probability of word w in the collection LM, and p(w|w0) to denote the probability of word w in the topic LM for word w0, where w0 is either "data" or "system". Now compute the normalized topic LM for w0 as p(w|w0)/pc(w) for each word w. For both "data" and "system", rank all words according to their normalized topic language model scores in descending order and list the top 10 words with their scores. Compare them with the top words in question B. How does normalization change the ranking order of words?

    D. Smoothing of Background Language Model [6 points] When we estimate a unigram language model using the Maximum Likelihood (ML) Estimator, the probability of a word is the relative frequency of the word in the text data. Such an estimator would assign zero probability to any unseen word, which is often unreasonable. Imagine if we had used a much larger computer science literature collection to estimate the background language model, some of the unseen words would almost for sure be observed. One way to solve the problem is to "smooth" a Maximum Likelihood estimator by pretending that we have observed an extra pseudo count of every word, including unseen words. Thus the formula for computing a smoothed background language model would be

             c(w,B) + 1
    p(w|B) = -----------
             T + V 
    
    where c(w,B) is the count of word w in the background collection B (i.e., our CACM collection), T is the total count of all the words in collection B, and V is the size of the complete vocabulary set. Note that the variable V in the denominator is the total pseudo counts we have added to all the words in the vocabularly. With such a formula, an unseen word would not have a zero probability, and the estimated probability is, in general, more accurate. Compute this smoothed background language model by setting V to 20,000. List the top 10 words of this smoothed background language model. Are they very different from the top 10 words in the unsmoothed background language model? Next, repeat the normalization that you have done for Question C by using this smoothed background language model p(w|B) to replace the unsmoothed one, pc(w). List the top 10 words based on the normalized topic language model for both "data" and "system". Are the top 10 words very different from those that you have obtained from Question C? Check the top 10 words for "data". Which set of words seem to be more semantically related to the word "data"?

What and How to submit

A. Please submit your essay for question 1 to Your Student Space by composing the wiki page under your name.

B. Please pack all the following files into one single zip file or tar file and upload the package to Compass. All homework are due 11:59PM of the due date.

  1. Your answers to question 2 and question 3 in pdf file;
  2. All your code and one noun phrase count file for problem 2(D). This is to show that you did actually try those commands to generate the counts.
  3. All your code for problem 3. No documentation is necessary; this is just to show that you did actually compute the values.