Review List for CS410 Midterm



Note: The midterm exam will be a closed book exam and last for 1 hour 15 minutes. You may consider bringing a calculator as there will be some calculations, though they are all simple enough so that doing them manually won't necessarily take much more time than using a calculator. You are not allowed to bring any paper or written material. (You will be able to use the back side of the exam booklet as scratch paper. )

General advice

The best way to prepare for the midterm is to (1) go over all the lecture slides and make sure that you can follow them (use the associated notes to help you follow the slides); (2) read all the assigned readings (use the directions given in the Readings page as guidance); (3) pay special attention to the topics and important slides listed below; (4) ask questions if you have trouble with following the materials (I strongly encourage you to use the newsgroup to discuss the materials).

Part I: Background

In this part of the course, you are expected to
  1. have a good understanding of some of the very basic concepts in probability, statistics, and information theory. In particular, you should know the basic rules for conditional probabilities, especially the Bayes rule.
  2. know the basic idea of maximum likelihood estimation and why it's problematic when the data sample is small.
  3. know how to compute the maximum likelihood estimate of a multinomial distribution (i.e., relative frequency).
  4. know how to compute entropy, cross entropy, mutual information, and KL-divergence, and know their relations. You need to remember the formulas of entropy, KL-divergence, and mutual information.

Important slides: slides 3-7, slides 11-12 in the prob & stat lecture; slides 5-6, 8-11 in the information theory lecture.

Part II: Understanding text -- Natural Language Processing

In this part of the course, you are expected to
  1. have a general picture of what we can do and what we can't do with today's NLP techniques.
  2. know what is POS tagging, what is parsing, and what is syntactic/structural ambiguity.
  3. know what is a statistical language model, what is a unigram/bigram language model
  4. know why smoothing is necessary when estimating a language model and know the formulas for Laplace smoothing, Dirichlet prior smoothing, and linear interpolation smoothing and their similarities and differences.

Important slides: Slides 4, 7, 10-12, 14, 18 in the Intro to NLP lecture; slides 1-2, 9-14, 17-19.

Part III: Accessing Text -- Text Retrieval and Filtering

In this part of the course, you are expected to
  1. know how text retrieval is different from database retrieval
  2. know the distinction of long term vs. short-term information need and how they can be satisfied in different ways (short-term = ad hoc retrieval; long-term= filtering).
  3. know why ranking (without an explicit cutoff) is often preferred to selecting a subset of documents for the user.
  4. know how to compute the basic retrieval evaluation measures (ie, precision, recall, and mean average precision, NDCG). know the relative advantages and disadvantages of these measures (e.g., why isn't precision@10documents as good as average precision for comparing different ranking results? what's the advantage of NDCG over Mean Average Precision?)
  5. know what is stemming, what is a stop word, and the Zipf's law.
  6. know what is relevance feedback and what is pseudo feedback and how they are different.
Important slides in the "Text Retrieval Overview" Lecture: Slides 4, 8, 10-14, 19-14, 31-33.
  1. know the basic idea of the vector space model.
  2. know the major term weighting heuristics (i.e., TF, IDF, and document length normalization).
  3. know the idea and formula for Rocchio feedback.
Important slides of the "Vector space model" lecture: slides 4-21.
  1. know what is an inverted index and how to build a large inverted index with only a limited amount of memory. know how to score documents quickly using an inverted index (i.e., how to use scoring accumulators for scoring).
  2. know the basic compression methods for integers (i.e., unary, gamma, delta, and gap).
Important slides of the "Implementation issues & IR systems" lecture: slides 5-15.
  1. know the general retrieval formula of the query-likelihood retrieval method when the document language model is smoothed with a collection language model, and know why smoothing with a collection language model leads to a retrieval formula that is similar to a traditional TF-IDF retrieval formula with length normalization.
  2. know why we need to do two-stage smoothing and the two different roles of smoothing
Important slides of the "Language models for text retrieval" lecture: slides 6-7, 9-12, 15.
  1. know the KL-divergence retrieval formula and why it covers the query likelihood method as a special case.
  2. know how to use a simple two component mixture model for feedback.
  3. know the general idea of EM and how to use EM to estimate the component model in the simple two-component mixture model for feedback.
Important slides of the "Feedback in language models" lecture: slides 4-5, 8-13.
  1. know what is a Web crawler. know the names "Google File System", "Big Table", "MapReduce", and "Hadoop". know what is "anchor text" and why it is especially useful for web search.
  2. know how PageRank works and why it is helpful for Web search. know the basic idea of HITS.
Important slides of the "Web search engines" lecture: slides 2-3, 5, 10-15, 18, 20-21.
  1. know the difference between content-based filtering (i.e., adaptive information filtering) and collaborative filtering. know how memory-based collaborative filtering approach works.
Important slides of the "Information Filtering" lecture: slides 3, 5, 6, 9, 19, 26-31, 33-34.