Review List for CS410 Midterm
Note: The midterm exam will be a closed book exam and last for 1 hour 15 minutes (2pm-3:15pm).
Calculator is not allowed. You
are not allowed to bring any paper or written material either. (Scratch paper will be provided to you. )
The exam will start promptly at 2pm, so please arrive before 2pm.
General advice
The best way to prepare for the midterm is to (1) go over all the lecture slides and make sure that you can follow
them (use the associated notes to help you follow the slides); (2) read all the assigned readings (use the directions given in the Readings page as guidance); (3) pay special attention to the topics and important slides listed below; (4) ask questions if you have trouble with following the materials (I strongly encourage you to use the newsgroup to discuss the materials).
Part I: Background
In this part of the course, you are expected to
- have a good understanding of some of the very basic concepts in probability, statistics, and information theory.
In particular, you should know the basic rules for conditional probabilities, especially
the Bayes rule.
- know the basic idea of maximum likelihood estimation and why it's problematic when the data sample is small.
- know how to compute the maximum likelihood estimate of a multinomial distribution (i.e., relative frequency).
- know how to compute entropy, cross entropy, mutual information, and KL-divergence, and know their
relations. You need to remember the formulas of entropy, KL-divergence, and mutual information.
Important slides: slides 3-7, slides 11-12 in the prob & stat lecture; slides 5-6, 8-12 in the information theory lecture.
Part II: Understanding text -- Natural Language Processing
In this part of the course, you are expected to
- have a general picture of what we can do and what we can't do with today's NLP techniques.
- know what is POS tagging, what is parsing, and what is syntactic/structural ambiguity.
- know what is a statistical language model, what is a unigram/bigram language model
- know Zipf's law.
- know why smoothing is necessary when estimating a language model and know the formulas for Laplace smoothing,
Dirichlet prior smoothing, and linear interpolation smoothing and their similarities and differences.
Important slides: Slides 4, 7, 10-12, 14, 18 in the Intro to NLP lecture;
slides 2-3, 8-11, 13, 16-19 in the Statistical Language Models lecture.
Part III: Accessing Text -- Text Retrieval and Filtering
In this part of the course, you are expected to
- know how text retrieval is different from database retrieval
- know the distinction of long term vs. short-term information need and how
they can be satisfied in different ways (short-term = ad hoc retrieval; long-term= filtering).
- know why ranking (without an explicit cutoff) is often preferred to selecting
a subset of documents for the user.
- know how to compute the basic retrieval evaluation measures (e.g., precision, recall, and
mean average precision, NDCG, MRR, F1). know the relative advantages and disadvantages of these measures
(e.g., why isn't precision@10documents as good as average precision for comparing different ranking results?
what's the advantage of NDCG over Mean Average Precision?)
- know what is stemming, what is a stop word.
- know what is relevance feedback and what is pseudo feedback and how they are different.
Important slides in the "Text Retrieval Overview" Lecture: Slides 4, 8, 10-14, 19-27, 32-34.
- know the basic idea of the vector space model.
- know the major term weighting heuristics (i.e., TF, IDF, and document length normalization).
- know the idea and formula for Rocchio feedback.
Important slides of the "Vector space model" lecture: slides 4-21.
- know what is an inverted index and how to build a large inverted index
with only a limited amount of memory. know how to score documents quickly using an inverted index (i.e.,
how to use scoring accumulators for scoring).
- know the basic compression methods for integers (i.e., unary, gamma, delta, and gap).
Important slides of the "Implementation issues & IR systems" lecture: slides 5-15.
- know the general retrieval formula of the query-likelihood retrieval method when the document
language model is smoothed with a collection language model, and know why smoothing with a collection
language model leads to a retrieval formula that is
similar to a traditional TF-IDF retrieval formula with length normalization.
- know why we need to do two-stage smoothing and the two different roles of smoothing
Important slides of the "Language models for text retrieval" lecture: slides 6-12, 15.
- know the KL-divergence retrieval formula and why it covers the query likelihood method as a special case.
- know how to use a simple two component mixture model for feedback.
- know the general idea of EM and how to use EM to estimate the component model in the simple two-component mixture model for feedback.
Important slides of the "Feedback in language models" lecture: slides 4-5, 8-14.
- know the basic idea of the axiomatic approach.
- know how the common retrieval heuristics (TF-IDF weighting, document length normalizatio) can be formalized as constraints on retrieval functions.
Important slides of the "Axiomatic retrieval models" lecture: slides 8-10, 12-16, 18, 20.
- know what is a Web crawler. know the names "Google File System", "Big Table", "MapReduce", and "Hadoop".
know what is "anchor text" and why it is especially useful for web search.
- know how PageRank works and why it is helpful for Web search. know the basic idea of HITS.
- know the basic idea of learning a retrieval function involving combinations of multiple features (particularly logistic regression).
Important slides of the "Web search engines" lecture: slides 2-3, 5, 10-15, 19, 21-25, 27, 30-33.
- know the difference between content-based filtering (i.e., adaptive information filtering) and collaborative filtering. know how memory-based collaborative filtering approach works.
Important slides of the "Information Filtering" lecture: slides 3, 5, 6, 9, 19, 26-31, 33-34.
Part IV: Organizing and Mining Text Information : Categorization, Clustering, and Extraction
In this part of the course, you are expected to
- know how basic retrieval techniques can be used to perform text categorization (prototype-based classifier and k-nearest neighbor).
- know how basic retrieval techniques can be used to perform clustering of documents/terms (single-link, complete-link, and average-link).
- know how basic retrieval techniques can be used to summarize a document.
Important slides of the "Applications of Retrieval Techniques" lecture: slides 2-3, 5-9, 12-26, 21-22.
- know how text mining differs from and supplements search.
- know how probabilistic latent semantic analysis (PLSA) generalizes the two-component mixture model for feedback.
- know how to write down the log-likelihood function of PLSA as specified on slide 13 of the lecture and know the meanings of all the parameters.
- know how the E-step and M-step formulas computationally update parameters (i.e., in E-step, the old parameter values are used in a formula to predict the latent/hidden topic label for each word in each document based on Bayes rule, while in M-step, the predicted topic labels are used to split the word counts to re-estimate all the parameters). you don't need to remember the formulas for the EM algorithm, but if you are shown these formulas, you should know how to interpret them.
- know how to obtain term clusters and document clusters from the estimated parameter values of PLSA.
Important slides of the "Probabilistic Topic Models" lecture: slides 2, 10-21.