As the amount of online textual information (e.g., web pages, email, news articles, office documents, and scientific literature) grows explosively, it is increasingly important to develop tools to help us manage and exploit the huge amount of information. Web search engines, such as Google, Yahoo!, and MSN, are good examples of such tools, and they are now an essential part of everyone's life. In this course, you will learn the underlying technologies of these and other powerful tools for managing text information. You will be able to learn the basic principles and algorithms for managing text information as well as obtain handson experience with using existing information retrieval toolkits to set up your own search engines and improving their search accuracy.
Unlike structured data, which is typically managed with a relational database, textual information is unstructured and poses special challenges due to the difficulty in precisely understanding natural language and users' information needs. In this course, we will introduce a variety of techniques for accessing and mining text information. The course emphasizes basic principles and pratically useful algorithms. Topics to be covered include, among others, text analysis, retrieval models (e.g., vector space and probabilistic models), text categorization, text filtering, clustering, retrieval system design and implementation, and applications to web information management.
The course is lecture-based. Grading is based on regular assignments, a late midterm examination, several two-minute in-class quizzes, and an optional course project which is only required for those who registered the course for 4 credit hours. For more information about the course policy, please see " Basic Information" of the course.