The mass collaboration annotation experiment is over. Thank you all for your participation!
Recent years have seen an explosive growth of the volume of online textual information (e.g., web pages, email, news articles, office documents, and scientific literature). It is a significant challenge to effectively manage such textual information and make it useful to people. For example, How do we search for interesting information in large collections of text? How do we automatically categorize a document according to a hierarchy of subject categories? Can we automatically discover the major complaints about a product from a large set of customer email messages to a company? These are only a few of many interesting and challenging questions that we can ask.
Unlike structured information, which is typically managed with a relational database, textual information is unstructured and poses special challenges due to the difficulty in precisely understanding natural language and users' information needs. In this course, we will introduce a variety of techniques for managing textual information, including algorithms for retrieval, filtering, clustering, and categorization of textual information. The course emphasizes basic principles and pratically useful algorithms. Topics to be covered include text analysis, retrieval models (e.g., Boolean, vector space, probabilistic), text categorization, text filtering, clustering, retrieval system design and implementation, and applications to web information management.
The course is lecture-based, and has a midterm and final examination. There will also be regular assignments, which often involve implementation of an algorithm and/or experimentation with real text data.