Comparative Text Mining

Back to IR Group



Text mining is concerned with extracting knowledge and patterns from text.Most existing research in text mining is focused on one single collection of text. The goals are often to extract basic semantic units such as named entities, to extract relations between information units, or to extract topic themes. In this project, we study a novel problem of text mining referred to as comparative text mining . Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme.

Depending on the collections to compare, comparative text mining covers many interesting text mining problems as special cases, such as spatiotemporal text mining, cross-language text mining, novelty/trend detection, and opinion mining. It has many applications such as opinion summarization, business intelligence, text federation, and customer relationship management.

One general methodology is to use mixture language models to model themes or subtopics in text. Through design these mixture models appropriately we can extract latent themes from text in an unsupervised way through fitting a mixture model to text. So far we have developed models for generative comparative summaries, discovering evolutionary theme patterns, and spatiotemporal analysis of Weblog data. We are currently applying these techniques to multiple domains to develop applications such as Customer Service Support System, Opinion Tracker, and Analyst's Portal.