Qiaozhu Mei -----------------------------------------------------------------------   Web: Personalized Conceptual Search Engine: Email: Dynamic Topic Clustering: Literature: Evolutive Text Mining:   Brief description:   Web: Personalized Conceptual Search Engine: For a scenario, when a user wants to know ¡°the state-of-the-art of NLP¡±, an ideal personalized system should first figure out NLP is Natural Language Processing of this user, and then figure out NLP here is a concept, which refers to POS Tagger, Parser, etc. These, are hard for general search engine but doable for personalized search. Interestingly, a person¡¯s name can be a very good example of concept.   Email: Dynamic Topic Clustering: In real world, a very significant need is to retrieve emails with a certain topic, or talking about a certain issue. For example, one may want to trace back the email discussion of ¡°set up a new workshop¡± last year with some potential collaborators. This may be distributive of time, sender, titles, which makes it difficult to find, especially when the user wants to find all of them(high recall). If we can maintain the emails by topic/events automatically, it will significantly help. This could be a problem of ¡°Dynamic Topic Clustering¡±. One key issue of this problem, which distinguishes it with common text clustering (LSI, CTM, etc), is that we have to maintain the clusters dynamically. When a new email comes, we will need to merge it into existing groups, or under some circumstance, generate a new group.   Literature: Evolutive Text Mining: This is somewhat related to the topic switching model I discussed with you before. However, it's a bit different, and with a even larger impact. Both of them can be viewed as components of temporal text mining.   If we can model the evolution of concepts/problems/technologies in one field, we can understand the evolution of this field well; sometimes even can predict the change of this field.   For an even more ambitious scenario, suppose A, B, C, .. are techniques in field 1, and A¡¯, B¡¯, C¡¯ are their analogical techniques in field 2. Suppose we discover two evolutive paths in field 1 and 2: Field1: A->B -> (+D) ->C-> (+E) ->F; Field2: A¡¯ ->B¡¯ -> (+ D¡¯) -> C¡¯; C and C¡¯ share similar evolutive process in field 1 and 2. Does this indicate that the involving of a technique E¡¯ (which is analogical to E in field1) might bring the next development of C¡¯ in field2?   Details: (including answers to homework questions, possible solutions for the problems):   Web Topic:   Personalized Conceptual Search Engine: Task1: An identical query may have different latent meanings. In real world searching, people usually have their own preferences of a certain aspect of one query. For example, ¡°apple¡± may mean ¡°computer¡± and also ¡°fruit¡±. ¡°Java¡± may mean ¡°country¡±, ¡°coffee¡± and ¡°programming language¡±. In general search, it¡¯s hard to indicate which aspect a user want from one concept, but in personalized search, one user is likely to prefer one aspect. Task2: Sometimes it¡¯s hard to generate a good query from a certain need. For example, a user wants to know ¡°what did America Government say about ¡­¡±. Most articles may mention ¡°Bush said ¡­¡±, ¡°Bill Clinton said ¡­¡±, ¡°Bowel said ¡­¡±. A query like ¡°America Government ¡­¡± may not get satisfactory results. This is because America Government is a concept, which indicates a group of terms. Again, in general search, modeling a concept is hard, because each concept may have different meaning. In personalized search, people may have stationary components for each concept.   A better scenario, when a user wants to know ¡°the state-of-the-art of NLP¡±, an ideal personalized system should first figure out NLP is Natural Language Processing of this user, and then figure out it refers to POS Tagger, Parser, etc. These, are hard for general search engine but doable for personalized search. Interestingly, a person¡¯s name can be a very good example of concept.   The training data could be any kind of texts with personalized property (may not be stricted on query history). For example, a word-usage statistics of the user¡¯s articles, chatting records, and other collections can be very useful. All these things can be done on client side, which avoids the privacy problem.   User: common users of search engine. Data: query history, personal collections of texts, articles and chatting records. Functions: concept clustering, summarization from texts; personal preference learning, query modification by concept selection and splitting. Challenge: How to cluster terms into concepts from personalized texts. How to represent a concept. How to do query expansion with the information of concepts.   Email Topic:   Dynamic Topic Clustering: People always complain about the mass of emails. It is very likely to lose track of some topics in the large amount of emails. Managing email is a hard task. Current email handlers provide the function of manage emails by their explicit attributes, such as time, sender, title, etc. People can also manually group emails sharing some properties together. In real world, a very significant need is to retrieve emails with a certain topic, or talking about a certain issue. For example, one may want to trace back the email discussion of the idea ¡°set up a new workshop¡± last year with some potential collaborators. It is hard, however, to retrieve those emails by time, sender, title (they may vary a lot) or even key words (sometimes difficult to define keywords, and one may want to get ALL of them instead of the most relevant ones). If we can automatically group and manage emails sharing a topic, or latently talking about the same issue, we can easily find the emails about a specific event, and yes, all of them.   This could be a problem of ¡°Dynamic Topic Clustering¡±. One key issue of this problem, which distinguishes it with common text clustering (LSI, CTM, etc), is that we have to maintain the clusters dynamically. When a new email comes, we will need to merge it into existing groups (modifying the structure of existing groups), or under some circumstance, generate a new group.   User: all email users, especially those handling mass email (Professor, Customer Service, etc) Data: Emails, specifically, content of emails. Functions: Incrementally grouping, browse by topic group, Retrieval by topic. Challenges are already discussed above.   Literature Topic:   Evolutive Text Mining: In literature collections, there would be hundreds of papers on the each area every year. Concepts, problems and technologies are not only evolutive over time in each field, but also involved in interdisciplinary interactions. Taking concepts for example, as time goes by, some concepts dies out, some concepts emerges, some concepts are borrowed from other fields, some merges together and some splits. Some concepts in different fields (collection, community) may have different name but share analogical content and similar evolution path. If we can model the evolution of concepts/problems/technologies in one field, we can understand the evolution of this field well; sometimes even can predict the change of this field.   For an even more ambitious scenario, suppose A, B, C, .. are techniques in field 1, and A¡¯, B¡¯, C¡¯ are their analogical techniques in field 2. Suppose we discover two evolutive paths in field 1 and 2: Field1: A->B -> (+D) ->C-> (+E) ->F; Field2: A¡¯ ->B¡¯ -> (+ D¡¯) -> C¡¯; C and C¡¯ share similar evolutive process in field 1 and 2. Does this indicate that the involving of a technique E¡¯ (which is analogical to E in field1) might bring the next development of C¡¯ in field2? This would be very useful for scientific researchers. Using Comparative Text Mining, we are able to find analogical concepts in different fields, and if we can model the evolution of concepts well, this task becomes possible.   User: Scientists, researchers Data: Scientific literatures, for example, Honeybee data and Flybase data. Functions: Finding analogical concepts over collections; Modeling the evolutive paths in each collection; compare and make predictions with the paths in different collections. Challenge: How to find a good model of concept evolutions. How to use CTM to define analogical concepts.