Benjamin Lambert ---------------------------------------------------------------- Web: Often searches for technical information can be formulated into a very precise query. For example, "sorting PERL hashtable keys" very precisely specifies the data sought. However, in quotes, this phrase is not found by Google because if the data exists on the Web it is phrased differently from the query phrase. Without quotes, this query has very low precision because all of the terms may be likely to appear in both relevant and non-relevant documents. This may be because Web sites with technical information often have navigational links to the "Perl" section of the website or "sorting". Perhaps one way to increase precision is is to discount search terms on a page. This might be accomplished by examining their context within the page. Do they appear next to many other proper nouns (Java, C++) and are they links, as perhaps navigational links? Does the context appear to be written textual language (i.e. is the word in a roughly grammatical English sentence)? Perhaps we could ignore all text on the page that does not occur in a (roughly) grammatical sentence. Email: Email may be able to serve as an excellent "push"-form of information retrieval. Mailing lists are an example of this. For example, I want information about IR-conferences so I am on an IR mailing list. For many users this may be a better way to get subject-specific information than an RSS feed because all the "pushed" information (email itself is also "pushed" information) comes in through one channel--the email client. However, not all information is created equal; when a user is very busy with other things, mailing list emails may be as much a nuissance as spam. Perhaps a semi-supervised, interactive clustering/filtering system would allow email reading to be more focussed for some users. Folder-based systems of email organization may not be sufficent because emails may belong in multiple classes. An email might serve both as a "receipt" of purchase and contain a "serial number". Lost or missing emails should be avoided as much as possible because a serial number might be worth the price of the product. Perhaps to minimize the chance of emails being misclassified the user should always supervise the classification of emails. To minimize the user effort, the email client might suggest categories and the user can manually add or remove the class labels for an email. A system like this might easily employ machine learning to make classifications. Manual rules might help to classify emails sent from people in the user's address book (i.e. from friends or family into the personal category). Literature: IR for literature may be the most important since the information is authoritative. Google and Citeseer can index papers in PS and PDF form and Citeseer appears to automatically extract the special fields from the document (e.g. title, author, bibliography). Perhaps an interesting next step would be to make browsing through the documents more tractable by automatically identifying related literature. Possible ways to find related literature would be word-level similarity (common keywords), bibliographic similarity, medium appeared in (same conference, same workshop, name author, etc.) For the suggestions to not overwhelm the user, some user feedback would seem necessary. If suggested literature from the same workshop is not relevant the system might suggest documents using a different heuristic.