As a first step to many information retrieval and natural language processing tasks, tokenization is the process of seperating text into individual tokens that each convey some semantic meaning. For English, in most cases, tokens are equivalent to words. For biomedical text, there are often names and symbols of various types of biomedical entities, such as genes, proteins, chemicals, etc. The special characters contained in these names and symbols make it harder to identify meaningful tokens than in normal English text.
This page is dedicated to some empirical study of tokenization strategies for biomedical text.
Please contact Jing Jiang (jiang4 AT cs DOT uiuc DOT edu) if you have any question, comment or suggestion.