Domain adaptation of statistical classifiers is the problem that arises when the data distribution in our test domain is different from that in our training domain. The need for domain adaptation is prevalent in many real-world classification problems. For example, spam filters can be trained on some public collection of spam and ham emails. But when applied to an individual person's inbox, we may want to ``personalize'' the spam filter, i.e. to adapt the spam filter to fit the person's own distribution of emails in order to achieve better performance.
Although the domain adaptation problem is a fundamental problem in machine learning, it only started gaining much attention very recently (Blitzer et al., 2008; Daumé III and Marcu, 2006; Satpal and Sarawagi, 2007; Daumé III, 2007; Jiang and Zhai, 2007b; Ben-David et al., 2007; Blitzer et al., 2006; Jiang and Zhai, 2007a). However, some special kinds of domain adaptation problems have been studied before under different names including class imbalance (Japkowicz and Stephen, 2002), covariate shift (Shimodaira, 2000), and sample selection bias (Zadrozny, 2004; Heckman, 1979). There are also some closely-related but not equivalent machine learning problems that have been studied extensively, including multi-task learning (Caruana, 1997) and semi-supervised learning (Chapelle et al., 2006; Zhu, 2005).
In this literature survey, we review some existing work in both the machine learning and the natural language processing communities related to domain adaptation. The goal of this survey is twofold. First, there have been a number of methods proposed to address domain adaptation, but it is not clear how these methods are related to each other. This survey thus tries to organize the existing work and lay out an overall picture of the domain adaptation problem with its possible solutions. Second, a systematic literature survey naturally reveals the limitations of current work and points out promising directions that should be explored in the future.
Because domain adaptation is a relatively new topic that is still constantly attracting attention, our survey is necessarily incomplete. Nevertheless, we try to cover the major lines of work that we are aware of up to the date this survey is written. This survey will also be updated periodically.