One simple assumption we can make about the connection between the distributions of the source and the target domains is that given the same class label, the conditional distributions of are the same in the two domains. However, the class distributions may be different in the source and the target domains. Formally, we assume that for all , but . This difference is referred to as the class imbalance problem in some work (Japkowicz and Stephen, 2002).
When this class imbalance assumption is made, the ratio
that we derived in Equation () can be rewritten as follows:
For classification algorithms that directly model the probability distribution such as logistic regression classifiers, it can be shown theoretically that the estimated probability
can be transformed into
in the following way (Lin et al., 2002; Chan and Ng, 2005):
For other classification algorithms that do not directly model , such as naive Bayes classifiers and support vector machines, if can be obtained through careful calibration, the same trick can be applied. Chan and Ng (2006) applied this method to the domain adaptation problem in word sense disambiguation (WSD) using naive Bayes classifiers.
In practice, one needs to know the class distribution in the target domain in order to apply the methods described above. In some studies, it is assumed that this distribution is known a priori (Lin et al., 2002). However, in reality, we may not have this information. Chan and Ng (2005) proposed to use the EM algorithm to estimate the class distribution in the target domain.