** Next:** 3.2 Covariate Shift
** Up:** 3 Instance Weighting
** Previous:** 3 Instance Weighting
** Contents**

##

3.1 Class Imbalance

One simple assumption we can make about the connection between the distributions of the source and the target domains is that given the same class label, the conditional distributions of are the same in the two domains. However, the class distributions may be different in the source and the target domains. Formally, we assume that
for all
, but
. This difference is referred to as the *class imbalance* problem in some work (Japkowicz and Stephen, 2002).

When this class imbalance assumption is made, the ratio
that we derived in Equation () can be rewritten as follows:

Therefore, we only need to use
to weight the instances. This approach has been explored in (Lin et al., 2002). Alternatively, we can re-sample the training instances from the source domain so that the re-sampled data roughly has the same class distribution as the target domain. In re-sampling methods, under-represented classes are over-sampled, and over-represented classes are under-sampled (Chawla et al., 2002; Zhu and Hovy, 2007; Kubat and Matwin, 1997).
For classification algorithms that directly model the probability distribution such as logistic regression classifiers, it can be shown theoretically that the estimated probability
can be transformed into
in the following way (Lin et al., 2002; Chan and Ng, 2005):

where is defined as

Now we can first estimate
from the source domain, and then derive
using and .
For other classification algorithms that do not directly model , such as naive Bayes classifiers and support vector machines, if can be obtained through careful calibration, the same trick can be applied. Chan and Ng (2006) applied this method to the domain adaptation problem in word sense disambiguation (WSD) using naive Bayes classifiers.

In practice, one needs to know the class distribution in the target domain in order to apply the methods described above. In some studies, it is assumed that this distribution is known a priori (Lin et al., 2002). However, in reality, we may not have this information. Chan and Ng (2005) proposed to use the EM algorithm to estimate the class distribution in the target domain.

** Next:** 3.2 Covariate Shift
** Up:** 3 Instance Weighting
** Previous:** 3 Instance Weighting
** Contents**
Jing Jiang
2008-03-06