3.2 Covariate Shift

Another assumption one can make about the connection between the source and the target domains is that given the same observation , the conditional distributions of are the same in the two domains. However, the marginal distributions of may be different in the source and the target domains. Formally, we assume that
for all
, but
. This difference between the two domains is called *covariate shift* (Shimodaira, 2000).

At first glance, it may appear that covariate shift is not a problem. For classification, we are only interested in . If
, why would the classifier learned from the source domain not perform well on the target domain even if
? Shimodaira (2000) showed that this covariate shift becomes a problem when *misspecified* models are used. Suppose we consider a parametric model family
from which a model
is selected to minimize the expected classification error. If none of the models in the model family can exactly match the true relation between and , that is, there does not exist any
such that
for all
, then we say that we have a misspecified model family. The intuition of why covariate shift under model misspecification becomes a problem is as follows. With a misspecified model family, the optimal model we select depends on , and if
, then the optimal model for the target domain will differ from that for the source domain. The intuitive is that the optimal model performs better in dense regions of than in sparse regions of , because the dense regions dominate the average classification error, which is what we want to minimize. If the dense regions of are different in the source and the target domains, the optimal model for the source domain will no longer be optimal for the target domain.

Under covariate shift, the ratio
that we derived in Equation () can be rewritten as follows:

We therefore want to weight each training instance with .

Shimodaira (2000) first proposed to re-weight the log likelihood of each training instance using in maximum likelihood estimation for covariate shift. It can be shown theoretically that if the support of (the set of 's for which ) is contained in the support of , then the optimal model that maximizes this re-weighted log likelihood function asymptotically converges to the optimal model for the target domain.

A major challenge is how to estimate the ratio for each in the training set. In some work, a principled method of using non-parametric kernel density estimation is explored (Sugiyama and Müller, 2005; Shimodaira, 2000). In some other work, it is proposed to transform this density ratio estimation into a problem of predicting whether an instance is from the source domain or from the target domain (Zadrozny, 2004; Bickel and Scheffer, 2007). Huang et al. (2007) transformed the problem into a kernel mean matching problem in a reproducing kernel Hilbert space. Bickel et al. (2007) proposed to learn this ratio together with the classification model parameters.