next up previous contents
Next: 3.3 Change of Functional Up: 3 Instance Weighting Previous: 3.1 Class Imbalance   Contents

3.2 Covariate Shift

Another assumption one can make about the connection between the source and the target domains is that given the same observation $ X = x$, the conditional distributions of $ Y$ are the same in the two domains. However, the marginal distributions of $ X$ may be different in the source and the target domains. Formally, we assume that $ P_s(Y \vert X = x) = P_t(Y \vert X = x)$ for all $ x \in \mathcal{X}$, but $ P_s(X) \ne P_t(X)$. This difference between the two domains is called covariate shift (Shimodaira, 2000).

At first glance, it may appear that covariate shift is not a problem. For classification, we are only interested in $ P(Y \vert X)$. If $ P_s(Y \vert X) = P_t(Y \vert X)$, why would the classifier learned from the source domain not perform well on the target domain even if $ P_s(X) \ne P_t(X)$? Shimodaira (2000) showed that this covariate shift becomes a problem when misspecified models are used. Suppose we consider a parametric model family $ \{P(Y \vert X, \theta)\}_{\theta \in \Theta}$ from which a model $ P(Y \vert X, \theta^*)$ is selected to minimize the expected classification error. If none of the models in the model family can exactly match the true relation between $ X$ and $ Y$, that is, there does not exist any $ \theta \in \Theta$ such that $ P(Y \vert X = x, \theta) = P(Y \vert X = x)$ for all $ x \in \mathcal{X}$, then we say that we have a misspecified model family. The intuition of why covariate shift under model misspecification becomes a problem is as follows. With a misspecified model family, the optimal model we select depends on $ P(X)$, and if $ P_t(X) \ne P_s(X)$, then the optimal model for the target domain will differ from that for the source domain. The intuitive is that the optimal model performs better in dense regions of $ X$ than in sparse regions of $ X$, because the dense regions dominate the average classification error, which is what we want to minimize. If the dense regions of $ X$ are different in the source and the target domains, the optimal model for the source domain will no longer be optimal for the target domain.

Under covariate shift, the ratio $ \frac{P_t(x, y)}{P_s(x, y)}$ that we derived in Equation ([*]) can be rewritten as follows:

$\displaystyle \frac{P_t(x, y)}{P_s(x, y)}$ $\displaystyle =$ $\displaystyle \frac{P_t(x)}{P_s(x)} \frac{P_t(y \vert x)}{P_s(y \vert x)}$  
  $\displaystyle =$ $\displaystyle \frac{P_t(x)}{P_s(x)}.$  

We therefore want to weight each training instance with $ \frac{P_t(x)}{P_s(x)}$.

Shimodaira (2000) first proposed to re-weight the log likelihood of each training instance $ (x, y)$ using $ \frac{P_t(x)}{P_s(x)}$ in maximum likelihood estimation for covariate shift. It can be shown theoretically that if the support of $ P_t(X)$ (the set of $ x$'s for which $ P_t(X = x) > 0$) is contained in the support of $ P_s(X)$, then the optimal model that maximizes this re-weighted log likelihood function asymptotically converges to the optimal model for the target domain.

A major challenge is how to estimate the ratio $ \frac{P_t(x)}{P_s(x)}$ for each $ x$ in the training set. In some work, a principled method of using non-parametric kernel density estimation is explored (Sugiyama and Müller, 2005; Shimodaira, 2000). In some other work, it is proposed to transform this density ratio estimation into a problem of predicting whether an instance is from the source domain or from the target domain (Zadrozny, 2004; Bickel and Scheffer, 2007). Huang et al. (2007) transformed the problem into a kernel mean matching problem in a reproducing kernel Hilbert space. Bickel et al. (2007) proposed to learn this ratio together with the classification model parameters.

next up previous contents
Next: 3.3 Change of Functional Up: 3 Instance Weighting Previous: 3.1 Class Imbalance   Contents
Jing Jiang 2008-03-06