next up previous contents
Next: 3.1 Class Imbalance Up: A Literature Survey on Previous: 2.2 Overview   Contents


3 Instance Weighting

One general approach to addressing the domain adaptation problem is to assign instance-dependent weights to the loss function when minimizing the expected loss over the distribution of data. To see why instance weighting may help, let us first briefly review the empirical risk minimization framework for standard supervised learning (Vapnik, 1999), and then informally derive an instance weighting solution to domain adaptation. Let $ \Theta$ be a model family from which we want to select an optimal model $ \theta^*$ for our classification task. Let $ l(x, y, \theta)$ be a loss function. Strictly speaking, we want to minimize the following objective function in order to obtain the optimal model $ \theta^*$ for the distribution $ P(X, Y)$:

$\displaystyle \theta^*$ $\displaystyle =$ $\displaystyle \argmin_{\theta \in \Theta} \sum_{(x, y) \in \mathcal{X} \times \mathcal{Y}} P(x, y) l(x, y, \theta).$  

Because $ P(X, Y)$ is unknown, we can use the empirical distribution $ \tilde{P}(X, Y)$ to approximate $ P(X, Y)$. Let $ \{(x_i, y_i)\}_{i = 1}^N$ be a set of training instances randomly sampled from $ P(X, Y)$. We then minimize the following empirical risk in order to find a good model $ \hat{\theta}$:
$\displaystyle \hat{\theta}$ $\displaystyle =$ $\displaystyle \argmin_{\theta \in \Theta} \sum_{(x, y) \in \mathcal{X} \times \mathcal{Y}} \tilde{P}(x, y) l(x, y, \theta)$  
  $\displaystyle =$ $\displaystyle \argmin_{\theta \in \Theta} \sum_{i = 1}^N l(x_i, y_i, \theta).$  

Now consider the setting of domain adaptation. Ideally, we want to find an optimal model for the target domain that minimizes the expected loss over the target distribution:

$\displaystyle \theta^*_t$ $\displaystyle =$ $\displaystyle \argmin_{\theta \in \Theta} \sum_{(x, y) \in \mathcal{X} \times \mathcal{Y}} P_t(x, y) l(x, y, \theta).$  

However, our training instances, $ D_s = \{(x^s_i, y^s_i\}_{i = 1}^{N_s}$, are randomly sampled from the source distribution $ P_s(X, Y)$. We can rewrite the equation above as follows:
$\displaystyle \theta^*_t$ $\displaystyle =$ $\displaystyle \argmin_{\theta \in \Theta} \sum_{(x, y) \in \mathcal{X} \times \mathcal{Y}} \frac{P_t(x, y)}{P_s(x, y)} P_s(x, y) l(x, y, \theta)$  
  $\displaystyle \approx$ $\displaystyle \argmin_{\theta \in \Theta} \sum_{(x, y) \in \mathcal{X} \times \mathcal{Y}} \frac{P_t(x, y)}{P_s(x, y)} \tilde{P}_s(x, y) l(x, y, \theta)$  
  $\displaystyle =$ $\displaystyle \argmin_{\theta \in \Theta} \sum_{i = 1}^{N_s} \frac{P_t(x^s_i, y^s_i)}{P_s(x^s_i, y^s_i)} l(x^s_i, y^s_i, \theta).$ (1)

As we can see, weighting the loss for the instance $ (x^s_i, y^s_i)$ with $ \frac{P_t(x^s_i, y^s_i)}{P_s(x^s_i, y^s_i)}$ provides a well-justified solution to the domain adaptation problem.

It is not possible to compute the exact value of $ \frac{P_t(x, y)}{P_s(x, y)}$ for a pair $ (x, y)$, especially because we do not have enough labeled instances in the target domain. Section [*] reviews one line of work in which $ P_t(X \vert Y) = P_s(X \vert Y)$ is assumed, while Section [*] reviews another line of work in which $ P_t(Y \vert X) = P_s(Y \vert X)$ is assumed.



Subsections
next up previous contents
Next: 3.1 Class Imbalance Up: A Literature Survey on Previous: 2.2 Overview   Contents
Jing Jiang 2008-03-06