next up previous contents
Next: 6 Bayesian Priors Up: A Literature Survey on Previous: 4 Semi-Supervised Learning   Contents


5 Change of Representation

As has been pointed out, the cause of the domain adaptation problem is the difference between $ P_t(X, Y)$ and $ P_s(X, Y)$. Note that while the representation of $ Y$ is fixed, the representation of $ X$ can change if we use different features. Such a change of representation of $ X$ can affect both the marginal distribution $ P(X)$ and the conditional distribution $ P(Y \vert X)$. One can assume that under some change of representation of $ X$, $ P_t(X, Y)$ and $ P_s(X, Y)$ will become the same.

Formally, let $ g: \mathcal{X} \rightarrow \mathcal{Z}$ denote a transformation function that transforms an observation $ x$ represented in the original form into another form $ z = g(x) \in \mathcal{Z}$. Define variable $ Z$ and an induced distribution of $ Z$ that satisfies $ P(z) = \sum_{x \in \mathcal{X}, g(x) = z} P(x)$. The joint distribution of $ Z$ and $ Y$ is then

$\displaystyle P(z, y)$ $\displaystyle =$ $\displaystyle \sum_{x \in \mathcal{X}, g(x) = z} P(x, y).$  

If we can find a transformation function $ g$ so that under this transformation, we have $ P_t(Z, Y) = P_s(Z, Y)$, then we no longer have the domain adaptation problem because the two domains have the same joint distribution of the observation and the class label. The optimal model $ P(Y \vert Z, \theta^*)$ we learn to approximate $ P_s(Y \vert Z)$ is still optimal for $ P_t(Y \vert Z)$.

Note that with a change of representation, the entropy of $ Y$ conditional on $ Z$ is likely to increase from the entropy of $ Y$ conditional on $ X$, because $ Z$ is usually a simpler representation of the observation than $ X$, and thus encodes less information. In another word, the Bayes error rate usually increases under a change of representation. Therefore, the criteria for good transformation functions include not only the distance between the induced distributions $ P_t(Z, Y)$ and $ P_s(Z, Y)$ but also the amount of increment of the Bayes error rate.

Ben-David et al. (2007) first formally analyzed the effect of representation change for domain adaptation. They proved a generalization bound for domain adaptation that is dependent on the distance between the induced $ P_s(Z, Y)$ and $ P_t(Z, Y)$.

A special and simple kind of transformation is feature subset selection. Satpal and Sarawagi (2007) proposed a feature subset selection method for domain adaptation, where the criterion for selecting features is to minimize an approximated distance function between the distributions in the two domains. Note that to measure the distance between $ P_s(Z, Y)$ and $ P_t(Z, Y)$, we still need class labels in the target domain. To solve this problem, in (Satpal and Sarawagi, 2007), predicted labels for the target domain instances are used.

Blitzer et al. (2006) proposed a structural correspondence learning (SCL) algorithm that makes use of the unlabeled data from the target domain to find a low-rank representation that is suitable for domain adaptation. It is empirically shown in (Ben-David et al., 2007) that the low-rank representation found by SCL indeed decreases the distance between the distributions in the two domains. However, SCL does not directly try to find a representation $ Z$ that minimizes the distance between $ P_s(Z, Y)$ and $ P_t(Z, Y)$. Instead, SCL tries to find a representation that works well for many related classification tasks for which labels are available in both the source and the target domains. The assumption is that if a representation $ Z$ gives good performance for the many related classification tasks in both domains, then $ Z$ is also a good representation for the main classification task we are interested in in both domains. The core algorithm in SCL is from (Ando and Zhang, 2005).


next up previous contents
Next: 6 Bayesian Priors Up: A Literature Survey on Previous: 4 Semi-Supervised Learning   Contents
Jing Jiang 2008-03-06