5 Change of Representation

As has been pointed out, the cause of the domain adaptation problem is the difference between and . Note that while the representation of is fixed, the representation of can change if we use different features. Such a change of representation of can affect both the marginal distribution and the conditional distribution . One can assume that under some change of representation of , and will become the same.

Formally, let
denote a transformation function that transforms an observation represented in the original form into another form
. Define variable and an induced distribution of that satisfies
. The joint distribution of and is then

If we can find a transformation function so that under this transformation, we have , then we no longer have the domain adaptation problem because the two domains have the same joint distribution of the observation and the class label. The optimal model we learn to approximate is still optimal for .

Note that with a change of representation, the entropy of conditional on is likely to increase from the entropy of conditional on , because is usually a simpler representation of the observation than , and thus encodes less information. In another word, the Bayes error rate usually increases under a change of representation. Therefore, the criteria for good transformation functions include not only the distance between the induced distributions and but also the amount of increment of the Bayes error rate.

Ben-David et al. (2007) first formally analyzed the effect of representation change for domain adaptation. They proved a generalization bound for domain adaptation that is dependent on the distance between the induced and .

A special and simple kind of transformation is feature subset selection. Satpal and Sarawagi (2007) proposed a feature subset selection method for domain adaptation, where the criterion for selecting features is to minimize an approximated distance function between the distributions in the two domains. Note that to measure the distance between and , we still need class labels in the target domain. To solve this problem, in (Satpal and Sarawagi, 2007), predicted labels for the target domain instances are used.

Blitzer et al. (2006) proposed a structural correspondence learning (SCL) algorithm that makes use of the unlabeled data from the target domain to find a low-rank representation that is suitable for domain adaptation. It is empirically shown in (Ben-David et al., 2007) that the low-rank representation found by SCL indeed decreases the distance between the distributions in the two domains. However, SCL does not directly try to find a representation that minimizes the distance between and . Instead, SCL tries to find a representation that works well for many related classification tasks for which labels are available in both the source and the target domains. The assumption is that if a representation gives good performance for the many related classification tasks in both domains, then is also a good representation for the main classification task we are interested in in both domains. The core algorithm in SCL is from (Ando and Zhang, 2005).