As has been pointed out, the cause of the domain adaptation problem is the difference between
and
. Note that while the representation of
is fixed, the representation of
can change if we use different features. Such a change of representation of
can affect both the marginal distribution
and the conditional distribution
. One can assume that under some change of representation of
,
and
will become the same.
Formally, let
denote a transformation function that transforms an observation
represented in the original form into another form
. Define variable
and an induced distribution of
that satisfies
. The joint distribution of
and
is then
![]() |
Note that with a change of representation, the entropy of
conditional on
is likely to increase from the entropy of
conditional on
, because
is usually a simpler representation of the observation than
, and thus encodes less information. In another word, the Bayes error rate usually increases under a change of representation. Therefore, the criteria for good transformation functions include not only the distance between the induced distributions
and
but also the amount of increment of the Bayes error rate.
Ben-David et al. (2007) first formally analyzed the effect of representation change for domain adaptation. They proved a generalization bound for domain adaptation that is dependent on the distance between the induced
and
.
A special and simple kind of transformation is feature subset selection. Satpal and Sarawagi (2007) proposed a feature subset selection method for domain adaptation, where the criterion for selecting features is to minimize an approximated distance function between the distributions in the two domains. Note that to measure the distance between
and
, we still need class labels in the target domain. To solve this problem, in (Satpal and Sarawagi, 2007), predicted labels for the target domain instances are used.
Blitzer et al. (2006) proposed a structural correspondence learning (SCL) algorithm that makes use of the unlabeled data from the target domain to find a low-rank representation that is suitable for domain adaptation. It is empirically shown in (Ben-David et al., 2007) that the low-rank representation found by SCL indeed decreases the distance between the distributions in the two domains. However, SCL does not directly try to find a representation
that minimizes the distance between
and
. Instead, SCL tries to find a representation that works well for many related classification tasks for which labels are available in both the source and the target domains. The assumption is that if a representation
gives good performance for the many related classification tasks in both domains, then
is also a good representation for the main classification task we are interested in in both domains. The core algorithm in SCL is from (Ando and Zhang, 2005).