Sunteți pe pagina 1din 14

770 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO.

6, JUNE 2010

Bridging Domains Using World Wide


Knowledge for Transfer Learning
Evan Wei Xiang, Bin Cao, Derek Hao Hu, and Qiang Yang, Fellow, IEEE

Abstract—A major problem of classification learning is the lack of ground-truth labeled data. It is usually expensive to label new data
instances for training a model. To solve this problem, domain adaptation in transfer learning has been proposed to classify target
domain data by using some other source domain data, even when the data may have different distributions. However, domain
adaptation may not work well when the differences between the source and target domains are large. In this paper, we design a novel
transfer learning approach, called BIG (Bridging Information Gap), to effectively extract useful knowledge in a worldwide knowledge
base, which is then used to link the source and target domains for improving the classification performance. BIG works when the
source and target domains share the same feature space but different underlying data distributions. Using the auxiliary source data, we
can extract a “bridge” that allows cross-domain text classification problems to be solved using standard semisupervised learning
algorithms. A major contribution of our work is that with BIG, a large amount of worldwide knowledge can be easily adapted and used
for learning in the target domain. We conduct experiments on several real-world cross-domain text classification tasks and
demonstrate that our proposed approach can outperform several existing domain adaptation approaches significantly.

Index Terms—Data mining, transfer learning, cross-domain, text classification, Wikipedia.

1 INTRODUCTION

T EXT classification, which aims to assign a document to one


or more categories based on its content, is a fundamental
task for Web and document data mining applications,
particular, when the distribution gap between the source and
target domains is large, transfer learning can hardly be used
to benefit learning in the target domain [6]. For example,
ranging from information retrieval, spam detection, to online when we use some financial documents as the source
advertisement and Web search. Traditional supervised domain and information technology documents as the target
learning approaches for text classification require sufficient domain, the differences are so large that the performance in
labeled instances in a problem domain in order to train a high- the target domain may decrease. Another problem is when
quality model. However, it is not always easy or feasible to the source and target domains have a large divergence in
obtain new labeled data in a domain of interest (hereafter, feature space; for example, the source data might be written
referred to as the target domain). The lack of labeled data for one audience and the target data for another. In these
problem can seriously hurt classification performance in situations, traditional transfer learning might not work well.
many real world applications. We note that the above difficulty is caused by a so-called
To solve this problem, transfer learning techniques, in information gap in domain adaptation tasks, which essen-
particular domain adaptation techniques in transfer learning, tially is a combination of feature space and data distribution
are introduced by capturing the shared knowledge from differences between the training and test data. Since
some related domains where labeled data are available, and previous domain adaptation methods only focused on the
use the knowledge to improve the performance of data data on the source and target domains, they may fail to
mining tasks in a target domain. In transfer learning connect the related but indirectly shared parts of the
terminologies, one or more auxiliary domains are identified
domains. Our observation is that such a gap can potentially
as the source of knowledge transfer, and the domain of
be found and bridged using knowledge from other domains.
interest is known as the target domain. Much effort has been
To solve this problem, we introduce a bridge between the two
dedicated to this problem in recent years in machine learning,
different domains by leveraging additional knowledge
data mining, and information retrieval [1], [2], [3], [4], [5].
sources that are readily available and have wide coverage
However, transfer learning may not work well when the
in scope. This knowledge source can be a third domain, such
difference between the source and target domains is large. In
as Wikipedia or the Open Directory Project (ODP). For
example, the connection between commutative algebra and
. The authors are with the Department of Computer Science and geometry can be found through a large knowledge base on
Engineering, Hong Kong University of Science and Technology, Clear- algebraic geometry topics. Once we find such a knowledge
water Bay, Kowloon, Hong Kong.
E-mail: {wxiang, caobin, derekhh, qyang}@cse.ust.hk. bridge, we can use the auxiliary data and semisupervised
Manuscript received 31 Mar. 2009; revised 22 Sept. 2009; accepted 30 Oct.
learning methods to fill in the information gap.
2009; published online 4 Feb. 2010. In this paper, we apply semisupervised learning (SSL) to
Recommended for acceptance by C. Zhang, P.S. Yu, and D. Bell. domain adaption problems based on the use of the auxiliary
For information on obtaining reprints of this article, please send e-mail to: data. We take the labeled data from the source domain and
tkde@computer.org, and reference IEEECS Log Number
TKDESI-2009-03-0256. the unlabeled data from the target domain, as well as an
Digital Object Identifier no. 10.1109/TKDE.2010.31. auxiliary data source such as the Wikipedia. We then apply
1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 771

SSL to utilize the information contained in the unlabeled the distributions of different domains are pulled closer. Both
data to help the classification task on the target data. types are trying to discover the relation between source and
Although domain adaptation (DA) and SSL share similar target domains within the scope of two domains. For
problem settings, directly using SSL to solve DA problems example, instance-based transfer learning models assume
may result in poor performance, as validated in [3], due to that there is a subset of instances sharing similar distribu-
the fact that SSL assumes that the distribution of the tions in different domains, and then they emphasize the
unlabeled data is similar to that of the labeled data. impact of these data in the models since they are more
However, the existence of the information gap can make “similar.” For the feature-based domain adaptation models,
the assumption invalid. Using our approach, we show that they assume that different domains may share some features,
with the extracted bridge for filling in the information gap, for instance, a subset of explicit features or implicit features.
SSL-based algorithms can be applied successfully to the Here, we consider some well-known instance-based
classification problems. domain adaptation methods. Jiang et al. in [2] use the
This paper introduces a novel domain adaptation instance weighting method for natural language processing,
algorithm called BIG (Bridging Information Gap). Our where the method used is a type of importance sampling
BIG algorithm requires that the source domain and the method for solving sample selection bias problems [10]. Dai
target domain share the same feature space, but the et al. in [1] propose a boosting-style reweighting method
distributions between domains can be highly different. A and provide different weighting schemes for data in
major contribution of our work is that we make use of a different domains. Other feature-based methods have been
large amount of worldwide knowledge to build a bridge for developed and are compared to instance-based methods.
linking the source and target domains, even when their Daume et al. [11] propose a simple feature augmentation
distribution differences are large. We conduct thorough method for NLP tasks. Blitzer et al. [12] use the Structural
experiments on several real-world cross-domain text Correspondence Learning model (SCL) to identify corre-
classification tasks, and demonstrate that our proposed spondences among features from different domains by
approach can outperform a number of existing approaches modeling their correlations with pivot features that behave
for classification, including nontransfer learning ap- in the same way for discriminative learning. They choose
proaches, traditional domain adaptation approaches that the pivot features that are used to bridge two domains. Lee
do not use the auxiliary data, and approaches that use the et al. [13] use transfer learning on an ensemble of related
auxiliary data in a naive manner. We show that, especially tasks to construct an informative prior on feature relevance.
in situations when the source and target domains are far They assume that features themselves have metafeatures
away from each other, our approach can outperform the that are predictive of their relevance to the prediction task,
baseline methods significantly. and modeled their relevance as a function of the metafea-
The remainder of the paper is organized as follows: In tures. Raina et al. [5] describe an approach to self-taught
Section 2, we first survey related works in domain learning that uses sparse coding to construct high-level
adaptation. In Section 3, we define the concept of informa- features using the unlabeled data. They first express the
tion gap and propose our algorithm BIG for filling in the unlabeled data instance with a sparse weighted linear
information gap. In Section 4, we demonstrate its effective- combination of the basis and emphasize the L1 norm. They
ness through several cross-domain text classification tasks. then use these features as input to standard supervised
In Section 5, we conclude the paper and discuss some classification algorithms. In [9], a co-clustering-based
directions for future research. classification algorithm called CoCC is proposed to classify
out-of-domain documents. The class structure is passed
through word clusters from the in-domain data to the out-
2 RELATED WORK of-domain data. Additional class-label information given by
In this section, we briefly review some previously proposed the in-domain data is extracted and used for labeling the
methods for solving the task of domain adaptation. Since word clusters for out-of-domain documents. However, one
the Wikipedia is used as our auxiliary data for building the drawback of many previous works is that for the source and
information bridge, we also review some methods that target domains, the shared knowledge sometimes may be
extract useful knowledge from Wikipedia and other similar quite limited and the relation between the two domains
knowledge bases such as ODP. Finally, since our approach cannot be fully exploited.
to domain adaptation problem is highly related to semi-
supervised learning, we also briefly review this topic. 2.2 Data Mining with Online Knowledge Repository
A major component of our approach is to use online
2.1 Domain Adaptation knowledge repositories as auxiliary information sources to
Domain adaptation has attracted more and more attention in help bridge the gap between the source domain and the
the recent years. In general, previous domain adaptation target domain. Therefore, we review some latest approaches
approaches can be classified into two categories [7]: instance- of data mining with online knowledge repositories.
based approaches [1], [2] or feature-based approaches [4], In recent years, understanding and using online knowl-
[5], [8], [9]. edge repositories to aid real world data mining tasks has
Instance-based methods try to seek some reweighting become a hot research topic. There are more and more
strategies on the source data, such that the source distribu- works trying to use the Wikipedia for feature enrichment.
tion can match the target distribution. Feature-based Gabrilovich and Markovitch [14], [15] try to use the Open
methods try to discover a shared feature space on which Directory Project (ODP) for feature enrichment in the text

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
772 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

classification problem. They also show that using Wikipedia Cotraining [28] assumes that the features can be split into
as the external Web knowledge resource for feature two sets and each subset is sufficient to train a good
enrichment performs better than using ODP [16]. Gabrilo- classifier. Unlabeled data in co-training help to reduce the
vich and Markovitch [18] try to explicitly represent the size of the version space. Transductive SVM [29] builds the
meaning of texts in terms of a weighted vector of connection between class distribution and decision bound-
Wikipedia-based concepts. Their semantic analysis is ary by putting the boundary in low density regions. The
explicit in the sense that they manipulate manifest concepts goal is to find a labeling of the unlabeled data such that a
grounded in human cognition, rather than latent concepts linear boundary has the maximum margin on both the
used by LSA. In [19], a general framework for building original labeled data and the unlabeled data. It can be
classifiers with hidden topics discovered from large-scale viewed as an SVM with additional regularization term on
data collections is proposed. The framework is mainly unlabeled data. Graph-based semisupervised methods
based on latent topic analysis models like PLSA [20] and define a graph where the nodes are labeled and unlabeled
LDA [21] and machine learning methods like maximum examples and edges reflect the similarities between exam-
entropy and SVMs. The underlying idea of such a frame- ples. Blum and Chawla [27] view semisupervised learning
work is that for each classification task, a very large external as a graph mincut problem, where positive labels act as
data collection is collected and called a “universal data set,” sources and negative labels act as sinks and the mincut
then a classification model on both a small set of labeled problem aims to find a minimum set of edges whose
training data and a rich set of hidden topics discovered removal would block all flow from sources to sinks. Nodes
from that data collection is built. Currently, few previous connected to the sources would be labeled as positive and
approaches have used auxiliary knowledge such as online those connected to the sinks would be labeled as negative.
knowledge database for transfer learning or domain In this work, we will exploit the ability of semisupervised
adaptation. Wang et al. make an extension [22], [23] to the learning to aid the problem of domain adaptation.
feature-based transfer learning models via incorporating a
semantic kernel [17], [24] learned from Wikipedia. How-
ever, building a semantic kernel from the whole knowledge 3 METHODOLOGY
base is costly. Huge cost could be saved by considering only 3.1 Preliminaries
the “most useful” concepts to bridge the information gap. In this section, we provide several definitions to clarify our
Moreover, instance-based transfer, especially our method, terminology and present our analysis on the problem of
adds more interpretability into the transfer scheme and it is domain adaptation.
easy to study “what kind” of instances are useful for
Definition 1 (Feature Space). A feature space is an abstract
bridging the gap instead of the more compact and abstract
space where each data instance is represented as a point in an
semantic kernel. In this paper, we propose to incorporate
n-dimensional space. Its dimension is determined by the
the background knowledge efficiently from the instance-
number of features used to describe the instances.
based transfer perspective, which could also establish a
connection between the transfer learning problems and
In the problem of text classification, the most commonly
traditional semisupervised learning problems.
used feature space is the term space used in the vector space
Another collection of works are done by Zelikovitz and
model [30]. The term space is usually of very high
Hirsh [25], [26] that are related to our research. These works
dimension. For example, in the 20Newsgroup data set,1 the
use some unlabeled data from a background knowledge
dimension of the vector space model is over 40,000. To solve
document collection to enhance document similarities [10]
the high dimensionality issue, topic models are introduced
or dimensionality reduction [26]. However, their target is to
for text modeling [20]. Then documents can be represented
reduce the data sparsity, while we are focusing on
in a low dimensional topic space. However, one disadvan-
connecting the different domains apart in the problem of
transfer learning. tage of topic space is that the topics are hidden and it is not
easy to obtain their semantic meanings. Therefore, recently
2.3 Semisupervised Learning the space of Wikipedia concept is introduced as the feature
Domain adaptation could also be viewed as transductive space for text modeling [18]. Due to the existence of rich
transfer learning if the source domain and the target background knowledge in the Wikipedia concept space,
domain had no information gap. In this case, the problem even short texts can be accurately modeled [18].
can be reduced to a semisupervised learning problem. Definition 2 (Domain). A domain D is a probabilistic
However, when there is information gap, how to exploit distribution PD over the data instances in a feature space.
semisupervised learning is not clear. In this section, we first
review some semisupervised learning research works. Let x represent the feature vector of one data instance in
Semisupervised learning addresses the problem when the the feature space, then when we say a data instance x is in
labeled data are too few to build a good classifier and makes the domain x 2 D, it means that it is sampled from the
use of a large amount of unlabeled data, together with a distribution PD . A data set in the domain D is a set of data
small amount of labeled data to enhance the classifiers. instances sampled from PD . We slightly overuse the
Many semisupervised learning algorithms have been notation D to represent both the domain and the data set
developed in the past 10 years. Some notable models in the domain.
developed include co-training, transductive SVM, and some
other graph-based regularization methods like mincut [27]. 1. http://people.csail.mit.edu/jrennie/20Newsgroups/.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 773

In traditional supervised learning problems, we are


given a set of labeled instances from a specific domain as
a training set. A learning machine is trained on the training
set and will be applied on newly incoming instances from
the same domain to obtain their labels. The condition that
training and test sets are drawn from the same domain
guarantees the consistence and generalization ability of the
learning machine [31]. However, in practice we may not be
able to ensure that the training and test sets are from the
same domain.
Definition 3 (Domain Adaptation). In this paper, we refer to
domain adaptation as the following problem: We are given a
set of labeled instances Dsl ¼ fðxi ; yi ÞgN s
i¼1 2 D  f1g from
a source domain. However, we need to make prediction for
some unlabeled data Dtu ¼ fxj gM t
j¼1 2 D from a target domain. Fig. 1. Measuring similarities between documents based on the
The source domain and target domain are different domains in knowledge base.
the same feature space. Such a problem setting for domain
adaptation is quite common in a variety of applications, such of the abundant and potentially useful information sources
as text categorization [1], image classification [5], sentiment that are around, and use them to connect the information
analysis [12], localization [32], [33], activity recognition [34] separated by the gap. Such an intuition motivates us to
and so on. Since all data instances in the source domain are think of a different way for solving the domain adaption
labeled and data instances in the target domain are unlabeled, problem, i.e., through finding an information bridge.
we can remove the subscript in the notations Dsl and Dtu and
use Ds and Dt without introducing any ambiguity. 3.2 Margin as Information Gap
An intuitive way to understand the concept of information
The problem of domain adaption is likely to be encoun- gap is to consider separability of the source and target
tered in many real-world applications. For example, we may domains. Consider the simplest case when we want to
have trained a sentiment classifier for reviews on movies but transfer knowledge from a single source domain to a target
we want to use it to classify reviews from other domains domain. Intuitively, the difficulty in separating these
domains shows how large the information gap is between
such as books or music [12]. Another example is that we may
them. If the two domains can be easily separated, then there
have trained a classifier to classify news into topical
exists a large information gap between them, which may
categories but we want to use it also on blogs. In these
prevent our adapting the original learned model from the
cases, we do not want to relabel the data in the new domains source to the target domain. On the contrary, if the two
but hope to borrow the knowledge from the old domains. domains cannot be separated from each other easily, then
When the differences between the source and target the information gap is small, in which case we can treat the
domains are large, the model trained on the source domains two domains as essentially data that are sampled from a
cannot generalize well for the target domain data [2], [9]. A single underlying distribution. In other words, the original
natural approach to follow is to consider transductive (or “domain adaptation problem” is transformed into a
semisupervised) learning, since unlabeled data from the classification problem under the supervised setting or a
target domain is available. However, some previous works semisupervised (transductive) setting. A similar idea is
have found that after introducing some unlabeled data in the used in [6] where a classifier is trained to distinguish the
target domain, transductive learning is still not sufficient in source and target domains and the classification error is
improving the performance [1]. The reason may be that used as an empirical estimation for domain distance.
transductive or semisupervised learning generally assumes Although this idea is useful, it did not consider the
that the decision boundary lies in the low-density region of existence of auxiliary information sources that can be used
the feature space.2 When the distributions of source and to bridge two domains.
target domains are different, there may exist a low density Following the above intuitive idea, we use GðDs ; Dt ; KÞ
region between different domains which is a gap that to denote the information gap between source domain Ds
disconnects the same-class data in different domains. We and target domain Dt when knowledge base K is available.
refer to this gap as the information gap in domain adaptation. In the case where no other knowledge is obtainable besides
For example, Fig. 1 illustrates an example where the feature the two domains, we use the notation GðDs ; Dt Þ. Previous
space is the Wikipedia concept space. From Fig. 1, we could works on domain adaptation only focused on modeling
GðDs ; Dt Þ and ignored K. In contrast, we are modeling
see that there exist some information gaps between the
GðDs ; Dt ; KÞ directly in this work.
source domain and target domains.
We first consider the case where K is not available.
To solve the problem of domain adaptation under large
Since the concept of margin in SVM can be used to
information gaps, an intuitive idea is to find the shared part
measure the separability between two classes in classifica-
of different knowledge between the domains, and ignore
tion problems, therefore, we can define one form of
the differences. One instantiation of this idea is to make use
information gap of two domains GðDs ; Dt Þ to be the margin
2. Some semisupervised learning algorithms have slightly different between the source domain and target domain, when
assumptions. treating them as two classes.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
774 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

Definition 4 (Information Gap with No Background consists of some unlabeled data that are chosen so that we
Knowledge Available). The definition of information gap can apply semisupervised learning algorithms for training a
between Ds and Dt without background knowledge is given by classifier. These unlabeled data carry important information
the following equation: about the distribution between the two domains.

2 Algorithm 1. BIG: A Greedy Algorithm for Min-Margin


GðDs ; Dt Þ ¼ ð1Þ Domain Adaptation
kwk
Input: Source domain data set Ds , target domain data set
Dt , a knowledge base K, terminating threshold
w ¼ arg min kwk2 s:t: yi ðwT xi  bÞ  1 for all i; ð2Þ
t, relevance threshold 
where xi 2 Ds [ Dt , yi ¼ 1 if xi 2 Ds and yi ¼ 1 if xi 2 Dt . Output: A subset D from K

Given the form of knowledge bases as set of auxiliary data Initialize: Set D ¼ ;.
K ¼ fx0i g from other domains, the margin between two Preprocess: Set D0 ¼ CandSelectðDs ; Dt ; K; Þ.
domains not only depends on the data from source and Train an SVM model using ðxi ; yi Þ, where xi 2 Ds [ Dt ,
target domains which are treated as labeled data, but also
yi ¼ 1 if xi 2 Ds and yi ¼ 1 if xi 2 Dt .
depends on the auxiliary data K ¼ fx0i g, which are treated as
while D0 6¼ ; do
unlabeled data. The information gap can be defined as the
maximum margin under the transductive learning setting. for j = 1 to k do
Select xi from C satisfying jwT xi  bj < 1 by:
Definition 5 (Information Gap with Background Knowl- xi ¼ arg minxi jwT xi  bj
edge). The definition of information gap between Ds and Dt Let D ¼ D [ fxi g, D0 ¼ D0 =fxi g.
with background knowledge K is given by the following
end for
equation:
wold ¼ w
2 Train a TSVM model using ðxi ; yi Þ and D.
GðDs ; Dt ; KÞ ¼ ; ð3Þ 2
kwk G ¼ ðkwk  kw2old kÞ= kw2old k
if G  t then
where Output D and Exit.
w ¼ arg min kwk2 end if
  ð4Þ end while
s:t: yi wT xi  b  1; jwT x0i  bj  1 for all i:
A comprehensive example is illustrated in Fig. 2, where
Given the above definition, reducing the information gap three domains are shown. Domain A is a source domain,
domain B is the target domain, and domain C is an auxiliary
can be expressed as selecting the set of unlabeled data fxi g
domain used to bridge domains A and B. Without any data
from K to minimize the margin. Therefore, it can be
from the domain C, the problem cannot be solved directly
formulated as,
using a transductive SVM method, since the information
max ½½minkwk2  gap is large. In applying the algorithm, some data instances
fx0i gK from the auxiliary domain C are selected into the unlabeled
data sets which reduce the information gap between the
    two domains. Finally, when the algorithm converges, some
s:t: yi wT xi  b  1; j wT x0i  b j  1 for all i: ð5Þ
data are selected from the domain C. These data can be
We can also extend the problem by adding slack variables used to fill the gap between domains A and B.
for inseparable problems, where the optimal boundary is
obtained by 3.4 Algorithm Convergence
"" ! In this section, we show that the BIG algorithm is guaranteed
2
Xl to converge. We first prove several properties the margin-
w ¼ arg min kwk þ C 1  yi fðxi Þ based information gap has. Given the Definitions 4 and 5, we
f
i¼1 þ
!## ð6Þ show the following properties for information gap.
X
n
þ ð1  jfðx0i ÞjÞþ ; Lemma 1 (Nonincreasing Margin Lemma). Given the
lþ1 definition of information gap, we have
 
where ðzÞþ ¼ maxðz; 0Þ. However, the above max-min GðDs ; Dt Þ  G Ds ; Dt ; K :
problem is difficult to solve directly. Finding the optimal
solution is an NP-hard problem and is impractical. In the Furthermore, if we have K1  K2 , then
next section, we propose an algorithm that can optimize (5)  
in a greedy manner. GðDs ; Dt ; K1 Þ  G Ds ; Dt ; K2 :

3.3 BIG: A Min-Margin Algorithm to Reduce Proof. We can show that if the conclusion does not hold,
Information Gap which means GðDs ; Dt Þ < GðDs ; Dt ; KÞ, then GðDs ; Dt Þ is
Algorithm 1 shows our proposed solution. Its inputs include not the maximum margin. This conflicts with the
the source domain data set Ds , target domain data set Dt and definition of GðDs ; Dt Þ. u
t
auxiliary domain data K. The output of the algorithm

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 775

Fig. 2. An illustration of the BIG algorithm. Domain A is the source domain; Domain B is the target domain. After adding the new auxiliary data during
each iteration, the information gap between domain A and B is reduced. After reducing the information gap, TSVM can successfully solve the domain
adaptation problem.

The above lemma indicates that the information gap information gap, we need to involve a large auxiliary
between two domains is nonincreasing when more and knowledge base for better coverage. However, it is very
more knowledge is available. This property is consistent costly to perform the TSVM algorithm for searching the
with our intuition. We can further obtain a stronger bridging documents over a large knowledge base. There-
conclusion as shown in the following lemma. fore, it is necessary for us to identify some possible related
subsets as candidates first. For example, in Fig. 2, we are
Lemma 2 (Decreasing Margin Lemma). We have
only interested in retrieving those relevant unlabeled
GðDs ; Dt Þ > GðDs ; Dt ; KÞ; documents which are “helpful” to connect a domain A
with another domain B. An intuitive idea to estimate the
if 9xi satisfies that jwT xi  bj < 1. Furthermore, if we have helpfulness is to check whether one unlabeled document
K1  K2 , then stays “close” to both of the domains. This can be done by
calculating the similarity between these documents. The
GðDs ; Dt ; K1 Þ > GðDs ; Dt ; K2 Þ; most intuitive idea to identify the related domain data is to
if 9xi 2 K2 and xi 62 K1 satisfies that jwT xi  bj < 1. measure the similarity between the domain documents and
each article in the knowledge base [18].
Proof. We can prove the converse-negative proposition of However, such a method is computationally expensive,
this proposition. Suppose we have since it requires that we pair up these documents with each
article in the knowledge base from over one million articles.
GðDs ; Dt ; K1 Þ  GðDs ; Dt ; K2 Þ: To address this problem, we introduce a novel method to
Let w1 and w2 be the separating planes for GðDs ; Dt ; K1 Þ solve this problem by unifying the idea of topic models [21]
and GðDs ; Dt ; K2 Þ, respectively. It is easy to validate that and the knowledge-base document space projection [18]. To
w2 is also a separating plane with margin 2=GðDs ; Dt ; K2 Þ this end, we apply Latent Dirichlet Allocation (LDA) [21] to
construct a topic space over the knowledge base docu-
for K1 . Therefore, such xi does not exist, which satisfies
ments. LDA is a generative graphical model which can be
xi 62 K1 and jwT xi  bj < 1 but xi 2 K2 . u
t
used to model and discover underlying topic structures of
This theorem shows that selecting the data that fall in the any discrete data such as texts. In LDA, a document dm ¼
margin can reduce the size of the margin. Algorithm BIG is fwm;n gNn¼1 is generated by first picking a distribution over
m

motivated by this lemma. We can further guarantee its topics m from a Dirichlet distribution DirðÞ, which
convergence as shown in the following theorem: determines the topic assignment of the document dm . Then,
Theorem 1 (Convergence Theorem). The iteration process in the topic assignment for each word placeholder ½m; n is
Algorithm 1 is always increasing for the objective in (5) until performed by sampling a particular topic zm;n from multi-
convergence. nomial distribution Multðm Þ. Finally, a particular word
wm;n is generated for the word place holder ½m; n by
sampling from multinomial distribution Multðzm;n Þ. In
The proof of the above theorem can be easily obtained
order to estimate the parameters  and  for LDA, we
from Lemma 2.
need to maximize the likelihood of the whole data
3.5 Selecting Candidates from Auxiliary Knowledge collection—i.e., the entire knowledge base K:
Base Y
pðKj; Þ ¼ pðdm j; Þ
In this section, we make our algorithm more concrete by dm 2K
considering the data as text documents. Z Z Y
Nm
Based on our proposed algorithm in the previous ¼ pðjÞ pðwm;n jzm;n Þpðzm;n jm Þpðm jÞddm :
section, theoretically we are able to directly retrieve some n¼1
unlabeled documents from the knowledge base to bridge ð7Þ
domains. However, in practice, in order to fill in the

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
776 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

1 X  
However, the two integrations in (7) are intractable. We pðdi jDs Þ ¼ p di jxsj ð11Þ
use Gibbs Sampling for approximation [35] here (due to space Z xs 2Ds
j

limitation, the details of the Gibbs Sampling algorithm for


LDA are omitted.) When learning parameters  and , we can and we form the related domain data set C by collecting the
top M ranked candidates which are related to both Dt and
obtain the topic association of each word in the documents in
Ds according to their relevance score:
our knowledge base. Thus, we are able to obtain the
document-topic association and the word-topic association, Scoreðdi jDs ; Dt Þ ¼ pðdi jDs Þpðdi jDt Þ: ð12Þ
represented as pðzj jdi Þ and pðwjzj Þ, respectively. Actually,
LDA summarizes the documents fdi gN We can now select the most relevant documents based on
i¼1 in the whole
knowledge base to a few latent topics fzj gkj¼1 . It can be their scores. Algorithm 2 shows how to preprocess the entire
knowledge base to obtain a set of candidate documents.
viewed as a soft clustering over the documents in the
knowledge base where each hidden topic zj can be viewed Algorithm 2. CandSelect: Algorithm for Candidate
as a cluster of documents. We can also obtain the conditional Selection
probabilities of generating document di from a topic zj with Input: Source domain data set Ds , target domain data set
Bayes rule: Dt , entire knowledge base K, relevance threshold 
Output: A candidate set D0 from K
pðzj jdi Þpðdi Þ
pðdi jzj Þ ¼ PN ; ð8Þ
i¼1 pðzj jdi Þpðdi Þ Initialize: Set D0 ¼ ;.
where pðzj jdi Þ is the topic proportion provided by the topic for each di in K do
model. We assume a uniform prior distribution for pðdi Þ. if Scoreðdi jDs ; Dt Þ   then
Using the word-topic association pðwjzj Þ, we can also Let D0 ¼ D0 [ fdi g.
infer the hidden topic for the newly incoming document d0 : end if
Q end for
pðd0 jzj Þpðzj Þ pðzj Þ w2d0 pðwjzj Þ
pðzj jd0 Þ ¼ Pk ¼ ; ð9Þ
0
j¼1 pðd jzj Þpðzj Þ
Z
4 EXPERIMENTS
where pðzj Þ is the prior of hidden topic zj , and Z is the
In this section, we demonstrate the effectiveness of our
normalization factor. Then we are able to obtain the semisupervised learning approach on several real-world
similarity between a new document d0 and the document domain adaptation tasks. To examine our approach in more
di in the knowledge base: detail, we analyze the basic step of our algorithm to
X
k illustrate that we can successfully discover the information
pðdi jd0 Þ ¼ pðdi jzj Þpðzj jd0 Þ: ð10Þ gap. Furthermore, we demonstrate that our proposed
j¼1 algorithm can reduce the information gap successfully
through a small number of iterations and converge to a local
Consider an example to illustrate how to calculate the optima efficiently.
similarity between a Wikipedia article and a newly incoming
document. We assume that the latent topics might refer to the 4.1 Tasks and Data Sets
following four aspects: <“Wearing”, “Outdoor Sports”, We conduct our experiments with three different domain
“Game Hardware”, “Maths”> . Consider one article with adaptation tasks. The first task is cross-domain text
the title “Xbox NBA Live 2008,” which refers to a video game. classification on the 20 Newsgroups data set. The second
During the learning process of the topic model, each word in task is a domain adaptation problem in sentiment classifica-
this paper is assigned to one of the four latent topics. After tion and the third task is a cross-domain short text
normalization, the topic distribution of this document can be classification task.
represented by a vector <0:3; 0:5; 0:7; 0:2>, which denotes
the strength of the soft association between this paper and 4.1.1 20 Newsgroups
the four latent topics. Similarly, when a newly incoming The 20 Newsgroups dat set3 [36] is a text collection of
document talking about “Jordan Shoes” comes, using (9), it approximately 20,000 newsgroup documents partitioned
can also be represented with the latent topic vectors as across 20 different newsgroups nearly evenly. Since we are
comparing our algorithm’s performance with that of [9], we
<0:8; 0:6; 0:1; 0:1>. Finally, we can calculate the similarity
follow exactly the same experimental settings as given in
between “Xbox NBA Live 2008” and “Jordan Shoes” on the
[9]. We create six different data sets for evaluating cross-
latent topic space using (10), that is 0:3 0:8 þ 0:5
domain classification algorithms and for each data set, two
0:6 þ 0:7 0:1 þ 0:2 0:1 ¼ 0:63.
top categories are chosen, one as positive and the other as
Our above method differs from the method used in [18]
negative. Data are split based on subcategories and
in that the inner product is calculated at the topic space different subcategories are considered as different domains.
instead of the term space. Therefore, the computational cost The six different data sets (shown in Table 1) are comp versus
of document projection can be greatly reduced. sci, rec versus talk, rec versus sci, sci versus talk, comp versus rec,
Finally, in order to evaluate the relatedness of a document and comp versus talk. Detailed descriptions of these data sets
with a domain, we aggregate the conditional distribution of
all the data in the same domain Dt or Ds : 3. http://people.csail.mit.edu/jrennie/20Newsgroups/.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 777

TABLE 1 TABLE 2
Data Description—20 Newsgroups Data Description—Amazon Product Reviews

TABLE 3
Data Description—AOL Queries

can also be found in [9]. For preprocessing details, all the


alphabets are converted into lower cases and stemmed
using the Porter stemmer. Stop words are also removed and
the document frequency (DF) feature is used to cut down
the number of features.

4.1.2 Sentiment Reviews


The data of sentiment domain adaptation [12] consist of
Amazon product reviews for four different product types, For projecting the domain data onto the latent topic
including books, DVDs, electronics, and kitchen appliances.4 space shared with the documents in the knowledge base,
Each review consists of a rating with scores ranging from 0 we train a topic model using the documents in the
to 5, a reviewer name and location, a product name, a knowledge base using the GibbsLDA++6 package. In [19],
review title and date and the review text. Reviews with they used the similar idea of using the topic model learned
ratings higher than three are labeled as positive and from the knowledge base to reduce the data sparseness
reviews with ratings lower than three are labeled as problem, and they also investigated the effect of different
negative, the rest are discarded since the polarity of these parameter settings for learning a topic model from
reviews are ambiguous. The details of the data in different Wikipedia. Following their conclusion, we set the para-
domains are summarized in Table 2. Such experimental meter of topic dimension as 100, and use the default hyper-
settings are the same as that of [12]. To study the parameter settings, that is  ¼ 50=K (K is the number of
performance of our approach in this task, we construct topics), and  ¼ 0:1. We follow such empirical parameters
12 pairs of cross-domain sentiment classification tasks, e.g., described in [19] for simplifying our parameter settings.
we use the reviews from domain A as training data and
then predict the sentiment of the reviews in the domain B. 4.3 Experimental Results
In this section, we answer the following two questions,
4.1.3 Query Classification empirically. 1) Can our min-margin-based semisupervised
We also construct a set of tasks on cross-domain query learning approach outperform traditional transfer learning
classification for a search engine. We use a set of search approaches on the domain adaptation tasks? 2) Can our
snippets gathered crawled from Google as our training data algorithm automatically identify the most important docu-
and and some incoming unlabeled queries as the test data. ments connecting the source domain to the target domain?
The detailed descriptions of the process can be found in
[19]. We use the labeled queries from AOL provided by [37] 4.3.1 Question 1: Comparison with Traditional Transfer
(shown in Table 3) for evaluation.5 We consider queries Learning Methods
from five classes: Business, Computer, Entertainment, Health, To answer the first question, we aim to compare our
and Sports which are shown in both of the training and test algorithm performance against some baseline algorithms
data sets. We form 10 binary classification tasks for query presented in previous transfer learning papers (shown in
classification and queries are enriched according to [38], Tables 4, 5, and 6). To evaluate the effectiveness of our
[39], where we gather the top 50 search snippets for the approach, we adopt three classification models for domain
query enrichment. adaptation: the first model is Support Vector Machines
(SVM), which is usually used for supervised learning; the
4.2 Evaluation Metrics second model is Transductive SVM (TSVM) [29], which is a
To compare the performance of the classification methods, semisupervised learning model; and the last model is Co-
we use classification accuracy, which is defined as the Cluster-based Classification (CoCC) [9] which is a transfer
percentage of correct predictions among all test examples. In learning model designed for cross domain classification. For
order to validate the robustness of our method, for each time all of the three baseline models, we only use the labeled
we randomly sampled 90 percent training data for training documents in the source domain and unlabeled documents
the model. Each experiment was repeated ten times, and both in the target domain for training, and evaluate the model on
the mean and variance of the accuracies are reported. the test documents in the target domain. Since our BIG
algorithm is based on TSVM, in order to obtain the optimal
4. http://www.seas.upenn.edu/mdredze/data-sets/sentiment/.
5. http://gregsadetsky.com/aol-data/. 6. http://gibbslda.sourceforge.net/.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
778 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

TABLE 4
Accuracy on 20NG Data

TABLE 5
Accuracy on Sentiment Data

TABLE 6
Accuracy on Web Query Data

parameter settings for SVM, TSVM, and BIG, we first tune talk, correspondingly. Each column denotes the mean and
the parameter C to achieve the optimal accuracy via 10-fold variance of different models. We use bold face to mark the
cross validation on each data set individually, and we best classifiers for each data set. We also add a column on
report the optimal results of different models on different the right to denote our improvement to the best result in
data sets in the following tables. baselines. For this and all subsequent experiments, the
We first compare the experimental results on the Student’s t-test was applied to assess statistical significance
20 Newsgroups data, which are shown in Table 4.7 Each with respect to the best baseline method. We use the
row in Table 4 represents a different cross-domain text notation ** to mark the rows in the improvement column in
classification task, including comp versus sci, rec versus talk, the tables to indicate significance with p < 0:01; and use * to
rec versus sci, sci versus talk, comp versus rec, and comp versus indicate significance with p < 0:05; unmarked results
indicate no significant differences.
7. For all the tables, the best model for each data set is marked with bold Comparing the results of SVM and TSVM, we find that
face. We also measure the statistical significance of the improvement using
Student’s t-test. * and ** mean that the improvement hypothesis are TSVM can boost the performance of SVM almost all the
accepted at different significant levels (p ¼ 0:05, 0.01). time. When we involve the top 500 important Wikipedia

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 779

articles as the unlabeled data to fill in the information gap, TABLE 7


where the articles are selected by BIG, we are able to boost Related Concepts in the Sentiment Reviews Task
the performance of TSVM even further. Furthermore,
amongst all six tasks, our algorithm can outperform the
CoCC algorithm.
Table 5 shows the experimental results on the sentiment
classification tasks. Each row in the table corresponds to a
different cross-domain review sentiment classification task,
where D, B, E, K corresponds to reviews on DVDs, books,
electronics, and kitchen appliances, correspondingly. A ! B
means we use reviews in domain A as training data and then
use reviews in domain B as test data. Polarity classification is
quite a different problem comparing with the traditional
genre categorization. We can find that, different from 20NG,
when the information gap between the domains becomes
larger, TSVM does not help a lot. This phenomenon might be
which were extracted by BIG, from the sentiment domain
caused by the fact that simply pulling the data across
adaptation task. Detailed results are shown in Table 7.
different domains together is not reasonable. We can find
From our results, we can see that the top-ranked
that, our BIG algorithm, after filling in the information gap,
Wikipedia concepts contain rich content that could be used
boosts the performance of TSVM. Almost for all tasks, our
to fill in the information gap between the two domains. Take
BIG algorithm can outperform the TSVM method, again
the cross-domain text classification task “D ! B” as an
suggesting that we are adding valuable information into the
example, the top-ranked Wikipedia concepts the algorithm
unlabeled data set.
extract include concepts like “James Bond” and “Batman,”
In Table 6, we show our result on cross-domain query
which are good examples of concepts that contain knowl-
classification tasks. The results are similar to previous tasks,
edge about both DVD domains and the Book domains.
where BIG outperforms TSVM most of the time, and SVM,
Therefore, the detailed results of extracted related concepts
CoCC perform at the same level as TSVM.
could also validate the effectiveness of our algorithm to fill
4.3.2 Question 2: Comparison with Random Selection in the information gap from another perspective.
Methods 4.5 Convergence and Stability
To answer the second question, we can compare our BIG We first demonstrate that our algorithm can reduce the
algorithm with some Random Selection baselines. We information gap between domains during the process of
adopt two random selection methods for comparison. The including the unlabeled data from the related domains. We
“Random 1” method randomly selects 500 data instances randomly sample three tasks for each of the three data sets
from the entire knowledge base; The “Random 2” method and display the performance (shown in Fig. 4), together
first uses the CandSelect algorithm mentioned in Section 3 with their corresponding margin sizes (shown in Fig. 3). For
to generate a candidate set, and then randomly selects each iteration, we include the top 100 unlabeled data that
500 instances from the candidate set. Once we have are closest to the decision boundary for TSVM. The x-axis is
selected the 500 instances through either kind of Random the iteration count. We find that our algorithm is able to
methods, the 500 instances are added to the training set reduce the information gap and converge quickly. We also
as unlabled data. Then we apply the TSVM model for list another group of figures to demonstrate the tradeoff
semisupervised learning. From Tables 4, 5, and 6, we can between CPU time and accuracy (shown in Fig. 5).
observe that, the performance of our approach is much Since BIG is based on the semisupervised learning model
better than both of the Random baselines, indicating that TSVM, a very sensitive parameter is C, which is the error
we really found the most important nodes during the tolerance factor. It is true that the performance of BIG might
domain drift process through which distribution is drifted be different under different settings of C. Another para-
from the source domain to the target domain. We can meter in our algorithm is the margin threshold t. Here, we
also observe that “Random 2” is better than “Random 1,” will discuss how to set the two parameters together. Since
since the Topic-model-based CandSelect algorithm can our margin is closely related to the error tolerance factor C,
really filter out a huge amount of irrelevant data from we also investigate how stable the algorithm performs
the knowledge base. under different settings of C.
All the experiments in different tasks are consistent and A very interesting observation is that our optimal margin
validate the effectiveness of our proposed algorithm. In the threshold is relatively stable over different data sets. Fig. 6
following we will investigate how our proposed method shows the optimal threshold under different C values. This
can find where the bridging documents are located in the makes our method robust and reliable when applying to
knowledge base, and verify that our BIG algorithm could different tasks. Thus, for different domain adaptation tasks,
converge and its performance is stable. we first use 10-fold cross validation on traditional TSVM to
determine the optimal parameter C. Then, based on the
4.4 Information Gap parameter C, we use 10-fold cross validation again to seek
In this section, we propose to further analyze the effectiveness the optimal parameter of threshold t for halting the BIG
of our algorithm by showing directly what “related domains” algorithm.
we have actually found in our experiments. Here, we provide We also investigate the impact of the scope of related
the related domains, as top-ranked Wikipedia concepts domains. We randomly choose one task from each of the

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
780 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

Fig. 3. The change of information gap (margin) during iterations. We can observe that the information gap is reducing during iterations. (a) shows the
results on 20 Newsgroups; (b) shows the result on sentiment; (c) shows the results on AOL query set. The position of the data sets are the same in
the following figures.

Fig. 4. The change of accuracy during iterations. We observe that the performance converges within a few iterations.

Fig. 5. The trade-off between accuracy and CPU time.

Fig. 6. The relation between optimal threshold and parameter C.

Fig. 7. The relation between the number of auxiliary data and the best performance.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 781

TABLE 8
Different Knowledge Bases on 20NG Data

TABLE 9
Different Knowledge Bases on Sentiment Data

TABLE 10
Different Knowledge Bases on Web Query Data

three data sets. We vary the size of candidate related any common knowledge repository available online, such
documents from 4,000 to 10,000. We observe that the as Wikipedia8 or ODP.9 In this paper, we incorporate
performance is quite stable (shown in Fig. 7). This may be Wikipedia as our external data source. Wikipedia is
due to the fact that we use the topic model to retrieve the currently the largest knowledge repository on the Web,
data from related domain as the candidate. The data and the quality of its article content is remarkable due to the
selection for filling in the information gap is achieved by open editing strategy. In our experiments, we use the
snapshot of Wikipedia on 30 Nov. 2006. There are totally
BIG, which means we can retrieve relevant data from
1,614,132 articles and 34,172,627 hyperlinks between them.
related domains as much as possible as the computational
We preprocess the articles by stemming and stop words
cost is low. The BIG algorithm is able to discover the gap
removing. We also filtered out the short articles with length
and choose the useful data automatically. less than 500, and finally there remain 53,803 articles as our
4.6 Choice of Different Auxiliary Knowledge Bases knowledge base. For ODP, we download its snapshot on
Since our algorithm does not depend on any specific 8. http://en.wikipedia.org.
structure of the knowledge base, we are free to choose 9. http://www.dmoz.org/.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
782 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

18 Oct. 2006, which covers 1,733,500 pages. We also filtered [2] J. Jiang and C. Zhai, “Instance Weighting for Domain Adaptation
in NLP,” Proc. 45th Ann. Meeting of the Assoc. for Computational
out the short articles with length less than 1,000, and finally Linguistics (ACL ’07), June 2007.
there remain 59,986 articles as our knowledge base. [3] G.-R. Xue, W. Dai, Q. Yang, and Y. Yu, “Topic-Bridged PLSA for
From Tables 8, 10, and 9, we can observe that our BIG Cross-Domain Text Classification,” Proc. 31st Ann. Int’l ACM
algorithm can always boost the performance of TSVM. We SIGIR Conf. Research and Development in Information Retrieval
(SIGIR ’08), pp. 627-634, July 2008.
also compare our BIG algorithm with the Random base-
[4] A. Argyriou, C.A. Micchelli, M. Pontil, and Y. Ying, “A
lines, suggesting that we are selecting useful unlabeled data Spectral Regularization Framework for Multi-Task Structure
in the semisupervised learning process. This further Learning,” Proc. 21st Ann. Conf. Neural Information Processing
demonstrates that it is necessary for us to design an Systems (NIPS ’07), Dec. 2007.
algorithm to carefully “pick out” the bridging documents. [5] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng, “Self-Taught
Learning: Transfer Learning from Unlabeled Data,” Proc. 24th
Ann. Int’l Conf. Machine Learning (ICML ’07), pp. 759-766, June
2007.
5 CONCLUSIONS AND FUTURE WORK [6] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman,
In this paper, we proposed a novel framework for tackling “Learning Bounds for Domain Adaptation,” Proc. 21st Ann. Conf.
the problem of domain adaption under large information Neural Information Processing Systems (NIPS ’07), Dec. 2007.
[7] S.J. Pan and Q. Yang, “A Survey on Transfer Learning,”
gaps. We model the learning problem as a semisupervised IEEE Trans. Knowledge and Data Eng., preprint, 12 Oct. 2009,
learning problem aided by a method for filling in the doi: 10.1109/TKDE.2009.191.
information gap between the source and target domains with [8] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of
the help of an auxiliary knowledge base (such as the Representations for Domain Adaptation,” Proc. 20th Ann. Conf.
Neural Information Processing Systems (NIPS ’06), pp. 137-144, Dec.
Wikipedia). By conducting experiments on different difficult
2006.
domain adaptation tasks, we show that our algorithm can [9] W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, “Co-Clustering Based
significantly outperform several existing domain adaptation Classification for Out-of-Domain Documents,” Proc. 13th ACM
approaches in situations when the source and target domains SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’07),
are far from each other. In each case, an auxiliary domain can pp. 210-219, Aug. 2007.
[10] B. Zadrozny, “Learning and Evaluating Classifiers Under
be used to fill in the information gap efficiently. Sample Selection Bias,” Proc. 21th Ann. Int’l Conf. Machine
We make three major contributions in this paper. Learning (ICML ’04), p. 114, July 2004.
1) Instead of the traditional instance-based or feature-based [11] H.D. III, “Frustratingly Easy Domain Adaptation,” Proc. 45th Ann.
perspective to view the problem of domain adaptation, we Meeting of the Assoc. for Computational Linguistics (ACL ’07), June
view the problem from a new perspective, i.e., we consider 2007.
[12] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, Bollywood,
the problem of transfer learning as one of filling in the Boom-Boxes and Blenders: Domain Adaptation for Sentiment
information gap based on a large document corpus. We Classification,” Proc. 45th Ann. Meeting of the Assoc. for Computa-
show that we can obtain useful information to bridge the tional Linguistics (ACL ’07), pp. 440-447, June 2007.
source and the target domains from auxiliary data sources. [13] S.I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller, “Learning a
Meta-Level Prior for Feature Relevance from Multiple Related
2) Instead of devising new models for tackling the domain
Tasks,” Proc. 24th Ann. Int’l Conf. Machine Learning (ICML ’07),
adaptation problems, we show that we can successfully pp. 489-496, June 2007.
bridge the source and target domains using well developed [14] E. Gabrilovich and S. Markovitch, “Feature Generation for Text
semisupervised learning algorithms. 3) We propose a min- Categorization Using World Knowledge,” Proc. 19th Int’l Joint
margin algorithm that can effectively identify and reduce Conf. Artificial Intelligence (IJCAI ’05), pp. 1048-1053, July/Aug.
2005.
the information gap between two domains. [15] E. Gabrilovich and S. Markovitch, “Harnessing the Expertise of
We plan to continue our research work on this direction in 70,000 Human Editors: Knowledge-Based Feature Generation for
the future, by pursuing several avenues. First, we plan to Text Categorization,” J. Machine Learning Research, vol. 8, pp. 2297-
validate the effectiveness of our approach through other 2345, 2007.
semisupervised learning algorithms and other relational [16] E. Gabrilovich and S. Markovitch, “Overcoming the Brittleness
Bottleneck Using Wikipedia: Enhancing Text Categorization with
knowledge bases to more extensively demonstrate the Encyclopedic Knowledge,” Proc. 21th Nat’l Conf. Artificial Intelli-
effectiveness of our approach. Second, in this paper, we gence and the 18th Innovative Applications of Artificial Intelligence
only investigate the case where the source, target and Conf. (AAAI ’06), July 2006.
auxiliary data sources share the same feature space. We plan to [17] P. Wang, J. Hu, H.-J. Zeng, L. Chen, and Z. Chen, “Improving Text
Classification by Using Encyclopedia Knowledge,” Proc. Seventh
extend our approach to be able to consider heterogeneous IEEE Int’l Conf. Data Mining (ICDM ’07), Oct. 2007.
transfer learning [40]. Finally, since our current approach is an [18] E. Gabrilovich and S. Markovitch, “Computing Semantic Related-
iterative model based on TSVM, which is quite slow for large ness Using Wikipedia-Based Explicit Semantic Analysis,” Proc.
learning tasks, we will try to develop online TSVM methods 20th Int’l Joint Conf. Artificial Intelligence (IJCAI ’07), pp. 1606-1611,
Jan. 2007.
for incremental cross-domain transductive learning.
[19] X.H. Phan, M.L. Nguyen, and S. Horiguchi, “Learning to Classify
Short and Sparse Text & Web with Hidden Topics from Large-
Scale Data Collections,” Proc. 17th Int’l Conf. World Wide Web
ACKNOWLEDGMENTS (WWW ’08), pp. 91-100, Apr. 2008.
The authors thank the support from Hong Kong CERG [20] T. Hofmann, “Probabilistic Latent Semantic Analysis,” Proc. 15th
Conf. Uncertainty in Artificial Intelligence (UAI ’99), pp. 289-296,
grant 621307 and a grant from NEC China Lab. July/Aug. 1999.
[21] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Alloca-
tion,” Proc. 14st Ann. Conf. Neural Information Processing Systems
REFERENCES (NIPS ’01), pp. 601-608, Dec. 2001.
[1] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for Transfer [22] P. Wang, C. Domeniconi, and J. Hu, “Using Wikipedia for Co-
Learning,” Proc. 24th Ann. Int’l Conf. Machine Learning (ICML ’07), Clustering Based Cross-Domain Text Classification,” Proc. Eighth
pp. 193-200, June 2007. IEEE Int’l Conf. Data Mining (ICDM ’08), pp. 1085-1090, Dec. 2008.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.
XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 783

[23] P. Wang, C. Domeniconi, and J. Hu, “Cross-Domain Text Evan Wei Xiang received the BE degree in
Classification Using Wikipedia,” The IEEE Intelligent Informatics software engineering from Nanjing University in
Bull., vol. 9, pp. 5-17, Nov. 2008. 2006 and is working toward the PhD degree in the
[24] P. Wang and C. Domeniconi, “Building Semantic Kernels for Text Department of Computer Science and Engineer-
Classification Using Wikipedia,” Proc. 14th ACM SIGKDD Int’l ing, the Hong Kong University of Science and
Conf. Knowledge Discovery and Data Mining (KDD ’08), pp. 713-721, Technology. His research interests include large
Aug. 2008. scale data mining, transfer learning, and their
[25] S. Zelikovitz and H. Hirsh, “Improving Short-Text Classification applications in Social Web mining. More details
Using Unlabeled Background Knowledge to Assess Document can be found at http://ihome.ust.hk/~wxiang.
Similarity,” Proc. 17th Int’l Conf. Machine Learning (ICML ’00),
pp. 1183-1190, 2000.
[26] S. Zelikovitz and H. Hirsh, “Using LSI for Text Classification in
the Presence of Background Text,” Proc. 10th ACM Int’l Conf. Bin Cao received the BS and MS degrees in
Information and Knowledge Management (CIKM ’01), pp. 113-118, mathematics from Xi’an Jiaotong Univeristy and
Nov. 2001. Peking University, in 2004 and 2007, respec-
[27] A. Blum and S. Chawla, “Learning from Labeled and Unlabeled tively, and and is currently working toward the
Data Using Graph Mincuts,” Proc. 18th Int’l Conf. Machine Learning PhD degree in the Department of Computer
(ICML ’01), pp. 19-26, June/July 2001. Science and Engineering, the Hong Kong Uni-
[28] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Sata versity of Science and Technology. His research
with Co-Training,” Proc. 11th Ann. Conf. Computational Learning interests include data mining, machine learning,
Theory (COLT ’98), pp. 92-100, 1998. and information retrieval. More details can be
[29] T. Joachims, “Transductive Inference for Text Classification Using found at http://ihome.ust.hk/~caobin.
Support Vector Machines,” Proc. 16th Int’l Conf. Machine Learning
(ICML ’99), pp. 200-209, June 1999.
[30] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model for Derek Hao Hu received the BS degree in
Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, computer science from Nanjing University, in
1975. 2007 and is working toward the PhD degree in
[31] C. Bishop, Pattern Recognition and Machine Learning. Springer- the Department of Computer Science and
Verlag, 2006. Engineering, the Hong Kong University of
[32] V.W. Zheng, S.J. Pan, Q. Yang, and J.J. Pan, “Transferring Multi- Science and Technology. His research interests
Device Localization Models Using Latent Multi-Task Learning,” include probabilistic graphical models, and their
Proc. 23rd Nat’l Conf. Artificial Intelligence (AAAI ’08), pp. 1427- applications in sensor-based activity recognition
1432, July 2008. and Web mining. More details can be found at
[33] V.W. Zheng, E.W. Xiang, Q. Yang, and D. Shen, “Transferring http://www.cse.ust.hk/~derekhh.
Localization Models Over Time,” Proc. 23rd Nat’l Conf. Artificial
Intelligence (AAAI ’08), pp. 1421-1426, July 2008.
[34] V.W. Zheng, D.H. Hu, and Q. Yang, “Cross-Domain Activity Qiang Yang received the bachelor’s degree in
Recognition,” Proc. 11th Int’l Conf. Ubiquitous Computing (Ubicom astrophysics from Peking University and the
’09), pp. 61-70, 2009. PhD degree in computer science from the
[35] T.L. Griffiths and M. Steyvers, “Finding Scientific Topics,” University of Maryland, College Park. He is a
Proc. Nat’l Academy of Sciences USA, vol. 101, suppl. 1, faculty member in the Department of Computer
pp. 5228-5235, http://dx.doi.org/10.1073/pnas.0307752101, Science and Engineering, Hong Kong University
Apr. 2004. of Science and Technology. His research inter-
[36] K. Lang, “Newsweeder: Learning to Filter Netnews,” Proc. 12th ests include data mining and machine learning,
Int’l Machine Learning Conf. (ICML ’95), pp. 331-339, 1995. AI planning, and sensor-based activity recogni-
[37] S.M. Beitzel, E.C. Jensen, O. Frieder, D.D. Lewis, A. Chowdhury, tion. He is a fellow of the IEEE, member of AAAI
and A. Kolcz, “Improving Automatic Query Classification via and ACM, a former associate editor for the IEEE TKDE, and a current
Semi-Supervised Learning,” Proc. Fifth IEEE Int’l Conf. Data associate editor for IEEE Intelligent Systems. More details can be found
Mining (ICDM ’05), pp. 42-49, Nov. 2005. at http://www.cse.ust.hk/~qyang.
[38] D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, “Building Bridges for
Web Query Classification,” Proc. 29th Ann. Int’l ACM SIGIR Conf.
Research and Development in Information Retrieval (SIGIR ’06),
pp. 131-138, Aug. 2006. . For more information on this or any other computing topic,
[39] A.Z. Broder, M. Fontoura, E. Gabrilovich, A. Joshi, V. Josifovski, please visit our Digital Library at www.computer.org/publications/dlib.
and T. Zhang, “Robust Classification of Rare Queries Using Web
Knowledge,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research and
Development in Information Retrieval, pp. 231-238, July 2007.
[40] Q. Yang, Y. Chen, G.-R. Xue, W. Dai, and Y. Yu, “Heterogeneous
Transfer Learning with Real-World Applications,” Proc. 47th Ann.
Meeting of the Assoc. for Computational Linguistics (ACL ’09), Aug.
2009.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on July 29,2010 at 13:16:22 UTC from IEEE Xplore. Restrictions apply.

S-ar putea să vă placă și