Sunteți pe pagina 1din 7

ICCTA 2012, 13-15 October 2012, Alexandria, Egypt

A Hybrid Method for User Query Reformation


and Classification
Said M. Fathalla

Yasser F. Hassan

Department of Mathematics & Computer Science


Faculty of Science, Alexandria University
Alexandria, Egypt
sm_fathalla@alex-sci.edu.eg

Department of Mathematics & Computer Science


Faculty of Science, Alexandria University
Alexandria, Egypt
y.fouad@alex-sci.edu.eg

Maged El-Sayed
Department of Information Systems and Computers
Faculty of Commerce, Alexandria University
Alexandria, Egypt
maged@alexu.edu.eg

Abstract. This paper presents a hybrid method for user query


reformation and classification depending on fuzzy semantic-based
approach and KNN classifier. The overall processes of the system
are: query preprocessing, fuzzy membership calculation and
query classification and reformation. Classification is performed
using KNN classifier not just by keyword-based semantic but
using a sentence-level semantics. A semantic model EnrichSearch - has been developed for applying this approach.
After classification, users query is reformulated to be submitted
to a search engine which gives better results than submitting the
original query to the search engine. Experiments show significant
enhancement in search results over traditional keyword-based
search engines results.
Keywordssemantic web; fuzzy logic; ontology; KNN

I. INTRODUCTION
The important goal of the semantic web is to make the
meaning of information explicit so that it will be machineaccessible information which enabling more effective access
to knowledge contained in heterogeneous information
environments, such as the web [1]. Current search engines are
not very effective because they are based on simple querydocument matches without considering the semantics of the
keywords [2] and also the intended semantic of a keyword is
difficult to know if the keyword has more than one semantic
according to the context.
Fuzzy sets are the type of sets in set theory that are
imprecise and no boundaries. Accordingly, they only relay on
the degree of their memberships. As the extension of web
grows, the thorny problem of data heterogeneity is also
increasing. However, the review/comparability of the fuzzy
sets, crisp sets and semantic web technologies were given
based on their extensions [3].
Synonyms are different words with similar meanings. Words
that are synonyms are said to be synonymous, so synonymous
are have the same membership values to a category. For
example the keywords movies and films are synonymous.
Also Short words such as msg and err are considered to
be synonymous for the original words message and error
respectively. Due to Polysemy, a homonym is one of a group
of words that share the same spelling and the same
978-1-4673-2824-1/12/$31.00 2012 IEEE

pronunciation but have different meanings. It is important to


identify the intended meaning of each keyword, this done by
using the classification of the text.
The main contribution of this work is the disambiguation of
the meaning of keywords through its context (getting the
intended meaning of keywords) because working with
keywords individually means lose the context which is the
only thing that can help to deal with ambiguity. The intended
meaning of keywords is used in query reformation to get
better results rather than submitting the original query to a
search engine and also to avoid too many irrelevant results
from retrieval which improve the retrieval effectiveness. The
task of automatically classifying a natural language text into a
predefined set of semantic categories based on their content is
called text categorization [4]. In this work, text categorization
is done not just for simple word or phrase-based but also
semantics-based classification. For example, consider the
keyword "switch", if a user working in hardware maintenance
formulates the problem statement turn the power switch on,
if a user working in network formulates the query power on
network switch.
Many researches in query enrichment uses ontologies to
extract keywords meanings to enrich the query but in our
research we does not use only ontology but we use our
classification algorithm to extract the intended meaning
(which is denoted by InSem(wi)) according to the meaning of
the overall query.
This paper is organized as follows. In section II, a related
work is discussed. In section III, an overall architecture of the
proposed methodology is discussed including the training
stage and classification and reformation stage. Section IV
discusses the ontology development process. Section V shows
our experimental results. Finally in section VI, we conclude
and describe our vision for the future work.
II. RELATED WORK
The ARCH system [5] is a personalized IR system that
enhance the user query using both of user profile which
contains long term user interests which limit the scope of the
system to whats in users profile which is not necessary
related to what the user is currently searching for. Researches
132

ICCTA 2012, 13-15 October 2012, Alexandria, Egypt

such as [6] rely on question classification and exact matching


between questions keywords and answers without using
semantics. The publicly available Microsearch system [7]
which enriches search results presentation with metadata
extracted from results and report on some of the early
feedback have been received. Such approach faces many
problems such as losing metadata, most users ignore feedback
and performance problems due to extracting and summarizing
embedded metadata.
III. OVERALL ARCHITECTURE OF THE PROPOSED
METHODOLOGY
The work is divided into two main stages. The first is the
training stage in which the system has been trained to adjust
the weights of the keywords that falls into more than one
category. The second is the working stage in which user can
submit the query to the system for classification and
reformation. Fig. 1 shows a layered approach for the proposed
model in which layers show the generic architecture of the
model as follows:
x User Interface Layer: involves the users who
submit the query to the system in form of keywords.
x Text Tokenizer Layer: splits the query statement
provided from the above layer to a set of keywords
and provides it to the underlying Semantic query
layer.
x Semantic Query Layer: takes the main keywords
from the above layer and formulate a semantic query
using the underlying Formal Query Layer.
x Formal Query Layer: provides a formal query
language which is SPARQL-that can be used to
retrieve semantic data from the underlying Semantic
data layer.
x Semantic Data Layer: contains the semantic data
represented in RDF repository.

categories C = {c1, c2, , cm}.The relevance of keywords K


to categories C is therefore expressed by a fuzzy relation
:KuC o [0, 1]
(1)
where the membership value of (ki, cj) specifies the
degree of relevance of keyword ki to category cj. The
membership values of this relation are determined by using a
set of training examples, where each training example
contains a text query and its category. Equation (2) shows
the possible values of membership. The larger the value; the
better the match of keyword to category. In particular, the
membership of 1 means a perfect assign of this word to
that category i.e. this word does not belong to other
category in the ontology- , a membership of 0 means this
word does not belong to this category at all and the
keywords that have different meaning (homonyms) belongs
to different categories has a value between 0 and 1.

(2)
Let TS be a set of training examples consisting of n queries
and their classifications.
TS = {q1, c(q1),q2, c(q2) , ,qn, c(qn) }
Each query in the training set is represented by a set of
term-frequency pairs as follows.
q = {k1, w1, k2,w2, ,kn, wn }
where wj is the occurrence frequency of keyword kj in the
query. Given a set of training queries TS, the membership
value (ki, cj) is calculated from the total number of
occurrences of keyword ki in category cj divided by the total
number of keyword frequency ki in all categories as follows.

A. The Training Stage


The training is done by adjusting the membership value of
a keyword to a certain category. This stage entails the
computation of fuzzy degree of similarity that ranges
between two edges 0 and 1. Consider a set of
recognized keywords K = {k1, k2, ,kn}, set of relevant

w i ,{w i q k q k TS C(q k )=c j }

(3)

w i ,{w i q k q k TS }

Table I, II and III show an illustration example for


membership value calculations. Suppose query q1 and q2 are
belong to category C1 and q3 belongs to C2 and there are
only three keywords {k1, k2, k3} appears in TS. A training set
structure is shown in table I, table II shows the termfrequency to each category and table III shows the
calculated value for each keyword to each category.

Query
Figure 1. Layered approach for the proposed methodology

q1
q2
q3

TABLE I. TRAINING SET STRUCTURE


Keywords

K1

K2

K3

Category

1
2
0

2
1
1

0
0
3

C1
C1
C2

TABLE II. THE TERM-FREQUENCY TO EACH CATEGORY


Keyword
K1
K2
K3

Category
C1

C2

3
3
0

0
1
3

133

ICCTA 2012, 13-15 October 2012, Alexandria, Egypt

TABLE III. CALCULATED VALUE FOR EACH KEYWORD TO EACH


CATEGORY
Category
keyword
K1
K2
K3

C1

C2

1
0.75
0

0
0.25
1

The training algorithm works in three main steps:


I. Input: the Training set (TS).
II. Calculate (ki, cj) according to (3).
III. Update ontology with new weights
B. The Classification and Reformation Stage
Fig. 2 shows the overall architecture of the proposed
model in classification and reformation stage which contains
the main components of the model and shows the overall
process starting from accepting users query as input to
produce the final classification of the query and the enriched
query. The overall process of EnrichSearch compromise
three major steps:
x Step1 (Preprocessing): the interface model takes a
query from the user as input and performs query
tokenization.
x Step2 (Semantic Annotation): an ontology-based
annotation is used in extractor model to relate the

keyword appearing in the query to an ontology


concept that defines it using SPARQL language [8]
and Jena [9]. This step removes any ambiguity that
may occur for homonyms.
Step3 (Query Classification): query classification is
performed using the classification algorithm in Fig.4.

1) Query-category fuzzy similarity measure


Let
a
test
query
q
=
{t1, ( 1 ),
( ) represents the
t2, ( 2 ), , t m , ( )}, where
membership degree that term i belongs to q. The fuzzy
similarity between q and a category ci is calculated using (4).

( , )=

( , )

( )

( , )

( )

(4)

Where
are fuzzy conjunction (t-norm) and
disjunction (t-conorm) and ( ) is the membership value
of a keyword to a query. The conjunction operator could be
generalized for any triangular norm, also called t norm and
denoted by t(x, y), which is a mapping [0, 1] u [0, 1] into [0,
1]. In the same way, it is possible to define fuzzy disjunction
operators as a triangular conorms, also called tconorm and
denoted by s(x, y), as a mapping [0, 1] u [0, 1] into [0, 1]
[10]. The most commonly used conjunction and disjunction
operators are Einstein Product and sum which used to
calculated t-norm and t-conorm as shown in (5) and (6).

Figure 2. Architecture of the proposed model (classification and reformation stage)

134

ICCTA 2012, 13-15 October 2012, Alexandria, Egypt

t(x, y) =

.
2( + .
.

s(x, y) =

(5)

(6)

1+ .

The test query to be classified is represented as a Boolean


features vector that assigns ( ) to one for every keyword
k that occurs in q at least once and zero if the keyword k is
not found in q as shown in (7).

(7)
2) Query Classification
The query classification is performed using the k-nearestneighbor (KNN) classifier according to the query-category
similarity. The KNN is selected as a baseline classifier as it
is easy to implement and often results in very good
classification performances in many practical applications
[11]. Fig. 3 illustrates how classification is performed. Given
the similarity measure of the given query to each category
sim(q,Ci),the final classification is Cc according to (8) such
that [12] CcC1, C2,, CM .
d(q, Cc) = min d(q, sim(q,Ci))
i=1,2,,M (8)

Figure 3. Query classification using KNN

problem in the network switch connected to my computer.


This keyword has two meanings (the switch device that
switch the power on or off and the switch device which
connect computers in a network) so the problem is which
meaning should the machine decide to take in the enriched
query? The answer is that the machine decide to take the
first meaning (which is the intended meaning according to
query classification) when the user submit the first query
and it will take the second meaning when the user submit the
second query according to the classification of the query
statement. Finally, the enriched query is submitted to
keyword-based search engines which give better results than
submitting the original query.
ki, EnrichedQuery= classification output+kiOR
(9)
InSem(ki)
IV. ONTOLOGY DEVELOPMENT
The semantic web uses the resource description format data
model (RDF) to describe data models, including concepts and
relationships between them. The main purpose of developing
ontology is to share common understanding of the structure of
information among people or software agents so that it
becomes computer readable and understandable. We have
developed a well-defined ontology as a way to specify
keyword semantics and the relationship between these
keywords. In the construction process of the ontology, we
follow the steps in the ontology-building process suggested in
[1]. For the implementation and modelling of the domain
ontology we chose Protg [13] as a development tool. Fig. 5
shows a mapping of some RDF concepts to ontology. Some of
properties types as discussed in the next sub sections.

Figure 4. Classification algorithm


Figure 5. Mapping RDF concepts to ontology

3) Query reformation
The intended semantic for each keyword ki is extracted
from the ontology according to the final classification of the
query to formulate the enriched query according to (9). For
example consider the keyword switch in the two queries
statements how to switch the HDD Rack on and I have a

A. Inverse Property
The property hasCategory has an inverse property which is
categoryOf that is if switch hasCategory network, then
because of inverse property, it could be inferred that Network
is category of switch.

135

ICCTA 2012, 13-15 October 2012, Alexandria, Egypt

B. Transitive Properties
Fig. 6 shows an example of the transitive property
hasSynonym. If the individual Network has a synonym that is
Net, and Net has a synonym that is Mesh, then we can infer
that Network has a synonym that is Mesh. Inverse property is
indicated by the dashed line in Fig. 6.

Figure 6. An example of a transitive property hasSynonym

C. Symmetric Property
Fig. 7 shows an example of a symmetric property. If the
individual Fuzzy is related to the individual Uncertain via the
hasSynonym property, then, it could be inferred that Uncertain
must also be related to Fuzzy via the hasSynonym property. In
other words, if Fuzzy is the synonym of Uncertain, then
Uncertain must be the synonym of Fuzzy.

Figure 7. An example of a symmetric property

D. Reasoning
One of the main services offered by a reasoner is to test
whether or not one class is a subclass of another class. By
performing such tests on the classes in an ontology it is
possible for a reasoner to compute the inferred ontology class
hierarchy. Another standard service that is offered by
reasoners is consistency checking. Based on the description
(conditions) of a class the reasoner can check whether or not it
is possible for the class to have any instances. Protg 4
allows different OWL reasoners to be plugged in; the reasoner
shipped with Protg is called FaCT++ [14]. The ontology can
be sent to the reasoner to automatically compute the
classification hierarchy, and also to check the logical
consistency of the ontology.
The W3C has proposed the Resource Description
Framework (RDF) [15] for exposing the meaning of a
document to the web community of people, machines, and
intelligent agents. Rules provide the natural and wideaccepted mechanism to perform automated reasoning, with
mature and available theory and technology. This has been
identified as a design issue for the semantic web, as clearly
stated by Tim Berners-Lee et al [16]. The mapping rules that
we consider are following.
x hasSynonym(x,y)hasSynonym(y,z)o hasSynonym(x,z)
x hasSynonym(x,y) o hasSynonym (y,x)

E. Online Validation Tool


There are many online validation tools that check and
visualize RDF documents. The most powerful RDF online
validation is that provided by W3C [17]. Fig. 8 shows a graph
representation of a resource by W3C online validation service.
V. EXPERIMENTAL RESULTS
The solution was built using Jena, a java framework for
building semantic web applications. Briefly, Jena provides an
API programmatic environment for RDF and its derivatives. It
can query RDF based data models, either in memory or
persisted in a data store, using the SPARQL RDF query
language [18]. Since all semantic data that we created are
described in RDF format, they can be queried with the
SPARQL query language. Currently, we integrate a snapshot
with a textbox for users inputting their queries [19].
Rather than using the semantics of individual keywords in
the query statement which breaks the relationship between
keywords, the proposed method utilized the relationship of a
keyword with the keyword that appears next (if any) by
searching the ontology using the two words as one concept, if
found then use the semantic of them as one concept rather
than using the semantic of each one individually. In this
section we will describe the data sets that we used for
evaluation, experiments procedures and results. Queries in the
study were designed to test various search features including
single word search and phrase search. We used different
collection of query sets to evaluate the effectiveness of fuzzy
similarity approach in query classification process. The first
collection is the AOL query set and the second collection is
the DMO query set. These data sets represent a different
number of categories. We divided the dataset into training set
and testing set. Two third of queries had been marked as the
training set and the remaining one third for the testing set. For
evaluation, we use only the top 3 categories, which consist of
about 84% of the queries. Samples of data set are shown in
Table IV.
TABLE IV. SAMPLE FROM DATA SET

Search query
Dell network switch
Dell database server
network router
DHCP
IP
switch table
query table

Classification
NETWORK
DATABASE
NETWORK
NETWORK
NETWORK
NETWORK
DATABASE

The performance presented in this section is measured


from the accuracy to predict the class of test queries. The
classification accuracy is defined as the percentage of test

136

ICCTA 2012, 13-15 October 2012, Alexandria, Egypt

queries that are correctly classified. For a query in the test set
that was assigned into several classes, the classification is
considered correct if it is same as the given classification. The
classification accuracy in our model is 89.2%.
We compared the results of a given query to a keywordbased search engine (we chose Google) with the results of the
enriched query produced by our solution. The enriched query
excludes many irrelevant results as shown in Table V. The
comparisons of the original query and the enriched query in
this regard are shown in Fig. 9.
The experimental results show that our solution enhances
the results retrieved by the keyword-based search engines
rather than using the original query.

TABLE V. A COMPARISON BETWEEN THE RESULTS OF GOOGLE


FOR A USER QUERY AND THE CORRESPONDING ENRICHED
QUERY (JUNE, 20 2012)
Google
Google
Results
Results
Difference
Original
for
for
in Results
No.
Enriched Query
Query
Original
Enriched
(Pages)
Query
Query
Network: what is
the network group
What is the of computers
About
About
288,690,00
connected together 291,000,00 2,310,000
network
1
0
switch a device that
switch?
0 results
results
connect computers
in a network
Network :DHCP
Dynamic Host
42,100,000 2,600,000
DHCP
39,500,000
2
Configuration
results
results
Protocol
Database: truncate
truncate
1,160,000
616,000
SQL command
544,000
3
table
results
results
table storage unit

Figure 8. Graph representation of a resource produced by W3C online validation service.

Figure 9. Comparison between precision of results using original query and enriched query.

137

ICCTA 2012, 13-15 October 2012, Alexandria, Egypt

VI. CONCLUSION
EnrichSearch was designed as a plug-in for any search
engine. Thus, implementation of the EnrichSearch focused on
the modification of the query string itself, instead of
modifying the target search engine directly which is easier.
The proposed solution provides a new method for query
statement classification by using the semantics of the query
keywords. Classification is done not just by keyword-based
semantic but using a sentence-level semantics which means
that the system can utilize the relationship between keywords
(if any) to get the intended semantic of a keyword. Another
benefit of this approach is that the algorithm is continuously
improving itself by adjusting the weight of keywords which
increases the classification accuracy by time. Ambiguity
comes from homonyms is reduces by getting the intended
meaning of keywords.
Our proposed solution facilitates online searching for user
query by using the already-implemented keyword-based
search engines (such as YAHOO! and Google) which
produced by many companies and research labs with the
benefit of semantic web and fuzzy logic to get better results
than submitting the original query. This avoid many irrelevant
results to the intended meaning of the user and save time to
search for the needed information in the huge result set
returned by the keyword-based search engines for the original
query. Our future work is to apply ontology learning to enrich
the ontology and use multiple ontologies in different domains
not just only one domain.

[12] B. Sierra, E. Lazkano, J.M. Mart__nez-Otzeta and A. Astigarraga,


Combining Bayesian Networks, k Nearest Neighbours algorithm and
Attribute Selection for Gene Expression Data Analysis, 2005.
[13]
PROTG,
"The
Protg
Ontology
Editor",
<http://protege.stanford.edu/>, 2011.
[14] D. Tsarkov and I. Horrocks, FaCT++ description logic reasoned,
System description. In Proc. of the Int. Joint Conf. on Automated
Reasoning (IJCAR 2006), vol. 4130, pp. 292297. Springer, 2006.
[15] Description Framework (RDF) web site: http://www.w3.org/RDF, 2012.
[16] T. Berneers-Lee. The Semantic Web: A new form of Web content that
is meaningful to computers will unleash a revolution of new
possibilities, Scientific American, 2001.
[17] W3C online validation service: http://www.w3.org/RDF/Validator,2011.
[18] V. Mendis, P. Foster, Rdf user profiles - bringing semantic web
capabilities to next generation networks and services, Proceedings
ICIN, 2007.
[19] G. Wu, M. Yang, K. Wu, G. Qi and Y. Qu, Falconer: Once SIOC Meets
Semantic Search Engine, Proceedings of the 19th international
conference on World Wide Web ACM New York, NY, USA, 2010.

REFERENCES
[1] G. Antoniou, F. Van Harmelen: A Semantic Web Primer, Cambridge
MA: MIT Press, 2008.
[2] M. Daoud, L. Tamine-Lechani and M. Boughanem, Using a conceptbased user context for search personalization. In Proc. of the 2008
International Conference of Data Mining and Knowledge Engineering,
2008.
[3] U. Kamaluddeen, J. Jafreezal, L. Shahir, Comparability between fuzzy
sets and crisp sets: a semantic web approach, 2010.
[4] M. Lan, C. Lim Tan, J. Su, Y. Lu. Supervised and traditional term
weighting methods for automatic text categorization. IEEE transactions
on pattern analysis and machine intelligence; 31(4), pp. 72135, 2009.
[5] A. Sieg, B. Mobasher, R. Burke, G. Prabu, S. Lytinen, Representing user
information context with ontologies, In uahci05, 2005
[6] S. Verberne, L. Boves, N. Oostdijk, and P. Coppen, What is not in the
bag of words for why-QA?, ACL, pp. 719-727, 2010.
[7] P. Mika. Microsearch, An Interface for Semantic Search. In Semantic
Search, International Workshop located at the 5th European Semantic
Web Conference (ESWC 2008), volume 334 of CEUR Workshop
Proceedings, pp. 79-88, 2008.
[8] E. Prud'hommeaux and A. Seaborne, "SPARQL Query Language for
RDF, http://www.w3.org/TR/rdf-sparql-query/,2011.
[9] Jena, A Semantic Web Framework for Java. http://jena.sourceforge.net/,
2012.
[10] E. Massad, N. Regina Siqueira Ortega, L. Carvalho de Barros and C. Jos
Struchiner.Fuzzy Logic in Action: Applications in Epidemiology and
Beyond Studies in Fuzziness and Soft Computing, 2008.
[11] F. Sebastiani, Machine Learning in Automated Text categorization.
ACM computing surveys, Vol. 34, pp. 1-47, 2002.

138

S-ar putea să vă placă și