Sunteți pe pagina 1din 6

An Intelligent Database Application for the Semantic Web

Amr F. El-Helw Hussien H. Aly


amrelhelw@elitemail.org alyh@computer.org
Department of Computer Science & Automatic Control, Faculty of Engineering
Alexandria University, EGYPT

Abstract some way. This is far beyond the capabilities of


directories and search engines.
Semantic Web is one of the important research fields
that came into light recently. Its major concern is to The main problem is that the Web was not designed,
convert the World Wide Web from just a huge initially, to be processed by machines. Documents on
repository of unrelated text, into useful linked pieces of the web do not provide any information that helps the
information. Linking the information is not only based machine to determine what the text means.
on text similarity, but mainly on the meanings and real- Semantic Web is not a technology by itself. In fact,
world relations between items. In this research, we use it is a group of inter-related technologies that may be
the techniques of the semantic web, to create formal used (individually or together) to produce the desired
definition rules for some kind of unstructured data, and outcome. Among these technologies and concepts are
to query this data for information that might be not Ontologies, Data Mining, Deductive Databases,
explicitly stated, and that would otherwise be very Artificial Intelligence, Man-Machine Interface and
difficult to extract. For this purpose, we develop an others.
integrated system that can extract data from both The Semantic Web may allow users to organize and
unstructured and structured documents, and then answer browse the Web in ways more suitable to the problems
users' queries using the extracted data together with they have at hand. The Semantic Web could be used to
some inference rule that help to deduce more impose a conceptual filter to a set of web pages, and
information from this data. As an example of this data, display their relationships based on such a filter. This
we take the social relationships found in death-ads in may also allow visualization of complex content. With
newspapers. These ads contain both implicit and HTML, such interfaces are virtually impossible since it
explicit information about various relationships among is difficult to extract meaning from the text. The major
people. These relationships, if arranged and organized concern of Semantic Web is to convert the World Wide
in a proper way, can be used to extract and infer other Web from just a huge repository of unrelated text, into
hidden, not explicitly mentioned relationships. useful linked pieces of information. Linking the
Keywords: Semantic Web, Intelligent Databases, information is not only based on text similarity, but
Deductive Databases, Information Retrieval. mainly on the meanings and real-world relations
between items.
1. Introduction
1.1. Overview 1.2. Objective
The World-Wide Web (www) is a huge repository of Currently, there is no universal application of the
information. It contains documents and multimedia Semantic Web, and it seemed that applications will be
resources concerning almost every imaginable subject, domain dependent as stated in [12], “the Semantic Web
in human-usable format. Documents on the Web have currently lacks a business case. Even if one were to
cross-references known as links. These links represent emerge, there is no guarantee that the Semantic Web
some sort of relationships between the documents. The will become universal”.
majority of these documents are written using the In this research, we aim to use an Intelligent
Hyper-Text Markup Language (HTML), which is a text Database approach to develop a real-life application
stream with embedded special tags. Most of the tags are that applies the techniques and the structures of the
concerned with organization and presentation of the Semantic Web and explores the potential use of it. We
document. consider the social relationships found in death-ads in
Due to the huge volume of available data, it is newspapers. These ads contain both implicit and
becoming increasingly difficult to locate useful explicit information about various relationships among
information, using current technology of search people. These relationships, if arranged and organized
engines. Also, users often want to use the Web to do in a proper way, can be used to extract and infer other
more than just locate a document; they want to perform hidden, not explicitly mentioned relationships. While it
some task, or perform some inference on its seemed to be a classic problem in deductive database
relationships with other documents. Completing these area of research, the problem is not the deduction
tasks often involves visiting a series of pages, process but how to collect the data from the scattered
integrating their content and reasoning about them in and unstructured Web pages into the extensional part of
the deductive database.
In order to achieve this goal, we have to translate the 2.2. Data Collector
existing documents on the Web into a proper model to The data collector is responsible for collecting the
represent the semantics and relationships between these data from the semi-structured XML documents and
documents, design a suitable inference engine to storing them into the database, to form the extensional
navigate through the documents and an interface query database of the deductive system.
language to drive this engine.

2.3. Inference Rules Editor


2. System Architecture
Using this component, the domain expert can always
The system developed throughout this research can add new inference rules to the knowledge base of the
be described as an integrated Semantic Web system. By system. Of course, the newly added rules will only take
the word “integrated”, we mean that the system is effect from the moment of their addition.
comprised of more than one component, each with a
specific task. These components are: the Data
Transformer, the Data Collector, the Inference Rules 2.4. Query Tool
Editor, and the Query Tool, in addition to the Intelligent This component receives queries built by the user
Database System itself, which can be considered as the interface and converts it into a logic program that uses
core of the system. Figure 1 represents a schematic for the inference rules (intensional database) together with
these components and their relationship. the stored data (extensional database) to determine the
result of the query, and passes this result back to the
User Interface user.

3. Design & Implementation


Query Tool In this research, we use the techniques and the
Query Result structures of the semantic web, to create formal
definition rules and collect the relevant facts for some
Domain Expert
kind of unstructured data that are scattered in Web
pages. A user can then query this data for information
Inference Intelligent that might be not explicitly stated, and that would
Domain of Rules Database otherwise be very difficult to extract.
Application Editor
The first step needed to accomplish this goal is to
convert the unstructured data found on the Web into
XML Documents
some structured or semi-structured format, so that we
Data Collector
can later manipulate this data. We chose XML format
as an example of standard semi-structured formats that
can be used for this purpose. When the unstructured
Intermediate data is converted into XML, it is easy to deal with it
XML Docs later using any programming language or technique.
This needs some knowledge representation to encode
Unstructured the expertise of the domain of application.
HTML Data Data Transformer
The second step is to parse the resulting XML
document(s) and export their data into a structured
database that should be used as the extensional part
(EDB) of the deductive database system. This step is
not a problem, since XML parsers are widely available.
Fig. 1 – System Architecture Also we had to enumerate a variety of possible
inference rules about human relations and store these
2.1. Data Transformer rules as well in the intentional database (IDB). These
rules are to be used to derive the deduction process.
This component is responsible for transforming the
data on the web from its unstructured HTML format Once the user enters a query, it is translated into
into a semi-structured format (e.g. XML) that can be some kind of a logic program (using Prolog for
easily processed by other system components. The example) and answered from the facts and rules stored
transformation process will depend on the application earlier in the database
domain to assign the target meaning to the document
content. Obviously, if the data already exists in a 3.1. Structuring the Data
structured format, then it can be passed directly to the As shown in the system architecture, the first step to
next component in the system. get the data is to convert the unstructured text data
found on the web into semi-structured XML documents.
In fact, the technique used to do this can be applied to documents. Many programming languages support the
any other type of documents, after making the XML-DOM and provide means to deal with it.
necessary changes. Note that structuring the data is only necessary
The main difficulty here was to find some pattern of because currently, most of the Web data is not available
the documents that can be followed to determine the in XML format. However, many data providers now
context of each token (word or phrase). This pattern is tend to provide their data as XML documents as well as
domain dependant, i.e. it differs from one area to HTML pages (as shown in the system architecture
another. Sometimes, this pattern would not be easy to above). So, in the near future, it is expected that most (if
discover. In our chosen domain of application, the ads not all) data would be available in XML format. Thus,
are usually semi-structured. To find this pattern, we had no structuring would be needed.
to group synonyms (words and expressions with the
same meaning) together, and replace them with a
3.2. Grouping the Data
unified semantic element. We can summarize the
algorithm of identifying the pattern as shown in fig. 2. Given an XML document that contains the data, we
scan the document, and insert the data into the database
repository after converting it to the suitable form.
Given:
A document that can be viewed as a set of Again, in this stage, we parse the XML document
phrases P. A phrase is any word or using a DOM (Document Object Model) parser, in
expression that has a meaning relative to order to easily extract the data and store it into the
the context of the application domain. database.
Output:
The question is: When should this step be carried
A tokenized document that contains only out? We should invoke it for every new XML document
the main semantic elements. that is added to the application domain. So, as we
initialize the system, it would be invoked for all the
Algorithm: documents that we initially have (or those that have
been created from unstructured web documents). Later
1. Let Pi ⊂ P be all the phrases that on, it should be invoked whenever any new document is
belong to the same token type Ti; 1 ≤ submitted to the system, in order to consolidate the data
i ≤ n in this document with the knowledge base.
2. For i = 1 to n:
2.a) Let Si be a semantic element that
represents the token type Ti 3.3. Knowledge Representation: Inference
2.b) ∀ p ∈ Pi, replace p with Si in the Rules
document
3. The resulting document only contains This part represents the "knowledge base" of the
the semantic elements Si; 1 ≤ i ≤ n. system. It should contain all the rules that define the
Fig. 2 – Pattern Identification Algorithm relationships between the various entities involved in
our application domain.
An inference rule is a statement of the form:
Of course, this algorithm has to be applied on a wide
variety of documents that belong to the required Result Å Fact1, Fact2, …, Factn
knowledge domain. The resulting pattern will be The facts on the right side of the rule are called the
revised and refined with every added document, until antecedent(s), and the result on the left side is called the
we get the most possible general pattern that covers all consequent. This means that the consequent can be
(or at least the majority) of the cases. inferred from the antecedent(s). In other words, if the
Note that some phrases can actually belong to more antecedents are all correct then the consequent is also
than one token type. This can be determined from the correct and can be used as an antecedent in other
context sometimes, but it can sometimes lead to inference rules.
unexpected results. For this reason, the output of this Note that both the antecedent and the consequent are
phase is not always 100% correct. predicates that can take arguments. For example,
Once we replace each phrase with its token type, we consider the following inference rule:
can deduce the pattern of tokens in the document. Then, father(x, y) Å male(x), child(y, x)
we parse the selected document, and process it In this rule, we have the predicates father, male and
according to this pattern. This way, we can correctly child, and we have two arguments (or parameters) x and
extract most (not necessarily all) of the data in the y. This means that if x is a male and y is the child of x,
document. then x is the father of y.
To convert the extracted data to an XML document, These inference rules are needed in order to deduce
we can use the Document Object Model (DOM) of any hidden relationships and to extract implicit
XML documents. The DOM deals with XML as a tree information from the stored data. The system gives the
with each element in the document as a node in the tree, ability to add more inference rules later, and they will
thus preserving the hierarchical structure of XML
take effect from the moment of their addition into the the document, and convert them into XML format. At
system. this point, there has to be some interaction between the
system and an expert in the application domain, to
review the generated XML document, and possible
3.4. Creating and Answering Queries
refine and correct any incorrect data that might have
This step involves translating the user's query into been generated due to a linguistic problem (for
some kind of a logic program (using Prolog for example, a word that might have two different
example) that can answer this query from the facts and meanings, and thus might be considered as any of two
rules stored earlier. To accomplish this step, we use a different token types).
Prolog engine that we can call and pass our query. For
The resulting XML file consists of two main
this purpose we use the XSB engine [13]. It can access
sections. The first section <persons> contains the data
the stored data, process the query, and return all
of every person mentioned in the document. The second
possible results of this query, so that we can present
section <relations> contains the relations between
these results to the user in the appropriate way.
the persons in the first section.
As an example of transforming an unstructured
4. The Social Relationship Miner document into XML format, consider the following
As an example of a real-life application that can death-ad:
make use of the concepts and techniques discussed
earlier, we take the social relationships found in death- ‫ﺍﻨﺘﻘﻠﺕ ﺇﻟﻰ ﺭﺤﻤﺔ ﺍﷲ ﺍﻟﺴﻴﺩﺓ ﺃﻤﻴﻨﺔ ﺤﺭﻡ ﺍﻟﻤﻬﻨﺩﺱ ﺤﺎﻤﺩ ﺍﻟﺒﺭﺠﻲ ﻭ ﻭﺍﻟـﺩﺓ‬
ads in newspapers. These ads contain both implicit and ‫ﺍﻟﻤﻬﻨﺩﺴﺔ ﻋﺎﺌﺸﺔ ﺤﺭﻡ ﺍﻟﻤﺴﺘﺸﺎﺭ ﻤﺤﻤﻭﺩ ﻓﻜﺭﻱ ﻭ ﺍﻟﺩﻜﺘﻭﺭ ﻤﻬﻨﺩﺱ ﻤﺤﻤﺩ‬
explicit information about various relationships among
‫ﺯﻭﺝ ﻭﻓﺎﺀ ﺒﺎﻟﺘﻠﻴﻔﺯﻴﻭﻥ ﻭ ﺍﻟﻤﻬﻨﺩﺱ ﺍﺤﻤﺩ ﻭ ﺯﻴﻨﺏ ﺤﺭﻡ ﺍﻟﺩﻜﻨﻭﺭ ﻤﺤﻤـﺩ‬
people. These relationships, if arranged and organized
in a proper way, can be used to extract and infer other ‫ﻤﻭﺴﻲ ﻭ ﺭﺸﻴﺩﺓ ﺤﺭﻡ ﺍﻟﻤﺤﺎﺴﺏ ﻓﺎﺭﻭﻕ ﻭ ﺼﻔﺎﺀ ﺤﺭﻡ ﺍﻟﻤﺭﺤﻭﻡ ﻋﻤﺭ ﻭ‬
hidden, not explicitly mentioned relationships that are ‫ﺍﻟﺴﻴﺩ ﺼﻔﻭﺕ ﺯﻭﺝ ﻤﻨﺎل ﻭ ﺍﻟﺴﻴﺩ ﻤﺤﻤﻭﺩ ﺯﻭﺝ ﺸﻴﺭﻴﻥ‬
otherwise difficult to extract.
Fig. 3 – Sample death-ad
An approximate English translation of this document
4.1. The Data Transformer (only for the sake of explanation) would look as shown
As mentioned earlier, the first step to get the data is in figure 4 below.
to convert the unstructured text data found in the ads We announce the death of Mrs. Amina the wife of Eng.
into semi-structured XML documents. And to do this Hamed El-Borgy, the mother of: Eng. Aisha (wife of
we had to identify the various semantic elements that Judge Mahmoud Fekry), Dr. Eng. Mohamed (husband
comprise the very type of documents that is relevant to of Wafaa who works in the TV), Eng. Ahmed, Zeinab
our application. For example, the phrases (‫– ﺗ ﻮﻓﻲ– ﺗ ﻮﻓﻰ‬ (wife of Dr. Mohamed Moussa), Rashida (wife of
‫ )اﻧﺘﻘ ﻞ إﻟ ﻰ رﺣﻤ ﺔ اﷲ‬all have the same meaning of (die). So Accountant Farouk) and Safaa (wife of the deceased
whenever we encounter any of these phrases, we Omar), Mr. Safwat (husband of Manal), and Mr.
determine that we have the DIE element. Other token Mahmoud (husband of Sherine).
types include family relations, job descriptions, etc. Fig.4 – English Translation of Figure 3
Naturally, the tokens and semantic elements differ Also, as we said earlier, this step can convert any
from one language to another. In our systems, we work type of documents, not just death-ads. All we need is to
with documents written in Arabic language. For the determine the different token types, and the state
sake of explanation, we shall translate some of the diagram that represents the token pattern of the
terms relevant to our topic. document, and the rest is straightforward.
After replacing each phrase with its representative Figure 5 shows a portion of the XML document that
semantic element and after discarding insignificant results from structuring the document in the above
phrases, one can deduce the pattern of tokens in the example.
document.
The state diagram is defined in the application as a
set of nodes (states) and an action to take at each state 4.2. The Data Collector
for each encountered token type. This gives the system The data collector is the component responsible for
more flexibility since it is possible to add more states collecting the data from the XML documents that have
and define the type of action that has to be taken for been created by the Data Transformer (or that should be
these states with different kinds of tokens. In fact, a provided by the data providers) and storing this data
domain expert can define a whole new state diagram into the deductive database system.
with a completely new set of nodes and actions. This Given an XML document that contains the data, the
way, the application can work on different kinds of data collector starts by scanning the <persons> section.
documents. For every person in this section, the system checks
Next, we parse the selected document, and process it whether or not this person already exists in the
according to our state diagram. This way, we can database, and inserts this person if necessary. The
correctly extract most (not necessarily all) of the data in comparison is done based on the name, gender, and job,
i.e. if a person with the same data exists in the database,
it is assumed to be the same person. We have no other These inference rules are needed in order to deduce
way to differentiate between people who might any hidden relationships and to extract implicit
accidentally have similar data. information from the stored data. They represent the
Next, the system parses the <relations> section of intentional database of the deductive database system.
the XML document, and inserts these relation into the The system gives the ability to add more inference rules
database (again, if they do not already exists), with the later, and they will take effect from the moment of their
respective IDs of the persons of these relations. addition into the system.
<ad>
<persons>
<person ID="1" gender="female" alive="no" 4.4. Issuing Queries
title="Mrs." name="Amina"/> It is clear from the name of the application “Social
<person ID="2" gender="male" alive="yes"
title="Eng." name="Hamed El-Borgy"/> Relationship Miner” that the main concern of this
<person ID="3" gender="female" alive="yes" application is to extract relationships between people.
title="Eng." name="Aisha"/> The user who issues queries to the system is mainly
<person ID="4" gender="male" alive="yes"
name="Mahmoud Fekry" job="Judge"/>
interested in one of the following:
... • Given two persons, the user might want to know if
</persons> a certain relation holds. For example, given the
<relations>
<relation type="wife" subjectID="1" above document, a possible query might be: Is
objectID="2"/> Sherine the wife of Mahmoud? The answer to this
<relation type="mother" subjectID="1" kind of query is simply yes or no.
objectID="3"/>
<relation type="mother" subjectID="1" • Given one person, the user might want to know all
objectID="5"/> persons who satisfy a certain relation with the
<relation type="mother" subjectID="1" given person. e.g. List all children of Amina. The
objectID="7"/> answer to this kind of query is a list of one or more
...
</relations> person names.
</ad> • Given two persons, the user might want to
Fig. 5 – Portion of the Generated XML file determine the relation(s) between them (if any),
e.g. What is the relation between Safwat and
Manal? The answer to this query is a list of one or
4.3. The Inference Rules more relations ordered by significance (closeness
Here, we included as many inference rules as of relation).
possible. These rules define the human relationships
To issue a query, the user selects the query type (any
and the inter-relations between these relationships. As
of the three categories above), and supplies the
pointed out in the system architecture, these inference
parameters (relation types or person names). The
rules are considered as the knowledge base of the
system constructs a logic query and passes it to the XSB
system. Domain experts can add rules at any time.
engine, which executes the query, and returns its results
Examples of these relationships might include: back to user.
father (x, y) Å child (y, x), male (x).
uncle (x, y) Å father (z, y), brother (x, z).
ancestor (x, y) Å ancestor (x, z), parent (z, y). 5. Conclusion
Sibling (x, y) Å child (x, z), child (y, z), x ≠ y. In this research, we used an Intelligent Database
uncle (x, y) Å brother (x, z), parent (z, y). approach to develop an application that applies the
techniques and the structures of the Semantic Web and
The above relationships represent a small portion of that makes use of its benefits. The system is tested
what we can call "family relationships". However, the through a prototype example of a real-life application
system does not only support family relationships. using death advertising documents published on the
There are also other kinds of relationships, like: Internet in Arabic newspapers.
superior (x, y) Å superior(x, z), superior (z, y). We implemented a system that integrated a number
knows (x, y) Å relative (x, y). of Semantic Web technologies. The inputs to this
knows (x, y) Å coworker (x, y). system are Internet document in a specific application
domain. These documents can be either unstructured
These are only a sample of the inference rules that HTML and text documents, or semi-structured XML
are included in the system. Although they might look documents. In the case of unstructured documents,
simple, some relations actually have a lot of cases to domain experts can define parsing rules for the
consider. For example if we look at the nephew documents, in the form of state diagrams. The system
relation, it can be inferred from any of the following: then parses the documents, extracts data items and
nephew (x, y) Å son (x, z), sibling (z, y). relations from them, and converts them into semi-
nephew (x, y) Å uncle (y, x), male(x). structured XML documents. The parsing is carried out
nephew (x, y) Å aunt (y, x), male (x). according to the user-defined state diagram, which
increases system flexibility. In case of Internet [9] World Wide Web Consortium RDF Specification,
documents that are already available in XML format, http://www.w3.org/TR/REC-rdf-syntax
the system can transform these documents from one [10] B. N. Grosof, I. Horrocks, R. Volz and S. Decker,
format to another. This is also accomplished using user- "Description Logic Programs: Combining Logic
defined XSLT stylesheet. Programs with Description Logic", Proc. of the 12th
Next, once all the data is available in XML format of International Conference on World Wide Web),
a defined structure for a given application domain, the May 2003.
system reads the data from the formed XML files, and [11] G. F. Luger and W. A. Stubblefield, "Artificial
stores them into the database repository. This repository Intelligence: Structures and Strategies for Complex
is structured in such a way to enable intelligent Problem Solving", 3rd edition, Addison-Wesley,
processing and querying of the data and/or 1998.
relationships. [12] S. Bowness, "Information Highways", April 2004,
Vol. 11, No. 3, P. 16.
The Inference Rules Editor allows the domain
[13] XSB Engine, http://xsb.sourceforge.net
experts to define inference rules (Intensional Database)
[14] J. Heflin, J. Hendler, and S. Luke, "SHOE: A
that define the relationships between data items in the
Knowledge Representation Language for Internet
required application domain. These rules are also stored
Applications", Technical Report CS-TR-4078
into the database.
(UMIACS TR-99-71), Dept. of Computer Science,
We also provide a user interface with which a user University of Maryland at College Park. 1999.
can query the system. The users' queries are translated [15] D. Fensel, S. Decker, M. Erdmann, and R. Studer:
into a logic program that is executed by a logic engine "Ontobroker: The Very High Idea". Proc. of the
in the intelligent database system. It uses the stored data 11th International Flairs Conference (FLAIRS-98),
and inference rules to produce the result of this query, Sanibal Island, Florida, May 1998.
and return it back to the user. [16] T. R. Gruber. "A Translation Approach to Portable
As a practical example for this system, we Ontology Specifications". Knowledge Acquisition,
introduced a real-life application. We considered the 5(2):199-220, 1993.
social relationships found in death-ads in newspapers. [17] D. Fensel, I. Horrocks, F. van Harmelen, D.
These ads contain both implicit and explicit information McGuinness, and P. F. Patel-Schneider, "OIL:
about various relationships among people. These Ontology Infrastructure to Enable the Semantic
relationships, if arranged and organized in a proper Web", IEEE Intelligent System, 16(2), 2001.
way, can be used to extract and infer other hidden, not [18] A. Farquhar, R. Fikes & J. Rice, "The Ontolingua
explicitly mentioned relationships. Server: A Tool for Collaborative Ontology
Construction", Knowledge Systems Laboratory,
1996.
6. References [19] H. P. Luhn, "The automatic creation of literature
[1] J. Heflin, "Towards the Semantic Web: Knowledge abstracts", IBM Journal of Research and
Representation in a Dynamic, Distributed Development, 2, pp. 159-165, 1958.
Environment", Ph.D. Thesis, University of [20] M. Sharp, "Text mining", Rutgers University,
Maryland, College Park, 2001. Communications, Information and library Science,
[2] E. Bertino, B. Catania and G. P. Zarri, "Intelligent Seminar in Information studies, Dec 11, 2001.
Database Systems", Addison-Wesley, 2001
[3] A. Deutsch, M. Fernandez, D. Florescu, A. Levy
and D. Suciu, "XML-QL: A Query Language for
XML", Proc. 8th Int'l WWW Conference, 1999.
[4] T. T. Chinenyanga and N. Kushmerick, "Expressive
Retrieval from XML Documents", Proc. 24th
Annual Int'l ACM SIGIR conf. on Research &
Development in Information Retrieval, 2001.
[5] S. Decker, D. Fensel, F. van Harmelen, I. Horrocks,
S. Melnik, M. Klein and J. Broekstra, "Knowledge
Representation on the Web", Proc. of the 2000 Int'l
Workshop on Description Logics, 2000.
[6] S. Decker, F. van Harmelen, J. Broekstra, M.
Erdmann, D. Fensel, I. Horrocks, M. Klein, and S.
Melnik, "The Semantic Web – on the Respective
Roles of XML and RDF", IEEE Internet
Computing, September/October 2000.
[7] World Wide Web Consortium HTML Specification,
http://www.w3.org/TR/REC-html40
[8] World Wide Web Consortium XML Specification,
http://www.w3.org/TR/REC-xml

S-ar putea să vă placă și