An Intelligent Database Application for the Semantic Web
Amr F. El-Helw Hussien H. Aly
amrelhelw@elitemail.org alyh@computer.org Department of Computer Science & Automatic Control, Faculty of Engineering Alexandria University, EGYPT
Abstract some way. This is far beyond the capabilities of
directories and search engines. Semantic Web is one of the important research fields that came into light recently. Its major concern is to The main problem is that the Web was not designed, convert the World Wide Web from just a huge initially, to be processed by machines. Documents on repository of unrelated text, into useful linked pieces of the web do not provide any information that helps the information. Linking the information is not only based machine to determine what the text means. on text similarity, but mainly on the meanings and real- Semantic Web is not a technology by itself. In fact, world relations between items. In this research, we use it is a group of inter-related technologies that may be the techniques of the semantic web, to create formal used (individually or together) to produce the desired definition rules for some kind of unstructured data, and outcome. Among these technologies and concepts are to query this data for information that might be not Ontologies, Data Mining, Deductive Databases, explicitly stated, and that would otherwise be very Artificial Intelligence, Man-Machine Interface and difficult to extract. For this purpose, we develop an others. integrated system that can extract data from both The Semantic Web may allow users to organize and unstructured and structured documents, and then answer browse the Web in ways more suitable to the problems users' queries using the extracted data together with they have at hand. The Semantic Web could be used to some inference rule that help to deduce more impose a conceptual filter to a set of web pages, and information from this data. As an example of this data, display their relationships based on such a filter. This we take the social relationships found in death-ads in may also allow visualization of complex content. With newspapers. These ads contain both implicit and HTML, such interfaces are virtually impossible since it explicit information about various relationships among is difficult to extract meaning from the text. The major people. These relationships, if arranged and organized concern of Semantic Web is to convert the World Wide in a proper way, can be used to extract and infer other Web from just a huge repository of unrelated text, into hidden, not explicitly mentioned relationships. useful linked pieces of information. Linking the Keywords: Semantic Web, Intelligent Databases, information is not only based on text similarity, but Deductive Databases, Information Retrieval. mainly on the meanings and real-world relations between items. 1. Introduction 1.1. Overview 1.2. Objective The World-Wide Web (www) is a huge repository of Currently, there is no universal application of the information. It contains documents and multimedia Semantic Web, and it seemed that applications will be resources concerning almost every imaginable subject, domain dependent as stated in [12], “the Semantic Web in human-usable format. Documents on the Web have currently lacks a business case. Even if one were to cross-references known as links. These links represent emerge, there is no guarantee that the Semantic Web some sort of relationships between the documents. The will become universal”. majority of these documents are written using the In this research, we aim to use an Intelligent Hyper-Text Markup Language (HTML), which is a text Database approach to develop a real-life application stream with embedded special tags. Most of the tags are that applies the techniques and the structures of the concerned with organization and presentation of the Semantic Web and explores the potential use of it. We document. consider the social relationships found in death-ads in Due to the huge volume of available data, it is newspapers. These ads contain both implicit and becoming increasingly difficult to locate useful explicit information about various relationships among information, using current technology of search people. These relationships, if arranged and organized engines. Also, users often want to use the Web to do in a proper way, can be used to extract and infer other more than just locate a document; they want to perform hidden, not explicitly mentioned relationships. While it some task, or perform some inference on its seemed to be a classic problem in deductive database relationships with other documents. Completing these area of research, the problem is not the deduction tasks often involves visiting a series of pages, process but how to collect the data from the scattered integrating their content and reasoning about them in and unstructured Web pages into the extensional part of the deductive database. In order to achieve this goal, we have to translate the 2.2. Data Collector existing documents on the Web into a proper model to The data collector is responsible for collecting the represent the semantics and relationships between these data from the semi-structured XML documents and documents, design a suitable inference engine to storing them into the database, to form the extensional navigate through the documents and an interface query database of the deductive system. language to drive this engine.
2.3. Inference Rules Editor
2. System Architecture Using this component, the domain expert can always The system developed throughout this research can add new inference rules to the knowledge base of the be described as an integrated Semantic Web system. By system. Of course, the newly added rules will only take the word “integrated”, we mean that the system is effect from the moment of their addition. comprised of more than one component, each with a specific task. These components are: the Data Transformer, the Data Collector, the Inference Rules 2.4. Query Tool Editor, and the Query Tool, in addition to the Intelligent This component receives queries built by the user Database System itself, which can be considered as the interface and converts it into a logic program that uses core of the system. Figure 1 represents a schematic for the inference rules (intensional database) together with these components and their relationship. the stored data (extensional database) to determine the result of the query, and passes this result back to the User Interface user.
3. Design & Implementation
Query Tool In this research, we use the techniques and the Query Result structures of the semantic web, to create formal definition rules and collect the relevant facts for some Domain Expert kind of unstructured data that are scattered in Web pages. A user can then query this data for information Inference Intelligent that might be not explicitly stated, and that would Domain of Rules Database otherwise be very difficult to extract. Application Editor The first step needed to accomplish this goal is to convert the unstructured data found on the Web into XML Documents some structured or semi-structured format, so that we Data Collector can later manipulate this data. We chose XML format as an example of standard semi-structured formats that can be used for this purpose. When the unstructured Intermediate data is converted into XML, it is easy to deal with it XML Docs later using any programming language or technique. This needs some knowledge representation to encode Unstructured the expertise of the domain of application. HTML Data Data Transformer The second step is to parse the resulting XML document(s) and export their data into a structured database that should be used as the extensional part (EDB) of the deductive database system. This step is not a problem, since XML parsers are widely available. Fig. 1 – System Architecture Also we had to enumerate a variety of possible inference rules about human relations and store these 2.1. Data Transformer rules as well in the intentional database (IDB). These rules are to be used to derive the deduction process. This component is responsible for transforming the data on the web from its unstructured HTML format Once the user enters a query, it is translated into into a semi-structured format (e.g. XML) that can be some kind of a logic program (using Prolog for easily processed by other system components. The example) and answered from the facts and rules stored transformation process will depend on the application earlier in the database domain to assign the target meaning to the document content. Obviously, if the data already exists in a 3.1. Structuring the Data structured format, then it can be passed directly to the As shown in the system architecture, the first step to next component in the system. get the data is to convert the unstructured text data found on the web into semi-structured XML documents. In fact, the technique used to do this can be applied to documents. Many programming languages support the any other type of documents, after making the XML-DOM and provide means to deal with it. necessary changes. Note that structuring the data is only necessary The main difficulty here was to find some pattern of because currently, most of the Web data is not available the documents that can be followed to determine the in XML format. However, many data providers now context of each token (word or phrase). This pattern is tend to provide their data as XML documents as well as domain dependant, i.e. it differs from one area to HTML pages (as shown in the system architecture another. Sometimes, this pattern would not be easy to above). So, in the near future, it is expected that most (if discover. In our chosen domain of application, the ads not all) data would be available in XML format. Thus, are usually semi-structured. To find this pattern, we had no structuring would be needed. to group synonyms (words and expressions with the same meaning) together, and replace them with a 3.2. Grouping the Data unified semantic element. We can summarize the algorithm of identifying the pattern as shown in fig. 2. Given an XML document that contains the data, we scan the document, and insert the data into the database repository after converting it to the suitable form. Given: A document that can be viewed as a set of Again, in this stage, we parse the XML document phrases P. A phrase is any word or using a DOM (Document Object Model) parser, in expression that has a meaning relative to order to easily extract the data and store it into the the context of the application domain. database. Output: The question is: When should this step be carried A tokenized document that contains only out? We should invoke it for every new XML document the main semantic elements. that is added to the application domain. So, as we initialize the system, it would be invoked for all the Algorithm: documents that we initially have (or those that have been created from unstructured web documents). Later 1. Let Pi ⊂ P be all the phrases that on, it should be invoked whenever any new document is belong to the same token type Ti; 1 ≤ submitted to the system, in order to consolidate the data i ≤ n in this document with the knowledge base. 2. For i = 1 to n: 2.a) Let Si be a semantic element that represents the token type Ti 3.3. Knowledge Representation: Inference 2.b) ∀ p ∈ Pi, replace p with Si in the Rules document 3. The resulting document only contains This part represents the "knowledge base" of the the semantic elements Si; 1 ≤ i ≤ n. system. It should contain all the rules that define the Fig. 2 – Pattern Identification Algorithm relationships between the various entities involved in our application domain. An inference rule is a statement of the form: Of course, this algorithm has to be applied on a wide variety of documents that belong to the required Result Å Fact1, Fact2, …, Factn knowledge domain. The resulting pattern will be The facts on the right side of the rule are called the revised and refined with every added document, until antecedent(s), and the result on the left side is called the we get the most possible general pattern that covers all consequent. This means that the consequent can be (or at least the majority) of the cases. inferred from the antecedent(s). In other words, if the Note that some phrases can actually belong to more antecedents are all correct then the consequent is also than one token type. This can be determined from the correct and can be used as an antecedent in other context sometimes, but it can sometimes lead to inference rules. unexpected results. For this reason, the output of this Note that both the antecedent and the consequent are phase is not always 100% correct. predicates that can take arguments. For example, Once we replace each phrase with its token type, we consider the following inference rule: can deduce the pattern of tokens in the document. Then, father(x, y) Å male(x), child(y, x) we parse the selected document, and process it In this rule, we have the predicates father, male and according to this pattern. This way, we can correctly child, and we have two arguments (or parameters) x and extract most (not necessarily all) of the data in the y. This means that if x is a male and y is the child of x, document. then x is the father of y. To convert the extracted data to an XML document, These inference rules are needed in order to deduce we can use the Document Object Model (DOM) of any hidden relationships and to extract implicit XML documents. The DOM deals with XML as a tree information from the stored data. The system gives the with each element in the document as a node in the tree, ability to add more inference rules later, and they will thus preserving the hierarchical structure of XML take effect from the moment of their addition into the the document, and convert them into XML format. At system. this point, there has to be some interaction between the system and an expert in the application domain, to review the generated XML document, and possible 3.4. Creating and Answering Queries refine and correct any incorrect data that might have This step involves translating the user's query into been generated due to a linguistic problem (for some kind of a logic program (using Prolog for example, a word that might have two different example) that can answer this query from the facts and meanings, and thus might be considered as any of two rules stored earlier. To accomplish this step, we use a different token types). Prolog engine that we can call and pass our query. For The resulting XML file consists of two main this purpose we use the XSB engine [13]. It can access sections. The first section <persons> contains the data the stored data, process the query, and return all of every person mentioned in the document. The second possible results of this query, so that we can present section <relations> contains the relations between these results to the user in the appropriate way. the persons in the first section. As an example of transforming an unstructured 4. The Social Relationship Miner document into XML format, consider the following As an example of a real-life application that can death-ad: make use of the concepts and techniques discussed earlier, we take the social relationships found in death- ﺍﻨﺘﻘﻠﺕ ﺇﻟﻰ ﺭﺤﻤﺔ ﺍﷲ ﺍﻟﺴﻴﺩﺓ ﺃﻤﻴﻨﺔ ﺤﺭﻡ ﺍﻟﻤﻬﻨﺩﺱ ﺤﺎﻤﺩ ﺍﻟﺒﺭﺠﻲ ﻭ ﻭﺍﻟـﺩﺓ ads in newspapers. These ads contain both implicit and ﺍﻟﻤﻬﻨﺩﺴﺔ ﻋﺎﺌﺸﺔ ﺤﺭﻡ ﺍﻟﻤﺴﺘﺸﺎﺭ ﻤﺤﻤﻭﺩ ﻓﻜﺭﻱ ﻭ ﺍﻟﺩﻜﺘﻭﺭ ﻤﻬﻨﺩﺱ ﻤﺤﻤﺩ explicit information about various relationships among ﺯﻭﺝ ﻭﻓﺎﺀ ﺒﺎﻟﺘﻠﻴﻔﺯﻴﻭﻥ ﻭ ﺍﻟﻤﻬﻨﺩﺱ ﺍﺤﻤﺩ ﻭ ﺯﻴﻨﺏ ﺤﺭﻡ ﺍﻟﺩﻜﻨﻭﺭ ﻤﺤﻤـﺩ people. These relationships, if arranged and organized in a proper way, can be used to extract and infer other ﻤﻭﺴﻲ ﻭ ﺭﺸﻴﺩﺓ ﺤﺭﻡ ﺍﻟﻤﺤﺎﺴﺏ ﻓﺎﺭﻭﻕ ﻭ ﺼﻔﺎﺀ ﺤﺭﻡ ﺍﻟﻤﺭﺤﻭﻡ ﻋﻤﺭ ﻭ hidden, not explicitly mentioned relationships that are ﺍﻟﺴﻴﺩ ﺼﻔﻭﺕ ﺯﻭﺝ ﻤﻨﺎل ﻭ ﺍﻟﺴﻴﺩ ﻤﺤﻤﻭﺩ ﺯﻭﺝ ﺸﻴﺭﻴﻥ otherwise difficult to extract. Fig. 3 – Sample death-ad An approximate English translation of this document 4.1. The Data Transformer (only for the sake of explanation) would look as shown As mentioned earlier, the first step to get the data is in figure 4 below. to convert the unstructured text data found in the ads We announce the death of Mrs. Amina the wife of Eng. into semi-structured XML documents. And to do this Hamed El-Borgy, the mother of: Eng. Aisha (wife of we had to identify the various semantic elements that Judge Mahmoud Fekry), Dr. Eng. Mohamed (husband comprise the very type of documents that is relevant to of Wafaa who works in the TV), Eng. Ahmed, Zeinab our application. For example, the phrases (– ﺗ ﻮﻓﻲ– ﺗ ﻮﻓﻰ (wife of Dr. Mohamed Moussa), Rashida (wife of )اﻧﺘﻘ ﻞ إﻟ ﻰ رﺣﻤ ﺔ اﷲall have the same meaning of (die). So Accountant Farouk) and Safaa (wife of the deceased whenever we encounter any of these phrases, we Omar), Mr. Safwat (husband of Manal), and Mr. determine that we have the DIE element. Other token Mahmoud (husband of Sherine). types include family relations, job descriptions, etc. Fig.4 – English Translation of Figure 3 Naturally, the tokens and semantic elements differ Also, as we said earlier, this step can convert any from one language to another. In our systems, we work type of documents, not just death-ads. All we need is to with documents written in Arabic language. For the determine the different token types, and the state sake of explanation, we shall translate some of the diagram that represents the token pattern of the terms relevant to our topic. document, and the rest is straightforward. After replacing each phrase with its representative Figure 5 shows a portion of the XML document that semantic element and after discarding insignificant results from structuring the document in the above phrases, one can deduce the pattern of tokens in the example. document. The state diagram is defined in the application as a set of nodes (states) and an action to take at each state 4.2. The Data Collector for each encountered token type. This gives the system The data collector is the component responsible for more flexibility since it is possible to add more states collecting the data from the XML documents that have and define the type of action that has to be taken for been created by the Data Transformer (or that should be these states with different kinds of tokens. In fact, a provided by the data providers) and storing this data domain expert can define a whole new state diagram into the deductive database system. with a completely new set of nodes and actions. This Given an XML document that contains the data, the way, the application can work on different kinds of data collector starts by scanning the <persons> section. documents. For every person in this section, the system checks Next, we parse the selected document, and process it whether or not this person already exists in the according to our state diagram. This way, we can database, and inserts this person if necessary. The correctly extract most (not necessarily all) of the data in comparison is done based on the name, gender, and job, i.e. if a person with the same data exists in the database, it is assumed to be the same person. We have no other These inference rules are needed in order to deduce way to differentiate between people who might any hidden relationships and to extract implicit accidentally have similar data. information from the stored data. They represent the Next, the system parses the <relations> section of intentional database of the deductive database system. the XML document, and inserts these relation into the The system gives the ability to add more inference rules database (again, if they do not already exists), with the later, and they will take effect from the moment of their respective IDs of the persons of these relations. addition into the system. <ad> <persons> <person ID="1" gender="female" alive="no" 4.4. Issuing Queries title="Mrs." name="Amina"/> It is clear from the name of the application “Social <person ID="2" gender="male" alive="yes" title="Eng." name="Hamed El-Borgy"/> Relationship Miner” that the main concern of this <person ID="3" gender="female" alive="yes" application is to extract relationships between people. title="Eng." name="Aisha"/> The user who issues queries to the system is mainly <person ID="4" gender="male" alive="yes" name="Mahmoud Fekry" job="Judge"/> interested in one of the following: ... • Given two persons, the user might want to know if </persons> a certain relation holds. For example, given the <relations> <relation type="wife" subjectID="1" above document, a possible query might be: Is objectID="2"/> Sherine the wife of Mahmoud? The answer to this <relation type="mother" subjectID="1" kind of query is simply yes or no. objectID="3"/> <relation type="mother" subjectID="1" • Given one person, the user might want to know all objectID="5"/> persons who satisfy a certain relation with the <relation type="mother" subjectID="1" given person. e.g. List all children of Amina. The objectID="7"/> answer to this kind of query is a list of one or more ... </relations> person names. </ad> • Given two persons, the user might want to Fig. 5 – Portion of the Generated XML file determine the relation(s) between them (if any), e.g. What is the relation between Safwat and Manal? The answer to this query is a list of one or 4.3. The Inference Rules more relations ordered by significance (closeness Here, we included as many inference rules as of relation). possible. These rules define the human relationships To issue a query, the user selects the query type (any and the inter-relations between these relationships. As of the three categories above), and supplies the pointed out in the system architecture, these inference parameters (relation types or person names). The rules are considered as the knowledge base of the system constructs a logic query and passes it to the XSB system. Domain experts can add rules at any time. engine, which executes the query, and returns its results Examples of these relationships might include: back to user. father (x, y) Å child (y, x), male (x). uncle (x, y) Å father (z, y), brother (x, z). ancestor (x, y) Å ancestor (x, z), parent (z, y). 5. Conclusion Sibling (x, y) Å child (x, z), child (y, z), x ≠ y. In this research, we used an Intelligent Database uncle (x, y) Å brother (x, z), parent (z, y). approach to develop an application that applies the techniques and the structures of the Semantic Web and The above relationships represent a small portion of that makes use of its benefits. The system is tested what we can call "family relationships". However, the through a prototype example of a real-life application system does not only support family relationships. using death advertising documents published on the There are also other kinds of relationships, like: Internet in Arabic newspapers. superior (x, y) Å superior(x, z), superior (z, y). We implemented a system that integrated a number knows (x, y) Å relative (x, y). of Semantic Web technologies. The inputs to this knows (x, y) Å coworker (x, y). system are Internet document in a specific application domain. These documents can be either unstructured These are only a sample of the inference rules that HTML and text documents, or semi-structured XML are included in the system. Although they might look documents. In the case of unstructured documents, simple, some relations actually have a lot of cases to domain experts can define parsing rules for the consider. For example if we look at the nephew documents, in the form of state diagrams. The system relation, it can be inferred from any of the following: then parses the documents, extracts data items and nephew (x, y) Å son (x, z), sibling (z, y). relations from them, and converts them into semi- nephew (x, y) Å uncle (y, x), male(x). structured XML documents. The parsing is carried out nephew (x, y) Å aunt (y, x), male (x). according to the user-defined state diagram, which increases system flexibility. In case of Internet [9] World Wide Web Consortium RDF Specification, documents that are already available in XML format, http://www.w3.org/TR/REC-rdf-syntax the system can transform these documents from one [10] B. N. Grosof, I. Horrocks, R. Volz and S. Decker, format to another. This is also accomplished using user- "Description Logic Programs: Combining Logic defined XSLT stylesheet. Programs with Description Logic", Proc. of the 12th Next, once all the data is available in XML format of International Conference on World Wide Web), a defined structure for a given application domain, the May 2003. system reads the data from the formed XML files, and [11] G. F. Luger and W. A. Stubblefield, "Artificial stores them into the database repository. This repository Intelligence: Structures and Strategies for Complex is structured in such a way to enable intelligent Problem Solving", 3rd edition, Addison-Wesley, processing and querying of the data and/or 1998. relationships. [12] S. Bowness, "Information Highways", April 2004, Vol. 11, No. 3, P. 16. The Inference Rules Editor allows the domain [13] XSB Engine, http://xsb.sourceforge.net experts to define inference rules (Intensional Database) [14] J. Heflin, J. Hendler, and S. Luke, "SHOE: A that define the relationships between data items in the Knowledge Representation Language for Internet required application domain. These rules are also stored Applications", Technical Report CS-TR-4078 into the database. (UMIACS TR-99-71), Dept. of Computer Science, We also provide a user interface with which a user University of Maryland at College Park. 1999. can query the system. The users' queries are translated [15] D. Fensel, S. Decker, M. Erdmann, and R. Studer: into a logic program that is executed by a logic engine "Ontobroker: The Very High Idea". Proc. of the in the intelligent database system. It uses the stored data 11th International Flairs Conference (FLAIRS-98), and inference rules to produce the result of this query, Sanibal Island, Florida, May 1998. and return it back to the user. [16] T. R. Gruber. "A Translation Approach to Portable As a practical example for this system, we Ontology Specifications". Knowledge Acquisition, introduced a real-life application. We considered the 5(2):199-220, 1993. social relationships found in death-ads in newspapers. [17] D. Fensel, I. Horrocks, F. van Harmelen, D. These ads contain both implicit and explicit information McGuinness, and P. F. Patel-Schneider, "OIL: about various relationships among people. These Ontology Infrastructure to Enable the Semantic relationships, if arranged and organized in a proper Web", IEEE Intelligent System, 16(2), 2001. way, can be used to extract and infer other hidden, not [18] A. Farquhar, R. Fikes & J. Rice, "The Ontolingua explicitly mentioned relationships. Server: A Tool for Collaborative Ontology Construction", Knowledge Systems Laboratory, 1996. 6. References [19] H. P. Luhn, "The automatic creation of literature [1] J. Heflin, "Towards the Semantic Web: Knowledge abstracts", IBM Journal of Research and Representation in a Dynamic, Distributed Development, 2, pp. 159-165, 1958. Environment", Ph.D. Thesis, University of [20] M. Sharp, "Text mining", Rutgers University, Maryland, College Park, 2001. Communications, Information and library Science, [2] E. Bertino, B. Catania and G. P. Zarri, "Intelligent Seminar in Information studies, Dec 11, 2001. Database Systems", Addison-Wesley, 2001 [3] A. Deutsch, M. Fernandez, D. Florescu, A. Levy and D. Suciu, "XML-QL: A Query Language for XML", Proc. 8th Int'l WWW Conference, 1999. [4] T. T. Chinenyanga and N. Kushmerick, "Expressive Retrieval from XML Documents", Proc. 24th Annual Int'l ACM SIGIR conf. on Research & Development in Information Retrieval, 2001. [5] S. Decker, D. Fensel, F. van Harmelen, I. Horrocks, S. Melnik, M. Klein and J. Broekstra, "Knowledge Representation on the Web", Proc. of the 2000 Int'l Workshop on Description Logics, 2000. [6] S. Decker, F. van Harmelen, J. Broekstra, M. Erdmann, D. Fensel, I. Horrocks, M. Klein, and S. Melnik, "The Semantic Web – on the Respective Roles of XML and RDF", IEEE Internet Computing, September/October 2000. [7] World Wide Web Consortium HTML Specification, http://www.w3.org/TR/REC-html40 [8] World Wide Web Consortium XML Specification, http://www.w3.org/TR/REC-xml