Sunteți pe pagina 1din 120

The Use of Knowledge Networks in the Biomedical Sciences and Parallel Processing Solutions in Bioinformatics and in Human Genome

Research

Masters Thesis
Summer and Fall 2011

Advisor: Prof. Christian Bach Prof. Hassan Bajwa

Student Name : Tamas Ragoncsa Student ID : 870393 Email : tragoncs@bridgeport.edu Section : TCMG-597A

Page 1

Table of Content 1 2 3 Executive Summary ................................................................................................................ 5 Introduction ............................................................................................................................. 6 Background ............................................................................................................................. 6 3.1 3.2 3.3 3.4 4 Problems and issues ......................................................................................................... 6 The Intellectual Bandwidth Model ................................................................................... 7 Organizational structure to improve collaboration......................................................... 10 How to extend the Intellectual Bandwidth: case study from the healthcare industry .... 12

The Knet environment .......................................................................................................... 15 4.1 4.2 4.3 4.4 Knet design concepts...................................................................................................... 16 Knowledge Network structures ...................................................................................... 18 Knowledge cube analysis ............................................................................................... 20 Conclusion...................................................................................................................... 20

Technical Issues .................................................................................................................... 20 5.1 5.2 Technical Issues1: Ontologies........................................................................................ 20 Technical Issues2: Text mining...................................................................................... 22

Cases study HuGE tool ...................................................................................................... 24 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Goals of the project ........................................................................................................ 24 HDFS .............................................................................................................................. 25 HBase ............................................................................................................................. 26 MapReduce..................................................................................................................... 28 LCS................................................................................................................................. 30 Zinc-Finger-Nuclease (ZFN) binding sites on the human genome ................................ 32 HuGE tool ...................................................................................................................... 33 Benefits of Knet concepts in Human Genome exploration ............................................ 33 Parallel processing in bioinformatics goals and plan .................................................. 35 Hadoop development environment on a single PC ..................................................... 37 Prerequisites ............................................................................................................ 37 Page 2

Searching for possible improvements and the choice of parallel processing ....................... 34 7.1

Parallel Processing as a computational challenge ................................................................. 36 8.1

8.1.1

8.1.2 8.1.3 8.1.4 8.1.5 8.2

Single Node setup ................................................................................................... 38 Pseudo distributed environment .............................................................................. 38 Testing the HDFS ................................................................................................... 39 Further information about your Hadoop system ..................................................... 40

Mapreduce in a development environment .................................................................... 40 Prerequisites ............................................................................................................ 40 WordCount example ............................................................................................... 40 Development and build of MapReduce applications .............................................. 41 Compilation and build of the WordCount application ............................................ 41 Execute and monitor the Wordcount example ........................................................ 42

8.2.1 8.2.2 8.2.3 8.2.4 8.2.5 8.3

Column oriented database considerations ...................................................................... 45 HBase installation challenges ................................................................................. 46 Out of the box pre built bundles (CDH3) ............................................................... 46

8.3.1 8.3.2 8.4

Amazon AWS ................................................................................................................ 47 Amazon EC2 ........................................................................................................... 48 Creating an AWS account ............................................................................... 49 Creating an EC2 instance ................................................................................ 51 Accessing an EC2 instance .............................................................................. 59 Accessing a web server running on an EC2 instance ...................................... 63

8.4.1

8.4.1.1 8.4.1.2 8.4.1.3 8.4.1.4 8.4.2

Amazon EBS and S3 ............................................................................................... 65 Creation of an EBS .......................................................................................... 66 Attaching an EBS to an EC2 instance ............................................................. 69 Making an EBS available for an instance ........................................................ 70 Creating a bucket and uploading data to S3 .................................................... 71 Accessing S3 from an EC2 instance ................................................................ 72

8.4.2.1 8.4.2.2 8.4.2.3 8.4.2.4 8.4.2.5 8.4.3

Amazon Elastic Mapreduce .................................................................................... 74 Starting a sample Mapreduce application provided by AWS.......................... 75 Monitoring the sample Mapreduce application ............................................... 80 Run the WordCount example using Elastic Mapreduce .................................. 85

8.4.3.1 8.4.3.2 8.4.3.3 9

Subsequence search in human genome ................................................................................. 89 9.1 Existing solutions ........................................................................................................... 91 Page 3

9.1.1 9.1.2 9.1.3 9.2

LCS ......................................................................................................................... 91 KMP (Knuth-Morris Pratt discrete algorithm) ....................................................... 92 RMAP ..................................................................................................................... 96

Creating a solution ......................................................................................................... 97 Requirements for a working prototype ................................................................... 97 Subsequence search algorithm ................................................................................ 98

9.2.1 9.2.2 9.3 9.4

Implementation............................................................................................................. 100 Test and measurement of the solution .......................................................................... 100 Test1 searching Chromosome 1 ......................................................................... 100 Test2 searching the whole human genome ........................................................ 102

9.4.1 9.4.2 9.5 9.6 10 11 12

Benefits of the solution ................................................................................................ 105 Possible extensions of the solution............................................................................... 106

Conclusion about parallel programming and human genome research .............................. 107 Conclusion about knowledge networks in bioinformatics .................................................. 108 Appendix ............................................................................................................................. 109 Sample Java source code of the KMP algorithm ......................................................... 109 Genome splitter with overlap - implementation in Perl ............................................... 110 Mapreduce implementation of simple subsequence search in Java ............................. 112 Improved Genome splitter handling multiple chromosomes - implementation in Perl 114

12.1 12.2 12.3 12.4 13 14

List of figures and tables ..................................................................................................... 116 References ........................................................................................................................... 118

Page 4

Executive Summary

The importance of knowledge management is growing in almost every industry. This is especially true about the field of biomedical science, which is one of the most rapidly expanding research areas. Biomedical science generates huge amount of data and such high volume of research results that is impossible to understand, manage and make sense of applying traditional methods (like simply searching for and reading of articles of interest by each individual researcher). Fortunately there are more and more technological advancements on the field of information management and retrieval (ontologies, text mining, semantic web) that can help us to overcome these problems. In addition the processing power needed is more available than ever because of enabling technologies like cloud computing. However neither the advanced methods nor the processing power can help without collaboration among people, among researchers. If people dont have the intention and willingness to cooperate, if they are not motivated to do that, since they cant see why it is reasonable for them, even the most promising technological advancements will be determined to fail. Consequently to have a better control over biomedical research and knowledge management we need both the appropriate technologies and collaboration among people. This is what knowledge networks are about. The second part of this document will deal with human genome research problems and challenges. Genome research is a very good example for a field that can gain significant benefits from more effective collaboration among researchers. Its importance in the health care industry unquestionable and the problems researchers have to face with demand new revolutionary concepts and principles. Human genome research is tightly coupled with the mathematical problem of subsequence search in very long sequences. The related problems result in highly computational intensive tasks so utilization of distributed systems, parallel processing and cloud computing seems to be required when dealing with these issues. In fact these technological achievements can be further utilized in the field of knowledge networks, where researchers have to face also significant computational challenges.

Page 5

Introduction

The first part of this paper presents a short description of the intellectual bandwidth (IB) model, which is an adequate tool to model and solve problems related to biomedical research and development. It shows possible ways of improvements of the intellectual bandwidth of an organization using a business case. The key message of the IB model reveals the relation between technological advancements and the level of collaboration and how these two factors ensure organizational efficiency (a large intellectual bandwidth). This paper also presents an approach to handle the challenges in question, this is the Knet approach. The Knet concepts presented here address the problems related to biomedical research, however they are applicable to the research methods of any other field as well. The Knet advantage is that it focuses not only on the technology related problems or only to the collaboration among researchers but gives a sound solution to both. In the last part of this paper we present a case study, which is a bioinformatics research project. The case study is intended to demonstrate the potential of the Knet concepts how these concepts can enable more effective research through better use of the technology and higher level of collaboration. 3 Background

3.1 Problems and issues Most of the problems related to biomedical research are known and there are numerous research papers defining these problems and proposing solutions to handle one or more of them. If we intend to compile an, although necessarily incomplete but representative list of these issues we have to look for papers related to information management and retrieval on the field of biomedical science. The following table presents such a list, which gives us a good starting point to understand these problems.
Table 1- Problems related to information management and retrieval in biomedical science

Difficutlies of data management Difficulties of data analysis Difficulties of data modeling Huge volume of data (considered huge for both human and machine) Isolation of various communities involved. For example different research groups building ontologies are isolated from each other, and researchers building ontologies are isolated from its intended group of users. The intension of data integration leads to creation of ontologies. But the various ontologies need to be integrated too. Improved evaluation methodologies are needed for the improvement of text mining. Bioinformatics tools for exploratory analysis are capable for finding relationships of interest but not the deep causal insights, consequently they are not enough to formulate hypothesis.

(Brusic & Ranganathan, 2008) (Brusic & Ranganathan, 2008) (Brusic & Ranganathan, 2008) (Brusic & Ranganathan, 2008) (Rubin, Shah, & Noy, 2008)

(B. Smith et al., 2007)

(Baumgartner, Cohen, & Hunter, 2008) (Mirel, 2009)

Page 6

Bioinformatics hosts heavily on the Web. But the Web is geared towards human interaction rather than automated processing. (Semantic Web approach can help.) Information organization and exchange of web-based biomedical communities is usually unstructured. Specialized software to create such communities at low-cost has been largely lacking. Question of the provenance of information. Not all the data available is reliable. Web based communities should be coupled with semantic web. Extracting data from various data repositories (each with a unique schema) is a tedious process. Consolidation of such diverse data using data warehouses is also very complex and has its limitations as well. Integration tools and user interfaces have to be able to hide the differences (location, access protocols, different schemas) from the user, furthermore have to filter out extraneous information and highlight key relationships. Importance of recording of the provenance of an experiment what was done; where, how and why. New knowledge is produced at a continuously increasing speed, and the list of papers, databases and other knowledge sources that a researcher in the life sciences needs to cope with is actually turning into a problem rather than an asset. Lack of search engine technology that searches and cross-links all different data types in life sciences. Relevance ranking - not the number of query results matters, but the relevance does. No standardized approach for parsing output that the numerous bioinformatics tools generate. Meta data standards need stability but science moves constantly. Maintaining both metadata standards, and all the code that is required to make them useful, is a non-trivial problem.

(Chandrasekaran & Piramanayagam, 2010) (Das et al., 2009)

(Zhao, Miles, Klyne, & Shotton, 2009) (Neumann & Prusak, 2007) (Manning, Aggarwal, Gao, & Tucker-Kellogg, 2009)

(Quan, 2007)

(Robert Stevens, Zhao, & Goble, 2007) (Antezana, Kuiper, & Mironov, 2009)

(Ltjohann et al., 2011) (Lange et al., 2010) (Groth et al., 2010) (Fogh et al., 2010)

There is a wide selection of problems and issues listed in the table above. All of them are related either to technology or to the collaboration of people working on related tasks or both (technology and collaboration). To model and demonstrate these two important aspects of the issues in question the next section will give a brief summary of the intellectual bandwidth model. 3.2 The Intellectual Bandwidth Model Why do people form groups and organizations? Why do they try to make cooperative efforts? Because they believe they can create mutual prosperity. Each individual of the group or organization can benefit from the collaboration and so does the group or organization as a whole. Technology works as a catalyzer in this process. As Nunamaker writes All technology will become collaborative Page 7

technology (J. F. Nunamaker, Briggs, & de Vreede, 2000). The purpose of technology is to create value that wouldnt be possible or economical to be created without technology. To maximize the value created, technology must utilize the potential in collaboration. That is why technologies are determined to move toward collaboration. When building knowledge networks our goal is to maximize collaboration with the application of enabling technologies and eventually to maximize the value created.

Figure 1 - Relationship of collaboration and technology

The Intellectual Bandwidth (IB) Model proposed by Nunamaker et al. (J. F. Nunamaker et al., 2000) is a tool to describe and explain the relationship above. The model defines the potential intellectual bandwidth (PIB) as a product of the information assimilation level at the company and the level of collaboration in the organization. The former means how sophisticated is the available technology at the company, in other words how much is the added value provided by the systems used for (any kind of) information retrieval. The latter is a measure of that how effective is the collaborative work, how much can it boost the efficiency. The technology can increase the added value various ways. It can simply reduce the time needed to find the right information. For example it is much faster to write and run a database query compared to reading through number of printed documents. However besides reducing the time needed for information retrieval, technology can also make the search more targeted and the results more accurate. This is possible if the data is equipped with semantic meta-data based on which the user can find the relevant information instead of simply looking for a word or a phrase in the entire search universe. In the absence of semantic meta-data there are various techniques to extract the essence of the information like text mining, utilization of ontologies, natural language processing (NLP), named entity recognition (NER) and so on. (Cohen & Hersh, 2005; Rubin et al., 2008; B. Smith et al., 2007; Spasic, Ananiadou, McNaught, & Kumar, 2005) In terms of collaboration there are many complex problems that are simply not possible to cover only by one individual. So collaboration is not only an opportunity to perform some tasks more efficient, but in many cases it is a necessity. Collective effort can increase the productivity since there are simply more people trying to solve the problem. If the collaboration is well coordinated, then the productivity of the group increases. There are less overlapping among the tasks of group members. Critical tasks are allocated to persons most capable of performing them and results are communicated well among the entire group. Concerted collaboration is the most optimal and so most productive level of collaboration. Each and every members efforts have a positive feedback on everyone elses work that results in a high level of consonance in the overall cooperation (just like in a symphony orchestra). However the opposite is also true. Any discrepancy in anyones work in this high collaboration level can have a negative impact on everyone elses effort as well.

Page 8

Figure 2 - The Intellectual Bandwidth Model source: (J. F. Nunamaker et al., 2000)

When discussing the topic of knowledge networks and knowledge management we can apply the rules of IB described above without any difficulty. However if we try to focus on the knowledge creation process we can classify the support of technology at the given organization based on that how capable is the IT system to support knowledge creation. (See Figure 3 - Increasing the Intellectual Bandwidth) From this point of view the lowest level is the simple observation (1), when knowledge is based on adhoc events observed and has no procedure behind it. Data collection (2) assumes a more systematic method of performing observations however the usability of the data collected is very limited since it has no appropriate, searchable structure when stored. Data organization (3) helps to solve this problem and increases the value of the data collected by making it more organized and easier to search. The next level is data combination (4) when the IT system provides functions to calculate various combinations of the data, so increasing its meaning and value. Information generation (5) adds one more step to this process by storing the combinations (generated in the previous level) and making it part of the data structure through which the meaning of the data stored will be extended. The steps described so far enables the data to grow. After a while the organization will realize that finding the information needed is more and more challenging, even if search and query functions are available. So if the IT system supports gist extraction (6) that increases the IB further. Gist extraction means being able to extract the essence of the data, so when searching we can easier filter out irrelevant results. If the system is able to extract the essence of the data it gives us a new opportunity to analyze the data, find relations in it and to make conclusions of it. This is called sense making (7). On the sense making level the system is able to find relations, correlations in the data and can suggest possible conclusions of it. This needs only one additional step to make these conclusions part of the system, which means new knowledge (8) has been created. To gain the highest level of IB the system can go even further and support the utilization of the new knowledge (9), which means finding ways how to make use of the knowledge created in the previous step, under which circumstances it is applicable, what is the likelihood of success or failure etc. In the final step the system can not only find the ways of applicability, but can support the application of the knowledge created. This is called knowledge implementation (10). Other systems, devices, components (if applicable in the field given) can be integrated into the system, so they can immediately use any new knowledge created by the knowledge creation process. Page 9

Figure 3 - Increasing the Intellectual Bandwidth source: (J. F. Nunamaker et al., 2000)

So we want to increase the intellectual bandwidth of our organization. What does this mean in the reality? How can we start it? If we decided to increase the IB of our company we have to go through the following steps. 1. Analyzing the current situation First we have to understand the current situation. Answering the following questions can help during this process. - What is the business we are in? - Do we want to be in this business? - Do we have a better idea? What would we change? - If we want to stay in this business we have to measure it. What are the key performance indicators? 2. Where do we want to go? (What is the potential bandwidth?) Once we have a clear picture of the current situation we have to define our goals. - How would look our ideal business process(es)? - What would be the ideal values of our key performance indicators? - How would we change the position of the organization in the market we are in? 3. How do we get there? (What is our action plan?) At this stage we a have a clear definition of the starting point and our goals as well. So we have to lay out a plan of action to see through which steps can we reach the optimal situation defined. This will include steps to increase information assimilation (y axis) in combination with steps to increase the collaboration (x axis). As a result we will have a combination of IT and management tasks. This includes development, integration, education etc. in IT and potential changes in the organization structure, employee trainings, changes in the communication processes etc. in management. 3.3 Organizational structure to improve collaboration Organizations form to create value that cannot be created by individuals (J. Nunamaker, Briggs, & De Vreede, 2001). The question is how can the organization maximize its value creation method?

Page 10

According to Bach et al (Bach, Zhang, & Belardo, 2001) the structure of the organization has a significant impact on its collaboration capabilities so they recommend an organizational structure called Collaborative Networks Organization, which is a hybrid structure stemmed from the organization structures Centralized Starburst and Decentralized Spiders Web.

Figure 4 - The Starburst organization structure source: Bach et al 2001

Figure 5 - The Spider's Web organization structure - source: Bach et al 2001

In a Starburst organization structure there is one central node (a company, or a group or department of a company) that acts like the main knowledge store and mediator among other nodes. Other nodes (can be other companies in tight cooperation with the core company or business units of the same organization) generate knowledge and feed it back to the core so the core can distribute it among other nodes as necessary. The benefit of this structure is the ease of the extension with other nodes or if needed with other levels without losing the main structure of the whole network. The Spiders Web organization is unstructured. There are many unorganized connections among nodes. They can be established and broken down on demand as a particular project or problem requires it. The benefit of this structure is even its spontaneous characteristic, there are no constraints and formalisms in the structure that would make a connection (so the collaboration between two nodes) impossible or would make the decision process slow and overwhelming. Of course there is a great lack of centralized control in this structure.

Page 11

Figure 6 - The Collaborative Network Organization Structure - source: Bach et al 2001

Bach et al. (Bach et al., 2001) recommend a hybrid structure, a combination of the two structures mentioned above, which is ideal for healthcare research collaborations. This hybrid structure is a highly centralized spider web. The key of success in the collaborative network organization type is the Information and Organization Mediator business unit. This node has a dual role. First it helps to distribute information in the network, so essential information wont be lost. Even if one node is not aware of another nodes location that created the intormation it is looking for, the information can easily be found through the Info. & Org. Mediator node. Second the Mediator ensures network transparency which means recognition and protection of ownership. So intellectual property rights are enforced within the network. This structure combines the advantages of the two previous structures. The flexibility of the spiders web structre makes it possible to establish spontenous connections and cooperations as a given project requires it. Besides the Starburst type of centralized information hub helps the healthy information distribution in the whole system. 3.4 How to extend the Intellectual Bandwidth: case study from the healthcare industry This case study was presented by Bach et al. on HICSS03 (Bach, Salvatore, & Jing, 2003). The case is about finding virus mutations efficiently and quickly enough to make it possible to treat the patient with the most appropriate drugs that the virus has no resistance against yet. This means on the one hand to have accessibility to the most recent knowledge about known virus mutations and their drug resistance, on the other hand to have methods and tools to identify new mutations which makes possible to start research work for finding new drugs to defeat them. A treatment is considered to be successful if the drug is able to subdue the reproduction of the virus protein, which causes the viral load (virus particles/ml blood) to fall below detection level. The case study demonstrates ways how to extend IB and how significant is the value created by this extension using HIV diagnosis as an example. On the lowest level of IB the physician draws blood from the patient. He or she sends the blood to the lab where it is analyzed. Mainly utilizing manual steps the virus mutation is extracted and compared to known mutations by an expert. Then the lab sends these results back to the physician, who decides what will be the therapy most likely to be effective. This process is slow and has many manual steps which are always more error prone. Besides that if the lab is not Page 12

connected to other labs performing similar analysis they cannot benefit mutually from each others results. In the first level of the process improvement we optimize the work of the labs. They perform most of the steps of their analysis on an automated way so the results are more accurate and reliable and the turnaround time is shorter so the physician receives a response sooner. This level of collaboration is demonstrated on the figures below.

Figure 7 - IB extension step 1 model - source: Bach et al 2003

Figure 8 - IB extension step 1 process - source: Bach et al 2003

The next level of improvement is building a network of labs so it becomes possible to store mutation and drug resistance information in a centralized database instead of local databases. Of course the system must keep ownership of the data uploaded by a lab. The key to get onto this level of cooperation is trust. If participants (the labs) can see the benefits of the system and can trust it they will actively participate. The IB can be increased even more if the central system can not only record but evaluate HIV sequences. So if the mutation is a known one and the treatment is already invented the lab has nothing else to do, but search the system and they will receive the description of the therapy.

Figure 9 - IB extension step 2 - model - source: Bach et al 2003

Figure 10 - IB extension step 2 process - source: Bach et al 2003

Page 13

In the next step we extend the network with a decision support system for physicians. Since the system has all the information about mutations and their drug resistance it can send this information directly to physicians. So physicians can receive a list of known effective drugs (if available) and/or drug combinations against the given mutation.

Figure 11 - IB extension step 3 - model - source: Bach et al 2003

Figure 12 - IB extension step 3 process - source: Bach et al 2003

On the next level we extend the network with another layer of decision support. This new sense making component analyzes also all the patient records available. It can capture all the so far known results of treatments of the given mutation with different drug combinations. Based on this result it can calculate which drugs or drug combinations have the highest success rate. It can compile a list of these suggestions and the related statistics and send it to the physician. When the physician follows up with the treatment he or she will send feedback to the system to improve the accuracy of its statistics. Using this functionality it is also possible to adjust an ongoing treatment if based on multiple cases a mutation is likely to be resistant against the drug or drug combination applied in the treatment.

Figure 13 - IB extension step 4 - model - source: Bach et al 2003

Figure 14 - IB extension step 4 - process - source: Bach et al 2003

Page 14

We already have high level of automation to locate known mutations and the most possible drug combinations against them. However there is a possibility to automate the analysis of new mutations as well for example with a microarray (gene-chip) or with other diagnostic tests. With this method the volume of data can be increased even more. So the scope of decision support and the accuracy of the automated sense making will be increased as well.

Figure 15 - IB extension step 5 model - source: Bach et al 2003

Figure 16 - IB extension step 5 process - source: Bach et al 2003

If we compare the last stage to the first one, we can see how wide selection of functionalities is available in the new IT system compared to the old one, how it enables a higher level of collaboration and how collaboration has a positive feedback on the effectiveness of the overall organization (i.e. every participants in the process). This significant difference is reflected on the IB model chart as well, which shows the IB of the first stage is a tiny fraction of the last stages IB. (Figure 15 - IB extension step 5 model) 4 The Knet environment

The ways how technology and collaboration are combined in research at present is inefficient. Research processes and their supposedly equivalent IT processes are dissonant therefore only a tiny part of the potential opportunities provided by technologies are utilized. The life cycle of knowledge in the real world (research) and the life cycle of its model in IT systems are entirely different and there is no easy way for researchers to collaborate efficiently either. So we need a system that adapts much more to the real world knowledge creation (e.g. research) process and enables a high level of collaboration. In this section of the paper we present the Knet approach, which is a collection of concepts how to increase the intellectual bandwidth of research processes through a revolutionary, transformational change. The Knet approach discussed here is not limited to the field of bioinformatics and genomic research. However this field gave the main motivation for Knets inventors to lay down these concepts. (Bajwa, Bach, Kongar, & Chu, 2009) Page 15

Challenges need to be addressed in bioinformatics and genomic research: How can researchers handle the enormous amount of data that is either unstructured or semi-structured? Single data sources are unable to provide satisfactory answers to researchers questions The lack of protocols and lack of standards makes data integration hard. The variation of names complicates data integration and search methods even more. (The same entity is named differently by different researchers.) In general the lack of enabling technology is not the only reason why researchers dont collaborate, but they are not motivated either to share their findings because they intend to protect their intellectual property. Unstructured or semi-structured data is hard to search and keep integrated. But using relational databases are not efficient either, since existing relations are incomplete and future relations are not known.

Based on the previous sections and the short introduction above we can define the mission statement of the Knet project. Mission statement: enable researchers and managers to build mutual prosperity by providing the essential IT platform and support the growth of mutual trust among the entire research network. We can define the system requirements and demands to fulfill the mission statement: Trustworthiness the system needs to be trustworthy and credible. Freedom the users of the system must not feel being limited by the system. The purpose is to enable new, more effective ways of research work and not to build restrictions. Information security the system must fulfill the general information security requirements: confidentiality, integrity, availability, authenticity and non-repudiation. Transparency and openness the system must fulfill these requirements to increase trust and enable collaboration. Ownership protection this is an especially important requirement. Without having rock solid ownership protection researchers would lose interest to collaborate. Accountability the system must enforce accountability within its own borders (e.g. tracing transactions) and it should support accountability even over its borders with concepts and policies. Especially related to intellectual property rights and copyrights. Privacy how various data (user information, copyrighted materials) are collected, stored and shared.

4.1 Knet design concepts Applying the Knet approach researchers create small, compact micro databases (micro DB) with very specific content instead of trying to build big monolithic relational databases. Links between micro DBs can be added dynamically. By loosening the restrictions of traditional relational databases and by eliminating the constraint of defining relations in advance help the Page 16

process of knowledge creation. Knowledge can be produced, micro DBs can be created and links can be established parallel and in an iterative manner.

Figure 18 - Knet concept Figure 17 - RDBMS concept

There can be various database templates available for on-line scholars to establish a needed Micro DB quickly when needed or they can create their own from scratch. We also call Micro DBs KCubes or Knowledge Cubes. This is the basic building block of the Knowledge Network (Knet). According to the Knet concept humans are integral part of the system. Data, be organized, unstructured or semi structured, has no meaning on its own. It becomes meaningful when it is put into context and is comprehended by people. So authors have to become part of the system. They have to continuously maintain their data and its relations. The Knet platform has two main parts: a front-end knowledge engineering system and a backend knowledge engineering system. The purpose of the first one is targeted knowledge breakdown, data ETL (Extract, Transform and Load). The functionality of the second part is data mining, knowledge interpretation and knowledge combination.

Page 17

Figure 19 - Detail Vew Part 1

Figure 20 - Detail Vew Part 2

4.2 Knowledge Network structures As we described above KCubes or Micro DBs are the basic building blocks of Knet. Here we demonstrate what are the further building structures of a Knowledge Network.

Page 18

Figure 21 - Building knowledge networks

Node (KNode) represents data or information most likely encapsulated into a Micro DB. We can establish links or relationships (KR) between nodes, so we can build simple and more complex structures, like line (KL), plan (KP) or net (KN). It makes sense to differentiate by the sort of knowledge we built into a net (or plan). In case we visualize the network, this can be done by color coding different sorts of nets (or plans) like in the figure above. To represent knowledge in a wider sense in the network we need various sort of information. We may need only some or all of the following types when building a network. 1. The Objective Information defines the knowledge in question. This may also scope the topic. 2. A Resource library (e.g. materials, data parameters) can be built and referred and used in other nets. 3. Methods (e.g. formulas) can be applied on our resources. 4. Models (e.g. interaction and steps) can orchestrate the content of all the former nets into one integral whole. 5. Skill sets (e.g. learning objects) provide a view of the network from a learning or educational standpoint. Page 19

4.3 Knowledge cube analysis When working with knowledge net researchers can find the knowledge cubes of their interest or even can build their own derived from more other cubes of interest. Once a cube like this is available for the researcher, he or she may want to analyze the cube on various ways (similar to OLAP cubes) like slice and dice, drill down and up, roll up, pivot etc. Slice: fix one dimension of an N dimension hyper cube to a given value. This reduces the number of dimensions by one. Dice: fix one dimension of an N dimension hyper cube to a given interval. This gives us a part of the original cube. The value set of the given dimension will be limited accordingly. Drill down or up: while doing a slice operation the given value is pulled down to a more detailed level (we eliminate one dimension but add a new one). This operation is called drill down. The opposite, to join the more detailed values again to one value and put back the other non-detailed dimension is called drill up. Roll up: compute all the data for one or more dimensions (by eliminating all the others). Pivot: rotating the N dimensional cube along one axis to give another perspective to the data. 4.4 Conclusion Applying the concepts suggested above makes the IT system and processes (machines, databases, links, database records etc.) more conformable to the real life system and processes (researchers, inventions, research work and relations among them). As a consequence technology becomes an enabling technology that encourages and makes possible higher level of collaboration so finally boosts the intellectual bandwidth in the entire research community. 5 Technical Issues

5.1 Technical Issues1: Ontologies As one of the most important tools of knowledge representation and reasoning in general in science and so in biomedical science as well ontologies are beyond doubt essential to know about and to deal with when building a biomedical knowledge networks. This section is not intended to give an exhaustive description of ontologies it just serves as a brief summary of the most important terms and questions related to ontologies mainly based on the paper Ontology-based knowledge representation for bioinformatics (R. Stevens, Goble, & Bechhofer, 2000). What is an ontology? An ontology is mainly a vocabulary of terms and the specification of their meanings in a given knowledge domain. Its goal is to create a vocabulary and semantic structure for exchanging information about the domain. The main components of an ontology are concepts, relations, instances and axioms. A concept defines a class of entities. Relations define the interactions Page 20

between concepts (is a kind of, part of etc.) Instances are the entities classified by a concept. Axioms constrain values for classes or instances. (For example restrictions of some values of given parameters of given concept, limitation of the use of given relationships etc.) Quality of ontologies and their fields of application When building an ontology theres an emphasis on modularity and reusability. The measure of how well dependencies are separated is known as ontological commitment. Other measures for quality are clarity, consistency, completeness and conciseness. Ontology may be used as: a community reference (knowledge base with neutral authoring), a specification (when building a database schema or a vocabulary), a tool for common access to information (the use of the vocabulary in the ontology makes possible the information exchange), base of search queries over databases (instead of free text search the user can be certain the semantic of his or her search term is correct), base of understanding annotations and technical literature (support for NLP natural language processing systems). Example ontologies Examples for ontologies on the field of bioinformatics and molecular biology are RiboWeb ontology ("The RiboWeb Project," 2001), EcoCyc ontology ("EcoCyc,"), Schulze-Kremer ontology for molecular biology (Schulze-Kremer, 1998), the Gene Ontology (GO) ("Gene Ontology," 1999), the TAMBIS Ontology (Tao) ("TAMBIS," 1998). The examples show how the conceptualization of the same domain may differ without being any of them incorrect. For example TaO has a concept AccessionNumber, which doesnt exist in the regular molecular biology but serves well the purpose of bioinformatics in this ontology. Building an ontology There are no standard methods for building an ontology. One way to build an ontology inspired by the software engineering V-process model, has the following steps: 1. 2. 3. 4. 5. Identify purpose and scope (What do we want to use the ontology for?) Knowledge acquisition (From where and how can we get the knowledge needed?) Conceptualization (Identification of key concepts, their relationships and attributes.) Integrating (How can we reuse existing ontologies to build our own?) Encoding (Making the knowledge accessible using the concepts and rules of the new ontology.) 6. Documentation (Both formal and informal for machines and humans respectively.) 7. Evaluation (Determining the quality of the new ontology.) The transformation of the conceptualization to a given knowledge representation language (the encoding phase) is key to make use of the ontology in different applications. The major considerations with languages are their expressivity (how wide range of concepts can be Page 21

expressed and how complex it makes the language when we need to process it), rigor (consistency without contradictions) and semantics (the meaning of something expressed with the language means unambiguously what is in the concepts). Ontology languages Languages fall into three categories: vocabularies, object based knowledge representation languages and languages based on predicates expressed in logic. Vocabularies are hand-crafted ontologies in a simple tree-like inheritance structure. Object based knowledge representation languages show similarities with the Object Oriented (OO) methodology in software development. We call this a frame based system. Frames are analogous to OO classes. Frames have slots which are analogous to OO attributes. Predicates expressed in logic also called DL (description logic) use other concepts and relations to define a new concept. This makes the whole structure more expressive and descriptive. One general problem with languages is that as they become more and more expressive, the computational complexity of reasoning increases. Open issues and challenges The most important open issues with ontologies on the field of bioinformatics are: 1. Knowledge based reasoning (How to find the most appropriate ontologies? How to ensure the needed processing power for reasoning?) 2. Reuse vs. specific (It helps to speed up building ontologies when other ontologies are reused. But what should happen if the base ontology changes? How should the change be reflected in the successors?) 3. Methodologies for constructing ontologies (Ontology creation is costly. If it is long lived in an application and/or widely reused then the investment was worthwhile.) 5.2 Technical Issues2: Text mining The usual form how researchers publish their findings is journal articles. Because of the huge volume of these publications it is not possible for individuals to keep track and find important connections that would enable new findings and knowledge creation. It needs automation to help researchers with this problem. From an information management and retrieval point of view the lack of structure in journal articles is a significant challenge, since that makes it difficult to process these documents and find the relevant content in them. Cohen and Hersh made a list of the most important fields and technologies in biomedical text mining in their article (Cohen & Hersh, 2005). Here a quick summary of their work will be presented. According to Cohen and Hersh the goal of biomedical text mining is to shift the burden of information overload from the researcher to the computer. There are various ways and techniques to reach this goal. The fields of most significance are named entity recognition, text classification, synonym and abbreviation extraction, relationship extraction and hypothesis generation. Named entity recognition Named entity recognition (NER) is intended to find instances of a specific thing in a text document, be it referred either by a widely used well known term or a less known synonym. Named entity Page 22

recognition gives a structure to the otherwise unstructured document. This makes possible to get an idea of the main concepts of the document and makes it easier to find the relations among its key terms or entities. NER has multiple challenges. A global universal dictionary for terms in biomedical science does not exist. Another problem is even if the definition of a word is available in many cases it may have multiple meanings depending on the context. The contrary is also true. Many entities have a long list of names how different researchers refer to them. It is very important to be able to measure if an NER approach is effective or not. In the literature there are two numbers to make this measurement. The one is precision and the other is recall. Precision is the number of correct prediction divided by the total number of predictions, which means if the NER system recognizes an entity how likely is that the recognition was correct and the entity was not misidentified. Recall is the number of correct predictions divided by the total number of named entities in the text, which means if there is an entity in the text how likely is that the system will find and recognize it correctly. These two numbers can be combined into one number that represents the overall efficiency of an NER system, this is the F-score. F-score is the harmonic mean of precision and recall (2PR/(P+R)). Because of the huge progress in genome research many NER systems focus on recognizing gene and protein names in free text. Text classification Text classification systems intend to determine if a document has certain characteristics. Typically the definition of the characteristic happens by defining a positive and a negative training set, which means some documents that fulfill the criteria and some other documents that do not. Then the system will find documents that are similar to the positive and different from the negative training set. Text classification functions can be extremely helpful for database curators when reviewing a large number of documents with the intent of classifying them. This is especially true about biomedical and genome research articles, which have been being provided in huge volume recently and there is a strong need for having them in a more organized format. Synonym and abbreviation extraction The intensive research on the field of bioinformatics creates new terms on continuous bases. Most of these terms have more synonyms and abbreviations. This adds to the complexity of reusing the research results as the base of new research studies. Having a system that can extract synonyms and abbreviations automatically and can maintain such a dictionary could help the processing of free text documents a lot (either it is done by human or machine). The automated extraction of synonyms and abbreviations of a term defined in the same document is possible with the current techniques already available. However extracting and identifying them from documents in which they are not defined, only reused based on a cited or not cited external source is a more difficult problem. Gene and protein name extraction is also a challenging task because of the size and complexity of the topic and most of the automatically generated synonym and abbreviation dictionaries are not accurate enough for practical use. Relationship extraction When extracting relationships the goal is to find the connections between a given type of entities. There are various approaches to achieve this goal. Manually generated template based methods have patterns (usually regular expressions) which are run on the document under investigation. Relationships are extracted when the pattern matches a piece in the document. Another way is the automated pattern generation. The system is given some sample relations based on which it will generate patterns and will look for similar relationships in other documents. A third way is utilizing Page 23

statistical methods. In this case the system is looking for terms happened to be mentioned together more often than it would be possible by chance if they are not related. The fourth way is the NLP based approach. This method decomposes sentences of the document and identifies both the entities and the relationship between them. Of course there is an emphasis on finding relationships among genes and proteins. This is beneficial for both organizing the already available information and finding new relations and making new conclusions practically generating new knowledge. Hypothesis generation To attempt to uncover relationships that are not present in the text is called hypothesis generation. This field is built on the simple formula: if A influences B and B influences C then A may influence C. Of course if we generate this type of hypotheses from a high number of relationships we may end up having a huge volume of potential new relations. Although this can be a good starting point of some new research the evaluation and organization of these relationships is a challenging issue. Open issues and challenges The potential of text mining system in biomedical science is huge but it is underutilized. Text mining systems of the future must not only fulfill the technical requirements like precision and recall, but must answer the real needs of their users, must be widely accessible, easy to use and flexible. The key to answer these challenges is collaboration. If biomedical researchers, text mining professionals and journal publishers can do coordinated efforts then the text mining systems of the future will not only a nice addition of biomedical scholars, but also a must have piece of their toolbox.

Cases study HuGE tool

In this case study we give a summary of the development and use of the Human Genome Exploration (HuGE) tool (Bach, Bajwa, & Erodula, 2011), which is a multipurpose tool to solve computationally extensive bioinformatics problems. HuGE is not a knowledge network and its development did not happen in a Knet either. Anyway it is the best candidate to be a case study related to knowledge networks because of couple of reasons. First it is intended to serve similar goals like knowledge networks (being widely accessible, handling huge volume of data, creating knowledge and sharing it). Second it works on a research area within bioinformatics that can gain practical benefits from Knet concepts. And third we can demonstrate through the HuGE tool in general why is it worthy for bioinformatics R&D projects to be completed in a Knet environment. 6.1 Goals of the project The project was started to solve a biomedical problem, specifically the question of binding sites of some ZFNs (Zinc-Finger-Nuclease) on the human genome. To accomplish this, project members implemented an algorithm that compares subsequences in proteins and the human genome. This was the LCS (Longest Common Subsequence) algorithm. Of course the tool is not restricted to solve only the original problem. Similar problems related to protein bindings can be easily analyzed using the tool. For processing large data sets a single machine approach usually Page 24

not sufficient there is a demand for parallel and distributed processing to be able to complete the calculations in a tolerable time limit. This project used tools, frameworks and concepts for distributed processing like a distributed file system (HDFS), a distributed column oriented database (HBase) and the MapReduce programming model. In this case study we give a quick overview of all these building blocks either. 6.2 HDFS The HuGE project selected Apache Hadoop ("Apache Hadoop Project," 2008) framework as a base on which it can build its distributed processing. Hadoop implements a distributed file system called HDFS (Hadoop Distributed File System) for its clients. HDFS is constructed from multiple physical nodes but looks as one transparent logical filesystem for the outside world. HDFS has many benefits, listed in the following table.
Table 2- Benefits of HDFS

Highly fault-tolerant

It is possible to deploy on a large amount of low-cost hardware. High throughput access to data Suitable for applications that have large data sets. Recovery from hardware failures Detection of faults and quick, automatic recovery from them. Streaming Data Access HDFS is not for general purpose applications. It is designed more for batch processing rather than interactive use by users and for high throughput of data access rather than low latency of data access. Large Data Sets A typical file in HDFS is gigabytes to terabytes in size. HDFS provides a high aggregate data bandwidth and can be scaled to hundreds of nodes in a single cluster and can handle tens of millions of files in a single instance. Simple Coherency Model HDFS employs a write-once-read-many access model for files. This simplification helps the system to be optimized for its strengths. Moving Computation is Cheaper than Moving HDFS minimizes network congestion and Data increases the overall throughput by performing computation on data physically as close as possible to the data itself. Portability Across Heterogeneous Hardware HDFS is implemented in Java, so it can run on and Software Platforms any platform that supports Java. An HDFS instance has a special node, called name node, which works as a register to index all the file blocks stored on other nodes, called data nodes. The name node keeps track of the availability of its data nodes using heart beats and the file blocks on them using block reports and Page 25

initiates replication operations if needed to keep the desired level of redundancy. Data nodes can be dynamically commissioned or decommissioned without affecting clients.

Figure 22 - The HDFS architecture source: http://hadoop.apache.org/

The system is designed in such a way that user data never flows through the NameNode. The NameNode is dedicated to handle meta data of the filesystem. HDFS supports a traditional hierarchical file organization and operations (like create, delete, copy or move a file, create a directory, delete directory etc.) It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. In HDFS its possible to create rack aware replica placement policies. Typically it is faster to replicate blocks within the same rack than between nodes located in different racks. Besides, the chance of an entire rack to become unavailable is much lower than the chance of a node failure. So in most of the cases it is not worthy to keep all the replicas in different racks. This will improve data reliability, availability, and network bandwidth utilization. When a client reads data HDFS tries to select the closest replica. For example if the client is located in a rack in which a replica of the requested block is available then that one will be provided. The characteristics of HDFS make it an ideal choice to serve as a base of data intensive biomedical computations. 6.3 HBase HBase is the Hadoop Database. Its purpose is to provide random read/write access to big sets of data stored in a Hadoop system. HBase is able to store very large tables with billions of rows and millions of columns. HBase is modeled after Google Bigtable (Chang et al., 2008) so it is a Page 26

sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, time:int64) -> string In other words we have a way more permissive data structure compared to a strict relational database structure. We dont have to define the complete schema at the time of the table creation. We can add not only rows but arbitrary number of columns also in the future. The role of the timestamp is versioning cell values. Since this functionality is part of the concepts implemented in the HBase system theres no need for additional implementation of versioning by the user (i.e. the application developer). Values are sorted by the timestamp value in a decreasing order to make it simple to return the current value (the one with the most current timestamp). Columns are organized into column families. Columns must be created under a column family. A table is supposed to have a relatively smaller number of column families (maximum in the hundreds) but a higher number of columns (it may be in the millions). Column families are defined when creating the table. Storage specification happens on the column family level. All column family members are stored together on the distributed filesystem. One important consequence of the column oriented storage format is the sparse characteristic of the data structure, which means it doesnt store empty cells at all. If a new column is added to a table (to one of its column families) it wont change its rows at all, however theres a possibility to assign value to this new column. Every table has a primary key that can be defined with any kind of data type. Rows are indexed by the primary key which provides access to row records. The architecture of an HBase system contains a master server, region servers and clients. The master server is responsible for monitoring all the region servers and it is the interface for all metadata changes. A region server is responsible for serving and managing its regions. A period of primary key values and the related row records from one or more column families are called regions. So if a client is working with a specific region of a table this concept will minimize the number of region servers the client has to deal with. Clients can communicate with region servers directly without sending information through the master server. The master server can serve the lookup requests coming from clients, however these lookups can be cached, so the load on the master server is relatively low. In the following example content of web pages their URL and their displayed anchor text is stored in an HBase table. The primary key of the table is the URL (Row Key). In the example there is only one value, so we can see only one record in this example. There are two column families defined: contents and anchor. The timestamp column contains the time when the given column value was added. In the contents column family there is only one column called Page 27

html. The value of this column for this row was added first at time t3, then it was modified twice at time t5 and t6. The other column family anchor has two columns: my.look.ca and cnnsi.com. The former was assigned a value at time t8 (CNN.com) the latter at time t9. So the most current values of record Row Key = com.cnn.www are the following: contents:html = "<html> some content3" anchor:cnnsi.com = CNN anchor:my.look.ca=CNN.com

Table 3- HBase example

Time Row Key Stamp com.cnn.www t9 com.cnn.www t8 com.cnn.www t6 com.cnn.www t5 com.cnn.www t3

ColumnFamily contents

ColumnFamily anchor anchor:cnnsi.com = "CNN" anchor:my.look.ca = "CNN.com"

contents:html = "<html> some content3" contents:html = "<html> some content2" contents:html = "<html>some content1"

6.4 MapReduce MapReduce is a programming model and an associated implementation for processing and generating large data sets. (Dean & Ghemawat, 2004) So far we were writing about techniques and methods how to store big files on distributed manner (HDFS) and how to create big, column oriented databases (HBase). Now is the time to introduce a programming model that helps us to perform processing intensive big computations. MapReduce is a Google invention. Its purpose is to provide a framework for software developers to build parallel processing applications without worrying about fault tolerance, distribution, load balancing and parallelization since all of these services are provided by the framework. Hadoop implements the MapReduce framework so these services are also available for Hadoop users. When working with the MapReduce framework the developer has to build only a map and a reduce function. map (k1,v1) reduce (k2,list(v2)) list(k2,v2) list(v2)

The input of the map function is a key and a value and its output is a list of keys and values. The input of the reduce function is a key and a list of values and its output is a list of values (this can be a one length list). The MapReduce framework moves the map functions output to the reduce Page 28

function. It collects all the outputs from the map function with the same key and will deliver it to one single reduce function. The concept is easier to understand through an example used from the Google article (Dean & Ghemawat, 2004): map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

In this simplified example our goal is to count words in a document. We pass the document to the map function as a document identifier (key) and document content (value). (Actually the framework does that. The programmer has to only initialize the data.) Then the reduce function will be called with each word found and the list of values (in this case it is always the number 1) emitted with them. So the reduce function can sum the numbers. One of the benefits of the framework is that all of the computation can be done on a parallel manner. Every map invocations (more complicated problems has a high number of map calls not only one) and reduce invocations may happen parallel, most likely on a different physical processor.

Page 29

Figure 23 - MapReduce parallel processing source: Dean & Ghemawat, 2004

The application will perform the initialization with a dedicated master node. It sets up its map and reduce functions, the location of the input data, the number of splits (how many inputs should be created) so how many map tasks should be performed by the workers and how many output files are required so how many reduce tasks should be performed. The master node will schedule the tasks by assigning them to idle workers. Since MapReduce is a proven programming model to perform processing intensive highly parallel calculations on large data sets it makes it ideal to utilize in the biomedical and human genome research environment. 6.5 LCS Longest common subsequence (LCS) problem is to find the longest subsequence common to all sequences in a set of sequences. Subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. For example [N1, N3, N4] is a subsequence of [N1, N2, N3, N4, N5]. This problem is NP hard. However in case of constant number of sequences the calculation can be performed in polynomial time. The optimal solution to the problem was first documented by Nakatsu et al (Nakatsu, Kambayashi, & Yajima, 1982). The problem can be defined in a recursive manner:

Page 30

Figure 24 - LCS definition source: Nakatsu et al, 1982

According to the definition the problem can be decomposed into smaller subproblems. If we know LCS(Xi-1,Yi-1), LCS(Xi,Yi-1) and LCS(Xi-1,Yi) then to calculate LCS(Xi,Yi) is easy. Dynamic programming gives us a sound solution to handle these kind of problems (Eddy, 2004). The following figure demonstrates how to solve the LCS problem using dynamic programming.

Figure 25 - Solving the LCS problem by using dynamic programming - source: Eddy, 2004

In this example there are numbers assigned to the different cases (match +5, mismatch -2, insertion/deletion -6) so the longest common subsequence will be the one with the highest score. The algorithm has three main steps 1. Initialization filling in the first row and column of the matrix is trivial. 2. Calculating the values of the matrix is simple by using the recursive LCS definition above. 3. Traceback the longest common subsequence. Starting from the bottom right corner we should build the common subsequence. Either we could save the path when we calculated the numbers (see the little arrows in the figure) so we can build the subsequence (or subsequences if there are more than one optimal) following the arrows. Or we can reverse the formula and trace back the subsequence based on the calculated numbers. Page 31

If X sequence has M and Y sequence has N elements then the calculation needs N x M operations and the traceback max(N,M) operations. In case of long sequences the calculation time can be significant, so caching and other optimizations may be beneficial. 6.6 Zinc-Finger-Nuclease (ZFN) binding sites on the human genome To test the HuGE tool researchers used a real life biomedical problem, which was identifying Zinc-Finger-Nuclease (ZFNs) binding sites on the human genome. ZFNs are engineered DNA binding proteins designed to modify the genome. ZFNs are produced from the Sp1 protein, which has a stable three dimensional protein structure and its DNA binding mechanisms are well documented (Dhanasekaran, Negi, & Sugiura, 2005). However ZFNs are causing cytotoxicity (cell death) which is a significant drawback in its usability for any clinical testing (Ramirez et al., 2008). ZFNs are designed to bind to a target binding site, but in reality they may bind to an unknown number of off target binding sites as well, which causes cytotoxicity. ZFNs usually consist of three single fingers of which each finger binds three DNA nucleotids. The structure looks similar to the figure below.
Sp1-Binding Domain (three fingers)
G Y V K G GC KT S T R S H F L R R K A G H C L SY R Zn Zn W W T C H TG F M E R P R S M E F L R Q K R P H C K E R Zn P T H C T G A E K K F D D H L S K H I H

Q I HC

K T

Figure 26 - The SP1 three finger structure source: Bach et al., 2009

Based on experiments performed using electrophoresis mobility shift assay there are three triplets with the highest binding affinity (the second finger of SP1 and two of its mutants). Best binding site of Sp1 = CGG Best binding site of CB1 = CCC Best binding site of MR14 = GAG

Using these results it is possible to modular assemble a three finger DNA binding protein that as its best binding binds CGGCCCGAG. The three finger protein consists of the second finger of SP1, CB1 and MR14. Though exact behavior of the binding and its relation with cytotoxicity is not known, if we can calculate all the possible binding sites in the human genome it gives us an idea how accurate can be the targeting of a binding site. In other words how many off target binding sites are in the human genome which would be possible cause of cytotoxicity. Page 32

6.7 HuGE tool Project members built the HuGE tool implementing the LCS algorithm and running on a Hadoop based environment. Using LCS it was possible to answer the scientific question of ZFN binding sites on the human genome. See the table below.
GGGGCGGGG (Sp1 consensus) 1667 1279 959 703 798 918 967 722 830 822 1078 882 422 669 553 898 1077 352 1555 596 244 567 757 30 19345 GGGCGGGGG (Sp1 Best) 1281 1077 795 561 683 783 771 575 669 634 758 735 337 547 429 618 756 294 887 504 182 420 624 21 14941 GGGCCCGGG (CB1 Best) 487 360 250 199 213 216 293 192 259 260 311 242 100 162 175 295 317 103 445 227 91 196 161 4 5558 GGGGAGGGG (MR14 Best) 4461 3685 2857 2430 2620 2710 2586 2283 2173 2422 2757 2421 1186 1609 1483 1779 2203 1049 2145 1418 595 1158 2819 119 50968 CGGCCCCAG (MA1 Best) 392 319 219 186 198 232 290 193 226 212 238 207 98 141 149 255 286 77 280 149 79 145 147 7 4725

chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 ch22 chX chY Total

Figure 27 - Number of actual binding sites listed for each chromosome source: Bach, Bajwa, & Erodula, 2011

As a conclusion of the results project members stated there is a significant number of off target binding sites for 9 base pair DNA strings. This supports the hypothesis of the uncontrolled genome modifications because of off target bindings. This limits the use of ZFNs for clinical applications. 6.8 Benefits of Knet concepts in Human Genome exploration Although the HuGE project was not developed in a Knet environment either this or any similar project would gain significant benefits from Knet. For example: 1. Knowledge of human genome and binding behavior. As we saw in the HuGE case the binding of proteins can be extremely complex and when analyzing them we need to consider a wide range of factors. There is extensive ongoing research all around the world on this field however potentials of collaboration is hardly utilized. This type of knowledge organized into knowledge cubes and being maintained by a community of researchers could boost developments and open new opportunities. 2. Knowledge related to computation and software tools. As demonstrated in the HuGE case IT systems being used to fulfill computational demands of human genome research tasks can be complex and requires a wide range of skills. (For example distributed systems, parallel processing, algorithm theory etc.) Having this knowledge in a more Page 33

organized and better maintained format compared to simple scientific journal articles could help research projects significantly. Utilizing the flexibility of a knowledge network, knowledge of IT systems built to solve human genome related computational problems can be linked directly to knowledge cubes dealing with those problems from not a computer science but a biomedical point of view. This would move closer not only the two types of knowledge but the two different research communities as well enabling conversation and more efficient collaboration. 3. Availability of software and computation capacity. Of course the Knet opportunities dont stop at sharing the knowledge. Making ready to use tools publicly available by linking them directly to the knowledge network increases efficiency further. Depending on various factors the shared resource can be a software component (probably with source code) that can be downloaded, public services (most likely web services) that can be invoked by other tools, or web applications for researchers. 4. Availability of data and various test results. Once knowledge and IT systems are available in Knet environment, it is a straightforward next step to integrate research data generated by the tools into the knowledge network. In spite of the possibly huge computing capacity available in a distributed cloud based infrastructure, very complex computations may require significant amount of time. So once it was completed by one researcher results should be reused by others. 7 Searching for possible improvements and the choice of parallel processing

All the sections above contained a research review of the current state of the knowledge networks in the biomedical sciences. From this point we are looking for improvements, possible contributions to this knowledge set and to the existing results. The field of knowledge networks is huge. There are many directly and indirectly related fields of study like Collaboration networks (See .3 Organizational structure to improve collaboration on 3 page 10.) Measurement of the intellectual bandwidth and the capacity for value creation (See .2 3 The Intellectual Bandwidth Model on page 7.) Building distributed IT systems with the purpose of extending intellectual bandwidth in a given field of knowledge (See .4 How to extend the Intellectual Bandwidth: case study 3 from the healthcare industry on page 12.) Building knowledge networks to connect researchers and information into a dynamic, living and flexible network by utilizing current technological improvements and cloud computing (See The Knet environment on page 15.) 4 Building ontologies, which can serve as a base of more effective knowledge sharing on a given field (See .1 Technical Issues1: Ontologies on page 20.) 5 Page 34

Building text mining methods, which enable better access to (more effective ways to search for) the already available literature of a given field (See .2 Technical Issues2: 5 Text mining on page 22.) In addition we can build specialized software tools that can help researchers to run simulations, analysis of such events or experiments, which would be either too difficult or too expensive or impossible to perform without the help of these tools. These software components can enable 1. Faster, more effective ways of research 2. More research opportunities for a wider group of researchers 3. Easier sharing of research results in the research community.

Talking about the last item in the list above, one of the most interesting and popular such topic is recently the human genome research on the field of biomedical science and bioinformatics. One example of such a tool that enables researchers to be more efficient in the analysis of the human genome is the HuGE tool as described earlier as a case study. (See Cases study HuGE tool on 6 page 24.) The rest of this document will deal with the improvement opportunities of the HuGE tool, the subsequence search problem in the human genome. Since this problem is processing power intensive we need to utilize some tools and techniques from computer science to deal with the problem. We can define two problems. One is the biomedical problem, namely: How to find all the occurrences of an arbitrary sequence in the human genome? To solve this problem we have to cross the borders of biology and computer science which leads us to a computer science problem, namely: How to find an arbitrary subsequence (substring) in a very long sequence (string) in a timely manner? Looking at the second problem we may think the key to the effective solution is parallel processing (as we can see it in the HuGE case). So we need a solution that can execute this computation on a parallel way and which is highly scalable in case our biomedical research is getting to be more complex by time. (For example: analyzing genomes of more species or looking for a high number of subsequences at the same time etc.) 7.1 Parallel processing in bioinformatics goals and plan Based on the above considerations we want to conduct a research. The purpose of this research is to find and evaluate the already existing and available ways and best practices related to computational processing serving genome research projects. It is also intended to propose a solution for the computational processing needs of these types of projects. There is tremendous need in the whole scientific world in various fields for conducting frequently large-scale computations on distributed resources on large data sets (Andrade, Andersen, Berglund, & Odeberg, 2007; Hirschman, Park, Tsujii, Wong, & Wu, 2002; Leo, Santoni, & Zanetti, 2009). Consequently services provided for end users to conveniently launch Page 35

their scientific applications without worrying about the technical details of the computational infrastructure (like complex programming, deployment and management) are in demand. We can refine our goal by listing tools and concepts, which can help in reaching the goal and at the same time they give a frame for the entire project: It is beneficial if the details of a distributed network like communication between nodes, looking up nodes and handling node failures are hidden by a software layer (platform), which supports distributed computing, like Apache Hadoop. ("Apache Hadoop Project," 2008) It is beneficial if the details of parallel processing like scheduling jobs, assigning jobs to worker threads, monitor jobs and aggregate results are handled by a framework, which supports these kind of operations, like the MapReduce concept. (Dean & Ghemawat, 2004) To store huge data sets a database management system is needed. Since operations happen in a distributed environment the database management system must be able to run on such an environment. When using Apache Hadoop, Apache Hbase fulfills this requirement. ("Apache HBASE," 2011) Because of the complexity and work-in-progress status of the components involved (neither Hadoop nor Hbase reached the version 1.0 yet) it may be beneficial to use a bundle of the needed components if available. Cloudera provides such software bundle called Cloudera's Distribution Including Apache Hadoop (CDH). ("Clouderas Distribution including Apache Hadoop (CDH)," 2011) Once the infrastructure for parallel processing with all the tools neccessary is given, we can start focusing on the bioinformatics problem, namely to find subsequences in the human genome. There are variuos algorithms already available, which we can use in such operations like The Longest Common Subsequence (LCS) algorythm (Nakatsu et al., 1982), or Knuth-Morris Pratt (KMP) discrete algorithm (Regnier, Kreczmar, & Mirkowska, 1989), position-specific scoring matrix (PSSM) (David T, 1999), RMAP short read mapping (Andrew D. Smith et al., 2009).

In the next chapter (8 Parallel Processing as a computational challenge on page 36) we will solely focus on the computer science aspects of the problem and try to build the necessary infrastructure that can provide the processing power needed. The succeeding chapter (9 Subsequence search in human genome on page 89) will deal with the biomedical problem and the algorithm needed for it. 8 Parallel Processing as a computational challenge

As discussed above there are publicly available tools and concepts that can help us in building an infrastructure capable of massive parallel processing. Some of these tools were already explained in the HuGE case study ( Cases study HuGE tool- on page 24) earlier in this document. 6 Page 36

However this part of the document intend to give not only the theoretical background and available features of these tools, but also intend to present practical hands-on guides that can help researchers to easily build on top of the results of this research. It is not the purpose of this research to invent new tools or techniques for parallel processing (to reinvent the wheel). The tools are available and many of them accessible even free of charge or for very low fees. The challenge is to find the right tools and to put various pieces together. 8.1 Hadoop development environment on a single PC To build a highly scalable parallel processing infrastructure we need multiple processors and most likely multiple physical machines. Managing communication between nodes and handling possible failures can be extremely challenging by increasing the number of nodes in the network. Fortunately the open source Apache Hadoop project ("Apache Hadoop Project," 2008) can handle these problems: 1. It allows distributed processing of large datasets on a high number of nodes. 2. It can run on a single machine and can be scaled up to thousands of machines. 3. It is highly failure tolerant. Hadoop assumes that the collaborating nodes can and will fail occasionally thats why it detects and handles node failures. For more information on Hadoop and HDFS see .2 HDFS on page 25. 6 8.1.1 Prerequisites The best thing we can do to start with Hadoop is building a small development environment on our desktop. According to the Hadoop documentation it is not well tested on Windows platforms so it is not recommended to use Hadoop on Windows. In theory we can build a development environment on Windows, however it is beneficial to have a development environment as similar to the production environment as possible (so for example the same shell scripts can be run in both environment for installation or initialization purposes). Because of this reason the author of this document does not recommend building Windows based development environment (and didnt test any such installation). It is preferred to use one of the Linux distributions instead. The development environment used in this research project was a single laptop PC with an Intel Duo T5550 CPU, 4GB memory and Ubuntu Linux v10.04.3 (Hereafter referred as development environment.) Note how to get your Linux If you have PC with a Windows operating system on it, you can still use a Linux without removing your current operating system. You have basically two options to do this: 1. You can install your Linux next to your Windows operating system. In this case you can use a so called boot loader to pick which operating system you want to boot at startup. Page 37

You can find guides on the Internet how to do this (for example for Ubuntu: https://help.ubuntu.com/community/WindowsDualBoot ) 2. Your other option is to create a virtual machine on your PC and install a Linux operating system on it. There are many virtualization solutions, one of the most well-known is VMWare. The VMWare Server product is free to download (http://www.vmware.com/products/server/overview.html ). In this case you might need more memory in your PC since you are running both operating systems at the same time. There are many free Linux distributions available you can choose from like Ubuntu (http://www.ubuntu.com/ ), CentOS (http://www.centos.org/ ), Open Suse (http://www.opensuse.org/en/ ) just to name a few. 8.1.2 Single Node setup Once you have your operating system up and running you will download a Hadoop distribution and run some simple tests with it. The easiest way to familiarize yourself with the Hadoop environment is performing a single node setup. (http://hadoop.apache.org/common/docs/current/single_node_setup.html ) Once you have completed a single node setup you can use your Hadoop in standalone mode or Pseudo-Distributed mode. In standalone mode you are running Hadoop as a standalone Java process. This can beneficial sometimes for some simple debugging purposes. However the standalone mode doesnt simulate the real, distributed, production environment very well. So from now on we will use Hadoop in pseudo distributed mode in the development environment. 8.1.3 Pseudo distributed environment You should go through the configuration steps explained by the guide referred above. When your configuration files and SSH are set, you should be able to format an HDFS file system and start your Hadoop system. $ bin/hadoop namenode format $ bin/start-all.sh The HDFS file system will be placed by default under your /tmp folder, so (depending your Linux configuration) if you restart your machine, most likely you will lose its content. However if you want to keep it permanently its possible to configure it so. (To do that you need to add dfs.name.dir and dfs.data.dir parameters to your <HADOOP_HOME>/conf/hdfs-site.xml configuration file. See the related Hadoop documentation for details http://hadoop.apache.org/common/docs/current/cluster_setup.html ) You can stop your Hadoop system using the stop-all.sh shell script $ bin/stop-all.sh Page 38

When your Hadoop system is running, you can browse the file hierarchy on your HDFS using the NameNode web interface http://localhost:50070/ . You can also get miscellaneous information about your file system (like total size, free space etc.) using this web page. Note make sure your web server is running and accessible To have an access to the NameNode web page make sure your web server is running. This can depend on the Linux distribution and the particular web server you are using. Assuming Ubuntu Linux and Apache HTTP Server, you can check, start and stop your server like this /etc/init.d/apache2 status /etc/init.d/apache2 start /etc/init.d/apache2 stop If you are accessing your web server from outside (for example you are running your Hadoop test environment on a Virtual Machine on your Windows PC) then make sure you can access the web server from outside. You can test this running a telnet command: telnet <your virtual hostname, or IP> 80 If you can establish a connection this way, then your web server is accessible. If not you may need to open the port 80 to make it accessible. This is again a Linux distribution and setup dependent question. Assuming you are using either Ubuntu or Red Hat Linux (that we will use in later examples) you should configure iptables. First you have to edit its configuration file vi /etc/sysconfig/iptables In case you are not familiar with that, how to edit files on a Linux using vi, you can find more information on the Internet. (For example: http://acms.ucsd.edu/info/vi_tutorial.shtml ) You should add an additional rule -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -dport 80 -j ACCEPT Finally you need to restart iptables: /etc/init.d/iptables restart 8.1.4 Testing the HDFS To familiarize yourself with HDFS operations you should try to create and remove directories, copy files between your local and your distributed file system. You can see the list of possible commands in the HDFS help by typing <HADOOP_HOME>/bin/hadoop dfs help Page 39

8.1.5 Further information about your Hadoop system You may want to see the actual processes running on your pseudo distributed system. You can list the processes by the following command: ps aux | grep hadoop You may also want to see the log files of your Hadoop system. It is located under <HADOOP_HOME>/logs In case of a problem, for example with starting up or stopping your Hadoop system, this should be the first place to look. 8.2 Mapreduce in a development environment As we stated earlier MapReduce is a programming model and an associated implementation for processing and generating large data sets. (Dean & Ghemawat, 2004) Why do we need a programming model, an additional abstraction layer to implement applications utilizing parallel processing? The answer is: because it highly simplifies the implementation. There are very complex issues regarding massive parallel processing, which, without using a framework, should be handled by the programmer of the application. Such issues are: fault tolerance, distribution, load balancing, scheduling tasks and so on. The MapReduce framework takes this burden off the shoulder of the programmer, so implementation of parallel processing applications will become more simple and straightforward. For more information on MapReduce see .4 MapReduce on 6 page 28. 8.2.1 Prerequisites To try and test MapReduce you need to have a pseudo distributed environment running. (See .1 8 Hadoop development environment on a single PC - on page 37.) 8.2.2 WordCount example The Hadoop documentation contains a simple MapReduce example. The best thing to do to understand the concepts is to try this example code. This example is also called the Hello World application for parallel processing, which means it is the most basic piece of code you can build and run to test if all the pieces in your environment work as they supposed to. A step by step guide for the WordCount example is available here http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Example%3A+WordCoun t+v1.0 You can also download the source code from this URL: http://www.infosci.cornell.edu/hadoop/wordcount.html

Page 40

The WordCount example is implemented in Java. Although it is possible to implement and run MapReduce applications on a Hadoop environment written in any languages, in this document we focus on Java implementations only. 8.2.3 Development and build of MapReduce applications You dont have to use an IDE to develop MapReduce applications, however it is possible to do so. The IDE used for this example and other implementations provided in this document later was Eclipse (http://www.eclipse.org/ ). Eclipse is available on both Linux and Windows as well. Apache Ant was used to conduct builds. (http://ant.apache.org/ ) In smaller applications (like this WordCount example) the build process is very simple. However in case of a larger, more complex application repeatable, one click build processes are essential, which is even the benefit we can gain from Apache Ant. If you are using a development environment for compile and build your MapReduce application, make sure you are using the same Hadoop version in your IDE as you are using in your target system, on which you want to deploy your application. This means the JAR files you need to compile your code must come from the same Hadoop version (or they can be even exactly the same files if you are working on a pseudo distributed environment and you are using the same machine for development). This way you can avoid potential compatibility problems later. 8.2.4 Compilation and build of the WordCount application If you decide to use Eclipse as an IDE and Apache Ant as a build tool, youll need to do the following: Note: In this and the following examples Eclipse SDK 3.7.0 will be used. 1. 2. 3. 4. Start your Eclipse environment. Create a Java project (File / New / Java Project) Lets call it Wordcount_app. Copy the WordCount.java file under the src folder in your project. Add the Hadoop libraries to your project. Right click on the project name, then choose Properties. Select Java Build Path. Click on the Add Library button. Select User Library. Click on User Libraries. Click on the New button. Name your new library, for example Hadoop Library. Select Hadoop Library from the list and click on Add JARs. You can select all the jar files located in HADOOP_HOME, including hadoop-<Hadoop version>-core.jar. Now you can close the window, your user library is set. In the other window select the checkbox next to Hadoop Library and click on Finish. Close the Properties window. Now your compile time errors should disappear from the WordCount.java. 5. For building a jar file you can simply use the export function of your Eclipse. However if you want to implement a scripted build process you are suggested to use Apache Ant for doing it. Create a build.xml file in your Eclipse project with the following content:

Page 41

<project name="WordCount" default="build" basedir="."> <property name="src" location="src"/> <property name="build" location="build"/> <path id="hadoop.lib.ref"> <fileset dir="/opt/hadoop/lib" includes="*.jar"/> <fileset dir="/opt/hadoop" includes="*.jar"/> </path> <target name="compile"> <javac srcdir="${src}" destdir="${build}/classes" classpathref="hadoop.lib.ref"/> </target> <target name="build" depends="compile"> <jar destfile="${build}/wordcount.jar" basedir="${build}/classes"/> </target> </project>

You need to change the path /opt/hadoop to your Hadoop home folder. You can open an Ant view (from Window / Show View) and drag and drop your build.xml into this view. Now you need to double click on the build script in the view to execute the build. Your wordcount.jar will be placed into the build folder. 8.2.5 Execute and monitor the Wordcount example Assuming that you started your Hadoop distributed environment ( .1.3 Pseudo distributed 8 environment on page 38) and the physical root of your distributed file system is /tmp/hadooproot (you can check this on the Name Node administrative web interface http://localhost:50075 ) and you current directory is you Hadoop home directory and you copied your wordcount.jar to ./wordcount_sample, the steps of executing the wordcount example are the following: 1. Creation of the directory structure on HDFS
./bin/hadoop dfs -mkdir /tmp/hadoop-root/wordcount ./bin/hadoop dfs -mkdir /tmp/hadoop-root/wordcount/input

2. Creation of the input files


echo "Hello World Bye World" > file01 echo "Hello Hadoop Goodbye Hadoop" > file02 ./bin/hadoop dfs -moveFromLocal file01 /tmp/hadoop-root/wordcount/input ./bin/hadoop dfs -moveFromLocal file02 /tmp/hadoop-root/wordcount/input

Page 42

Figure 28 - Hadoop HDFS sample input files

3. Running Mapreduce
./bin/hadoop jar ./wordcount_sample/wordcount.jar WordCount root/wordcount/input /tmp/hadoop-root/wordcount/output /tmp/hadoop-

4. Monitoring the execution and getting the results. a. You should be able to get an output on the standard out similar to this
11/11/13 2 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 13:35:35 INFO mapred.FileInputFormat: Total input paths to process : 13:35:36 13:35:37 13:35:46 13:35:49 13:35:58 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO mapred.JobClient: Running job: job_201111131320_0002 mapred.JobClient: map 0% reduce 0% mapred.JobClient: map 66% reduce 0% mapred.JobClient: map 100% reduce 0% mapred.JobClient: map 100% reduce 100% mapred.JobClient: Job complete: job_201111131320_0002 mapred.JobClient: Counters: 18 mapred.JobClient: Job Counters mapred.JobClient: Launched reduce tasks=1 mapred.JobClient: Launched map tasks=3 mapred.JobClient: Data-local map tasks=3 mapred.JobClient: FileSystemCounters mapred.JobClient: FILE_BYTES_READ=79 mapred.JobClient: HDFS_BYTES_READ=54 mapred.JobClient: FILE_BYTES_WRITTEN=266 mapred.JobClient: HDFS_BYTES_WRITTEN=41

Page 43

11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13 11/11/13

13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00 13:36:00

INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO INFO

mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient: mapred.JobClient:

Map-Reduce Framework Reduce input groups=5 Combine output records=6 Map input records=2 Reduce shuffle bytes=91 Reduce output records=5 Spilled Records=12 Map output bytes=82 Map input bytes=50 Combine input records=8 Map output records=8 Reduce input records=6

b. You can also monitor the execution using the Job Tracker web interface (http://localhost:50030 )

Figure 29 - Hadoop Job Tracker tracking the Wordcount example

c. You can access the output using the file browser of Name Node interface http://localhost:50075 )

Page 44

Figure 30 - Hadoop output of Wordcount example

8.3 Column oriented database considerations Looking for opportunities how to utilize column oriented databases in human genome research projects was among the original goals of this research. There are possible applications of a column oriented database in human genome research, one of which is storing temporary results for later reuse. This way it is possible to decrease computation time. The previous statement may need a little more explanation. Lets assume that we are executing an algorithm, which is searching for subsequences in large chromosomes. The very high level structure of the algorithm works like this 1. Is the data set too big to process quickly enough a. If yes, then split it to smaller pieces and go back to step 1. b. If no, then execute the computation then continue with any other parts not computed yet. If a column oriented database is available during this computation we can optimize the above algorithm the following way: 1. Was this data set computed before? a. If yes, then get the results of the computation from the database (and continue with any other parts not computed yet.) b. If no, go to step 2. 2. Is the data set too big to process quickly enough Page 45

a. If yes, then split it to smaller pieces and go back to step 1. b. If no, then execute the computation then continue with any other parts not computed yet. Assuming that the processing of big data chunks requires a lot of computation (which is very true when searching genome data) and assuming that there is a database with good enough accessibility and capacity to store huge amounts of data (which is true about column oriented databases) the second algorithm will significantly overperform the first one. Utilization of a column oriented database finally did not happen in this research (because of the given time constraints). However it is possible to extend the results presented in this document how it is described in the high level procedures above. The efforts and conclusions made in this research regarding the use of column oriented databases will be explained in this part of the document. 8.3.1 HBase installation challenges HBase and column oriented database concepts were explained in this document earlier. (See .3 6 HBase on page 26.) HBase is part of the Hadoop project, so it utilizes other Hadoop components like HDFS. However according to the Hadoop documentation (http://hbase.apache.org/book/hadoop.html) at the time of writing this document it is not possible to run HBase with any currently available Hadoop release. HBase will lose data unless it is running on an HDFS that has a durable sync. Hadoop 0.20.2 and Hadoop 0.20.203.0 DO NOT have this attribute. Currently only the branch-0.20-append branch has this working sync. No official releases have been made from the branch-0.20-append branch up to now so you will have to build your own Hadoop from the tip of this branch. Steps of such a build process is known and was (unofficially) documented: http://www.michaelnoll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2/ This research doesnt try to cover the topic of Hadoop build process. However it can be an exciting topic for other researchers. 8.3.2 Out of the box pre built bundles (CDH3) It is possible to download a third party solution, which contains both Hadoop and HBase bundled. Cloudera is such a third party and it provides its bundle called Clouderas Distribution Including Apache Hadoop (CDH3). CDH3 is available in various formats. (https://ccp.cloudera.com/display/SUPPORT/Downloads ) There is an installer called SCM Express Edition, which is able to install the bundle on your system. There are also various virtual machine images available that have CDH3 preinstalled. Page 46

It is important to know about the SCM Express Edition installer it is supported only on the following operation systems: 64 bit Red Hat Enterprise Linux 5 or higher 64 bit CentOS 5 or higher 64 bit SUSE Linux Enterprise Server 11 service pack 1 or later

It is important that you need to have a physical 64 bit processor even if you are building a virtual machine with one of the supported operating systems above. The SCM Express Edition was not tested in this research, so no more details will be covered in this document. Virtual machine images of CDH3 are available for KVM, VMWare and VirtualBox. In this research the VMWare image was downloaded and started. Note: At the time of writing this document running VMWare Server on Linux has some issues: To use the VMWare console you need a plugin for your browser. Since Internet Explorer is not available on Linux, you have to go with Firefox, however it starts only on Firefox 3.5 and not in newer versions. Firefox 3.5 is not available officially anymore. (Some problems with handling mouse events and refreshing console screen were also experienced in this setting.) - The VMWare Server itself also needed a patch to run on Ubuntu Linux 10 (http://radu.cotescu.com/how-to-install-vmware-server-2-0-x-on-ubuntu-9-10-karmickoala/) If you are running VMWare on a Windows machine you dont need to worry about the issues above. In this research the VMWare image was started on a Windows platform (Vista Home Edition) with a 64 bit AMD processor and 2GB memory. The virtual machine was able to boot and all the Hadoop services were started correctly. However running the simple WordCount example described earlier caused an out of memory error. No further tests have been made with the virtual machine image. Attention! If you plan to experiment with one of the CDH3 virtual machine images, make sure if your CPU is 64 bit. You wont be able to boot up the virtual machine using a 32 bit processor. 8.4 Amazon AWS Acquisition of hardware, installation and configuration of the required software and in general the building of a distributed system can be challenging and time consuming. (Not to mention it may cost a lot of money.) Genome search applications, by their nature require huge processing Page 47 -

power and can utilize distributed environments that can provide highly parallel computation. Besides that this high computational power is necessary only for a relatively short time (until we execute a particular calculation). These requirements suggest that utilizing a cloud based solution can be viable option. The most widely known cloud computing resource provider is Amazon. Amazon Web Services (http://aws.amazon.com/ ) covers numerous cloud based solutions. To execute parallel processing solutions what we definitely need are access to processing units (virtual machines), which can be the nodes in our distributed environment and a shared storage space where inputs and outputs of the computations can be stored. Fortunately these services are available in Amazon AWS. Amazon Elastic Compute Cloud (EC2) provides access to arbitrary number of processing units on a very flexible way. Amazon Simple Storage Service (S3) provide storage service for EC2 computing units. These two services alone would make it much easier to set up a distributed environment, which fulfills our processing needs. However to make things even easier Amazon provides another service called Amazon Elastic Mapreduce, which makes not only the storage and the processing units available for us, but we can get the whole environment pre-installed with Hadoop (HDFS and Mapreduce). This gives us the opportunity to focus only on the problem we are seeking a solution for (searching for subsequences in the human genome). In the following subsections the above mentioned three Amazon services will be introduced: EC2, S3 and Amazon Elastic Mapreduce. 8.4.1 Amazon EC2 Amazon Elastic Compute Cloud is a true virtual computing environment. Users can create their own virtual machines, clusters or distributed systems on demand. Basically the experience is the same as having a powerful virtual environment on your own, except that you dont need to worry about the maintenance of the hardware and the virtualization layer, you need to deal only with your virtual machines. Of course at the same time you lose the feeling of full control, you dont own the hardware you dont have physical access to the hardware and the same thing is true about the virtualization layer as well. It is similar to a lease. There are many practical questions popping up in ones mind regarding a solution like the EC2. How is it possible to create a virtual machine for myself? The EC2 service provided by Amazon is accessible through web services. So basically it is possible to write any kind of client in any languages that can manage EC2 instances (virtual machines) for you. Of course Amazon has its own implementation for such a client application, which is called AWS Management Console. This is a single web application, which gives you the opportunity to manage all your AWS services in one place including your EC2 instances. So to answer the question above most likely you want to open the AWS management console pick one of the instance templates provided by Amazon (usually a clean installation of one of the operating systems they provide Amazon Linux, Red Hat Enterprise Linux, SUSE Linux Enterprise Server and various Windows servers), select a virtual hardware configuration, the Page 48

number of instances needed and you are ready to go. Amazon will create the requested instances for you and boot them up. How is it possible to access the EC2 instances? Each instance has a public DNS name. Using this name you can simply SSH into your EC2 instance. You can download the private key Amazon generated for you and you can login as root. Can an EC2 instance have its own IP address? Its possible to request elastic IP addresses. These IP addresses are assigned to your Amazon account and you can assign them to any of your EC2 instances. This way you can handle your instance like it would have a static IP. Is it most cost efficient to run applications using EC2 than building my own distributed system? It depends. Probably Amazon AWS doesnt give us a cost advantage in all possible cases. However in computation intensive calculations on big datasets like performing searches on the human genome it is definitely a very cost effective solution. Even if we build a relatively large distributed system, for example from 50 nodes, all of them standard large EC2 instances (they have 7.5 GB of memory and two EC2 compute units with two virtual cores for each, where one compute unit is equivalent with 1.0-1.2 GHz 2007 Opteron processor) and the whole processing takes one hour then it would cost us $17 at the time of this writing (Standard Large EC2 instances cost $0.34/hour.) This makes these solutions ideal even for researches with a tight budget. There are more information available about the EC2 service under the Amazon Web Services web site: Features and pricing: http://aws.amazon.com/ec2/ Documentation: http://aws.amazon.com/documentation/ec2/

8.4.1.1 Creating an AWS account 1. To start using the EC2 or any other service first you need an Amazon account for AWS. Go to http://aws.amazon.com/ and click on Create an AWS account on the top of the page. If you have an Amazon account already you can use it, or you can create a new one.

Page 49

Figure 31 - AWS account registration

2. You also need to register a credit card with your account. Amazon will charge this card based on your usage of the service.

Page 50

Figure 32 - AWS account registration 2

3. Amazon will also verify your identity with an automated phone call to the phone number you give at the registration 8.4.1.2 Creating an EC2 instance 1. Now you have an account you can sign in. Click on the Sign in to the AWS Management Console link. 2. AWS management console is a web application that gives you the capability to manage all your Amazon Web Services from one place. On the top of the screen you can see the various services listed as tabs. Now we want to work with the EC2 service so click on EC2. 3. Using the console you can stop and start EC2 instances, you can make a snapshot of one or more instances, save it, then restore it, you can attach virtual storage to your instances, which remain available after you terminate your instance and you can configure network security for your instances. In this example we want to create an instance. Click on Launch Instance.

Page 51

Figure 33 - The EC2 service

4. Amazon has many pre-created images (AMI) to start with. So you can simply choose one. In the examples in this document we will use Red Hat Enterprise Linux 6.1 64 bit, so lets choose that one.

Page 52

Figure 34 - EC2 instance creation 1

5. On the next page you can specify the instance details. For the sake of this example now we create one Micro instance. You may also specify the availability zone or request a spot instance. You can read more about these options in the documentation. (See the link above.)

Page 53

Figure 35 - EC2 instance creation 2

6. On the next page we can use the default values. One important option on this page is the Shutdown Behavior, which specifies what should happen with your instance when you shut it down. It can either stop or terminate. When it is stopped, you can still start again, you dont need to create a new instance instead. But if you terminate your instance all the resources will be released, so anything you did on your virtual machine will be lost (except the data you saved to a virtual storage). Of course you wont be charged for an instance that you terminated, however you will be charged for your stopped instances. (If your intention is to create many instances for only a short period of time to do a processing intensive calculation, as we will do in later examples, it can be very inconvenient when you realize that only days later, you didnt terminate your instances only stopped them.)

Page 54

Figure 36 - EC2 instance creation 3

7. You can also add arbitrary tags to your instances. This can be useful when you have a large number of instances for various purposes, so without tagging them it could be difficult to find the one you want to work on. 8. In the next step you can create a public/private key pair. This will be needed to log in to your instance from your desktop machine. Once created you can download the key pair to your desktop. Note: Be cautious how you store your key pair. For security reasons Amazon does NOT save the private part of your key pair. So you will have the only copy of this information. This means, if you lose it (for example your local hard drive crashes) then you wont be able to access those instances which are using this key pair. Your only option will be to terminate those instances (and potentially losing worthy data).

Page 55

Figure 37 - EC2 instance creation 4

9. You can also configure your firewall settings. All the instances are behind a firewall. We will need to change the default settings later, but for now you can chose the default configuration from the list.

Page 56

Figure 38 - EC2 instance creation 5

10. On the last page you can review all your settings and launch the instance.

Page 57

Figure 39 - EC2 instance creation 6

11. Now you can see your new instance under Instances. You can also monitor the status of your instance. First it supposed to be pending, then it turns to be running. Once it is running you can access it through SSH.

Page 58

Figure 40 - List EC2 instances

8.4.1.3 Accessing an EC2 instance You can access your EC2 instance either from a Linux or a Windows environment. The way how to access it from a Linux machine will be described below. You can read more about the Windows options in the AWS documentation. (http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/ ) 1. We will access the instance using SSH. So first we need to make it sure if port 22 is open on Amazons firewall behind which our instance is located. First lets check again, which security group is the instance assigned to. Sign in to the Management console, select the EC2 tab, list your instances (Instances link on the left) and select the instance you just created. Under the description you can find which security groups are assigned to your instance. In this case we have the default security group assigned.

Page 59

Figure 41 - EC2 instance - finding the security group

2. Select Security Groups under NETWORK & SECURITY. Click on the default security group and select the Inbound tab. Check if the port 22 is open for the outside world (source: 0.0.0.0/0). On the screenshot below, it is not open, so we need to open it.

Page 60

Figure 42 - Network and Security - checking ports

3. Type 22 into the Port range field. Leave everything else as default (new rule: Custom TCP rule, Source: 0.0.0.0/0) Click on Add Rule then Apply Rule Changes. Now all the instances associated with the default security group have their port 22 accessible.

Page 61

Figure 43 - Network and Security - open a port

4. Now it possible to access the instance. Go to the folder, where you saved the key pair at the time of the instance creation. Restrict the file permissions of the key pair file or the SSH program wont accept is as a valid key pair. chmod 400 MyKeypair.pem 5. Now go back to your instance properties in the Management console and take a look what is its public DNS. (ec2-107-22-40-16.compute-1.amazonaws.com in this example)

Page 62

Figure 44 - EC2 instance - finding public DNS name

6. Now you can issue the SSH command ssh -i <key pair file> root@<public DNS> in this example the command looks like this ssh -i MyKeypair.pem root@ec2-107-22-40-16.compute1.amazonaws.com 7. Now you are logged in as root 8.4.1.4 Accessing a web server running on an EC2 instance One typical scenario we may need to go through is starting a web server on our EC2 instance and access it from outside. (This has a practical purpose when working with Hadoop Mapreduce, since the monitoring application is running on a web server) 1. Go to the security groups and open the port 80. (See details in the previous procedure.) Page 63

2. Log into your EC2 instance. Check if your web server is running. service httpd status 3. If it is not running, start it. service httpd start 4. Open port 80 on the virtual machines software firewall. To do that you have to edit the iptables configuration. For example like this: vim /etc/sysconfig/iptables 5. You need to add a line to enable port 80 -A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT 6. Then you need to restart iptables to make the new settings take effect. service iptables restart 7. Now you should be able to access your Apache Web Server running on your virtual instace from your local machine typing the URL into your browser http://<public_dns> For example http://ec2-107-22-40-16.compute-1.amazonaws.com

Page 64

Figure 45 - EC2 instance - accessing the web server

8.4.2 Amazon EBS and S3 Now we know how to start and use Amazon EC2 instances whenever we need them. But how can we store data? Without leaving the AWS infrastructure there are three possible ways to store data: Using the inner storage in the EC2 instance Elastic Block Store (EBS) Simple Storage Service (S3)

To use the inner storage of an EC2 instance you dont need to do anything. You simply save data to the file system already available when you start up your instance. The drawback of this is that your data will be lost once you terminate your instance. If you need more permanent storage you need to consider using EBS or S3. In terms of deciding either to use EBS or S3 for storage you need to consider couple of factors. If you need a storage that, besides its permanent characteristic, is able to behave as a regular hard disk in your instance (it would be better to say as a virtual hard disk) then EBS is the right solution for you. Since you can create an arbitrary size of EBS then you can attach it to any of your instances. (There are certain restrictions, which will be explained later.) Once you attached your EBS to your instance it behaves like an additional hard disk from the instances point of view. So you can format it like you would do with a regular disk and mount it. If you want you can detach your EBS and attach to another instance. If you shut down and terminate your instance your EBS will be detached automatically and your data wont be lost. However you Page 65

cant attach an EBS to multiple instances at the same time and theres no other way to access the data on it except through an instance, to which it is attached. S3 implements a different approach. It is a solution that is less bound to the EC2 instances themselves. It has its own public APIs to access data on it (native Amazon API implemented in many technologies like Java, .NET, PHP, Ruby, and standard web services). So its benefits from an EC2 instance point of view is that it is not limited to be accessed by only one instance, whats more it is possible to access it from any other non-Amazon system as well. However accessing it is less transparent for an EC2 instance. It needs to run an application that handles the access to S3. In practice, and according to Amazons recommendations, it is the best to store only temporary data on instance storage. The regular data of operation should be stored on EBS, and for backup copies and administrative operations with EBS (like migrating them, resizing them etc.) S3 is a good option. However the right combination of storage solutions varies by the purpose of the system in consideration. In this part of the document we will go through couple of examples how to use EBS and S3. We demonstrate how to create, attach and make an EBS available to use for an EC2 instance. We also explain how to access S3 from an EC2 instance. You can read more about EBS in the EC2 documentation (http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/ ) and about S3 in the S3 Developer Guide (http://docs.amazonwebservices.com/AmazonS3/latest/dev/ ). 8.4.2.1 Creation of an EBS 1. Sign in to the AWS Management Console and choose the EC2 tab. (We assume that you have an EC2 instance running, to which we will attach the EBS.) 2. Check that in which zone is your instance running, because you need to create your EBS in the same availability zone. (In this example it is us-east-1d.)

Page 66

Figure 46 - EC2 instance - finding the zone

3. Select Volumes under ELASTIC BLOCK STORE and click on Create Volume. 4. Set the size of your EBS (in this example 1 GB is used) and set the availability zone the same as your instances zone. This new EBS wont be created from a snapshot rather it will be a new empty EBS. Click on Yes, Create.

Page 67

Figure 47 - EBS creation

5. You can see that your new EBS is now created and available.

Page 68

Figure 48 - Listing EBS volumes

8.4.2.2 Attaching an EBS to an EC2 instance Now your EBS is available you can attach it to your EC2 instance. 1. Select the EBS you just created and click on Attach Volume. 2. You need to choose your instance and enter a device. At the time of writing this document picking a device worked a little bit unpredictable. In many cases the system returned an error message, which indicated that the named device is already in use. Usually trying the ranges of devices suggested in the dialog (from /dev/sdf through /dev/sdp and from /dev/xvdf through /dev/xvdp) sooner or later results in a successful attach.

Page 69

Figure 49 - Attaching EBS to EC2 instance

3. You should check the status in the list of volumes to see when the EBS will become attached. (First the status you can see will be attaching.) 8.4.2.3 Making an EBS available for an instance By attaching an EBS to your instance is similar to attach an additional disk to a physical machine. So you need to make this disk available for use. 1. First you need to find the device under /dev . This supposed to be the same device name that you entered when attaching your EBS. However at the time of writing this document there were some discrepancies experienced and the new device name didnt match the device name given at the time of the attachment. You can list all of your devices ls /dev And you can check which device is the one that is mounted as the root directory (/). mount Page 70

Based on this information you can guess which device is the new disk. 2. First you need to format your new disk. (Assuming that the new device name is /dev/xvdl) mkfs -t ext3 /dev/xvdl 3. Now you can mount the new device. mkdir /mnt/data mount /dev/xvdl /mnt/data Now you can access your EBS like a regular (virtual) hard disk of your instance. Similar to the steps above you can unmounts and detach an EBS from an instance and attach it to another instance, or you can delete it completely. 8.4.2.4 Creating a bucket and uploading data to S3 The easiest way to use S3 is through the AWS Management Console. 1. Sign in to the AWS Management Console and switch the S3 tab. 2. To use S3 first you need to create a bucket with a globally unique identifier. Click on the Create Bucket button. 3. One way to make an identifier unique is to combine it with the URL of your organization. For example edu.bridgeport.tragoncs.

Page 71

Figure 50 - Creating S3 bucket

4. Now you can create a new folder in your new bucket using the Create Folder button. (For example mytest.) 5. Go into your new folder. Create a text file on your desktop machine (for example testfile.txt) with arbitrary content (for example Hello World). Upload the file into your new folder. 8.4.2.5 Accessing S3 from an EC2 instance As mentioned above accessing S3 services may happen using various APIs. You can use standard web service calls or you can use the native API implemented in multiple programming languages. Here only one method will be briefly described for demonstration purposes, the native API using Java. (You can read more about other options in the S3 Developer Guide http://docs.amazonwebservices.com/AmazonS3/latest/dev/ .) To use the Java API for accessing S3 services first you need to download the required libraries (jar files). You need to download the AWS SDK for Java from the AWS web site http://aws.amazon.com/sdkforjava/ . You will also need a third party library the Apache HTTPCore library that you can download from here http://hc.apache.org/downloads.cgi . Your Page 72

classpath should contain the following jars (your version numbers may differ from the ones in this example) aws-java-sdk-1.2.10.jar commons-codec-1.4.jar commons-logging-1.1.1.jar httpclient-4.1.2.jar httpcore-4.1.2.jar

The following example lists the contents of your bucket and opens and reads your test file.
public class S3AccessExample { public static void main(String[] args) throws IOException { //Credentials and Client creation String myAccessKeyID="<Your Access Key ID>"; String mySecretKey="<Your Secret Access Key>"; String bucketName="<Your bucket name>"; AWSCredentials myCredentials = new BasicAWSCredentials( myAccessKeyID, mySecretKey); AmazonS3 s3client = new AmazonS3Client(myCredentials); //Listing the contents of your bucket ObjectListing objectListing = s3client.listObjects(new ListObjectsRequest().withBucketName(bucketName)); for (S3ObjectSummary summary : objectListing.getObjectSummaries()) { System.out.println(summary.getKey()); } //Reading the content of a test file S3Object object = s3client.getObject( new GetObjectRequest(bucketName, "<your test file>")); InputStream objectData = object.getObjectContent(); //Process the objectData stream. BufferedReader reader=new BufferedReader(new InputStreamReader(objectData)); System.out.println(reader.readLine()); objectData.close(); } }

To run this Java program you will need your AWS access credentials. You can find on the AWS website. 1. 2. 3. 4. Go to http://aws.amazon.com/ Under Account click on Manage Your Account On the left click on Security Credentials Sign in if asked. Page 73

5. You can find your Access Key ID and Secret Access Key on this page. You will also need to name your bucket. (In this example it is edu.bridgeport.tragoncs.) Finally you need to enter your path and name of your test file under your bucket. (In this case mytest/testfile.txt.) Then you can run the program either on your local desktop machine or on your EC2 instance, it will access your bucket, list its contents and print the content of your test file. 8.4.3 Amazon Elastic Mapreduce Now after a general overview and couple of examples of the storage options in an EC2 environment we can move on and focus on our main goal, namely how to execute Mapreduce applications in an EC2 environment. Using the AWS tools introduced so far it would be possible to build a multiple node Hadoop system and set up Mapreduce tasks on them. To make things even easier Amazon fortunately provides this as a ready to use service, which is called Amazon Elastic Mapreduce. Using this service building and running Mapreduce applications will become extremely quick and easy. The main steps of building and running a Mapreduce application using Elastic Mapreduce are the following: 1. 2. 3. 4. Implement your map and reduce functions and test it in a development environment. Create your input files for your application. Upload both your application and input files to S3. Create an Elastic Mapreduce Job Flow. Give your input files, your application and the parameters of the distributed system (number of nodes, capacity of each node). 5. Run and monitor your job flow. 6. Retrieve the results from S3. What does this additional help from Amazon really mean? Lets try to compare how different would it be to build a real distributed Hadoop system with or without Amazon AWS. Without AWS (and assuming that we wont use any other virtualization solution either) building a Hadoop infrastructure and running Mapreduce would look like this. 1. Acquisition of servers and configuration (BIOS setup, setting up disks and RAID etc.) 2. Operating system installation on each server and operating system level configuration. (Setting up networking, making sure each node can access the others.) 3. Hadoop installation and configuration on each server and starting the distributed Hadoop system. 4. Build, install and run Mapreduce application on the system. Amazon EC2 services provide us the steps 1 and 2. (This is the advantage of virtualization. We can gain this benefit by using any kind of virtualized environment.) Amazon Elastic Mapreduce covers step 3 and implicitly covers step 1 and 2 through EC2. So we can focus only on the last step. Page 74

Amazon provides example job flows to give you an idea about the service before building your own code. In this part of the document first we will demonstrate how to build a Job Flow using one of these examples. Then we will try how to make our own code run in this environment. For this purpose we will use the WordCount example, which we tested already in our local development environment. (See .2.2 WordCount example on page 40.) 8 You can read more about this service on http://aws.amazon.com/elasticmapreduce/ , or you can access the complete developer documentation from http://aws.amazon.com/documentation/elasticmapreduce/ . 8.4.3.1 Starting a sample Mapreduce application provided by AWS 1. Sign in to the AWS Management Console and switch the Elastic Mapreduce tab. 2. Click on Create New Job Flow. 3. Now you can see the Define Job Flow screen. Choose the Run a sample application option and choose CloudBurst (Custom JAR). We have two reasons for choosing even this sample application. The first one is that this is a custom JAR Java application, which is the same format how we run our previous sample Mapreduce in a previous section of this document. (See .2.2 WordCount example on page 40.) The second one is that this is 8 a genome search example, which is even the same topic we are interested in.

Page 75

Figure 51 - Elastic Mapreduce - creating a job flow 1

4. On the next screen you can specify your JAR location and the parameters passed to the JVM when running the JAR. Since this is a sample applications the parameters are already given. There is a bucket defined in S3 with the name elasticmapreduce, which contains all the files needed (the JAR itself and the input files). The only thing you need to do is typing in the name of your own bucket to make the output files be saved into your bucket, so you can take a look at the results once the execution is completed.

Page 76

Figure 52 - Elastic Mapreduce - creating a job flow 2

5. On the Configure EC2 instances page you can specify the number and type of instances (nodes) you want to build a Hadoop cluster from. You can specify instances in the Master, Core and Task Instance groups. You must have a master instance group with one and only one instance in your Hadoop cluster. This instance has the following responsibilities a. It distributes the Mapreduce executable (in this case the JAR file) and subsets of the input data among other nodes. b. It tracks the status of the processing based on the meta data the other nodes send back about the status of their processing. c. It also monitors the health of other instances (if they are still available, or if the execution of the assigned task takes too long timeout). Core and Task instances are the worker nodes. They are performing the actual computation and saves the output to S3. The main difference between Core and Task instances is that Core instances are participating in the HDFS, while Task instances dont. This means in practice that the presence of Task instances doesnt have any effect on the Hadoop file system (which stores the temporary data provided as the output of the map Page 77

tasks and serves as an input for the reduce tasks). So you need to have at least one Core instance (if you dont start any your Master instance will be also your Core instance). If you need to add excess capacity while your mapreduce job flow is running you can add both Core and Task instances. However you can remove only Task instances without damaging the HDFS. (Task instances are running only TaskTracker Hadoop daemons, while Core instances are running both TaskTracker and DataNode Hadoop daemons.) In this example we can simply use the default values. (Small Master instance plus two small Core instances.)

Figure 53 - Elastic Mapreduce - creating a job flow 3

6. Under Advanced Options you can set an Amazon EC2 key pair to make it possible to connect to your servers through SSH. You can pick the key pair you generated for the previous examples. (Attention! You need to have the private key saved on your hard drive. Without it you wont be able to log in.) For this example we enable logging also. You should create a folder in your S3 bucket to store your log files. Its possible to additional debugging, which we wont use in this example. You should change the Keep Alive option to Yes. With this option you will be able to examine your nodes and their

Page 78

status even after the execution of the job flow is over. (Otherwise EC2 instance would be terminated automatically, so you wont be able to log in when the job flow is completed.)

Figure 54 - Elastic Mapreduce - creating a job flow 4

7. On the Bootstrap Actions page it is possible to set up additional configuration options for the instances in the cluster. We dont use any additional bootstrap action in this example. 8. On the Review page you can see all your settings. Once you click on Create Job Flow AWS will start your EC2 instances and start the execution of the Mapreduce application.

Page 79

Figure 55 - Elastic Mapreduce - creating a job flow 5

8.4.3.2 Monitoring the sample Mapreduce application 1. You just started your job flow. You can see your new job flow in starting state in the job flow list.

Page 80

Figure 56 - Elastic Mapreduce - listing job flows

2. You can also see your EC2 instances needed for the cluster are being initialized under the list of EC2 instances (on the EC2 tab).

Page 81

Figure 57 - Elastic Mapreduce - listing related EC2 instamces

3. You can monitor your cluster through your master instance. So first you need to determine which EC2 instance is your master instance. Go to the instances list on the EC2 tab. Switch to the Tags tab on the lower part of the screen. Your master instance has a tag with the key aws:elasticmapreduce:instance-group-role and with the value MASTER. You can find this instances public DNS name on the Description tab as explained in the EC2 examples.

Page 82

Figure 58 - Elastic Mapreduce - identifying the master node

4. You can access this instance using SSH with the user hadoop. (See previous EC2 examples .4.1.3 Accessing an EC2 instance on page 59) 8 5. You can also access the MapReduce job tracker web interface and HDFS name node web interface. Both of these services are running on your master instance, however to access them you need to establish an SSH tunnel between your desktop and your master instance and you need to start a proxy service on your desktop to send your requests through the tunnel. The Amazon documentation guides you through this process using SSH for the SSH tunnel creation and FoxyProxy for the proxy creation (http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Usingthe HadoopUserInterface.html ). Using this option you can observe the status of your Mapreduce processing. The summary of your cluster:

Page 83

Figure 59 - Elastic Mapreduce - Job Tracker Interface 1

And the status of the particular jobs:

Figure 60 - Elastic Mapreduce - Job Tracker Interface 2

You can check the status and browse the HDFS from the NameNode interface:

Page 84

Figure 61 - Elastic Mapreduce - the Name Node interface

6. Once all of your Mapreduce tasks are done (the status of every tasks are 100%) then you can find the output in your S3 bucket you defined and you can also check the MapReducs log files there. 8.4.3.3 Run the WordCount example using Elastic Mapreduce We have already tried the Wordcount example from the Hadoop documentation and we have been running it on a local development environment. This time well do the same thing using Elastic Mapreduce. 1. Sign in to the AWS Management Console and go to the S3 tab. 2. Create a folder in your bucket called wordcount and open it. 3. Create a subfolder called input and open it. 4. Find the two sample input files used in the WordCount example. file01: Hello World Bye World file02: Hello Hadoop Goodbye Hadoop 5. Upload file01 and file02 into the input folder you just created.

Page 85

Figure 62 - Elastic Mapreduce - Wordcount example - input files

6. Now go to the wordcount folder and upload the file wordcount.jar used in the WordCount example into it. 7. Create another subfolder in the wordcount folder called output. 8. Go to the Elastic Mapredcuce tab and click on Create New Job Flow. Choose the Run your own application option and select Custom JAR.

Page 86

Figure 63 - Elastic Mapreduce - Wordcount example - creating a job flow 1

9. On the Specify Parameters page you should type in your JAR location and the arguments. Your JAR location is: <the name of you bucket>/wordcount/wordcount.jar And the arguments: WordCount s3n://<the name of your bucket>/wordocunt/input s3n://<the name of your bucket>/wordocunt/output

Page 87

Figure 64 - Elastic Mapreduce - Wordcount example - creating a job flow 2

10. This computation can be easily done by only one node. However for the sake of the test lets use one core node as well. (When setting 0 core nodes the master instance would take the role of the core node as well similar to the pseudo distributed environment we built as a developer environment See .1.3 Pseudo distributed environment on page 38.) 8

Page 88

Figure 65 - Elastic Mapreduce - Wordcount example - creating a job flow 3

11. On the Advanced Options page you can choose your key pair, you can enable debugging and send the log into a arbitrary folder into your bucket. You may also select the Keep Alive option to give you a chance to log in to the nodes and check their status. 12. Theres no need for Bootstrap actions. 13. On the review page click on Create Job Flow. 14. You can see your instances being started under the EC2 tab. Once they get into the Running status the Mapreduce execution will be started. 15. Once the execution is done, you can find the output in your output folder. (In this case the number of each word in the given text files.) 9 Subsequence search in human genome

The previous chapter was focusing on the technological part of our problem, namely how to execute parallel processing in an efficient manner. Now we have some tools to accomplish that we can go into details with the original problem. Page 89

In an earlier chapter we defined the problem (see Searching for possible improvements and the 7 choice of parallel processing on page 34). The problem from a biology point of view: How to find all the occurrences of an arbitrary sequence in the human genome? This problem can be transformed into a mathematical problem: How to find an arbitrary subsequence (substring) in a very long sequence (string) in a timely manner? This chapters purpose is to clarify this problem definition, review some existing methods and recommend a solution, with which it is possible to build a complete system that can utilize the technological recommendations from the previous chapter and can be a starting point of further research as well. Since in the problem definition we are speaking about subsequences now is the time to clarify what is a subsequence. Subsequence A subsequence of {a} is a sequence {b} defined by increasing sequence of indices. (Weisstein, 2000) For example if {a} = {A, B, C, D, E, F} and {b} = {A, B, D} then {b} is a subsequence of {a} (n1 = 1, n2 = 2, n3 = 4). If we examine this definition we can see we need to find a pair for each element in the subsequence ({b}) from the other sequence ({a}). We must keep the order of the elements, but we may ignore arbitrary number of elements from the other sequence ({a}) while looking for matching the pairs. To handle some problems we may need to give a more permissive definition for subsequences. The previous definition requires to find a matching pair for each and every element in the subsequence ({b}). The more permissive definition doesnt require that. (The author of this document didnt find a widely known and accepted term for this type of subsequence, but we may call it partial subsequence.) For example: {a} = {A, B, C, D, E, F} and {b} = {A, Q, D} In other words the first definition allows only match and ignore, while the second definition allows match, ignore and mismatch as well. Page 90 , where is an

Now we clarified the problem, the question is how we can solve it efficiently (in a timely manner and using reasonable amount of processing power). 9.1 Existing solutions There are numerous algorithms available, which handle the subsequence search problem. This part of the document will give an overview of these solutions. Since there are many solutions (more and less known) we dont try to provide a complete list. The goal is rather to give the reader an idea, how this problem can be handled from different point of views. 9.1.1 LCS The Longest Common Subsequence (LCS) algorithm was already explained in .5 LCS on page 6 30. So we wont go into details again. The part we are interested in is the running time, more precisely how many operations we need to execute to calculate LCS. We also did this calculation earlier: If X sequence has M and Y sequence has N elements then the calculation needs N x M operations and the traceback max(N,M) operations. However now we have tools available for parallel processing, so we can parallelize the execution. Depending on the completeness of the matrix there are potential calculations that are independent from each other so can be executed parallel. This is demonstrated on the figure below. Calculations of numbers covered by the same blue line can be executed parallel. With this improvement the N x M operations can be reduced to max(N,M) operations. (Plus we need max(N,M) operations for the backtrack, which cant be executed parallel.)

Figure 66 - Parallelization of the LCS problem (figure created by reusing an image from Eddy, 2004)

Page 91

9.1.2 KMP (Knuth-Morris Pratt discrete algorithm) The KMP algorithm is not specifically bounded to genome research problems. It is a pure subsequence search algorithm. As such it doesnt consider mismatched or ignored characters in the comparison. The algorithm itself has nothing to do with parallel processing either (however it is possible to combine it with other techniques to make processing parallel). In spite of this it is worthy to learn more about this technique, since we can gain useful insight into the optimization of the subsequence search by studying this method. There are many explanations of the KMP algorithm available online, one of which is Professor David Eppsteins Lecture Notes from the University of California (Eppstein, 1996). This summary is based on Professor Eppsteins notes. Problem definition: There are two strings given: text and pattern. We are looking for all the occurrences of pattern in text. We will use Java code for demonstration purposes in this part of the document. According the problem definition we have the following variables in a sample code:
String text="banananobano"; String pattern="nano"; int text_l=text.length(); int pattern_l=pattern.length();

The brute force solution is iterating through both strings with two for cycles like this
for (int n=0;n<text_l;n++) { int q; for (q=0; (q<pattern_l)&&(n+q<text_l)&& (text.charAt(n+q)==pattern.charAt(q)); q++){} if (q==pattern_l) {System.out.println("Match at "+n);} }

This computation is very inefficient. Whenever we have a partial match of the pattern (couple of the first characters of the pattern match the text) theres no need to continue the computation from the next character, since we examined as many characters as the length of our partial match. For example in the third iteration of the outer loop the program will find a partial match banananobano nano The texts third, fourth and fifth characters match the patterns first 3 characters, but the texts sixth character does not match the patterns fourth one. However based on this result we dont need to test the pattern beginning with the fourth character of the text since we already know it cant be a match. We could simply jump to the fifth character, since we know it is even an n Page 92

character, which is the first character of the pattern, consequently it is a potential match that needs to be tested. In other words we can simply ignore those characters which we tested already and which are not a prefix of the pattern. Our sample code will look like this:
for (int n=0;n<text_l;) { int q; for (q=0; (q<pattern_l)&&(n+q<text_l)&& (text.charAt(n+q)==pattern.charAt(q)); q++){} if (q==pattern_l) {System.out.println("Match at "+n);} n += Math.max(1, q-overlap(q)); }

To make this code work we will need to implement the overlap method. This method answers the following questions: 1. Theres a given prefix of the pattern (length of q in this case). Is any suffix of this prefix a prefix of the pattern? 2. If the answer is yes to the previous question, how long is that prefix? What does this mean for the example pattern nano. Possible values that q can be assigned to are: 0 (first character didnt match), 1 (second character didnt match), 2 (third character didnt match), 3 (fourth character didnt match), 4 (the pattern was found in the text). So the value of q after the second for cycle shows the number of matching characters. For example q=3 in the third iteration of the outer loop banananobano nano since there were three matching characters found. The overlap method will examine the nan prefix. The method finds that the nan prefix has a one character long suffix n, which is even a prefix of the original pattern. So the algorithm shouldnt ignore all the examined characters, since there is another potential match among them. For the purpose of further demonstration here are the possible inputs and outputs of the overlap function in case of the nano pattern. 0 -> 0 (There were no matching characters. Overlap cant be calculated.) 1 -> 0 (There was only one matching character. We cant use the overlap concept.) 2 -> 0 (Prefix: na theres no suffix that would be a prefix of the whole pattern) 3 -> 1 (Prefix: nan the one length suffix is even a prefix of the pattern) 4 -> 0 (This is the whole pattern: nano - theres no suffix that would be a prefix of the whole pattern) Page 93

Heres a possible implementation of the overlap function:


private int overlap(int x) { if ((x==0)||(x==1)) {return 0;} int value=0; for (int n=1;n<x;n++) { if (pattern.charAt(n) == pattern.charAt(value)) { value++; } else { value=0; } } return value; }

We can notice that in this case our overlap function may calculate the same thing multiple times. Since its outputs is only a function of the pattern, it may make sense to cache its results, because 1. We dont want to calculate the same thing twice. 2. To calculate the overlap for the prefix of n length we need to calculate first the overlap for the prefix of (n-1) length. So we can reuse our previous results in later calculations. To accomplish this the only thing we need to do is storing the results in an array and keep track of that how much we populated the array. We will need the following data structure:
private int[] overlap = new int[pattern.length()]; private int overlapCalculated=-1;

The new version of the overlap method will look like this:
private int overlap(int x) { if (x==0) {return 0;} if (x-1>overlapCalculated) { for (int n=overlapCalculated+1;n<x;n++) { if (n==0) {overlap[n]=0;} else { if (pattern.charAt(n) == pattern.charAt(overlap[n-1])) { overlap[n]=overlap[n-1]+1; } else { overlap[n]=0; }

Page 94

} } overlapCalculated=x; } return overlap[x-1]; }

With this solution we dont do any unnecessary iteration in the outer loop. However we can notice that we still have some wasted iterations in the inner loop. Lets say we found a 10 characters long partial match of a 20 characters long pattern. The overlap method calculated that the 10 characters long partial match has a suffix of 5 characters which is even a prefix of the pattern. This pattern may look like this: ABCDEABCDEABCDEABCDE The overlap found by the overlap method is in red. According to our algorithm we can skip 4 iterations in the outer loop and start to compare the pattern to the text from the beginning of the overlap. However we already completed the comparison for the entire overhead part, so theres no need to do it again. It would be more efficient to start the comparison after the overlap, since that would be the first character which is in question. The improved algorithm will look like this
for (int n=0;n<text_l;) { int q; for (q=o; (q<pattern_l)&&(n+q<text_l)&& (text.charAt(n+q)==pattern.charAt(q)); q++){} if (q==pattern_l) {System.out.println("Match at "+n);} o=overlap(q); n += Math.max(1, q-o); }

The algorithm we got is even the KMP algorithm. By looking the code it seems to be clear the improved algorithm needs less computation consequently will require less time. However what does this really mean? Be the text N long and the pattern M long. If we examine the brute force algorithm, we can see in a worst case scenario it may require NxM comparison. (For example both the text and the pattern contain nothing but A characters.) If we examine now the KMP algorithm we can see that the algorithm compares each character of the text only once. (More precisely it compares the non-matching characters at the end of a partial match of the pattern twice, but this is still less than 2xN operations.) We have one more potentially computation intensive part if this algorithm, namely the overlap calculation. However thanks to the caching mechanism it doesnt require more than M comparisons. (So the whole calculation will require less than 2xN + M operations.) More formally this means: brute force solutions running time O(NxM) quadratic time KMP O(N+M) linear time Page 95

So the improvement is very significant. (See the complete source code of the example in Appendix - 2.1 Sample Java source code of the KMP algorithm on page 109.) 1 9.1.3 RMAP The RMAP algorithm is specifically developed to help genome research scientists to map patterns to (human or other) genomes. RMAP is not only an algorithm but a working software library as well. It has a home page (Andrew D Smith, 2009), and it was also published in numerous articles (A. Smith, Xuan, & Zhang, 2008; Andrew D. Smith et al., 2009). According to A. D. Smith et al (2009) the RMAP algorithm was designed to be capable of (1) mapping reads with length exceeding 50 bases, (2) allowing the number of mismatches to be controlled (not being restricted to a small fixed number) (3) completing mapping tasks under reasonable time constraints on widely available computing hardware. The main idea how RMAP can handle the mismatches is the following. If we are looking for matches (maps in the RMAP terminology) of a pattern (read in the RMAP terms) with a maximum mismatch of k, then we can split the pattern to k+1 pieces, called seeds. Now if we are looking for exact matches of the seeds in the whole text (the genome in this case) then we know that the matches of the pattern with a maximum of k mismatching characters must be surrounding the positions identified by the seeds. (Since we have a maximum number of k mismatch and k+1 seeds, at least one seed must be an exact match.) Further improvement the algorithm applies is how it handles the seed list. Since there may be many patterns and even more seeds, there may be many duplicates among seeds. Of course it would make no sense to search for them separately. So instead of that RMAP builds a hash table of seeds and it indexes it by the patterns and the position of the seed in them. The step in which RMAP finds the matches of the patterns based on the seed positions takes advantage of bitwise operations. A series of operations produces a bit vector that indicates the locations of mismatches. According to A. D. Smith et al (2009) RMAP is sufficiently fast that several million reads can be mapped to a mammalian genome in one day on a computer with a single processor. It is also one of the benefits of the RMAP algorithm it is possible to utilize parallel processing in it. We will demonstrate such utilization (though not with RMAP) in the next sections of this document. Another important point about RMAP is that CloudBurst application is built using the RMAP libraries. At the time of writing this document CloudBurst is one of the sample applications provided by Amazon to test the Elastic Mapreduce service. So it is a very good starting point (or Page 96

a next point after reading this document) of study to learn more about the topics of parallel processing, Mapreduce, subsequence search and genome research. Cloudburst is an opensource project and is available under the Sourceforge website (Schatz, 2009). 9.2 Creating a solution As we can see based on the previous chapters there are many research results, algorithms and ready to use (even free and open-source) products to accomplish various subsequence search goals in the field of genome research. However this document will also provide a solution and an approach how to build it. The emphasis is on the approach. The solution being built here is intended to be neither the solution with most features, nor a solution with a brand new revolutionary algorithm. Instead its purpose is to give the reader an idea 1. how to solve a problem using the Mapreduce paradigm, 2. how to solve subsequence search problems for human genome research purposes. 9.2.1 Requirements for a working prototype To get closer to our solution lets define the requirements we want to fulfill with it. Lets call the solution a working prototype referring to the cycles in the agile software development method, so the solution will be a functional, usable piece of software, however by detailing and unfolding the requirements it is possible (and strongly encouraged) to build new more feature rich versions of it. The genome research tool to be built needs to Have a practical use to solve current genome research problems (similar to the HuGE tool - Cases study HuGE tool on page 24.) 6 Be easy to scale up (handling more users, more genomes, more patterns). Be simple enough to complete it in frames of writing this thesis document.

Based on the requirements above the author of this document made the following decisions to fulfill all of the needs: Pseudo distributed Hadoop installation on a single PC will serve as the development environment for this solution. Amazon Elastic Mapreduce will serve as the production environment for this solution. Utilizing various caching mechanism by the application of column oriented databases can be found in the original plans of this thesis. To keep the timeline and build a working solution this item has been removed from the list of requirements. One of the solutions main purposes is to demonstrate a scalable parallel way for processing. From this point of view handling partial matches (mismatched characters or ignores) are not that relevant. So these features have been moved to the potential future improvement list from the list of requirements. Page 97

9.2.2 Subsequence search algorithm Now is the time to build an algorithm that fulfills the requirements. The RMAP algorithm gives good ideas how to decompose the subsequence search problem into smaller pieces. This also serves as an opportunity for parallelization. It seems to be a good idea to split the genome into numerous smaller pieces and perform subsequence search in those smaller pieces only. One obvious problem is, what should happen with matches crossing splitting borders. We can easily work around this problem by building some redundancy into the parts being split. In case we are searching for a pattern with a length of k, lets add the last (k-1) characters from every piece to the beginning of the next one. With this method we ensure that there wont be any matches missed and there wont be any matches counted twice either. Once we completed this splitting we can run Mapreduce on the pieces with a very simple subsequence search logic in it (since the individual pieces can be very small the exact subsequence search method is less significant). After the informal definition above lets define the plan in a more formal manner. Interface definition The application being built is a Mapreduce application so its input is the input of the map function its output is the output of the reduce function: INPUT: <pattern to search for>;<genome fragment offset>;<genome fragment> OUTPUT: <pattern to search for>;<its position in the genome> The Mapreduce functions The map function: It is simply looking for all the matches of the pattern in the fragment (linear substring search) and emits the pattern-position pairs The reduce function It simply returns its input. Initialization Under initialization we mean the creation of the input files to run the Mapreduce on, which means the splitting of the genome. The algorithm for the initialization: Input: genome (G); pattern to find (P); size of pieces to be split to (x); number of pieces per input files (n) Page 98

Output: input files for Mapreduce The algorithm for splitting: The splitter iterates through each characters in G and fills the characters into a temporary buffer (lets call it buf). Lets call the position of the cursor in the genome pos. So whenever (pos mod x) == 0 we reached the end of a piece so buf can be flushed into the current output file. Before flushing the buffer the (k-1) suffix needs to be saved and copied into the new buffer. Whenever (pos mod (x*n) == 0) we reached the end of a file. So this input file needs to be closed and a new one needs to be opened. Demonstration of the solution through an example G: ABCDEFGHIJABCDEFGHIJ P: EFG fragments with positions: ABCDEFG 0 HIJABCD 7 EFGHIJ 14 fragments with overlaps (this is the input for Mapreduce): ABCDEFG 0 FGHIJABCD 5 CDEFGHIJ 12 Expected result (the output from Mapreduce): EFG 4 EFG 14 Page 99

This solution fulfills all the requirements we specified above. We kept the algorithm extremely simple, so it is easy to implement and understand. On the other hand it is very easy to scale. By adding more Amazon EC2 instances to the job flow in which the Mapreduce is running and/or by cutting the genome into more pieces the solution can be scaled almost limitless. 9.3 Implementation The source code of the algorithms above is available in the Appendix. The Mapreduce was implemented in Java The splitter was implemented in Perl

9.4 Test and measurement of the solution There were two tests conducted to test the solution. The human genome data which was used as an input to test the solution was downloaded from the University of California Santa Cruzs web site, from the UCSC Genome Bioinformatics page (UCSC, 2009). The first test was conducted using only the Chromosome 1 (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz ), while the second one was executed on the whole human genome. In both of the tests we are looking for a, basically arbitrary sequence. Our choice was the GGGGCGGGG sequence, which is a mutation of the SP1 protein, however our goal is only to demonstrate a real subsequence search on the genome, so we could use a different sequence as well. 9.4.1 Test1 searching Chromosome 1 These are the steps of conducting the first test 1. The input data was downloaded to and decompressed on local PC. 2. Using the splitter tool ( 2.2 Genome splitter with overlap - implementation in Perl on 1 page 110) sliced sequences were created to serve as an input for the Mapreduce. (The results of this step were 100 files file000..file099- approx.. 2.5 MB each.) 3. Data structure was created on Amazon S3. Input data and Mapreduce JAR were uploaded to the S3 bucket. The following parameters were used: a. Name of the bucket: edu.bridgeport.tragoncs b. Location of the JAR: genome_search/genomesearch.jar c. Location of the input: genome_search/input d. Location of the output: genome_search/20110930_test1/output. (Note: only the folder genome_search/20110930_test1 needs to be created. The folder output will be created by Mapredcue.) e. Location of Mapreduce logs genome_search/20110930_test1/log. (Note: only the folder genome_search/20110930_test1 needs to be created. The folder log will be created by Mapredcue.) 4. Elastic Mapreduce Job Flow was created with the following parameters: Page 100

a. b. c. d. e. f. g. h. i. j. k. l. Results:

Job Flow name: Genome Search Chr 1 Type: Custom JAR JAR location: edu.bridgeport.tragoncs/genome_search/genomesearch.jar GenomeSearch s3n://edu.bridgeport.tragoncs/genome_search/input s3n://edu.bridgeport.tragoncs/genome_search/20110930_test1/output Master instance group instance type: Small Core instance group instance type: Small; instance count: 10 Task instance group instance count: 0 Amazon EC2 keypair: hadooptest2 Enable debugging: Yes; Amazon S3 Log Path: s3n:// edu.bridgeport.tragoncs /genome_search/20110930_test1/log Enable Hadoop Debugging: No Keep Alive: Yes No Bootstrap Actions

According to the log file produced the execution of the subsequence search took 2 minutes and 13.807 seconds.
2011-09-30T23:02:05.204Z INFO Fetching jar file. 2011-09-30T23:04:19.011Z INFO Step succeeded

This calculation does not contain the initialization of the environment (creation of EC2 instances, and booting up the virtual machines), which can take another couple of minutes. (Usually not more that 2-3 minutes.) The execution produced 17 output files (part-00000 part-00016) only one of which was not empty (in our case part-00007). The number of lines in this file gives us the number of matches of the subsequence in Chromosome 1, which is 1346. Each line of the file shows one position to each match, like this
GGGGCGGGG GGGGCGGGG GGGGCGGGG GGGGCGGGG GGGGCGGGG GGGGCGGGG GGGGCGGGG 112532601 112939019 113050887 113246219 113345203 113616601 113915027

Page 101

9.4.2 Test2 searching the whole human genome To search the whole human genome we have to make some changes in the Mapreduce application and also in the splitter tool. In this case it is not enough if simply maintain offsets in the file (in the chromosome) since in this case we have multiple files (multiple chromosomes). So the format of the input file will change like the following: Old format:
<pattern to search for>;<genome fragment offset>;<genome fragment>

New format:
<pattern to search for>;<file name>;<genome fragment offset>;<genome fragment>

The splitter tool has to be changed to handle multiple files and the new file format. You can find the source code of the changed splitter in the Appendix. (See 2.4 Improved Genome splitter 1 handling multiple chromosomes - implementation in Perl on page 114.) The Mapreduce application also has to handle the new format. This requires only a minor change in the map function. We parse the filename as well, plus instead of returning (pattern offset) pairs we can return ((pattern and filename) offset) pairs. Here is the new map function. Find the changes in red.
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line,";"); String pattern=itr.nextToken(); String fileName=itr.nextToken(); String fragment_offset_str=itr.nextToken(); long fragment_offset=Long.parseLong(fragment_offset_str); String fragment=itr.nextToken(); ArrayList<Integer> positions = new ArrayList<Integer>(); int offset_in_fragment=0; int maxpos=fragment.length()-1; while ((offset_in_fragment!=-1) && (offset_in_fragment<maxpos)) { offset_in_fragment=fragment.toUpperCase().indexOf(pattern, offset_in_fragment); if (offset_in_fragment!=-1) { positions.add(offset_in_fragment); offset_in_fragment++; } } pattern_toprint.set(pattern+" "+fileName); for (Integer pos : positions)

Page 102

{ pos_toprint.set(pos+fragment_offset); output.collect(pattern_toprint, pos_toprint); } }

After these changes the second test can be conducted. 1. The input data was downloaded to and decompressed on local PC. (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz ) 2. Using the new version of the splitter tool sliced sequences were created. (The results of this step were 1329 files file000..file1328- approx.. 2.5 MB each.) 3. Data structure was created on Amazon S3. Input data and Mapreduce JAR (the new version) were uploaded to the S3 bucket. The following parameters were used: a. Name of the bucket: edu.bridgeport.tragoncs b. Location of the JAR: genome_search/genomesearch2.jar c. Location of the input: genome_search/input2 d. Location of the output: genome_search/20111112_test2/output. (Note: only the folder genome_search/20111112_test2 needs to be created. The folder output will be created by Mapredcue.) e. Location of Mapreduce logs genome_search/20111112_test2/log. (Note: only the folder genome_search/20111112_test2 needs to be created. The folder log will be created by Mapredcue.) 4. Elastic Mapreduce Job Flow was created with the following parameters: a. Job Flow name: Genome Search whole genome b. Type: Custom JAR c. JAR location: edu.bridgeport.tragoncs/genome_search/genomesearch2.jar d. GenomeSearch s3n://edu.bridgeport.tragoncs/genome_search/input s3n://edu.bridgeport.tragoncs/genome_search/20111112_test2/output e. Master instance group instance type: Small f. Core instance group instance type: Small; instance count: 10 g. Task instance group instance count: 0 h. Amazon EC2 keypair: hadooptest2 i. Enable debugging: Yes; Amazon S3 Log Path: s3n:// edu.bridgeport.tragoncs /genome_search/20111112_test2/log j. Enable Hadoop Debugging: No k. Keep Alive: Yes l. No Bootstrap Actions Results: According to the log file produced the execution of the subsequence search took 8 minutes and 42.100 seconds. Page 103

2011-11-13T00:24:04.730Z INFO Fetching jar file. 2011-11-13T00:32:46.830Z INFO Step succeeded

This calculation does not contain the initialization of the environment (creation of EC2 instances, and booting up the virtual machines), which can take another couple of minutes. (Usually not more that 2-3 minutes.) On the following screenshot you can see the results summary as displayed on the Hadoop Map/Reduce administrative interface.

Figure 67 - Genome research test - results in Job Tracker

The execution produced 35 output files (part-00000 part-00034). To understand these results we need some further processing. We should count the occurrence of each file name (chromosome) in each output file and sum them. We can do this with a simple Perl program:
my %count=(); while (<STDIN>) { $_ =~ m/\S+\s+(\S+)\s/; if (not defined($count{$1})) {$count{$1}=0;} $count{$1}+=1; } foreach (sort keys(%count)) { print "$_ $count{$_}\n";

Page 104

This Perl program expects its input from the standard input so we can join the content of the files using simple Linux commands:
cat ./output/* | perl counter.pl

The results look like this:


chr1.fa 1346 chr10.fa 649 chr11.fa 890 chr12.fa 670 chr13.fa 272 chr14.fa 585 chr15.fa 462 chr16.fa 788 chr17.fa 942 chr18.fa 261 chr19.fa 1315 chr2.fa 972 chr20.fa 446 chr21.fa 175 chr22.fa 490 chr3.fa 658 chr4.fa 478 chr5.fa 607 chr6.fa 669 chr7.fa 791 chr8.fa 543 chr9.fa 723 chrX.fa 512 chrY.fa 59

Note: To analyze the results the output has to be copied from the S3 bucket to a Linux machine. We can use either a desktop machine or another EC2 instance. The output files from the S3 bucket can be accessed through the AWS Management Console, or can be downloaded using wget. In the second case the files have to have the right permission settings. 9.5 Benefits of the solution There are many benefits of the solution presented here: The algorithm used is simple, so it is easy to understand and can be a good entry point for anyone who wants to study the field of subsequence search and human genome research. But at the same time it is functional and fulfills its purpose (finding exact match of patterns in long sequences efficiently). Page 105

The solution is easy to scale. Splitting parameters can be easily configured to create different input (number of files or size of each file, size of each fragments all can be changed). We can run the job flow using as many EC2 instances as needed. The solution is not bound to Amazon EC2, it is very simple to deploy onto any kind of Hadoop platform.

The solution created during the writing of this thesis document is neither a real complete product nor a brand new invention. Its main purpose is to demonstrate How to deal with subsequence search (human genome research) problems How to deal with and keep at a minimum the administration of a virtualized environment that is powerful and scalable enough to serve as a base for parallel computing How to deal with Hadoop Mapreduce, again keeping the technological involvement at a necessary minimum level to enable us to focus on the problem in scope How to deal with, modify or build subsequence search algorithms for utilization in human genome research And finally how to put these pieces together and build applications or rapid prototypes quickly and in a timely manner in response of current business needs of the health care industry

9.6 Possible extensions of the solution This solution can be a good starting point for many further study and research related to this field, just to name a few: How does the combination of different parameters affect the overall performance of the system? (These parameters can be the input structure, the length of the subsequence being searched for, the number and processing power of the virtual machines in the system etc.) The solution can be extended to support search for multiple subsequences at the same time, this could increase its usability a lot. The solution can be easily extended to include errors in the comparison (partial, or ambiguous matches). We can use the ideas presented in the RMAP algorithm. ( .1.3 9 RMAP on page 96 provided an overview of the RMAP algorithm.) Its possible to improve the linear search algorithm in the map function since there was no intension to try to make that optimal in this solution. It can be easily improved by applying the KMP algorithm as the primitive operation implemented in the map function ( .1.2 KMP (Knuth-Morris Pratt discrete algorithm) on page 92). 9 We tried to keep the technical overhead at a minimum in this solution. However it is definitely possible to use Amazons EC2 service for virtual machine creation but to build the Hadoop layer manually. This could give the researcher a deeper insight into the Hadoop/Mapreduce platform. If we need a stronger control over virtualization it is also possible to build our own virtualized environment (using VMWare or any other Page 106

virtualization solution) so we gain finer control over the virtualization and instance management as well. 10 Conclusion about parallel processing and human genome research As we can see based on the review of the related research there are multiple ways how to conduct subsequence search for human genome research purposes. There are also various methods and techniques how to build solid parallel computing solutions. In terms of parallel computing one of todays most popular paradigms is the Mapreduce paradigm, so the Mapreduce method was chosen to study in this thesis document as an approach to respond to parallel processing demands. A ready to use implementation of a Mapreduce library for the public is included in the Hadoop project. This project was also included into the scope of this study. There are multiple ways how to build a Hadoop-Mapreduce environment: 1. You can use physical nodes and install the environment on them. This gives you the finest control over each component of the system, but it comes with high cost and long and complex installation procedure. 2. You can build a virtualized environment and build Hadoop Mapreduce on it. Your costs will go down. Although you will spend on virtualization you will need less equipment to reach the same processing power. Centralized virtual instance management also may reduce the installation time. In exchange you pay some loss in control because of assigning virtual devices to tasks instead of physical ones. 3. You can use a service provider and borrow virtual machines and build your Hadoop Mapreduce environment on it. This solution will drastically reduce your costs and installation complexity, since you pay only for the real usage of the resources and they are installed and maintained by your provider. In exchange for this convenience you pay a huge loss in control. In case your service provider has a down time, you cant do anything but wait for your service to come back. Even in this solution you still need to spend significant time on installing and configuring your Hadoop-Mapreduce platform. 4. You can outsource both the virtualization and the Hadoop-Mapreduce installation to a service provider. This is a very reasonable way compared to the previous option. If there is no special need related to the Hadoop-Mapreduce environment you can save a lot of time by letting your virtualization service provider take care of this as well. In this study Amazon was tested as a virtualization service provider and the option 4 was chosen from the list above. The choice was proven to be reasonable, the development cycle of the solution was speeded up significantly, full-fledged working prototype was developed in days instead of weeks or months. Subsequence search for human genome research purposes is an excellent candidate to be a beneficiary of virtualization service provider solutions, since typically this kind of research demands huge processing power on occasional bases. (A huge computation can be followed by months of analysis of the results.) Page 107

Subsequence search as a mathematical problem has various solutions published. In this thesis document only a handful of solutions were examined: the Longest Common Subsequence (LCS) algorithm, the Knuth-Morris Pratt (KMP) algorithm and the RMAP algorithm. Even this short list gives us an idea how many different ways can this problem be approached on. LCS gives us a method to handle practically any kind of and any number of mismatches or ignored bases while searching (or even comparing) genomes. It is possible to parallelize LCS, however it is not designed to do that. KMP is a plain subsequence search algorithm to find exact matches in a long sequence. The algorithm is completely linear, however it can be easily utilized as part of a larger parallelized solution. RMAP is built for genome research purposes and it also utilizes parallel processing opportunities. This algorithm was the closest solution to the problems defined in this thesis document. The solution built in the frames of this thesis is, at some extent, a simplified version of a RMAP solution. The algorithm built in this study is capable of conducting subsequence search on very large sequences. It handles only exact matches. It is simple and extremely scalable. As our results indicate the solution can be utilized in real genome research projects. The two tests conducted with this solution were the following.
Test 1 Sequence Subsequence (pattern) Number of EC2 nodes Type of EC2 nodes Execution time Chrmosome 1 one 9 bases long 11 (1 master instance and 10 core instance) Small 2 min 13.807 sec Test 2 The whole human genome one 9 bases long 11 (1 master instance and 10 core instance) Small 8 min 42.100 sec

Table 4 - Genome research tests - summary

Based on the results our conclusion is that there are available solutions for conducting subsequence search procedures in a parallelized environment. Human genome researchers can make a good use of them. It is also possible to build such solutions for special needs in a timely and cost efficient manner because of the maturity of the related technologies. Such a sample is the solution built in the frames of this thesis. 11 Conclusion about knowledge networks in bioinformatics The broader scope of this research was knowledge networks in bioinformatics. The main question under our investigation was that how can we improve the efficiency of biomedical research through increasing the level of cooperation among individual researchers. There are published theories about methods and procedures how organizations and networks of individuals can be built or changed to provide more room for collaboration. Such method is the extension of the intellectual bandwidth of an organization. The Intellectual Bandwidth Model demonstrates Page 108

how intellectual capability processes and level of collaboration relate to each other and how they influence the organizations intellectual bandwidth, which is tightly coupled to the effectiveness (profitability, growth etc.) of an organization. Another concept is the Collaborative Networks concept, which gives practical guidance how to build the organizational structure to support collaboration as much as possible. A detailed analysis of the Knet concept was also provided in this document. Knet concepts contain recommendations and principles how to build knowledge networks to make the knowledge as widely available and as easy to reuse as possible. We also covered other fields related to knowledge and information management, like ontologies and text mining. Based on this research review and the experience gained from analyzing the HuGE tool and also building a subsequence search solution it is clear todays technological advancements (like cloud computing for example) give many opportunities to improve collaboration among researchers. It is also obvious there are obstacles that may or may not be overcome using technology. Such issues are: lack of trust in information security, fear of losing freedom in research, lack of trust in ownership protection, fear of losing privacy and in general lack of trust in the collaborating party. Although technological advancements provide solutions to some of these problems they have wider sociological and psychological roots and causes, so most likely to develop and recommend more effective solutions further research needs to be conducted on these related fields as well. 12 Appendix 12.1 Sample Java source code of the KMP algorithm
public class KnuthMorrisPratt { private private private private String text; String pattern; int[] overlap; int overlapCalculated=-1;

public KnuthMorrisPratt(String text, String pattern) { super(); this.text = text; this.pattern = pattern; overlap = new int[pattern.length()]; } public void calculate() { int text_l=text.length(); int pattern_l=pattern.length(); int o=0; for (int n=0;n<text_l;) { int q; for (q=o; (q<pattern_l)&&(n+q<text_l)&& (text.charAt(n+q)==pattern.charAt(q));

Page 109

q++){} if (q==pattern_l) {System.out.println("Match at "+n);} o=overlap(q); n += Math.max(1, q-o); } } private int overlap(int x) { if (x==0) {return 0;} if (x-1>overlapCalculated) { for (int n=overlapCalculated+1;n<x;n++) { if (n==0) {overlap[n]=0;} else { if (pattern.charAt(n) == pattern.charAt(overlap[n-1])) { overlap[n]=overlap[n-1]+1; } else { overlap[n]=0; } } } overlapCalculated=x; } return overlap[x-1]; } public static void main(String[] args) { new KnuthMorrisPratt("banananobano","nano").calculate(); } }

12.2 Genome splitter with overlap - implementation in Perl


use strict; my $filename="./chr1.fa"; my $outdir="./splitted"; my $filesize = -s "$filename"; print "Splitting chromosome file: $filename\n"; my my my my $FRAGMENT_SIZE=10000; $FRAGMENT_IN_FILE=250; $PATTERN_SIZE=9; $PATTERN="GGGGCGGGG";

open(F,"<$filename"); my $cnt_line=0;

Page 110

my $cnt_char=1; my $cnt_file=0; my $current_file=0; my $current_fragment=0; createFile($current_file); my $line_to_print=$PATTERN.";".($cnt_char-1).";"; while (<F>) { $cnt_line++; if ($cnt_line eq 1) {next;} my $row=$_; chomp($row); for (my $n=0;$n<length($row);$n++) { my $char = substr($row, $n, 1); my $file=int($cnt_char / ($FRAGMENT_SIZE*$FRAGMENT_IN_FILE)); my $fragment=int($cnt_char / $FRAGMENT_SIZE); $line_to_print.=$char; if ($fragment ne $current_fragment) { $current_fragment=$fragment; print F1 $line_to_print."\n"; my $overlap=substr($line_to_print,length($line_to_print)$PATTERN_SIZE+1); $line_to_print=$PATTERN.";".($cnt_char$PATTERN_SIZE+1).";".$overlap; if ($fragment % 100 eq 0) { my $status = sprintf("%.3f", ($cnt_char*100/$filesize)); print "file: $file - fragment: $fragment ($status %)\n"; } } if ($file ne $current_file) { closeFile(); $current_file=$file; createFile($current_file); } $cnt_char++; } } print F1 $line_to_print."\n"; closeFile(); close(F); sub createFile { my ($current_file)=@_; print "CRETAE: ".$current_file."\n"; my $suffix=$current_file; while (length($suffix)<3) {$suffix="0".$suffix;} open(F1,">$outdir/file".$suffix);

Page 111

} sub closeFile { print "CLOSE: ".$current_file."\n"; close(F1); }

You can find all the parameters needed at the top of the file. You can set them according to your needs.
my $filename="./chr1.fa"; my $outdir="./splitted"; #... my $FRAGMENT_SIZE=10000; my $FRAGMENT_IN_FILE=250; my $PATTERN_SIZE=9; my $PATTERN="GGGGCGGGG";

12.3 Mapreduce implementation of simple subsequence search in Java


public class GenomeSearch extends Configured implements Tool { public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> { private final static LongWritable pos_toprint = new LongWritable(1); private Text pattern_toprint = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line,";"); String pattern=itr.nextToken(); String fragment_offset_str=itr.nextToken(); long fragment_offset=Long.parseLong(fragment_offset_str); String fragment=itr.nextToken(); ArrayList<Integer> positions = new ArrayList<Integer>(); int offset_in_fragment=0; int maxpos=fragment.length()-1; while ((offset_in_fragment!=-1) && (offset_in_fragment<maxpos)) { offset_in_fragment=fragment.indexOf(pattern, offset_in_fragment); if (offset_in_fragment!=-1) { positions.add(offset_in_fragment); offset_in_fragment++; } } pattern_toprint.set(pattern); for (Integer pos : positions) { pos_toprint.set(pos+fragment_offset); output.collect(pattern_toprint, pos_toprint);

Page 112

} } } public static class Reduce extends MapReduceBase implements Reducer<Text, LongWritable, Text, LongWritable> { public void reduce(Text key, Iterator<LongWritable> values, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, new LongWritable(values.next().get())); } } } static int printUsage() { System.out.println("genomesearch [-m <maps>] [-r <reduces>] <input> <output>"); ToolRunner.printGenericCommandUsage(System.out); return -1; } public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), GenomeSearch.class); conf.setJobName("genomesearch"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(LongWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); List<String> other_args = new ArrayList<String>(); for(int i=0; i < args.length; ++i) { try { if ("-m".equals(args[i])) { conf.setNumMapTasks(Integer.parseInt(args[++i])); } else if ("-r".equals(args[i])) { conf.setNumReduceTasks(Integer.parseInt(args[++i])); } else { other_args.add(args[i]); } } catch (NumberFormatException except) { System.out.println("ERROR: Integer expected instead of " + args[i]); return printUsage(); } catch (ArrayIndexOutOfBoundsException except) { System.out.println("ERROR: Required parameter missing from " + args[i-1]); return printUsage(); } } if (other_args.size() != 2) { System.out.println("ERROR: Wrong number of parameters: " + other_args.size() + " instead of 2."); return printUsage(); }

Page 113

FileInputFormat.setInputPaths(conf, other_args.get(0)); FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1))); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new GenomeSearch(), args); System.exit(res); } }

This is the implementation of the Mapreduce functions. It needs to be compiled and packed into a JAR file to be able to executed by Hadoop Mapreduce. (See example in .2.4 Compilation and 8 build of the WordCount application on page 41.) 12.4 Improved Genome splitter handling multiple chromosomes - implementation in Perl
use strict; my $file_index_offset=0; my $inputdir="./full_genome/"; my $outdir="./splitted2"; opendir(DIR, $inputdir); my @files= readdir(DIR); my @files1=(); foreach(@files) { if (($_ eq ".") || ($_ eq "..")) {next;} splitFile($_); } exit; sub splitFile { my ($filen)=@_; my $filesize = -s $inputdir.$filen; print "Splitting chromosome file: $filen\n"; my my my my $FRAGMENT_SIZE=10000; $FRAGMENT_IN_FILE=250; $PATTERN_SIZE=9; $PATTERN="GGGGCGGGG";

open(F,"<".$inputdir.$filen); my $cnt_line=0; my $cnt_char=1;

Page 114

my $current_file_index=$file_index_offset; my $current_fragment_index=0; createFile($current_file_index); my $line_to_print=$PATTERN.";".$filen.";".($cnt_char-1).";"; while (<F>) { $cnt_line++; if ($cnt_line eq 1) {next;} my $row=$_; chomp($row); for (my $n=0;$n<length($row);$n++) { my $char = substr($row, $n, 1); my $file_index=int($cnt_char / ($FRAGMENT_SIZE*$FRAGMENT_IN_FILE))+$file_index_offset; my $fragment_index=int($cnt_char / $FRAGMENT_SIZE); $line_to_print.=$char; if ($fragment_index ne $current_fragment_index) { $current_fragment_index=$fragment_index; print F1 $line_to_print."\n"; my $overlap=substr($line_to_print,length($line_to_print)-$PATTERN_SIZE+1); $line_to_print=$PATTERN.";".$filen.";".($cnt_char$PATTERN_SIZE+1).";".$overlap; if ($fragment_index % 100 eq 0) { my $status = sprintf("%.3f", ($cnt_char*100/$filesize)); print "file: $file_index - fragment: $fragment_index ($status %)\n"; } } if ($file_index ne $current_file_index) { closeFile($current_file_index); $current_file_index=$file_index; #if ($current_file_index eq 10) {exit;} createFile($current_file_index); } $cnt_char++; } } print F1 $line_to_print."\n"; closeFile($current_file_index); close(F); $file_index_offset=$current_file_index+1; } sub createFile { my ($current_file_index)=@_; print "CRETAE: ".$current_file_index."\n"; my $suffix=$current_file_index; while (length($suffix)<3) {$suffix="0".$suffix;} open(F1,">$outdir/file".$suffix); }

Page 115

sub closeFile { my ($current_file_index)=@_; print "CLOSE: ".$current_file_index."\n"; close(F1); }

13 List of figures and tables Figures Figure 1 - Relationship of collaboration and technology................................................................ 8 Figure 2 - The Intellectual Bandwidth Model source: (J. F. Nunamaker et al., 2000) ................ 9 Figure 3 - Increasing the Intellectual Bandwidth source: (J. F. Nunamaker et al., 2000) ......... 10 Figure 4 - The Starburst organization structure source: Bach et al 2001................................... 11 Figure 5 - The Spider's Web organization structure - source: Bach et al 2001 ............................ 11 Figure 6 - The Collaborative Network Organization Structure - source: Bach et al 2001 ........... 12 Figure 7 - IB extension step 1 model - source: Bach et al 2003 ................................................ 13 Figure 8 - IB extension step 1 process - source: Bach et al 2003 .............................................. 13 Figure 9 - IB extension step 2 - model - source: Bach et al 2003 ................................................. 13 Figure 10 - IB extension step 2 process - source: Bach et al 2003 ............................................ 13 Figure 11 - IB extension step 3 - model - source: Bach et al 2003 ............................................... 14 Figure 12 - IB extension step 3 process - source: Bach et al 2003 ............................................ 14 Figure 13 - IB extension step 4 - model - source: Bach et al 2003 ............................................... 14 Figure 14 - IB extension step 4 - process - source: Bach et al 2003 ............................................. 14 Figure 15 - IB extension step 5 model - source: Bach et al 2003 .............................................. 15 Figure 16 - IB extension step 5 process - source: Bach et al 2003 ............................................ 15 Figure 17 - RDBMS concept ........................................................................................................ 17 Figure 18 - Knet concept............................................................................................................... 17 Figure 19 - Detail Vew Part 1 ....................................................................................................... 18 Figure 20 - Detail Vew Part 2 ....................................................................................................... 18 Figure 21 - Building knowledge networks.................................................................................... 19 Figure 22 - The HDFS architecture source: http://hadoop.apache.org/ ..................................... 26 Figure 23 - MapReduce parallel processing source: Dean & Ghemawat, 2004 ........................ 30 Figure 24 - LCS definition source: Nakatsu et al, 1982 ............................................................ 31 Figure 25 - Solving the LCS problem by using dynamic programming - source: Eddy, 2004 .... 31 Figure 26 - The SP1 three finger structure source: Bach et al., 2009 ........................................ 32 Figure 27 - Number of actual binding sites listed for each chromosome source: Bach, Bajwa, & Erodula, 2011 ................................................................................................................................ 33 Figure 28 - Hadoop HDFS sample input files ........................................................................ 43 Figure 29 - Hadoop Job Tracker tracking the Wordcount example........................................ 44 Figure 30 - Hadoop output of Wordcount example ................................................................... 45 Page 116

Figure 31 - AWS account registration ....................................................................................... 50 Figure 32 - AWS account registration 2 .................................................................................... 51 Figure 33 - The EC2 service ......................................................................................................... 52 Figure 34 - EC2 instance creation 1.............................................................................................. 53 Figure 35 - EC2 instance creation 2.............................................................................................. 54 Figure 36 - EC2 instance creation 3.............................................................................................. 55 Figure 37 - EC2 instance creation 4.............................................................................................. 56 Figure 38 - EC2 instance creation 5.............................................................................................. 57 Figure 39 - EC2 instance creation 6.............................................................................................. 58 Figure 40 - List EC2 instances ...................................................................................................... 59 Figure 41 - EC2 instance - finding the security group .................................................................. 60 Figure 42 - Network and Security - checking ports ...................................................................... 61 Figure 43 - Network and Security - open a port ........................................................................... 62 Figure 44 - EC2 instance - finding public DNS name .................................................................. 63 Figure 45 - EC2 instance - accessing the web server.................................................................... 65 Figure 46 - EC2 instance - finding the zone ................................................................................. 67 Figure 47 - EBS creation............................................................................................................... 68 Figure 48 - Listing EBS volumes.................................................................................................. 69 Figure 49 - Attaching EBS to EC2 instance ................................................................................. 70 Figure 50 - Creating S3 bucket ..................................................................................................... 72 Figure 51 - Elastic Mapreduce - creating a job flow 1 ................................................................. 76 Figure 52 - Elastic Mapreduce - creating a job flow 2 ................................................................. 77 Figure 53 - Elastic Mapreduce - creating a job flow 3 ................................................................. 78 Figure 54 - Elastic Mapreduce - creating a job flow 4 ................................................................. 79 Figure 55 - Elastic Mapreduce - creating a job flow 5 ................................................................. 80 Figure 56 - Elastic Mapreduce - listing job flows......................................................................... 81 Figure 57 - Elastic Mapreduce - listing related EC2 instamces .................................................... 82 Figure 58 - Elastic Mapreduce - identifying the master node ....................................................... 83 Figure 59 - Elastic Mapreduce - Job Tracker Interface 1 ............................................................. 84 Figure 60 - Elastic Mapreduce - Job Tracker Interface 2 ............................................................. 84 Figure 61 - Elastic Mapreduce - the Name Node interface .......................................................... 85 Figure 62 - Elastic Mapreduce - Wordcount example - input files............................................... 86 Figure 63 - Elastic Mapreduce - Wordcount example - creating a job flow 1.............................. 87 Figure 64 - Elastic Mapreduce - Wordcount example - creating a job flow 2.............................. 88 Figure 65 - Elastic Mapreduce - Wordcount example - creating a job flow 3.............................. 89 Figure 66 - Parallelization of the LCS problem (figure created by reusing an image from Eddy, 2004) ............................................................................................................................................. 91 Figure 67 - Genome research test - results in Job Tracker ......................................................... 104

Page 117

Tables Table 1- Problems related to information management and retrieval in biomedical science ......... 6 Table 2- Benefits of HDFS ........................................................................................................... 25 Table 3- HBase example ............................................................................................................... 28 Table 4 - Genome research tests - summary ............................................................................... 108

14 References
Andrade, J., Andersen, M., Berglund, L., & Odeberg, J. (2007). Applications of Grid Computing in Genetics and Proteomics. LNCS, 4699, 791-798. Antezana, E., Kuiper, M., & Mironov, V. (2009). Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4), 392-407. Apache Hadoop Project. (2008). from http://hadoop.apache.org/ Apache HBASE. (2011). from http://hbase.apache.org/ Bach, C., Bajwa, H., & Erodula, K. (2011). Use of Multi Threaded Asynchronous DNA Sequence Pattern Searching Tool to Identifying Zinc-Finger-Nuclease Binding Sites on the Human Genome. Bach, C., Patra, P., Bajwa, H., Pallis, J., Sherman, W., Cotlet, M., et al. (2009). Functional Nanobiology of Zinc Finger Protein Sp1. Bridgeport: University of Bridgeport. Bach, C., Salvatore, B., & Jing, Z. (2003). Increase of Potential Intellectual Bandwidth in a Scientific Community through Implementation of an End-User Information System. Paper presented at the Hawaii International Conference on System Sciences (HICSS03), Hawaii Bach, C., Zhang, J., & Belardo, S. (2001). Collaborative Networks. Paper presented at the AMCIS. from http://aisel.aisnet.org/amcis2001/133 Bajwa, H., Bach, C., Kongar, E., & Chu, Y.-L. (2009). Utilization of Cloud Computing in science-based Knet environments. Paper presented at the IBM University Days, The Third International Conference on the Virtual Computing Initiative (ICVCI3), The IBM Employee and Activity Center (EAFC), Research Triangle Park, North Carolina. Baumgartner, W., Cohen, K. B., & Hunter, L. (2008). An open-source framework for large-scale, flexible evaluation of biomedical text mining systems. Journal of Biomedical Discovery and Collaboration, 3(1), 1. Brusic, V., & Ranganathan, S. (2008). Critical technologies for bioinformatics. Briefings in Bioinformatics, 9(4), 261-262. Chandrasekaran, N., & Piramanayagam, S. (2010). Application of semantic web technology to bioinformatics. International Journal of Bioinformatics Research, 2(1), 67-71. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (2008). Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst., 26(2), 1-26. Clouderas Distribution including Apache Hadoop (CDH). (2011). from http://www.cloudera.com/hadoop/ Cohen, A. M., & Hersh, W. R. (2005). A survey of current work in biomedical text mining. Briefings in Bioinformatics, 6(1), 57-71. Das, S., Girard, L., Green, T., Weitzman, L., Lewis-Bowen, A., & Clark, T. (2009). Building biomedical web communities using a semantically aware content management system. Briefings in Bioinformatics, 10(2), 129-138.

Page 118

David T, J. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2), 195-202. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Paper presented at the 6th Symp. on Operating Systems Design & Implementation. Dhanasekaran, M., Negi, S., & Sugiura, Y. (2005). Designer Zinc Finger Proteins: Tools for Creating Artificial DNA-Binding Functional Proteins. Accounts of Chemical Research, 39(1), 45-52. EcoCyc. Retrieved 7/28, 2011, from http://ecocyc.org/ Eddy, S. R. (2004). What is dynamic programming? Nature Biotechnology, 22(7). Eppstein, D. (1996). ICS 161: Design and Analysis of Algorithms Lecture notes for February 27, 1996. Retrieved 11/5, 2011, from http://www.ics.uci.edu/~eppstein/161/960227.html Fogh, R. H., Boucher, W., Ionides, J. M., Vranken, W. F., Stevens, T. J., & Laue, E. D. (2010). MEMOPS: Data modelling and automatic code generation. Journal of Integrative Bioinformatics, , 7(3). Gene Ontology. (1999, 2011). Retrieved 7/28, 2011, from http://www.geneontology.org/ Groth, D., Hartmann, S., Friemel, M., Hill, N., Mller, S., Poustka, A. J., et al. (2010). Data integration using scanners with SQL output - The Bioscanners project at sourceforge. . Journal of Integrative Bioinformatics, 7(3). Hirschman, L., Park, J. C., Tsujii, J., Wong, L., & Wu, C. H. (2002). Accomplishments and challenges in literature data mining for biology. BIOINFORMATICS REVIEW, 18(12), 15531561. Lange, M., Spies, K., Bargsten, J., Haberhauer, G., Klapperstck, M., Leps, M., et al. (2010). The LAILAPS Search Engine: Relevance Ranking in Life Science Databases. . Journal of Integrative Bioinformatics, 7(2). Leo, S., Santoni, F., & Zanetti, G. (2009). Biodoop: Bioinformatics on Hadoop. Paper presented at the International Conference on Parallel Processing Workshops. Ltjohann, D. S., Shah, A. H., Christen, M. P., Richter, F., Knese, K., & Liebel, U. (2011). Sciencenet towards a global search and share engine for all scientific knowledge. Bioinformatics, 27(12), 1734-1735. Manning, M., Aggarwal, A., Gao, K., & Tucker-Kellogg, G. (2009). Scaling the walls of discovery: using semantic metadata for integrative problem solving. Briefings in Bioinformatics, 10(2), 164-176. Mirel, B. (2009). Supporting cognition in systems biology analysis: findings on users' processes and design implications. Journal of Biomedical Discovery and Collaboration, 4(1), 2. Nakatsu, N., Kambayashi, Y., & Yajima, S. (1982). A longest common subsequence algorithm suitable for similar text strings. Acta Informatica, 18(2), 171-179-179. Neumann, E., & Prusak, L. (2007). Knowledge networks in the age of the Semantic Web. Briefings in Bioinformatics, 8(3), 141-149. Nunamaker, J., Briggs, R. O., & De Vreede, G. (2001). From information technology to value creation technology. In G. W. Dickson & G. Desanctis (Eds.), Information technology and the future enterprise : New models for managers. Upper Saddle River, N.J.: Prentice Hall. Nunamaker, J. F., Briggs, R. O., & de Vreede, G.-J. (2000). Value Creation Technology: Changing the Focus to the Group. In M. Cox (Ed.), Information technology and the future enterprise: New Models for Managers. Upper Saddle River, NJ: Prentice Hall. Quan, D. (2007). Improving life sciences information retrieval using semantic web technology. Briefings in Bioinformatics, 8(3), 172-182. Ramirez, C. L., Foley, J. E., Wright, D. A., Muller-Lerch, F., Rahman, S. H., Cornu, T. I., et al. (2008). Unexpected failure rates for modular assembly of engineered zinc fingers. Nat Meth, 5(5), 374375. Regnier, M., Kreczmar, A., & Mirkowska, G. (1989). Knuth-Morris-Pratt algorithm: An analysis Mathematical Foundations of Computer Science 1989. In (Vol. 379, pp. 431-444): Springer Berlin / Heidelberg.

Page 119

The

RiboWeb Project. (2001, 03/27/01). Retrieved 7/28, 2011, from http://helixweb.stanford.edu/riboweb.html Rubin, D. L., Shah, N. H., & Noy, N. F. (2008). Biomedical ontologies: a functional perspective. Briefings in Bioinformatics, 9(1), 75-90. Schatz, M. (2009). CloudBurst: Highly Sensitive Short Read Mapping with MapReduce. Retrieved 11/6, 2011, from http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=CloudBurst Schulze-Kremer, S. (1998). Ontologies For Molecular Biology. Paper presented at the In Proceedings of the Third Pacific Symposium on Biocomputing. Smith, A., Xuan, Z., & Zhang, M. (2008). Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 9(1), 128. Smith, A. D. (2009). The RMAP software for short-read mapping. Retrieved 11/6, 2011, from http://rulai.cshl.edu/rmap/ Smith, A. D., Chung, W.-Y., Hodges, E., Kendall, J., Hannon, G., Hicks, J., et al. (2009). Updates to the RMAP short-read mapping software. Bioinformatics, 25(21), 2841-2842. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., et al. (2007). The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech, 25(11), 1251-1255. Spasic, I., Ananiadou, S., McNaught, J., & Kumar, A. (2005). Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics, 6(3), 239-251. Stevens, R., Goble, C. A., & Bechhofer, S. (2000). Ontology-based knowledge representation for bioinformatics. Briefings in Bioinformatics, 1(4), 398-414. Stevens, R., Zhao, J., & Goble, C. (2007). Using provenance to manage knowledge of In Silico experiments. Briefings in Bioinformatics, 8(3), 183-194. TAMBIS. (1998). Retrieved 7/28, 2011, from http://www.cs.man.ac.uk/~stevensr/tambis/ UCSC. (2009). UCSC Genome Bioinformatics (Publication. Retrieved 11/13/2011: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ Weisstein, E. W. (2000). "Subsequence.". MathWorld--A Wolfram Retrieved 11/3, 2011, from http://mathworld.wolfram.com/Subsequence.html Zhao, J., Miles, A., Klyne, G., & Shotton, D. (2009). Linked data and provenance in biological data webs. Briefings in Bioinformatics, 10(2), 139-152.

Page 120