Sunteți pe pagina 1din 5

2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)

Processing Next Generation Sequencing Data in Map-


Reduce Framework using Hadoop-BAM in
a Computer Cluster
Rifki Sadikin, Andria Arisal Rofithah Omar, Nur Hidayah Mazni
Research Center for Informatics Faculty of Science and Information Technology
Indonesian Institute of Sciences Universiti Teknologi PETRONAS
Bandung, Jawa Barat 40135, Indonesia 32610 Seri Iskandar, Perak, Malaysia
{rifki.sadikin,andria.arisal}@lipi.go.id {rofithah_22427, nurhidayah_22399}@utp.edu.my

Abstract— Next-Generation Sequencing in bioinformatics BAM file into Hadoop data blocks [1] and Picard tool to process
produce a massive amount of data volume. Big data technologies the BAM files [5].
are needed to reduce computation time in data processing. In this
paper, we implement Hadoop Map-Reduce framework for In a broad-spectrum, Hadoop-BAM is a library for
processing Next-Generation Sequencing using Hadoop-BAM manipulating the Next-Generation Sequencing file format in
library. Our implementation process a Binary Alignment Map various platforms that written in Java Programming Language
(BAM) file which contains a reference sequence and many through the framework of Hadoop Map-Reduce [4]. As a library
aligned/not-aligned reads by spitting the BAM file into Hadoop for Hadoop framework, Hadoop-BAM is utilized to cater the
data blocks. To process the BAM file in a computer cluster, we interrelated matters regarding the BAM splitting using any
implement a mapper and a reducer of Hadoop Map-Reduce suitable Application Program Interfaces (API) [4]. Therefore,
framework. The mapper processes the BAM file to produce key we are using tools from Hadoop BAM in order to process Next-
value pairs. While, the reducer summary the key value pairs into Generation Sequencing (NGS) data in a Hadoop cluster.
a meaningful output. Here the mapper and reducer are created to
summarize the number of bases in a BAM file. We conduct the The allotment of this research paper is arranged as follows:
experiment in a LIPI Hadoop cluster. The cluster consists of 96 Section 2 describes about Next-Generation Sequencing
CPU cores. The result of our experiments show that our map- includes definition, data format and the way of the processing
reduce implementations are gaining speed-up compare to serial NGS data with big data technology. Section 3 presents the
Next-Generation Sequencing with Picard tools. Map-Reduce implementation for summarizing the Next-
Generation Sequencing (NGS) Data encompass of the Map and
Keywords— map-reduce; Next-Generation Sequencing; Reduce algorithm. The information about LIPI Hadoop Cluster
bioinformatics; will be presented in Section 4 including the result of the
differentiation between Hadoop-BAM and Picard. The
I. INTRODUCTION discussion has been conducted to discuss about the current
With the emergence of the sequencing technologies of result and also as conclusion of this research paper.
bioinformatics, it becomes the fiercest issue in solving
enormous of big data problems especially in sequencing reads
II. NEXT-GENERATION SEQUENCING
[1]. The introduction of the tools has led to analyze gigantic
amount of biological data activity more accurately and speedily A. Definition
[2]. Besides, the advancement of technology modern makes the The Next-Generation Sequencing (NGS) technologies
Next-Generation Sequencing (NGS) more systematic and known as a provider of various types of application to generate
already surpass boundaries which all the data are being process millions fragment of reads in a single runtime by parallelization
and analyze in parallel using a scalable technique [3]. Apache of sequencing process [6, 7, 8, 9, 10]. NGS technologies applied
Hadoop acts as noticeable establishment of the current usage of and publicized internationally by most genomic researchers in
Hadoop in providing the parallelized data sets within 2005 [9, 11] as a new platform of the DNA sequencing process.
bioinformatics community [3].
The NGS technologies are capable to emit accurately the
In this paper, we will describe the design and numerous data sequencing reduced precipitously the cost of
implementation of Hadoop Map-Reduce in the direction of research, decreasing the runtime elapsed of alignment reads and
Next-Generation Sequencing (NGS) data and make produce in greater qualities rather than previous sequencers
summarizing the number of bases in a BAM file. The [6,8,9,12]. Time by time, NGS technologies have been
implementation of Hadoop Map-Reduce is crucial for the enhanced better and modulated the sequencers based on
Binary Alignment Map (BAM) files executed in parallel [4]. particular process [12].
The mapper and reducer stages parallelization the splitting a big

978-1-5386-0658-2/17/$31.00 ©2017 IEEE 421


2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)

B. NGS Data Format III. MAP-REDUCE IMPLEMENTATION FOR


Deal with the sophisticated data and unforeseen genomic SUMMARIZING NGS DATA
sequencing [7], the NGS enforces several applications or tools Map-Reduce programming is known as a platform for
for precision reads the DNA sequencing such as the data of processing huge of data sets in parallel by writing the act
human genomes to discover any related research [6]. Therefore, directly to the framework in order to process the data [1, 5].
NGS technologies consist different data format for specific This framework breaks down into two different major phases
alignment tools [13] such as SAM, BAM, FASTQ and VCF namely as Map phase and Reduce Phase [1, 5, 18]. The map
format [14]. phase consists of the input, splitting, mapping, shuffling and
Sequence Alignment Map (SAM) file format applied in sorting where the reduce phases is reducing the number of
NGS intended for short reads alignment and more than 128 MB tuples (key-value pairs) and produce the output [1, 18].
which can be storing, splitting, indexing and executing different
tools [13]. BAM format or known as a Binary Alignment Map
file format for binary of SAM information [13]. BAM
signifying as a library that compressed the data into block of
BGZF and need to rely on Hadoop Map-Reduce framework [4,
14]. FASTQ known as the file format for paired-end of two
alignment reads contiguous together to form a single file [ 14,
15].
The FASTQ file applied two separated tools, Joiner and
Splitter tools. The data that reads attached the two files as one
using FASTQ Joiner and contradictory with the Splitter tools,
processing each data of FASTQ file and splits into two [15].
The Variant Call Format (VCF) created to support the reference
genome merging, comparison, base quality scores as well as
other related tools and save it VCF file which is compressed
using BGZF block [14, 16].
C. Processing NGS Data with Big Data Technology
NGS technologies need to run into enormous and flexible
data storage which provide the efficient ways when deal with
genomic sequencing analysis [11]. Previous studies proved that
the implementation of big data technologies with cloud
computing as a new approach in bioinformatics research solved
the problem of data storage [5, 17] Fig. 1. Example of a Mapper algorithm
At the present time, big data technologies widely held with
the depiction of Apache Hadoop Map-Reduce and Hadoop Using the Hadoop-BAM tools, starting from Map phase,
Distributed File System (HDFS) [17]. The growth of the set of genome data will be used at the input part and
sequencing data made researchers implemented the Hadoop- converted into another set of data [1]. At the splitting part,
BAM Java library that dependencies of Hadoop Map-Reduce Hadoop mapper will break down the set of data into a few data
and Picard operated in parallel sequencing process [4]. block before storing them into the disk [19]. Each of the data
block consist of 128MB of storage in parallel [13] and for one
As a result, the Hadoop programming codes not have the set of data will have many data block. The Hadoop Framework
trouble regarding compressed file of BGZF blocks, alignment have make it as a standardize storage in order to make the
reads boundary, blockage the detection of boundary or system more speedily in processing the data [1,5].
deconstructing the binary data. BAM dependencies on Picard
API because to make available the adapted of huge amount data At the mapping for key and value pairs, the mapper will
with Hadoop-BAM [ 4]. start produce the intermediate key based on the data from each
of data blocks [19]. As illustrate at the picture above, all the
NGS sequencing with big data in Halvade Framework also words that appear in the data block will be count one by one.
carried out the sequence analysis using Hadoop Map-Reduce
framework for high performance of data in cloud computing [5]. Then, at the shuffling and sorting part, all the reads that
The result of the Halvade framework was created new available align to the identical intermediate key-value pairs will be
approach to NGS data sequencing run together parallelism grouped together [1, 5, 19]. The key that have been created by
throughout alignment reads and the variant calling using the mapper will be sorted first and handover the values to the
different tools in each phase. reducer phase.
Due to massive data streams, Halvade controls the NGS
data sequencing by make available the different multithreaded
instances of tool for easy to run on multiple nodes and can
replaced new version without change the tools [5].

422
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)

NameNode /
ResourceManager: LIPI
TAUKE02 Clusters

DataNode:
KLEREK14

Secondary
NameNode:
TAUKE02

Fig. 2. Example of a Reducer algorithm Fig. 3. Example of the LIPI Clusters

For the second stage which is the reduce phases, the d) TAUKE02: The total capacity of TAUKE02 is 3.6
Hadoop Reducer will receive all the key and value from the Terabyte, which Distributed File System(DFS) used about
shuffling as an input and filtering the input by examining the 174.59 GB and 466.82 GB for non-Distributed File System
differentiation from the previous key in order to decrease time (DFS). For TAUKE02, the block pools consist of 174.59 GB
at the reducer phase. Next, it will calculate the total of the which the capacity contains the block files in the namespace and
intermediate key [1,19]. After that, the outputs from the total encompass of 12 name nodes.
will be composed and gathered in one folder [19].
e) KLEREK14: It contains 307.18 GB of the storage and
have used 14.88 GB for DFS and 43.14 GB for non DFS.
KLEREK14 consist of 254 of blocks and used 4.85%
IV. RESULT AND DISCUSSION (14.88GB) to test the alignment read of Hadoop-BAM file.
A. LIPI Hadoop Clusters
Indonesian Institute of Sciences (LIPI) has provided us with B. Results and Discussion
the Hadoop Cluster of Research Center for Informatics, When experimentation of NGS data processing in Map-
recognized as P2I LIPI, for carrying out the experiment. We are Reduce framework using Hadoop-BAM, we run the reads of
using LIPI Hadoop clusters at Bandung site where the BAM files and make the comparison of the Hadoop-BAM
specifications as below: performances with the Picard HTSJDK. The performance
between two tools has been tested including of the elapsed time
a) Basic node: Contains of 34 basic nodes, each node spent, the number of splits and the number of read alignment in
containing 2 processors and each processor have 4 cores. P2I Picard htsjdk-2.3.0 and Hadoop-BAM 7.8.1 version with
LIPI cluster using Dual Intel EON E5-2609 product family, Hadoop 2.7.3 version.
with the memory speed 2.4GHz, 8 GB RAM DDR3-1600 of
memory with 500 GB HD SATA of hard disk space. It is also We investigate the performance of the tools by setting the
using Linux (CentOS) operating system including dual GB different genome data size starting from 1.0 GB, 2.1 GB, 3.1
interconnection network. GB, 4.0 GB, 5.1 GB, 6.0 GB, 7.0 GB, 8.1 GB, 9.1 GB and 10.0
GB of BAM files format. The genome data consist of four types
b) GPU node: Graphic Processing Unit (GPU) consists of of nitrogenous bases that contain of Adenine (A), Cytosine (C),
4 nodes, each node containing 2 processors and every processor Guanine (G) and Thymine (T) [20]. For that reason, we used
contain 4 cores. GPU node using Dual Intel Xeon E5-2609 @ Hadoop-BAM tools to reads alignment of nitrogenous bases.
2.4 GHz memory speed and 8GByte of RAM DDR-1600
including 500GB HD SATA of the hard disk space. Inside the When evaluated the counting reads of BAM files thru
GPU also provided Dual GB interconnection, NVIDIA Tesla Picard, we have accustomed specific mechanism to count the
M2075 GPGPU and also using Linux (CentOS) operating alignment reads of BAM files first. After customize the required
system. Picard tool, the Picard managed the data in serial of java
garbage collector, which is one BAM files entry accessed at a
c) Master node: Encompass of 2 nodes, including 2 single time. This tool differentiated against Hadoop-BAM
processors per node, where in each processors containing 8 which is as a parallel collection processing, all entries run
cores. Using dual INTEL EON E5-2650 as product family and simultaneously. The details of each data size, time elapsed and
2.0 GHz memory speed with 128 GB of RAM DDR3-1600 of result of each data presented in the Table 1.
memory, 24 TB HD SATA (Raw), and RAID 5. The nodes are
interconnected by Dual 10 GB interconnection and using Linux
(Centos) operating system.

423
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)

TABLE I. THE RESULT CALCULATION BASED ON THE DIFFERENTIATION V. CONCLUSION AND FURTHER STUDIES
OF ELAPSED TIMES USING PICARD AND DIVIDED THE TIME TAKEN OF HADOOP-
BAM.
In this work, we show that the Map-Reduce version for
Time Elapsed in Seconds Speed Up processing Next-Generation Sequencing data speeds up the
Data BAM Picard/ computation time. This means with Hadoop Map-Reduce
size Picard Hadoop-BAM
Splits Hadoop- running on a computer cluster could reduce computation time
BAM for processing NGS data which usually has a very large data
1.0 GB 37.09 4.64 9 7.99 size. However, our study is limited to summarizing reads data
2.1 GB 120.37 14.98 17 8.04
on a large BAM file which is naturally easy to parallel. Further
studies are needed to show how it is effective for other type of
3.1 GB 135.28 16.01 25 8.45 computation in NGS data processing.
4.0 GB 157.07 16.48 32 9.53
ACKNOWLEDGMENT
5.1 GB 225 23.38 41 9.62
This Hadoop Cluster is fully provided by P2I LIPI, Research
6.0 GB 237.71 24.28 48 9.79 Center for Informatics, Indonesian Institute of Sciences. The
7.0 GB 289.77 29.38 57 9.86 financial funding for student internship program is fully
supported by Yayasan Universiti Teknologi PETRONAS
8.1 GB 336 33.69 65 9.97 (YUTP) Scholarship.
9.1 GB 390.52 37.63 74 10.38
10 GB 429.46 37.11 81 11.57
REFERENCES

[1] Matti, N. (2013, 11 16). Analysing sequencing data in Hadoop:The road


to interactivity via SQL. Retrieved 09 28, 2017, from aaltodoc.aalto.fi:
https://aaltodoc.aalto.fi/bitstream/handle/123456789/11886/master_niem
enmaa_matti_2013.pdf
[2] Sehar, U., Ahmad, N., & Mehmood, M. A. (2014). Use of Bioinformatics
Tools in Different Spheres of Life Sciences. Retrieved 09 28, 2017, from
Data Mining in Genomics & Proteomics:
https://www.omicsonline.org/open-access/use-of-bioinformatics-tools-
in-different-spheres-of-life-sciences-2153-0602-5-158.pdf
[3] Driscoll, A. O., Daugelaite, J., & Sleator, R. D. (2013, July 18). ‘Big data’,
Hadoop and cloud computing in genomics. Retrieved 09 30, 2017, from
sciencedirect.com:
http://www.sciencedirect.com/science/article/pii/S1532046413001007
[4] Niemenmaa , M., Kallio , A., Schumacher , A., Klemelä , P., Korpelainen,
E., & Heljanko, K. (2012, 03 15). Hadoop-BAM: directly manipulating
next generation sequencing data in the cloud . Retrieved 09 30, 2017, from
academic.oup.com: https://academic.oup.com/bioinformatics/article-
lookup/doi/10.1093/bioinformatics/bts054
[5] Decap , D., Reumers , J., Herze, C., Costanza, P., & Fostier, J. (2015, 03
26). Halvade: scalable sequence analysis with MapReduce. Retrieved 09
30, 2017, from ncbi.nlm.nih.gov:
https://www.ncbi.nlm.nih.gov/pubmed/25819078
Fig. 4. Speed up gain: Map-Reduce VS serial on summarizing BAM File [6] Patel RK, Jain M (2012) NGS QC Toolkit: A Toolkit for Quality Control
of Next Generation Sequencing Data. PLoS ONE 7(2): e30619
As shown in Figure 4, we illustrate the Map-Reduce [7] Behjati S, Tarpey PS (2013) What is next generation sequencing? Arch
version of summarizing BAM file gain speed up significantly. Dis Child Pract Educ 98:236–238
The speeding up of the computation time is supported by the [8] H.P.J. Buermans, J.T. den Dunnen (2014) Next generation sequencing
technology: Advances and application. Biochimica et Biophysica Acta
number of splitting done by Hadoop-BAM. In Hadoop file 1842: 1932–1941
system, a large file was divided into the number of Hadoop
[9] Kchouk M, Gibrat JF, Elloumi M (2017) Generations of Sequencing
blocks (128 MB at large). Here, the Hadoop-BAM library splits Technologies: From First to Next Generation. Biol Med (Aligarh) 9:395.
a large BAM file into Hadoop blocks decisively. doi:10.4172/0974-8369.1000395
[10] J. Shendure & H. Ji (2008) Next-generation DNA sequencing. Nat.
Our Map-Reduce implementation runs parallel toward Biotechnol.26:1135–1145
these BAM files process besides concerning the number of data [11] O. Morozova, M.A. Marra (2008) Applications of next-generation
nodes that available. For that justification, the different of BAM sequencing technologies in functional genomics. Genomics 92: 255–264
data size brought the huge changes in runtime performance [12] Rashmi Tripathi, Pawan Sharma, Pavan Chakraborty & Pritish
between serial processing of Picard and Hadoop-BAM. The KumarVaradwaj (2016) Next-generation sequencing revolution through
speed up between 1.0 GB and 2.1 GB not too obvious where the big data analytics, Frontiers in LifeScience, 9:2, 119-149, DOI:
different only 0.05. However, the data noticeably different 10.1080/21553769.2016.1178180
when compared the speed up of 2.1 GB with 3.1 GB and 3.1 GB [13] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The
Sequence Alignment/Map format and SAMtools. Bioinformatics 25:
with 4.0 GB including the splits of Hadoop-BAM blocks. As 2078–2079. doi:10.1093/bioinformatics/btp352
the consequences, the graph shown that the speed up were
[14] Decap D, Reumers J, Herzeel C, Costanza P, Fostier J (2017) Halvade-
increased when enlarging the BAM data size. RNA: Parallel variant calling from transcriptomic data using MapReduce.

424
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)

PLoS ONE 12(3): e0174575.


https://doi.org/10.1371/journal.pone.0174575
[15] D Blankenberg, A Gordon, GV Kuster, N Coraor, J Taylor, A Nekrutenko,
et al (2010) Manipulation of FASTQ data with Galaxy,
Bioinformatics,26(14),1783-1785
[16] Danecek P, Auton A, Abecasis G, Albers C.A, Banks E, DePristo M.A, et
al (2011) The variant call format and VCFtools, Bioinformatics, 27(15),
2156–2158,
[17] Singh P (2016) Big Genomic Data in Bioinformatics Cloud . Appli
Microbio Open Access 2:113. doi:10.4172/2471-9315.1000113
[18] Maharjan, M. (2011, 06 15). Genome Analysis with MapReduce.
Retrieved 10 30, 2017, from tcs.hut.fi: http://www.tcs.hut.fi/Studies/T-
79.5001/reports/2011-Maharjan.pdf
[19] Mohammed, E. A., Far, B. H., & Naugler, C. (2014, 07 22). Applications
of the MapReduce programming framework to clinical big data analysis:
current landscape and future trends. Retrieved 09 30, 2017, from
biodatamining.biomedcentral.com:
https://biodatamining.biomedcentral.com/track/pdf/10.1186/1756-0381-
7-22?site=biodatamining.biomedcentral.com
[20] Francesco, E. D., Santo, G. D., Palopoli, L., & Rombo, S. E. (2009). A
Summary of Genomic Databases: Overview and Discussion. Retrieved 09
27, 2017, from math.unipa.it:
http://math.unipa.it/rombo/files/publications/chapter09c_draft.pdf

425

S-ar putea să vă placă și