Documente Academic
Documente Profesional
Documente Cultură
Abstract— Next-Generation Sequencing in bioinformatics BAM file into Hadoop data blocks [1] and Picard tool to process
produce a massive amount of data volume. Big data technologies the BAM files [5].
are needed to reduce computation time in data processing. In this
paper, we implement Hadoop Map-Reduce framework for In a broad-spectrum, Hadoop-BAM is a library for
processing Next-Generation Sequencing using Hadoop-BAM manipulating the Next-Generation Sequencing file format in
library. Our implementation process a Binary Alignment Map various platforms that written in Java Programming Language
(BAM) file which contains a reference sequence and many through the framework of Hadoop Map-Reduce [4]. As a library
aligned/not-aligned reads by spitting the BAM file into Hadoop for Hadoop framework, Hadoop-BAM is utilized to cater the
data blocks. To process the BAM file in a computer cluster, we interrelated matters regarding the BAM splitting using any
implement a mapper and a reducer of Hadoop Map-Reduce suitable Application Program Interfaces (API) [4]. Therefore,
framework. The mapper processes the BAM file to produce key we are using tools from Hadoop BAM in order to process Next-
value pairs. While, the reducer summary the key value pairs into Generation Sequencing (NGS) data in a Hadoop cluster.
a meaningful output. Here the mapper and reducer are created to
summarize the number of bases in a BAM file. We conduct the The allotment of this research paper is arranged as follows:
experiment in a LIPI Hadoop cluster. The cluster consists of 96 Section 2 describes about Next-Generation Sequencing
CPU cores. The result of our experiments show that our map- includes definition, data format and the way of the processing
reduce implementations are gaining speed-up compare to serial NGS data with big data technology. Section 3 presents the
Next-Generation Sequencing with Picard tools. Map-Reduce implementation for summarizing the Next-
Generation Sequencing (NGS) Data encompass of the Map and
Keywords— map-reduce; Next-Generation Sequencing; Reduce algorithm. The information about LIPI Hadoop Cluster
bioinformatics; will be presented in Section 4 including the result of the
differentiation between Hadoop-BAM and Picard. The
I. INTRODUCTION discussion has been conducted to discuss about the current
With the emergence of the sequencing technologies of result and also as conclusion of this research paper.
bioinformatics, it becomes the fiercest issue in solving
enormous of big data problems especially in sequencing reads
II. NEXT-GENERATION SEQUENCING
[1]. The introduction of the tools has led to analyze gigantic
amount of biological data activity more accurately and speedily A. Definition
[2]. Besides, the advancement of technology modern makes the The Next-Generation Sequencing (NGS) technologies
Next-Generation Sequencing (NGS) more systematic and known as a provider of various types of application to generate
already surpass boundaries which all the data are being process millions fragment of reads in a single runtime by parallelization
and analyze in parallel using a scalable technique [3]. Apache of sequencing process [6, 7, 8, 9, 10]. NGS technologies applied
Hadoop acts as noticeable establishment of the current usage of and publicized internationally by most genomic researchers in
Hadoop in providing the parallelized data sets within 2005 [9, 11] as a new platform of the DNA sequencing process.
bioinformatics community [3].
The NGS technologies are capable to emit accurately the
In this paper, we will describe the design and numerous data sequencing reduced precipitously the cost of
implementation of Hadoop Map-Reduce in the direction of research, decreasing the runtime elapsed of alignment reads and
Next-Generation Sequencing (NGS) data and make produce in greater qualities rather than previous sequencers
summarizing the number of bases in a BAM file. The [6,8,9,12]. Time by time, NGS technologies have been
implementation of Hadoop Map-Reduce is crucial for the enhanced better and modulated the sequencers based on
Binary Alignment Map (BAM) files executed in parallel [4]. particular process [12].
The mapper and reducer stages parallelization the splitting a big
422
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
NameNode /
ResourceManager: LIPI
TAUKE02 Clusters
DataNode:
KLEREK14
Secondary
NameNode:
TAUKE02
For the second stage which is the reduce phases, the d) TAUKE02: The total capacity of TAUKE02 is 3.6
Hadoop Reducer will receive all the key and value from the Terabyte, which Distributed File System(DFS) used about
shuffling as an input and filtering the input by examining the 174.59 GB and 466.82 GB for non-Distributed File System
differentiation from the previous key in order to decrease time (DFS). For TAUKE02, the block pools consist of 174.59 GB
at the reducer phase. Next, it will calculate the total of the which the capacity contains the block files in the namespace and
intermediate key [1,19]. After that, the outputs from the total encompass of 12 name nodes.
will be composed and gathered in one folder [19].
e) KLEREK14: It contains 307.18 GB of the storage and
have used 14.88 GB for DFS and 43.14 GB for non DFS.
KLEREK14 consist of 254 of blocks and used 4.85%
IV. RESULT AND DISCUSSION (14.88GB) to test the alignment read of Hadoop-BAM file.
A. LIPI Hadoop Clusters
Indonesian Institute of Sciences (LIPI) has provided us with B. Results and Discussion
the Hadoop Cluster of Research Center for Informatics, When experimentation of NGS data processing in Map-
recognized as P2I LIPI, for carrying out the experiment. We are Reduce framework using Hadoop-BAM, we run the reads of
using LIPI Hadoop clusters at Bandung site where the BAM files and make the comparison of the Hadoop-BAM
specifications as below: performances with the Picard HTSJDK. The performance
between two tools has been tested including of the elapsed time
a) Basic node: Contains of 34 basic nodes, each node spent, the number of splits and the number of read alignment in
containing 2 processors and each processor have 4 cores. P2I Picard htsjdk-2.3.0 and Hadoop-BAM 7.8.1 version with
LIPI cluster using Dual Intel EON E5-2609 product family, Hadoop 2.7.3 version.
with the memory speed 2.4GHz, 8 GB RAM DDR3-1600 of
memory with 500 GB HD SATA of hard disk space. It is also We investigate the performance of the tools by setting the
using Linux (CentOS) operating system including dual GB different genome data size starting from 1.0 GB, 2.1 GB, 3.1
interconnection network. GB, 4.0 GB, 5.1 GB, 6.0 GB, 7.0 GB, 8.1 GB, 9.1 GB and 10.0
GB of BAM files format. The genome data consist of four types
b) GPU node: Graphic Processing Unit (GPU) consists of of nitrogenous bases that contain of Adenine (A), Cytosine (C),
4 nodes, each node containing 2 processors and every processor Guanine (G) and Thymine (T) [20]. For that reason, we used
contain 4 cores. GPU node using Dual Intel Xeon E5-2609 @ Hadoop-BAM tools to reads alignment of nitrogenous bases.
2.4 GHz memory speed and 8GByte of RAM DDR-1600
including 500GB HD SATA of the hard disk space. Inside the When evaluated the counting reads of BAM files thru
GPU also provided Dual GB interconnection, NVIDIA Tesla Picard, we have accustomed specific mechanism to count the
M2075 GPGPU and also using Linux (CentOS) operating alignment reads of BAM files first. After customize the required
system. Picard tool, the Picard managed the data in serial of java
garbage collector, which is one BAM files entry accessed at a
c) Master node: Encompass of 2 nodes, including 2 single time. This tool differentiated against Hadoop-BAM
processors per node, where in each processors containing 8 which is as a parallel collection processing, all entries run
cores. Using dual INTEL EON E5-2650 as product family and simultaneously. The details of each data size, time elapsed and
2.0 GHz memory speed with 128 GB of RAM DDR3-1600 of result of each data presented in the Table 1.
memory, 24 TB HD SATA (Raw), and RAID 5. The nodes are
interconnected by Dual 10 GB interconnection and using Linux
(Centos) operating system.
423
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
TABLE I. THE RESULT CALCULATION BASED ON THE DIFFERENTIATION V. CONCLUSION AND FURTHER STUDIES
OF ELAPSED TIMES USING PICARD AND DIVIDED THE TIME TAKEN OF HADOOP-
BAM.
In this work, we show that the Map-Reduce version for
Time Elapsed in Seconds Speed Up processing Next-Generation Sequencing data speeds up the
Data BAM Picard/ computation time. This means with Hadoop Map-Reduce
size Picard Hadoop-BAM
Splits Hadoop- running on a computer cluster could reduce computation time
BAM for processing NGS data which usually has a very large data
1.0 GB 37.09 4.64 9 7.99 size. However, our study is limited to summarizing reads data
2.1 GB 120.37 14.98 17 8.04
on a large BAM file which is naturally easy to parallel. Further
studies are needed to show how it is effective for other type of
3.1 GB 135.28 16.01 25 8.45 computation in NGS data processing.
4.0 GB 157.07 16.48 32 9.53
ACKNOWLEDGMENT
5.1 GB 225 23.38 41 9.62
This Hadoop Cluster is fully provided by P2I LIPI, Research
6.0 GB 237.71 24.28 48 9.79 Center for Informatics, Indonesian Institute of Sciences. The
7.0 GB 289.77 29.38 57 9.86 financial funding for student internship program is fully
supported by Yayasan Universiti Teknologi PETRONAS
8.1 GB 336 33.69 65 9.97 (YUTP) Scholarship.
9.1 GB 390.52 37.63 74 10.38
10 GB 429.46 37.11 81 11.57
REFERENCES
424
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
425