Sunteți pe pagina 1din 31

Testing Big Data

Camelia Rad

DVSERO TC 2019
Agenda
● Big data - a short introduction
● What is Hadoop and how it works?
● What is Map Reduce and how it works?
● Big data Testing - What is, Strategy
● How to test Hadoop Applications
● Big data Testing Vs. Traditional database Testing
● Challenges in Big Data Testing
What is Data?

The quantities, characters, or symbols on which operations are performed by a


computer, which may be stored and transmitted in the form of electrical signals
and recorded on magnetic, optical, or mechanical recording media.
What is Big Data?

Big Data is also data but with a huge size.

Big Data is a term used to describe a collection of data that is huge in size and
yet growing exponentially with time. In short such data is so large and complex
that none of the traditional data management tools are able to store it or
process it efficiently.
Classification of Big Data

Structured Data : any data that can be stored, accessed and processed in the
form of fixed format is termed as a 'structured' data.

Data stored in a relational database management system is one example of a


'structured' data.
Classification of Big Data

Unstructured Data: Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, unstructured
data poses multiple challenges in terms of its
processing for deriving value out of it. A typical
example of unstructured data is a
heterogeneous data source containing a
combination of simple text files, images,
videos etc.
Classification of Big Data

Semi - Structured Data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.

Example of semi-structured data is a data represented in an XML file.


Characteristics of Big Data

Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be considered as
a Big Data or not, is dependent upon the volume of data.

Variety - Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.

Velocity– The term refers to the speed of generation of data. How fast the data is generated and
processed to meet the demands, determines real potential in the data.

Variety– This refers to the inconsistency which can be shown by the data at times, thus hampering the
process of being able to handle and manage the data effectively.
Challenges of Big Data

There are two main challenges associated with big data:


1. How do we store and manage such a huge volume of data efficiently.
2. How do we process and extract valuable information from this huge volume of data within a given
timeframe.
These are the two main challenges associated with big data that lead to development of Hadoop
framework.
What is Hadoop?

Apache Hadoop is an open source software framework used to develop data processing applications which are
executed in a distributed computing environment.

Applications built using HADOOP are run on large data sets distributed across clusters of commodity
computers. Commodity computers are cheap and widely available. These are mainly useful for achieving greater
computational power at low cost.

Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides in a
distributed file system which is called as a Hadoop Distributed File system. The processing model is based on 'Data
Locality' concept wherein computational logic is sent to cluster nodes(server) containing data. This computational
logic is nothing but a compiled version of a program written in a high-level language such as Java. Such a program
processes data stored in Hadoop HDFS.
Features of Hadoop.

• Suitable for Big Data Analysis: As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are
best suited for analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes,
less network bandwidth is consumed. This concept is called as data locality concept which helps increase the
efficiency of Hadoop based applications.

• Scalability: HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows
for the growth of Big Data. Also, scaling does not require modifications to application logic.

• Fault Tolerance: HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way,
in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.
How Hadoop Works?

Apache Hadoop consists of two sub-projects:

Hadoop MapReduce: MapReduce is a computational model and software framework for writing applications which are
run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of
computation nodes.

HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop applications. MapReduce
applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on
compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family
of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other
Hadoop-related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.
What is HDFS?

HDFS is a distributed file system for storing very large data files, running on clusters of commodity
hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop comes bundled with
HDFS (Hadoop Distributed File Systems).

When data exceeds the capacity of storage on a single physical machine, it becomes essential to
divide it across a number of separate machines. A file system that manages storage specific
operations across a network of machines is called a distributed file system. HDFS is one such
software.
HDFS Architecture

HDFS cluster primarily consists of a NameNode that manages the file system Metadata and a DataNodes that stores
the actual data.

NameNode: can be considered as a master of the system. It maintains the file system tree and the metadata for
all the files and directories present in the system. Two files 'Namespace image' and the 'edit log' are used to store
metadata information. Namenode has knowledge of all the datanodes containing data blocks for a given file, however,
it does not store block locations persistently. This information is reconstructed every time from datanodes when the
system starts.

DataNode: are slaves which reside on each machine in a cluster and provide the actual storage. It is responsible
for serving, read and write requests for the clients.

Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into block-sized chunks, which
are stored as independent units. Default block-size is 64 MB.
What is Map Reduce?

MapReduce is a programming model suitable for processing of huge data.


Hadoop is capable of running MapReduce programs written in various
languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in
nature, thus are very useful for performing large-scale data analysis using
multiple machines in the cluster.
How Map Reduce Works?

The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and
reducing.

Consider you have following input data for your Map Reduce Program:
Map reduce example
Map reduce example
What is Big Data Testing?

Big Data testing is defined as testing of Big data applications.

Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. Testing of these datasets involves various tools, techniques, and frameworks to process.
Big Data Testing Strategy

Testing Big Data application is more verification of its data processing rather than testing the individual
features of the software product. When it comes to Big data testing, performance and functional testing
are the keys.

Along with this, data quality is also an important factor in Hadoop testing. Before testing the application, it
is necessary to check the quality of data and should be considered as a part of database testing. It
involves checking various characteristics like conformity, accuracy, duplication, consistency,
validity, etc.
How to test Hadoop Applications

Step 1: Data Staging Validation


The first step of big data testing also referred as pre-Hadoop stage involves process validation.

● Data from various source like RDBMS, weblogs, social media, etc. should be validated to make
sure that correct data is pulled into the system
● Comparing source data with the data pushed into the Hadoop system to make sure they match
● Verify the right data is extracted and loaded into the correct HDFS location

Tools like Talend, Datameer, can be used for data staging validation
How to test Hadoop Applications

Step 2: "MapReduce" Validation


The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic
validation on every node and then validating them after running against multiple nodes, ensuring that the

● Map Reduce process works correctly


● Data aggregation or segregation rules are implemented on the data
● Key value pairs are generated
● Validating the data after the MapReduce process
How to test Hadoop Applications
Step 3: Output Validation Phase
The final or third stage of Big Data testing is the output validation process. The output data files are
generated and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on
the requirement.
Activities in the third stage include

● To check the transformation rules are correctly applied


● To check the data integrity and successful data load into the target system
● To check that there is no data corruption by comparing the target data with the HDFS file system
data
How to test Hadoop Applications

Architecture Testing
Hadoop processes very large volumes of data and is highly resource intensive. Hence, architectural
testing is crucial to ensure the success of your Big Data project. A poorly or improper designed system
may lead to performance degradation, and the system could fail to meet the requirement. At least,
Performance and Failover test services should be done in a Hadoop environment.

Performance testing includes testing of job completion time, memory utilization, data throughput, and
similar system metrics. While the motive of Failover test service is to verify that data processing occurs
seamlessly in case of failure of data nodes
How to test Hadoop Applications
Performance Testing

● Data ingestion and Throughout: In this stage, the tester verifies how the fast system can
consume data from various data source. Testing involves identifying a different message that the
queue can process in a given time frame. It also includes how quickly data can be inserted into the
underlying data store for example insertion rate into a Mongo and Cassandra database.
● Data Processing: It involves verifying the speed with which the queries or map reduce jobs are
executed. It also includes testing the data processing in isolation when the underlying data store is
populated within the data sets. For example, running Map Reduce jobs on the underlying HDFS
● Sub-Component Performance: These systems are made up of multiple components, and it is
essential to test each of these components in isolation. For example, how quickly the message is
indexed and consumed, MapReduce jobs, query performance, search, etc.
How to test Hadoop Applications
Performance Testing Approach
Performance testing for big data application involves testing of huge volumes of structured and
unstructured data, and it requires a specific testing approach to test such massive data.
Big data Testing Vs. Traditional database Testing
Big data Testing Vs. Traditional database Testing
Challenges in Big Data Testing

● Automation - Automation testing for Big data requires someone with technical expertise. Also,
automated tools are not equipped to handle unexpected problems that arise during testing
● Virtualization - It is one of the integral phases of testing. Virtual machine latency creates timing
problems in real time big data testing. Also managing images in Big data is a hassle.
● Large Dataset
○ Need to verify more data and need to do it faster
○ Need to automate the testing effort
○ Need to be able to test across different platform
Challenges in Big Data Testing

Performance testing challenges

● Diverse set of technologies: Each sub-component belongs to different technology and requires testing in
isolation
● Unavailability of specific tools: No single tool can perform the end-to-end testing.
● Test Scripting: A high degree of scripting is needed to design test scenarios and test cases
● Test environment: It needs a special test environment due to the large data size
● Monitoring Solution: Limited solutions exists that can monitor the entire environment
● Diagnostic Solution: a Custom solution is required to develop to drill down the performance bottleneck
areas
Thank you!

S-ar putea să vă placă și