Sunteți pe pagina 1din 18

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Presented by: Syed Shabi-ul-hasnain Nazir Supervised by: Sir Tahir Roshmi

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Distributed Computing:
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal.

In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Paradigm means a pattern, example, or model.

There are different types of Paradigms. The Message Passing Paradigm. The Client-Server Paradigm. The Peer-to-Peer System Architecture. The Message System Paradigm. Remote Procedure Call. The Mobile Agent Paradigm. Groupware Paradigm.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Hadoop is a software framework that enables distributed manipulation of large amounts of data. Hadoop does this in a way that makes it reliable, and efficient. Hadoop is reliable because it assumes that computing elements and storage will fail and, therefore, it maintains several copies of working data to ensure that processing can be redistributed around failed nodes. Hadoop is efficient because it works on the principle of parallelization, allowing data to process in parallel to increase the processing speed.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Hadoop is made up of a number of elements.

Hadoop Distributed File System (HDFS)


Map Reduce

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

HDFS architecture is built from a collection of special nodes.

1. 2.

There are two types of node in hdfs. Name node Data node The Name Node (there is only one), which provides metadata services within HDFS, and the Data Node, which serves storage blocks for HDFS.
Files stored in HDFS in the form blocks and the size of one block is 64 MB.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Map Reduce is itself a software framework for the parallel processing of large data sets across a distributed cluster of processors or stand-alone computers. It consists of two operations.

The Map function takes a set of data and transforms it into a list of key/value pairs, one per element of the input domain.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

The Reduce function takes the list that resulted from the Map function and reduces the list of key/value pairs based on their key (a single key/value pair results for each key).

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Single pc
Dell Gx 620 Pentium 4 1 2 GB 30 GB 3.4 GHZ Ubuntu 11.04 eclipse-SDK-3.6.1

PC Name No of PCs Ram Hard disk Processor Operating System Software

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Cluster pcs
Data node
Core 2 due 1 2 GB 200 GB 3.0 GHZ Ubuntu11.04 0.20.0X

Name node PC Name No of PCs Ram Hard disk Processor Operating System Hadoop Version

PC Name No of PCs Ram Hard disk Processor Operating System Hadoop Version

Dell Pentium 4 8 1 GB 30 GB 3.0 GHZ Ubuntu11.04 0.20.0X

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Sorting using single pc


We sort different data on a single pc and then note the time consumed by sorting process.

Sorting 1 GB data: When we sort 1 GB data on a single pc it takes 22mins and 15 sec for sort that data. Sorting 10 GB data: When we sort 10 GB data on a single pc it takes 4 hours 6mins and 23 sec for sort that data.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Sorting using 8 pcs Cluster


We sort different data on 8 pcs cluster and then note the time consumed by sorting process.

Sorting 1 GB Data

When we sort 1 GB data using tarasort on multi node cluster it takes 3min and 33sec to complete its job.

Sorting 10 GB Data

When we sort 10 GB data using tarasort on multi node cluster it takes 9min and 54sec to complete its job.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

Data size Sorting on Single Pc


Sorting on cluster

Sorting time 22min and 15 sec


4 hour 6min and 23 sec 3min and 33sec 9min and 54sec

1 GB
10 GB 1 GB 10 GB

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

25 20 15 10 5 single pc 8 nodecluster 1 2

Comparison Graph for Sorting 1GB Data

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

250 200 150 100 50 0 1 2 cluster single pc

Comparison Graph for Sorting 10GB Data

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

When we sort data on cluster it reduces the time as compare to when sort data on single pc and When we increase the data size on cluster its time decrease as compare to single system.

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

THANKS

Syed Shabi-ul-hasnain Nazir

Deployment of Distributed Cluster and Problem Testing

S-ar putea să vă placă și