Documente Academic
Documente Profesional
Documente Cultură
Presented by: Syed Shabi-ul-hasnain Nazir Supervised by: Sir Tahir Roshmi
Distributed Computing:
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal.
In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers.
There are different types of Paradigms. The Message Passing Paradigm. The Client-Server Paradigm. The Peer-to-Peer System Architecture. The Message System Paradigm. Remote Procedure Call. The Mobile Agent Paradigm. Groupware Paradigm.
Hadoop is a software framework that enables distributed manipulation of large amounts of data. Hadoop does this in a way that makes it reliable, and efficient. Hadoop is reliable because it assumes that computing elements and storage will fail and, therefore, it maintains several copies of working data to ensure that processing can be redistributed around failed nodes. Hadoop is efficient because it works on the principle of parallelization, allowing data to process in parallel to increase the processing speed.
1. 2.
There are two types of node in hdfs. Name node Data node The Name Node (there is only one), which provides metadata services within HDFS, and the Data Node, which serves storage blocks for HDFS.
Files stored in HDFS in the form blocks and the size of one block is 64 MB.
Map Reduce is itself a software framework for the parallel processing of large data sets across a distributed cluster of processors or stand-alone computers. It consists of two operations.
The Map function takes a set of data and transforms it into a list of key/value pairs, one per element of the input domain.
The Reduce function takes the list that resulted from the Map function and reduces the list of key/value pairs based on their key (a single key/value pair results for each key).
Single pc
Dell Gx 620 Pentium 4 1 2 GB 30 GB 3.4 GHZ Ubuntu 11.04 eclipse-SDK-3.6.1
Cluster pcs
Data node
Core 2 due 1 2 GB 200 GB 3.0 GHZ Ubuntu11.04 0.20.0X
Name node PC Name No of PCs Ram Hard disk Processor Operating System Hadoop Version
PC Name No of PCs Ram Hard disk Processor Operating System Hadoop Version
Sorting 1 GB data: When we sort 1 GB data on a single pc it takes 22mins and 15 sec for sort that data. Sorting 10 GB data: When we sort 10 GB data on a single pc it takes 4 hours 6mins and 23 sec for sort that data.
Sorting 1 GB Data
When we sort 1 GB data using tarasort on multi node cluster it takes 3min and 33sec to complete its job.
Sorting 10 GB Data
When we sort 10 GB data using tarasort on multi node cluster it takes 9min and 54sec to complete its job.
1 GB
10 GB 1 GB 10 GB
25 20 15 10 5 single pc 8 nodecluster 1 2
When we sort data on cluster it reduces the time as compare to when sort data on single pc and When we increase the data size on cluster its time decrease as compare to single system.
THANKS