Big Data Infrastructure

BIG DATA
INFRASTRUCTURE
Infrastructure for Big Data
• Infrastructure is the cornerstone of Big Data architecture.
• Possessing the right tools for storing, processing and analyzing your data is crucial in any
Big Data project.
• Closely examining infrastructural approaches- what they are, how they work and what
each approach is best used for.
HADOOP
• Hadoop is essentially an open-source framework for processing, storing and analyzing
data.
• The fundamental principle behind Hadoop is processing and analyzing the data &
distribute data into many parts concurrently.
• HDFS- The default storage layer
• MapReduce- The “Map” job distributes a job to different nodes, and the “Reduce” gathers
the results and resolves them into a single value.
• YARN- Responsible for cluster management and scheduling user applications.
• Spark- Used on top of HDFS, and speeds up to 100 times faster than MapReduce
function in applications.
• Allows data to loaded in-memory and queried repeatedly, making it particularly apt for
machine learning algorithms
• The main advantages of Hadoop are its cost and time-effectiveness.
• Cost because as it’s open source, it’s free and available for anyone to use, and can run
off cheap commodity hardware.
• Time because it processes multiple ‘parts’ of the data set concurrently, making it a
comparatively fast tool for in-depth analysis.
• However, open source has its drawbacks. The Apache Software Foundation are
constantly updating and developing the Hadoop ecosystem.
NOSQL
• NoSQL, which stands for Not Only SQL, is a term used to cover a range of different
database technologies.
• NoSQL is better suited for “operational” tasks; interactive workloads based on selective
criteria where data can be processed in near real-time.
• Since they serve different purposes, Hadoop and NoSQL products are sometimes marketed
concurrently.
• Some NoSQL databases, such as HBase, were primarily designed to work on top of
Hadoop.
• Some big names in NoSQL field include Apache Cassandra, MongoDB, and Oracle NoSQL.
• It also places less focus on atomicity and consistency than on performance and scalability.
• Premium packages of NoSQL databases (such as Datastax for Cassandra) work to address
these issues.
MASSIVELY PARALLEL PROCESSING (MPP)
• As the name might suggest, MPP technologies process massive amounts of data in
parallel.
• Hundreds (or potentially even thousands) of processors, each with their own operating
system and memory, work on different parts of the same programme.
• MPP usually runs on expensive data warehouse appliances, whereas Hadoop is most
often run on cheap commodity hardware.
• MPP uses SQL, and Hadoop uses Java as default.
• MPP has crossovers with the other technologies; Teradata, an MPP technology,
has an ongoing partnership with Hortonworks.
• MPP market have been acquired by technology vendor behemoths; Netezza, for
instance, is owned by IBM, Vertica is owned by HP and Greenplum is owned by
EMC.
CLOUD
• Cloud computing refers to a broad set of products that are sold as a service and delivered
over a network.
• In other infrastructural approaches, when setting up your big architecture you need to
buy hardware and software for each person involved with the processing and analyzing
of your data.
• In cloud computing, your analysts only require access to 1 application- a web-based

service where all of the necessary resources and programmes are hosted.
• In cloud computing, up-front costs are minimal as you typically only pay for what
you use, and scale out from there- Amazon Redshift, for instance, allows you to
get started for as little as 25 cents an hour.
• Data is hosted by third party can raise questions about security; many choose to
host their confidential information in-house, and use the cloud for less private
data.
• Alot of big names in IT offer cloud computing solutions; Google has a whole host of Cloud computing
products, including Big Query, specifically designed for the processing and management of Big Data.
• Amazon Web Services also has a wide range, included EMR for Hadoop, RDS for MySQL and
DynamoDB for NoSQL.
• There are also vendors such as Infochimps and Mortar specifically dedicated to offering cloud
computing solutions.

Big Data Infrastructure

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Big Data Infrastructure

Încărcat de

Drepturi de autor:

Formate disponibile

BIG DATA

• HDFS- The default storage layer

• In cloud computing, your analysts only require access to 1 application- a web-based

DynamoDB for NoSQL.

S-ar putea să vă placă și