Sunteți pe pagina 1din 12

BIG DATA

INFRASTRUCTURE
Infrastructure for Big Data
• Infrastructure is the cornerstone of Big Data architecture.

• Possessing the right tools for storing, processing and analyzing your data is crucial in any
Big Data project.

• Closely examining infrastructural approaches- what they are, how they work and what
each approach is best used for.
HADOOP
• Hadoop is essentially an open-source framework for processing, storing and analyzing
data.

• The fundamental principle behind Hadoop is processing and analyzing the data &
distribute data into many parts concurrently.

• HDFS- The default storage layer

• MapReduce- The “Map” job distributes a job to different nodes, and the “Reduce” gathers
the results and resolves them into a single value.
• YARN- Responsible for cluster management and scheduling user applications.

• Spark- Used on top of HDFS, and speeds up to 100 times faster than MapReduce
function in applications.

• Allows data to loaded in-memory and queried repeatedly, making it particularly apt for
machine learning algorithms
• The main advantages of Hadoop are its cost and time-effectiveness.

• Cost because as it’s open source, it’s free and available for anyone to use, and can run
off cheap commodity hardware.

• Time because it processes multiple ‘parts’ of the data set concurrently, making it a
comparatively fast tool for in-depth analysis.

• However, open source has its drawbacks. The Apache Software Foundation are
constantly updating and developing the Hadoop ecosystem.
NOSQL
• NoSQL, which stands for Not Only SQL, is a term used to cover a range of different
database technologies.

• NoSQL is better suited for “operational” tasks; interactive workloads based on selective
criteria where data can be processed in near real-time.

• Since they serve different purposes, Hadoop and NoSQL products are sometimes marketed
concurrently.

• Some NoSQL databases, such as HBase, were primarily designed to work on top of
Hadoop.
• Some big names in NoSQL field include Apache Cassandra, MongoDB, and Oracle NoSQL.

• It also places less focus on atomicity and consistency than on performance and scalability.

• Premium packages of NoSQL databases (such as Datastax for Cassandra) work to address

these issues.
MASSIVELY PARALLEL PROCESSING (MPP)

• As the name might suggest, MPP technologies process massive amounts of data in
parallel.

• Hundreds (or potentially even thousands) of processors, each with their own operating
system and memory, work on different parts of the same programme.

• MPP usually runs on expensive data warehouse appliances, whereas Hadoop is most
often run on cheap commodity hardware.
• MPP uses SQL, and Hadoop uses Java as default.

• MPP has crossovers with the other technologies; Teradata, an MPP technology,
has an ongoing partnership with Hortonworks.

• MPP market have been acquired by technology vendor behemoths; Netezza, for
instance, is owned by IBM, Vertica is owned by HP and Greenplum is owned by
EMC.
CLOUD
• Cloud computing refers to a broad set of products that are sold as a service and delivered
over a network.

• In other infrastructural approaches, when setting up your big architecture you need to
buy hardware and software for each person involved with the processing and analyzing
of your data.

• In cloud computing, your analysts only require access to 1 application- a web-based


service where all of the necessary resources and programmes are hosted.
• In cloud computing, up-front costs are minimal as you typically only pay for what
you use, and scale out from there- Amazon Redshift, for instance, allows you to
get started for as little as 25 cents an hour.

• Data is hosted by third party can raise questions about security; many choose to
host their confidential information in-house, and use the cloud for less private
data.
• Alot of big names in IT offer cloud computing solutions; Google has a whole host of Cloud computing

products, including Big Query, specifically designed for the processing and management of Big Data.

• Amazon Web Services also has a wide range, included EMR for Hadoop, RDS for MySQL and

DynamoDB for NoSQL.

• There are also vendors such as Infochimps and Mortar specifically dedicated to offering cloud

computing solutions.

S-ar putea să vă placă și