Documente Academic
Documente Profesional
Documente Cultură
INFRASTRUCTURE
Infrastructure for Big Data
• Infrastructure is the cornerstone of Big Data architecture.
• Possessing the right tools for storing, processing and analyzing your data is crucial in any
Big Data project.
• Closely examining infrastructural approaches- what they are, how they work and what
each approach is best used for.
HADOOP
• Hadoop is essentially an open-source framework for processing, storing and analyzing
data.
• The fundamental principle behind Hadoop is processing and analyzing the data &
distribute data into many parts concurrently.
• MapReduce- The “Map” job distributes a job to different nodes, and the “Reduce” gathers
the results and resolves them into a single value.
• YARN- Responsible for cluster management and scheduling user applications.
• Spark- Used on top of HDFS, and speeds up to 100 times faster than MapReduce
function in applications.
• Allows data to loaded in-memory and queried repeatedly, making it particularly apt for
machine learning algorithms
• The main advantages of Hadoop are its cost and time-effectiveness.
• Cost because as it’s open source, it’s free and available for anyone to use, and can run
off cheap commodity hardware.
• Time because it processes multiple ‘parts’ of the data set concurrently, making it a
comparatively fast tool for in-depth analysis.
• However, open source has its drawbacks. The Apache Software Foundation are
constantly updating and developing the Hadoop ecosystem.
NOSQL
• NoSQL, which stands for Not Only SQL, is a term used to cover a range of different
database technologies.
• NoSQL is better suited for “operational” tasks; interactive workloads based on selective
criteria where data can be processed in near real-time.
• Since they serve different purposes, Hadoop and NoSQL products are sometimes marketed
concurrently.
• Some NoSQL databases, such as HBase, were primarily designed to work on top of
Hadoop.
• Some big names in NoSQL field include Apache Cassandra, MongoDB, and Oracle NoSQL.
• It also places less focus on atomicity and consistency than on performance and scalability.
• Premium packages of NoSQL databases (such as Datastax for Cassandra) work to address
these issues.
MASSIVELY PARALLEL PROCESSING (MPP)
• As the name might suggest, MPP technologies process massive amounts of data in
parallel.
• Hundreds (or potentially even thousands) of processors, each with their own operating
system and memory, work on different parts of the same programme.
• MPP usually runs on expensive data warehouse appliances, whereas Hadoop is most
often run on cheap commodity hardware.
• MPP uses SQL, and Hadoop uses Java as default.
• MPP has crossovers with the other technologies; Teradata, an MPP technology,
has an ongoing partnership with Hortonworks.
• MPP market have been acquired by technology vendor behemoths; Netezza, for
instance, is owned by IBM, Vertica is owned by HP and Greenplum is owned by
EMC.
CLOUD
• Cloud computing refers to a broad set of products that are sold as a service and delivered
over a network.
• In other infrastructural approaches, when setting up your big architecture you need to
buy hardware and software for each person involved with the processing and analyzing
of your data.
• Data is hosted by third party can raise questions about security; many choose to
host their confidential information in-house, and use the cloud for less private
data.
• Alot of big names in IT offer cloud computing solutions; Google has a whole host of Cloud computing
products, including Big Query, specifically designed for the processing and management of Big Data.
• Amazon Web Services also has a wide range, included EMR for Hadoop, RDS for MySQL and
• There are also vendors such as Infochimps and Mortar specifically dedicated to offering cloud
computing solutions.