How Are Hadoop and Big Data Related?

High Performance Analytics using Big Data through Hadoop
1. INTRODUCTION
In today’s fast-paced business environment obtaining results quickly represents a key

consideration for big data analytics‟. Big Data analytics has gained a lot of interests in the
recent years due to immense applications in World Wide Web. The huge datasets can be
structured or unstructured.in the past the challenges was to create and obtain data but now we
are more so, in the future it will be what to do with all the available data. the cloud based big
data analytics is a system to deal with big data to perform linear regression and similar
predictive analysis with ease and prove to be very helpful for engineering research businesses
and airlines etc. where complicated statistical analysis need to be performed.
In Enterprises the problem is they have to rely on desktop applications like excel mini tab and
spaces where vertical scaling of memory is always limited so distributed platform comes in
picture to deal with big data. These are some commercial big data analytical tools like IBM
Big Insight, Amazon web services etc. To store huge data we are using a platform called
Amazon Web Services which offer great services to handle big data. In that we are using
amazon s3 to store data.
How are Hadoop and big data related?
Hadoop is commonly used to process big data workloads because it is massively scalable. To
increase the processing power of your Hadoop cluster, add more servers with the required
CPU and memory resources to meet your needs.
Hadoop provides a high level of durability and availability while still being able to process
computational analytical workloads in parallel. The combination of availability, durability,
and scalability of processing makes Hadoop a natural fit for big data workloads. You can use
Amazon EMR to create and configure a cluster of Amazon EC2 instances running Hadoop
within minutes, and begin deriving value from your data.
Dept.cse.ACEM Page 1
2. EXISTING SYSTEM
Several weaknesses in the Hadoop platform have been identified as its adoption rate has
increased. These weaknesses have stalled Hadoop projects or prevented Hadoop adoption in
many cases. SAS has specifically sought to address these weaknesses.
The shortage of skilled MapReduce coders in the current marketplace is well known. SAS
addresses this problem with graphical drag-and-drop interfaces that allow the definition of
data preparation and analytics workflows. These graphical workflows can be designed by
non-programmers and can use the MapReduce framework to profile, prepare, transform, and
cleanse data in parallel across the cluster.
MapReduce is very batch oriented, and in many ways, not appropriate for iterative, multistep
analytics algorithms. In particular, its strict paradigm of doing a shuffle and write to disk
between each step in a process would cause multiple intermediate files to be created. This is
highly inefficient. By pulling the Hadoop data into an in-memory format, In-Memory
Statistics and SAS Visual Analytics, for example, provide algorithms that can apply to
multiple steps without touching disk. This vastly increases the productivity of data scientists
and business analysts.
One of the difficulties associated with the Hadoop data lake architecture is gaining an initial
understanding of the content, combinations and potential correlations of all the many types of
data stored there.
3. PROPOSED SYSTEM
We are intended to overcome all these obstacles and built a user friendly SAAS platform.
It is cloud based web application which stores data in Amazon S3. As this system supports
dynamic and optimized cluster nodes size as per the desired time, user doesn’t need to calculate
and estimate the number of nodes. The system uses Amazon EMR and MapReduce paradigm
using open source R scripting language to perform analysis of bigdata analysis in desire time.
Amazon EMR is a Service of Amazon Web Service where one can create a cluster.
Amazon EMR is a hadoop cluster that runs mapreduce program but one should install
dependency packages and create a mapreduce program that needs to be feed to the cluster.
Although this would mean that users no longer have to purchase hardware, rack it network it and
pay to use it, inexperienced users are not aware of an optimum size of cluster that will best suit
their needs.
This project as a service platform does not require expertise in computer technology
making it useful even for those with a basic aptitude in computation.
Using Amazon S3 cloud storage will be used here permitting users to upload data. User
can perform file security and management. Data security, availability, storage etc are achieved by
using Amazon S3. Due to the connection of Amazon S3 and Amazon EMR in same region, faster
data transfer can be achieved eliminating the problem of network latency.
While using this project, Hadoop cluster is out of the concern of a user because the system
takes it on itself to create an optimized and dynamic hadoop cluster nodes depending on user data
size, desire and other properties.
By using MapReduce paradigm to run in distributed nodes in multiple cores and nodes
makes it analytic performance high. Map reduce splits the data and combines the data by
reducing the duplicates here we are using <key,value> pairs.
The data is replicated in two nodes master node and core nodes, master node give task to
core nodes and core nodes take that task and processes that accordingly.
4. ARCHITECTURE
Hadoop commonly refers to the actual Apache Hadoop project, which includes MapReduce
(execution framework), YARN (resource manager), and HDFS (distributed storage). You can
also install Apache Tez, a next-generation framework which can be used instead of Hadoop
MapReduce as an execution engine.
Amazon EMR also includes EMRFS, a connector allowing Hadoop to use Amazon S3 as a
storage layer.
However, there are also other applications and frameworks in the Hadoop ecosystem,
including tools that enable low-latency queries, GUIs for interactive querying, a variety of
interfaces like SQL, and distributed NoSQL databases. The Hadoop ecosystem includes
many open source tools designed to build additional functionality on Hadoop core
components, and you can use Amazon EMR to easily install and configure tools such as
Hive, Pig, Hue, Ganglia, Oozie, and HBase on your cluster. You can also run other
frameworks, like Apache Spark for in-memory processing, or Presto for interactive SQL, in
addition to Hadoop on Amazon EMR.
Amazon EMR programmatically installs and configures applications in the Hadoop project,
including Hadoop MapReduce, YARN, and HDFS, across the nodes in your cluster.
However, starting with Amazon EMR release 5.x, Hive and Pig use Apache Tez instead of
Hadoop MapReduce as an execution engine.
Amazon EMR programmatically installs and configures applications in the Hadoop project,
including Hadoop MapReduce, YARN, and HDFS, across the nodes in your cluster.
However, starting with Amazon EMR release 5.x, Hive and Pig use Apache Tez instead of
Hadoop MapReduce as an execution engine.
5. SYSTEM COMMUNICATION
The System Communication pattern is shown below:

Input:
Data in .csv format is the input to the system.
Output:
The output of the system will be the result of linear regression in text format which is
stored in Amazon S3.
Application Server and Database Server:
Database Server store user details, data details and job history detail. Application
server is responsible for the authentication of user to enter the system through web
application. i.e: security is confirmed by application server and database server.
Application Server and Amazon S3:
Application Server communicate with Amazon S3 using Amazon Java SDK to help
users to store their data, browse and perform data management operations like upload,
download, view, rename and delete using web browser
Application Server and Cluster Optimization:
Application Server has a cluster optimization module that initiate a cluster with
optimal number of nodes based on the volume of input data, expected time to complete
analysis.
Application Server and EMR Cluster:
Application Server initiate a cluster with optimal number of nodes and take the
location of file located in Amazon S3 on which linear regression need to be performed
according to the desire time given by the user.
EMR Cluster and Amazon S3:
The EMR Cluster consists of one master node and several core nodes. Upon initiation,
the cluster performs a bootstrap action when all the required packages and its dependencies
are installed in EMR cluster thereby preparing it to perform map reduce operation for linear
regression.
Once ready, the system is can regress bigdata beginning with the EMR copying the script
which has the code to perform mapreduce of give dataset. When set to begin functioning, the
EMR cluster start fetching data from Amazon S3. It is received by the Hadoop Distributed
file System of EMR cluster and generation of the map and reduction of task for distribution
begins.
6. SOFTWARE COMPONENTS
Apache Hadoop:
Hadoop common libraries and utilities are HDFS(hadoop distributed file system)that
stores data across multiple nodes,YARN(yet another resource negotiator)provides resource
management for processes running on hadoop and mapreduce has two steps map and reduce
Map is master node that takes input and partitions into subproblems and distributes to worker
nodes,Reduce combines them.
However, there are also other applications and frameworks in the Hadoop ecosystem,
including tools that enable low-latency queries, GUIs for interactive querying, a variety of
interfaces like SQL, and distributed NoSQL databases. The Hadoop ecosystem includes
many open source tools designed to build additional functionality on Hadoop core
components, and you can use Amazon EMR to easily install and configure tools such as
Hive, Pig, Hue, Ganglia, Oozie, and HBase on your cluster. You can also run other
frameworks, like Apache Spark for in-memory processing, or Presto for interactive SQL, in
addition to Hadoop on Amazon EMR.
Software components that run on top of or alongside of hadoop are: Ambari- A Web
interface for managing , configuring and testing hadoop services and components. Hive – A Data
Warehousing and sql like query language that present data in the form of tables. Oozie – A
hadoop job scheduler. distributed database that runs on top of hadoop. Hbase tables can serve as
input and output for mapreduce job. Pig- Data extractions and loading is done here. Spark-
cluster computing framework with in memory analytics. Sqoop- that moves data between hadoop
and relational databases`s. Zookeeper- An application that coordinates distributed processes.
R Language:
R hadoop`s free libraries enable users to leverage the data computing environment
hadoop to manage their data. With terabytes of data at hand, every business is trying to figure
out the best way to understand information about their customers and themselves. But simply
using excel pivot tables to analyse such quantities of information is absurd, so many
companies use the commercially available tool SAS to well business intelligence. But SAS is
no match for the open source language that pioneering data scientists use in academia, which
is simply known as R. the R programing language leans more frequently to the cutting edge
of data science, giving business the latest data analysis tools. Given its massive scalability
and lower costs, Hadoop is ideally suited for common ETL workloads such as collecting,
sorting, joining, and aggregating big datasets for easier consumption by downstream systems.
MAPREDUCE:
It is a programming framework for efficiently processing very large amounts of data
stored in the HDFS. But while several programming frameworks for hadoop exist, few are turned
to the needs of data analysts who typically work in the environment. That`s why dev team at
revolution analytics created the RHadoop project to give R programmers powerful open source
tools to analyse data stored in Hadoop.
Amazon EMR processes bigdata across a hadoop-cluster of virtual servers on amazon

elastic compute cloud(EC2)and amazon simple storage services(S3). The elastic in EMR‟s name
refers to its dynamic resizing ability, which allows it to rampup or reduce resource use depending
on the demand at any given time
7. ADVANTAGES OF HADOOP ON AMAZON EMR
Increased speed and agility
You can initialize a new Hadoop cluster dynamically and quickly, or add servers to
your existing Amazon EMR cluster, significantly reducing the time it takes to make resources
available to your users and data scientists. Using Hadoop on the AWS platform can
dramatically increase your organizational agility by lowering the cost and time it takes to
allocate resources for experimentation and development.
Reduced administrative complexity
Hadoop configuration, networking, server installation, security configuration, and

ongoing administrative maintenance can be a complicated and challenging activity. As a
managed service, Amazon EMR addresses your Hadoop infrastructure requirements so you
can focus on your core business.
Integration with other cloud services
You can easily integrate your Hadoop environment with other services such
as Amazon S3, Amazon Kinesis, Amazon Redshift, and Amazon DynamoDB to enable data
movement, workflows, and analytics across the many diverse services on the AWS platform.
Additionally, you can use the AWS Glue Data Catalog as a managed metadata repository for
Apache Hive and Apache Spark.
Pay for clusters only when you need them
Many Hadoop jobs are spiky in nature. For instance, an ETL job can run hourly,
daily, or monthly, while modelling jobs for financial firms or genetic sequencing may occur
only a few times a year. Using Hadoop on Amazon EMR allows you to spin up these
workload clusters easily, save the results, and shut down your Hadoop resources when
they’re no longer needed, to avoid unnecessary infrastructure costs.
Improved availability and disaster recovery
By using Hadoop on Amazon EMR, you have the flexibility to launch your clusters in
any number of Availability Zones in any AWS region. A potential problem or threat in one
region or zone can be easily circumvented by launching a cluster in another zone in minutes.
Flexible capacity
Capacity planning prior to deploying a Hadoop environment can often result in

expensive idle resources or resource limitations. With Amazon EMR, you can create clusters
with the required capacity within minutes and use Auto Scaling to dynamically scale out and
scale in nodes.
8. INTEGRATION OF TECHNOLOGIES
If you have noticed, technologies like IoT, Machine Learning, artificial intelligence
and more are making their ways into our everyday lives. Behind all of these is Big Data
sitting strong in an authoritative position. There are devices talking to each other over a
connected network sharing and generating data you feed, and there are algorithms learning
patterns and processing information from the generated data. A simple example of the
Internet of Things is your smart television that is connected to your home network and
generating data on your viewing patterns, interests and more.
With social apps installed, it is also taking into considerations your personal tastes and
preferences and cumulatively working on personas like yours to deliver better online content
and streaming options. You would be amazed to know that the massive blockbuster House of
Cards was the result of Big Data analytics!
Together, they are designed to offer the best of convenience and support to consumers
and industries globally. A warehouse of Amazon is mostly automated, and there are tech
companies that have replaced manpower with a simple code for monotonous jobs. As much
as redundancy is killed by Big Data and analytics, newer opportunities are equally arising on
the other side as well.
The Volume of Data Generated will Continue to Increase
One of the most reassuring things when it comes to big data Hadoop future is that the
amount of data generated every day will only continue to grow. As of now, we generate
approximately 2.3 trillion gigabytes of data every day, and this will only grow in the future. If
you notice, there are smartwatches, smart televisions, smart wearable techs in the market that
further collect data from consumers, leaving the scope for only massive generation of data.
More on this below.
9. FUTURE SCOPE
This industry has been evolving since the day of its inception and touching industries and
companies for the better. As it continues to impact companies, the future of big data
regarding its market share and patronage around the globe is only expected to increase
manifold by the year 2020.
DARK DATA:
One of the first instances of big data future trends for 2018, as we foresee, is
the emergence of dark data. It’s a known fact that the integration of digital data and its
implementation through analytics has been fetching humongous rewards to brands and
businesses around the world. To complement this technology further, dark data will make its
way and mark the future of big data in the coming years or even months.
In simple terms, dark data refers to the data from non-digital sources and digital data that has
been undermined by its value by analytics experts. The data sets are usually untapped,
unstructured and untagged and are also referred to as dusty data. The Big Data future scope
predicts that such data sets will come into the limelight this year and further revolutionize the
technology further.
PRIVACY:
While privacy continues to be one of the major shortcomings of this technology, it is

also promising to note that an antidote will soon hit the market for its resolution. Today, we
hardly have ideas on how the data we generate is used and shared amongst companies, and as
far as big data Hadoop future is concerned, this is estimated to decline. Major companies
around the world will wake up to this emerging challenge and will seek actions on the legal
front. Newer policies will be made, and laws will be amended for data consumption and
analytics, paving the way for a safer ecosystem for consumers to generate data.
10. CONCLUSION
Big Data has taken the world by storm. It is said that the next decade will be going to be
dominated by Big-data wherein all the companies will be using the data available to them to
learn about their company’s ecosystem and improving fall-backs. All major universities and
companies have started investing in building tools that would help them understand and
create useful insights from the data that they have access to. One such tool that helps in
analysing and processing Big-data is Hadoop. By training the linear regression model we
develop an algorithm that is used to analyze the data and help the organization to make
decisions in their respective fields. The dynamic scalability improves the performance of data
in overall characteristics that makes an efficient decision making to an individual or an
organization.
The world is changing the way it is operating currently and Big-data is playing an
important role in it. Hadoop is a framework that makes an engineers life easy while working
on large sets of data. There are improvements on all the fronts. The future is exciting.
13. REFERENCES
1. Yanishpradananga, shridevikarande, chandraprakashkarande: “high performance analytics

of bigdata using dynamic and optimized hadoop cluster” in Procedings of the 2016
international conference on advanced communication control and computing
technologies.
2. Improving Decision Making in the World of Big Data http://www. forbes.

com/sites/Christopher frank/2012/03/25/ improving decision-making-in-the-world-of-
big-data/.
3. F. Bonomi, R. Milito, J. Zhu, and S. Addepalli: "Fog computing and its role in the
internet of things", in Proceedings of MCC workshop on Mobile Cloud Computing (MCC
'12), pp. 13-16, ACM Press, 2012
4. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File

System, " in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems
and Technologies (MSST), 2010, pp. 1-10
5. Amin, A.T., Hakimi, S.L. Upper bounds on the order of a clique of a graph, SIAM
Journal on Applied Mathematics. 22, 569–573 (1972) M. Bomze, M. Budinich, P. M.
Pardalos, and M. Pelillo. The maximum clique problem. In D. Z. Duand P. M. Pardalos,
editors, Handbook of Combinatorial Optimization: Supplementary Volume A, pages 1-
74. Kluwer Academic, Dordrecht.
6. M. Budinich. Exact bounds on the order of the maximum clique of a graph. Discrete
Applied Mathematics, 127 : 535-543, 2003.
7. R. Diestel. Graph Theory. Springer-Verlag Heidelberg, New York, 2005. Frieze, R.

Kannan, and S. Vempala. Fast monte-carlo algorithms for finding low-rank
approximations. Journal of the ACM, 51(6) : 1025-1041, 2004.
8. N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness:

Probabilistic algorithms for constructing approximate matrix decompositions. SIAM
Review, 53(2):217-288, 2011.
9. J. Hastad. Clique is hard to approximate within n1−. Acta Mathematica, 182(1):105-142,

1999.
10. H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology.
In Proceedings of the14th International World Wide Web Conference, pages 661-669,
Chiba, Japan, 2005.
11. R. Kannan and S. Vempala. Spectral algorithms. Foundations and Trends in Theoretical
Computer Science,

How Are Hadoop and Big Data Related?

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

How Are Hadoop and Big Data Related?

Încărcat de

Drepturi de autor:

Formate disponibile

High Performance Analytics using Big Data through Hadoop

In today’s fast-paced business environment obtaining results quickly represents a key

How are Hadoop and big data related?

The System Communication pattern is shown below:

Amazon EMR processes bigdata across a hadoop-cluster of virtual servers on amazon

7. ADVANTAGES OF HADOOP ON AMAZON EMR

Increased speed and agility

Reduced administrative complexity

Hadoop configuration, networking, server installation, security configuration, and

Integration with other cloud services

Pay for clusters only when you need them

Improved availability and disaster recovery

Capacity planning prior to deploying a Hadoop environment can often result in

The Volume of Data Generated will Continue to Increase

While privacy continues to be one of the major shortcomings of this technology, it is

1. Yanishpradananga, shridevikarande, chandraprakashkarande: “high performance analytics

2. Improving Decision Making in the World of Big Data http://www. forbes.

4. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File

7. R. Diestel. Graph Theory. Springer-Verlag Heidelberg, New York, 2005. Frieze, R.

8. N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness:

9. J. Hastad. Clique is hard to approximate within n1−. Acta Mathematica, 182(1):105-142,

S-ar putea să vă placă și