Documente Academic
Documente Profesional
Documente Cultură
1. INTRODUCTION
Hadoop is commonly used to process big data workloads because it is massively scalable. To
increase the processing power of your Hadoop cluster, add more servers with the required
CPU and memory resources to meet your needs.
Hadoop provides a high level of durability and availability while still being able to process
computational analytical workloads in parallel. The combination of availability, durability,
and scalability of processing makes Hadoop a natural fit for big data workloads. You can use
Amazon EMR to create and configure a cluster of Amazon EC2 instances running Hadoop
within minutes, and begin deriving value from your data.
Dept.cse.ACEM Page 1
High Performance Analytics using Big Data through Hadoop
2. EXISTING SYSTEM
Several weaknesses in the Hadoop platform have been identified as its adoption rate has
increased. These weaknesses have stalled Hadoop projects or prevented Hadoop adoption in
many cases. SAS has specifically sought to address these weaknesses.
The shortage of skilled MapReduce coders in the current marketplace is well known. SAS
addresses this problem with graphical drag-and-drop interfaces that allow the definition of
data preparation and analytics workflows. These graphical workflows can be designed by
non-programmers and can use the MapReduce framework to profile, prepare, transform, and
cleanse data in parallel across the cluster.
MapReduce is very batch oriented, and in many ways, not appropriate for iterative, multistep
analytics algorithms. In particular, its strict paradigm of doing a shuffle and write to disk
between each step in a process would cause multiple intermediate files to be created. This is
highly inefficient. By pulling the Hadoop data into an in-memory format, In-Memory
Statistics and SAS Visual Analytics, for example, provide algorithms that can apply to
multiple steps without touching disk. This vastly increases the productivity of data scientists
and business analysts.
One of the difficulties associated with the Hadoop data lake architecture is gaining an initial
understanding of the content, combinations and potential correlations of all the many types of
data stored there.
Dept.cse.ACEM Page 2
High Performance Analytics using Big Data through Hadoop
3. PROPOSED SYSTEM
We are intended to overcome all these obstacles and built a user friendly SAAS platform.
It is cloud based web application which stores data in Amazon S3. As this system supports
dynamic and optimized cluster nodes size as per the desired time, user doesn’t need to calculate
and estimate the number of nodes. The system uses Amazon EMR and MapReduce paradigm
using open source R scripting language to perform analysis of bigdata analysis in desire time.
Amazon EMR is a Service of Amazon Web Service where one can create a cluster.
Amazon EMR is a hadoop cluster that runs mapreduce program but one should install
dependency packages and create a mapreduce program that needs to be feed to the cluster.
Although this would mean that users no longer have to purchase hardware, rack it network it and
pay to use it, inexperienced users are not aware of an optimum size of cluster that will best suit
their needs.
This project as a service platform does not require expertise in computer technology
making it useful even for those with a basic aptitude in computation.
Using Amazon S3 cloud storage will be used here permitting users to upload data. User
can perform file security and management. Data security, availability, storage etc are achieved by
using Amazon S3. Due to the connection of Amazon S3 and Amazon EMR in same region, faster
data transfer can be achieved eliminating the problem of network latency.
While using this project, Hadoop cluster is out of the concern of a user because the system
takes it on itself to create an optimized and dynamic hadoop cluster nodes depending on user data
size, desire and other properties.
By using MapReduce paradigm to run in distributed nodes in multiple cores and nodes
makes it analytic performance high. Map reduce splits the data and combines the data by
reducing the duplicates here we are using <key,value> pairs.
The data is replicated in two nodes master node and core nodes, master node give task to
core nodes and core nodes take that task and processes that accordingly.
Dept.cse.ACEM Page 3
High Performance Analytics using Big Data through Hadoop
4. ARCHITECTURE
Hadoop commonly refers to the actual Apache Hadoop project, which includes MapReduce
(execution framework), YARN (resource manager), and HDFS (distributed storage). You can
also install Apache Tez, a next-generation framework which can be used instead of Hadoop
MapReduce as an execution engine.
Amazon EMR also includes EMRFS, a connector allowing Hadoop to use Amazon S3 as a
storage layer.
Dept.cse.ACEM Page 4
High Performance Analytics using Big Data through Hadoop
However, there are also other applications and frameworks in the Hadoop ecosystem,
including tools that enable low-latency queries, GUIs for interactive querying, a variety of
interfaces like SQL, and distributed NoSQL databases. The Hadoop ecosystem includes
many open source tools designed to build additional functionality on Hadoop core
components, and you can use Amazon EMR to easily install and configure tools such as
Hive, Pig, Hue, Ganglia, Oozie, and HBase on your cluster. You can also run other
frameworks, like Apache Spark for in-memory processing, or Presto for interactive SQL, in
addition to Hadoop on Amazon EMR.
Amazon EMR programmatically installs and configures applications in the Hadoop project,
including Hadoop MapReduce, YARN, and HDFS, across the nodes in your cluster.
However, starting with Amazon EMR release 5.x, Hive and Pig use Apache Tez instead of
Hadoop MapReduce as an execution engine.
Amazon EMR programmatically installs and configures applications in the Hadoop project,
including Hadoop MapReduce, YARN, and HDFS, across the nodes in your cluster.
However, starting with Amazon EMR release 5.x, Hive and Pig use Apache Tez instead of
Hadoop MapReduce as an execution engine.
Dept.cse.ACEM Page 5
High Performance Analytics using Big Data through Hadoop
5. SYSTEM COMMUNICATION
Dept.cse.ACEM Page 6
High Performance Analytics using Big Data through Hadoop
begins.
Dept.cse.ACEM Page 7
High Performance Analytics using Big Data through Hadoop
6. SOFTWARE COMPONENTS
Apache Hadoop:
Hadoop common libraries and utilities are HDFS(hadoop distributed file system)that
stores data across multiple nodes,YARN(yet another resource negotiator)provides resource
management for processes running on hadoop and mapreduce has two steps map and reduce
Map is master node that takes input and partitions into subproblems and distributes to worker
nodes,Reduce combines them.
However, there are also other applications and frameworks in the Hadoop ecosystem,
including tools that enable low-latency queries, GUIs for interactive querying, a variety of
interfaces like SQL, and distributed NoSQL databases. The Hadoop ecosystem includes
many open source tools designed to build additional functionality on Hadoop core
Dept.cse.ACEM Page 8
High Performance Analytics using Big Data through Hadoop
components, and you can use Amazon EMR to easily install and configure tools such as
Hive, Pig, Hue, Ganglia, Oozie, and HBase on your cluster. You can also run other
frameworks, like Apache Spark for in-memory processing, or Presto for interactive SQL, in
addition to Hadoop on Amazon EMR.
Software components that run on top of or alongside of hadoop are: Ambari- A Web
interface for managing , configuring and testing hadoop services and components. Hive – A Data
Warehousing and sql like query language that present data in the form of tables. Oozie – A
hadoop job scheduler. distributed database that runs on top of hadoop. Hbase tables can serve as
input and output for mapreduce job. Pig- Data extractions and loading is done here. Spark-
cluster computing framework with in memory analytics. Sqoop- that moves data between hadoop
and relational databases`s. Zookeeper- An application that coordinates distributed processes.
R Language:
R hadoop`s free libraries enable users to leverage the data computing environment
hadoop to manage their data. With terabytes of data at hand, every business is trying to figure
out the best way to understand information about their customers and themselves. But simply
using excel pivot tables to analyse such quantities of information is absurd, so many
companies use the commercially available tool SAS to well business intelligence. But SAS is
no match for the open source language that pioneering data scientists use in academia, which
is simply known as R. the R programing language leans more frequently to the cutting edge
of data science, giving business the latest data analysis tools. Given its massive scalability
and lower costs, Hadoop is ideally suited for common ETL workloads such as collecting,
sorting, joining, and aggregating big datasets for easier consumption by downstream systems.
MAPREDUCE:
It is a programming framework for efficiently processing very large amounts of data
stored in the HDFS. But while several programming frameworks for hadoop exist, few are turned
to the needs of data analysts who typically work in the environment. That`s why dev team at
revolution analytics created the RHadoop project to give R programmers powerful open source
tools to analyse data stored in Hadoop.
Dept.cse.ACEM Page 9
High Performance Analytics using Big Data through Hadoop
Dept.cse.ACEM Page 10
High Performance Analytics using Big Data through Hadoop
You can initialize a new Hadoop cluster dynamically and quickly, or add servers to
your existing Amazon EMR cluster, significantly reducing the time it takes to make resources
available to your users and data scientists. Using Hadoop on the AWS platform can
dramatically increase your organizational agility by lowering the cost and time it takes to
allocate resources for experimentation and development.
You can easily integrate your Hadoop environment with other services such
as Amazon S3, Amazon Kinesis, Amazon Redshift, and Amazon DynamoDB to enable data
movement, workflows, and analytics across the many diverse services on the AWS platform.
Additionally, you can use the AWS Glue Data Catalog as a managed metadata repository for
Apache Hive and Apache Spark.
Many Hadoop jobs are spiky in nature. For instance, an ETL job can run hourly,
daily, or monthly, while modelling jobs for financial firms or genetic sequencing may occur
only a few times a year. Using Hadoop on Amazon EMR allows you to spin up these
workload clusters easily, save the results, and shut down your Hadoop resources when
they’re no longer needed, to avoid unnecessary infrastructure costs.
By using Hadoop on Amazon EMR, you have the flexibility to launch your clusters in
any number of Availability Zones in any AWS region. A potential problem or threat in one
region or zone can be easily circumvented by launching a cluster in another zone in minutes.
Dept.cse.ACEM Page 11
High Performance Analytics using Big Data through Hadoop
Flexible capacity
Dept.cse.ACEM Page 12
High Performance Analytics using Big Data through Hadoop
8. INTEGRATION OF TECHNOLOGIES
If you have noticed, technologies like IoT, Machine Learning, artificial intelligence
and more are making their ways into our everyday lives. Behind all of these is Big Data
sitting strong in an authoritative position. There are devices talking to each other over a
connected network sharing and generating data you feed, and there are algorithms learning
patterns and processing information from the generated data. A simple example of the
Internet of Things is your smart television that is connected to your home network and
generating data on your viewing patterns, interests and more.
With social apps installed, it is also taking into considerations your personal tastes and
preferences and cumulatively working on personas like yours to deliver better online content
and streaming options. You would be amazed to know that the massive blockbuster House of
Cards was the result of Big Data analytics!
Together, they are designed to offer the best of convenience and support to consumers
and industries globally. A warehouse of Amazon is mostly automated, and there are tech
companies that have replaced manpower with a simple code for monotonous jobs. As much
as redundancy is killed by Big Data and analytics, newer opportunities are equally arising on
the other side as well.
One of the most reassuring things when it comes to big data Hadoop future is that the
amount of data generated every day will only continue to grow. As of now, we generate
approximately 2.3 trillion gigabytes of data every day, and this will only grow in the future. If
you notice, there are smartwatches, smart televisions, smart wearable techs in the market that
further collect data from consumers, leaving the scope for only massive generation of data.
More on this below.
Dept.cse.ACEM Page 13
High Performance Analytics using Big Data through Hadoop
9. FUTURE SCOPE
This industry has been evolving since the day of its inception and touching industries and
companies for the better. As it continues to impact companies, the future of big data
regarding its market share and patronage around the globe is only expected to increase
manifold by the year 2020.
DARK DATA:
One of the first instances of big data future trends for 2018, as we foresee, is
the emergence of dark data. It’s a known fact that the integration of digital data and its
implementation through analytics has been fetching humongous rewards to brands and
businesses around the world. To complement this technology further, dark data will make its
way and mark the future of big data in the coming years or even months.
In simple terms, dark data refers to the data from non-digital sources and digital data that has
been undermined by its value by analytics experts. The data sets are usually untapped,
unstructured and untagged and are also referred to as dusty data. The Big Data future scope
predicts that such data sets will come into the limelight this year and further revolutionize the
technology further.
PRIVACY:
Dept.cse.ACEM Page 14
High Performance Analytics using Big Data through Hadoop
10. CONCLUSION
Big Data has taken the world by storm. It is said that the next decade will be going to be
dominated by Big-data wherein all the companies will be using the data available to them to
learn about their company’s ecosystem and improving fall-backs. All major universities and
companies have started investing in building tools that would help them understand and
create useful insights from the data that they have access to. One such tool that helps in
analysing and processing Big-data is Hadoop. By training the linear regression model we
develop an algorithm that is used to analyze the data and help the organization to make
decisions in their respective fields. The dynamic scalability improves the performance of data
in overall characteristics that makes an efficient decision making to an individual or an
organization.
The world is changing the way it is operating currently and Big-data is playing an
important role in it. Hadoop is a framework that makes an engineers life easy while working
on large sets of data. There are improvements on all the fronts. The future is exciting.
Dept.cse.ACEM Page 15
High Performance Analytics using Big Data through Hadoop
13. REFERENCES
3. F. Bonomi, R. Milito, J. Zhu, and S. Addepalli: "Fog computing and its role in the
internet of things", in Proceedings of MCC workshop on Mobile Cloud Computing (MCC
'12), pp. 13-16, ACM Press, 2012
5. Amin, A.T., Hakimi, S.L. Upper bounds on the order of a clique of a graph, SIAM
Journal on Applied Mathematics. 22, 569–573 (1972) M. Bomze, M. Budinich, P. M.
Pardalos, and M. Pelillo. The maximum clique problem. In D. Z. Duand P. M. Pardalos,
editors, Handbook of Combinatorial Optimization: Supplementary Volume A, pages 1-
74. Kluwer Academic, Dordrecht.
6. M. Budinich. Exact bounds on the order of the maximum clique of a graph. Discrete
Applied Mathematics, 127 : 535-543, 2003.
Dept.cse.ACEM Page 16
High Performance Analytics using Big Data through Hadoop
10. H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology.
In Proceedings of the14th International World Wide Web Conference, pages 661-669,
Chiba, Japan, 2005.
11. R. Kannan and S. Vempala. Spectral algorithms. Foundations and Trends in Theoretical
Computer Science,
Dept.cse.ACEM Page 17
High Performance Analytics using Big Data through Hadoop
Dept.cse.ACEM Page 18