Sunteți pe pagina 1din 2

White Paper

Machine Learning in the Enterprise Hadoop

1. Abstract

4. Apache Hadoop & Big Data

Over the past few years organizations


are storing huge amounts of data sets
and using this Big Data as their
competitive advantage. The challenge
lies when a human has to intervene in
every scenario of processing and
analyzing Big Data. The ideal case is
made when machine can collaborate
and help humans in decision-making.
This white paper examines the role of
machine learning in the most popular
Big Data platform, Hadoop.

Apache Hadoop is an open source


distributed software platform for
storing and processing data running on
multiple servers. Hadoop is written in
Java.
Hadoop
implements
a
computational
paradigm
named
Map/Reduce, where the application is
divided into many small fragments of
work, each of which may be executed or
re-executed on any node in the cluster.
In addition, it provides a distributed file
system (DFS) that stores data on the
compute nodes, providing very high
aggregate bandwidth across the cluster.
Using Hadoop, you can store petabytes
of data reliably on tens of thousands of
servers while scaling performance costeffectively by merely adding inexpensive
nodes to the cluster.

This is one of the many white papers we


plan to publish and understand the role
of machine learning with Hadoop to
enhance business processes.

2. Key Words
Hadoop, Big Data, Machine Learning

3. Introduction

Following is the Hadoop physical


architecture with the master & slave
nodes
and
the
Hadoop
logical
architecture with Map Reduce.

Organizations today generate huge


amount of data. This data includes both
unstructured and structured data that
is stored in silos of databases and
archive
solutions.
Managing
and
analyzing this data using legacy
systems is challenging and sometimes
impractical. Companies today focus on
generating business benefits out of their
huge data sets. Hadoop is unparalleled
as a high efficiency Big Data platform to
store, process and analyze data in cost
effective model.
The challenge lies in acquiring
knowledge base from the experts to
understand the processing of data sets,
the data interpretation techniques and
creating value that empowers business
decisions. Either we can create different
processes for knowledge base or we can
code the learning technique inside the
machine i.e. Machine Learning.

2014 Alethe Labs All rights reserved. Alethe Labs and the Alethe Labs logo are trademarks or registered trademarks of Alethe Labs. All other
trademarks are the property of their respective companies. Information is subject to change without notice.


White Paper
Machine Learning in the Enterprise Hadoop
In addition to Map Reduce the following
components are useful in developing
machine learning in Big Data systems
powered by Hadoop.

5. Machine Learning
Machine learning, a branch of artificial
intelligence, concerns the construction
and study of systems that can learn
from data. Machine learning gives
computers the power to learn from data
without being explicitly programmed.
Computers enabled with machine
learning improve their performance by
learning from training data and
previous outcomes.
Collaboration of man and machine
improves the machine learning outputs
faster and gives more accurate results.
Human intuition is another key aspect
to make the machine give better results
every time a new data set is used or a
new query is asked.

6. Apache Mahout
Apache Mahout is a library of machine
learning algorithms, implemented on
top of Apache Hadoop and using the
Map/Reduce paradigm.
Once Big Data is available on Hadoop
Distributed File System (HDFS),
Mahout provides the necessary tools to
automatically
explore
meaningful
information out of those Big Data sets.
Currently Mahout has three use cases

7. Machine Learning & Big Data


Basic analytical tools for business
intelligence work in limited domain of
factors e.g. sum, count and average.
Their capability is also limited to that of
the user.
Traditional techniques are not well
suited for Big Data because its too large
for a human to test the entire
hypothesis and not limit to traditional
calculations.
Machine learning on the other hand is
not limited to human capacity and the
more data is fed into it, the better are
the insights.

8. Conclusion
Apache Hadoop provides a highly
scalable and cost effective model for Big
Data.
Using
machine-learning
techniques like the Mahout library
makes Hadoop consistently powerful,
delivering better results as the data
grows.

9. References
!
!
!

http://hadoop.apache.org
https://mahout.apache.org
http://en.wikipedia.org/wiki/Machine
_learning

a. Recommendation Mining: Mines


user behavior and recommends the
items they like.
b. Clustering: Takes items and puts
them into naturally occurring
groups e.g. text documents, audio
files.
c. Classification: Learns from the
current categorization and assigns
unlabeled items the best match
group.

2014 Alethe Labs All rights reserved. Alethe Labs and the Alethe Labs logo are trademarks or registered trademarks of Alethe Labs. All other
trademarks are the property of their respective companies. Information is subject to change without notice.

S-ar putea să vă placă și