Sunteți pe pagina 1din 19

Apache

The Elephant Driver


Presenters:
Antonio Loureiro Severien
Emmanouil Dimogerontakis
Muhammad Anis uddin Nasir
What is Apache Mahout?
Machine learning and data mining framework for
classification, clustering and recommendation

The Apache Mahout free machine learning library's goal


is to build scalable machine learning tools for use on
analysing big data on a distributed manner
Machine Learning
"Machine Learning is programming computers to optimize a
performance criterion using example data or past
experience" - Alpaydin, 2004

Machine learning is concerned with the design and


development of algorithms that allow machines to make
decisions or even evolve behaviors based on collection of
empirical data.
Data Mining
Data mining, also called knowledge discovery in
databases(KDD) is the process of discovering interesting
and useful patterns and relationships in large volumes of
data.
Combines tools from:
statistics
artificial intelligence (such as neural networks and
machine learning)
with database management to analyze large data sets.
-Britannica Online Encyclopedia
Why Machine Learning and Data
Mining?

Data, Data, DATA!!!

Tasks too Hard to Program

Customizing software
Available Machine Learning Tools

WEKA
R
KEEL
Others...

Not enough?
Apache Mahout vs others?
Many open source Machine Learning
libraries either:
Lack Community
Lack Documentation and Examples
Lack the Apache License
(business opportunity)
Are research-oriented
(not fit for production yet)
Lack Scalability
Mahout = Elephant Driver?
Why we need scalability?
Big Data
Applications
Recommendation features
Clustering of information
Classification

Examples: Movie recommendations, stock


analysis, fraud detection, ad-sense
recommendation, etc...

How do we do this?
Supported Algorithms
Classification
Clustering
Recommender / Collaborative Filtering
Evolutionary Algorithms
Pattern Mining
Regression
Dimension reduction
Similarity Vectors
Classification
(learn to assign categories to documents)

Fully functional
Logistic Regression (SGD)
Bayesian

Integrated to Mahout Development


Random Forests (integrated)
Online Passive Aggressive (integrated)
Boosting (awaiting patch commit)

Open to be worked on...


Hidden Markov Models (HMM) - Training is done in Map-Reduce
Support Vector Machines (SVM) (open)
Perceptron and Winnow (open)
Neural Network (open)
Clustering
(group items that are topically related)

Fully functional
Expectation Maximization (EM)
Hierarchical Clustering

Integrated to Mahout Development


Canopy Clustering
K-Means Clustering
Fuzzy K-Means
Mean Shift Clustering
Dirichlet Process Clustering
Latent Dirichlet Allocation
Spectral Clustering
Minhash Clustering
Top Down Clustering
Recommenders /
Collaborative Filtering
(find items a user might like /
find items that appear together)

Integrated to Mahout Development


Non-distributed recommenders ("Taste") (integrated)
Distributed Item-Based Collaborative Filtering (integrated)
Collaborative Filtering using a parallel matrix factorization (integrated)
Who is using it?
Opportunities
Developers
Researchers
Small Business
Large Business
Consultancy...
on Mahout
on specific data analysis
Open data
etc...
Apache Mahout
Business?

Ideas?

Suggestions?

Questions?
Where to start?
Wikipedia Bayes Example
https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html

What does it do?


Classify wikipedia data dump by countries.
Objective: Predict what country an unseen article
should be categorized into.
References
General
http://www.slideshare.net/sdec2011/sdec2011-mahout-the-what-the-how-and-
the-why
http://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoop
http://www.slideshare.net/aneeshabakharia/lca2011-mahout
Hands-on
http://www.slideshare.net/OReillyOSCON/hands-on-mahout
Who is using it?
https://cwiki.apache.org/MAHOUT/powered-by-mahout.html
Apache Mahout
http://mahout.apache.org/
Quickstart
https://cwiki.apache.org/MAHOUT/quickstart.html

S-ar putea să vă placă și