Sunteți pe pagina 1din 2

10/15/2014

Apache Mahout: Scalable machine learning and data mining

Stochastic Singular Value Decomposition


Stochastic SVD method in Mahout produces reduced rank Singular Value Decomposition output in its strict

mathematical definition: A UV .

The benefits over other methods are:


reduced flops required compared to Krylov subspace methods
In map-reduce world, a fixed number of MR iterations required regardless of rank requested
Tweak precision/speed balance with options.
A is a Distributed Row Matrix where rows may be identified by any Writable (such as a document path). As
such, it would work directly on the output of seq2sparse.
As of 0.7 trunk, includes PCA and dimensionality reduction workflow (EXPERIMENTAL! Feedback on
performance/other PCA related issues/ blogs is greatly appreciated.)

Map-Reduce characteristics:
SSVD uses at most 3 MR sequential steps (map-only + map-reduce + 2 optional parallel map-reduce jobs) to
produce reduced rank approximation of U, V and S matrices. Additionally, two more map-reduce steps are added
for each power iteration step if requested.

Potential drawbacks:
potentially less precise (but adding even one power iteration seems to fix that quite a bit).

Twitter
Tweets
Paul Scott
@paulscott56
Thanks to the 6 brave souls for
attending my @ApacheMahout talk at
#tech4africa. Hope I answered some
questions!
Retweeted by Apache Mahout
Expand
Vintzooz
@vintzooz
Tutoriel sur Apache Mahout, un sousprojet Hadoop pour la mise en place
d'outils de recommandation
soat.developpez.com/tutoriels/nosq
Retweeted by Apache Mahout
Expand
John Kennedy
@CommerceJohn
A Great AWS Mahout tutorial
blogs.aws.amazon.com/bigdata/post/T
#mapreduce #ecommerce #tech #math
Retweeted by Apache Mahout
Expand

Documentation
Overview and Usage
Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.

Zaizi Social
@Zaizi
Content Recommendation in @Alfresco
using @ApacheMahout, read our latest
blog bit.ly/1rurNHN

Tweet to @ApacheMahout

Publications
Nathan Halko's dissertation "Randomized methods for computing low-rank approximations of matrices" contains
comprehensive definition of parallelization strategy taken in Mahout SSVD implementation and also some
precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation on a typical corpus data set.
Halko, Martinsson, Tropp paper discusses family of random projection-based algorithms and contains theoretical
error estimates.

Apache Software
Foundation
How the ASF works

R simulation

Get Involved

Non-parallel SSVD simulation in R with power iterations and PCA options. Note that this implementation is not
most optimal for sequential flow solver, but it is for demonstration purposes only.

Developer Resources

However, try this R code to simulate a meaningful input:

Thanks

Sponsorship

tests.R
n<-1000
m<-2000
k<-10
qi<-1

Related Projects
Lucene
Hadoop

#simulated input
svalsim<-diag(k:1)
usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))
x<- usim %*% svalsim %*% t(vsim)

and try to compare ssvd.svd(x) and stock svd(x) performance for the same rank k, notice the difference in the
running time. Also play with power iterations (qIter) and compare accuracies of standard svd and SSVD.
Note: numerical stability of R algorithms may differ from that of Mahout's distributed version. We haven't studied
accuracy of the R simulation. For study of accuracy of Mahout's version, please refer to Nathan's dissertation as
referenced above.
Modified SSVD Algorithm.
Given an m n matrix A , a target rank k N 1 , an oversampling parameter p N 1 , and the number of

additional power iterations q N 0 , this procedure computes an m (k + p) SVD A UV :


https://mahout.apache.org/users/dim-reduction/ssvd.html

1/2

10/15/2014

Apache Mahout: Scalable machine learning and data mining

1.

Create seed for random n (k + p) matrix . The seed defines matrix using Gaussian unit vectors
per one of suggestions in Halko, Martinsson, Tropp .

2.

Y = A, Y R

3.

Column-orthonormalize Y
Q R

4.

m(k+p)

B0 = Q

A :

m(k+p)

, R R
B R

(k+p)n

. I denote this as Q

6.

Compute Eigensolution of a small Hermitian B q B q

7.

Singular values

8.
9.

: Bi

If q

= 1..q

= A

qr (AB i1 ) . Q

If needed, compute

0.5

, or, in other words, s i

^
U = QU

If needed, compute V

. Also,

(power iterations step).

^
^
= U U

= i

, Bq Bq

(k+p)(k+p)

1
^
= Bq U

= QR

= qr (Y) . Q

5.

> 0

repeat: for i

by computing thin decomposition Y

(k+p)(k+p)

. Another way is V

= A

Copyright 2014 The Apache Software Foundation, Licensed under the Apache License, Version 2.0 .
Apache and the Apache feather logos are trademarks of The Apache Software Foundation.

https://mahout.apache.org/users/dim-reduction/ssvd.html

2/2

S-ar putea să vă placă și