Sunteți pe pagina 1din 14

Mixture

Models and Random Number Generation


Austin Kinion
1.1 Mixture Models
#Create multiple component densities to use.
norm1 = function(n) rnorm(n,3,1)
norm2 = function(n) rnorm(n,4,1)
norm3 = function(n) rnorm(n,7,2)
norm4 = function(n) rnorm(n,10,4)
norms = list(norm1, norm2, norm3, norm4)
problty = c(.2, .3, .1, .4)

The first thing I did for this problem was create a series of functions that gave
multiple component densities for me to use. I then compiled these functions into a
list. The last variable will be used later to give an ouput based on the 4 probabilities.
#Function taking sample from our density
#Help from Duncan Tample Lang was given in office hours to to write
this function.
samp_dens_1 <- function(n, problty, components){

compon = numeric(n)

answer = numeric(n)

for (i in 1:n){

y = sample(c(1:length(components)), 1 , replace = TRUE, prob =
problty)

compon[i] = y

U = components[[y]](n=1)

answer[i] = U
}
return(answer)
}
samp_dens_1(n = 9, problty = problty, components = norms)
## [1] 5.068311 12.329605 6.125661 4.085809 5.187219 3.653186
15.947611
## [8] 4.741648 2.727713

The function returns n sample values from a mixture of Normals. It accepts three
parameters: the number of sample values (n), the probabilities for sampling from

each component (probability), and an object representing the densities of the


components.
#Second function taking sample from our density
samp_dens_2<- function(n, problty, components){

compon = numeric(n)

compon = sample(c(1:length(components)), n , replace = TRUE, prob =
problty)

tab = table(compon)

store = list()

for (i in names(tab)) {

store[[i]] = (components[[as.numeric(i)]](n = tab[i]))
}
return(unlist(store))
}

samp_dens_2(n=9, problty = problty, components = norms)


## 21 22 23 41 42 43
44
## 3.019743 3.129023 5.148207 8.336102 8.154173 15.705501
12.932853
## 45 46
## 12.843981 8.749921

The function returns n sample values from a mixture of Normals. It accepts three
parameters: the number of sample values (n), the probabilities for sampling from
each component (problty), and an object representing the densities of the
components. It also shows from which of the distrobutions the numbers are beign
generated from.
#Compare distributions
answers_1 = samp_dens_1(n = 10000, problty = problty, components =
norms)

answers_2 = samp_dens_2(n = 10000, problty = problty, components =


norms)

#show quantile plot for the two distrobutions and make sure straight
line

qqplot(answers_1, answers_2, xlab = "First Function", ylab = "Second


Function", main = "Testing Distribution Between Both Functions")


The plot above is making certain the two functions produce similar results. I used Q-
Q plots to compare the output from the two functions It is shown in the plot above
that the distrobutions from the two functions are very similar, since there is a
relatively linear distrobution of the two quantiles when plotted together.

samp_dens_1_run<- function(n, problty=c(.2,.3,.1,.4), components =
norms){

compon = numeric(n)

answer = numeric(n)

for (i in 1:n){

y = sample(c(1:length(components)), 1 , replace = TRUE, prob
= problty)

compon[i] = y

U = components[[y]](n=1)

answer[i] = U
}
return(answer)
}
samp_dens_1_run(10, problty = c(.2,.3,.1,.4), components = norms)
## [1] 2.842619 8.351622 3.842018 12.676749 12.135348 10.280814
7.434478
## [8] 3.098023 4.934412 3.443973

The function above performs multiple runs of function 1 for different sample sizes.
samp_dens_2_run <- function(n, prob = c(.2,.3,.1,.4), components =
norms){

compon = numeric(n)

compon = sample(c(1:length(components)), n , replace = TRUE, prob =
problty)

tab = table(compon)

storage = list()

for (i in names(tab)) {

storage[[i]] = components[[as.numeric(i)]](n = tab[i])
}

return(storage)
}

samp_dens_2_run(100, prob = c(.2,3,.1,.4), components = norms)


## $`1`
## [1] 1.789947 4.453420 4.364661 4.625647 4.058852 3.393313 2.786988
## [8] 1.714030 5.967371 4.684864 2.269156 1.926005 2.406693 3.604401
## [15] 2.430342 3.311085 3.018073 1.879801 4.457779 2.975392 3.650793
## [22] 4.971389 2.963034 3.106816
##
## $`2`
## [1] 4.809709 2.478025 5.240262 4.002886 4.648964 3.976636 4.222864
## [8] 3.240145 4.744460 1.829389 4.389381 3.411834 3.946715 2.531828
## [15] 3.064393 4.147441 4.818504 5.142197 5.017655 3.482268 3.039907
## [22] 4.411146 3.255804 3.678140 4.258327 5.141063 3.265941
##
## $`3`
## [1] 9.881238 5.998197 5.432925 4.857151 9.221326 4.018011 8.949734
6.553921
##

## $`4`
## [1] 14.452658 4.478555 2.508859 13.464259 6.777307 13.905548
18.477334
## [8] 7.838756 12.674486 8.550978 12.558988 13.412830 14.207680
6.613511
## [15] 4.166077 5.547381 7.695239 9.123781 12.808309 8.550782
9.322842
## [22] 8.652894 9.681145 12.171426 5.438197 12.193603 9.124177
2.836161
## [29] 11.171202 15.108634 9.407813 4.046711 16.366509 15.279492
7.147838
## [36] 8.888420 15.536749 14.273155 12.396337 10.559881 7.626454

The function above performs multiple runs of function 2 for different sample sizes.

sampsize = c(1, 5, 25, 100, 1000, 10000)


time_1 = sapply(sampsize,
function(x){system.time(samp_dens_1_run(n=x))})[3,]

time_2 = sapply(sampsize,
function(x){system.time(samp_dens_2_run(n=x))})[3,]

time_elapsed_1 = t(replicate(60, sapply(sampsize,


function(x){system.time(samp_dens_1_run(n=x))})[3,]))

time_elapsed_2 = t(replicate(60, sapply(sampsize,


function(x){system.time(samp_dens_2_run(n=x))})[3,]))

boxplot(time_elapsed_1,yim = c(0, 15), main = "Time elapsed for
function 1 sample sizes", xlab = "Sample Size", ylab = "Time (in
minutes)", xaxt = 'n')

xmark = c(1, 5, 25, 100, 1000, 10000)

axis(side = 1, at = 1:length(sampsize), labels = xmark)


boxplot(time_elapsed_2, main = "Time elapsed for function 2 sample
sizes", xlab = "Sample Size", ylab = "Time (in minutes)", xaxt = 'n')

xmark = c(1, 5, 25, 100, 1000, 10000)

axis(side = 1, at = 1:length(sampsize), labels = xmark)


In comparing the two boxplots, it is obvious that the second function works much
more quickly than the first, especially for larger sample sizes. Also, the larger the
sample size, the longer it takes for the function to complete. After experimenting
with different parameters in the component densities, it is not obvious that the time
elapsed is effected. Also, for larger numbers of components in the mixture (tried for
k=4,5,6) there is a greater amount of time elapsed.
2.1 Random Number Generation
a.Triangular Distribution
#compute density of triagular distrobution

dtriang <- function(x,a,b,c){



store = numeric(length(x))

for (i in 1:length(x)){

if (a < x[i] & x[i]< c)

store[i] = 2*(x[i]-a)/((b-a)*(c-a))

if (c <= x[i] & x[i] < b)

store[i] = 2*(b-x[i])/((b-a)*(b-c))

if (x[i] < a | x[i] > b)

store[i] = 0

}

return (store)

}

#call dtriang() with the specified values
dtriang(x=c(3,4,8,2),a=1,b=6,c=3)
## [1] 0.4000000 0.2666667 0.0000000 0.2000000

Since the function is computing the density of the triangular distribution at one or
more values of x, for values of a, b and c specified by the caller , we can determine
that it is working properly.
#Make function ptriang that takes one or more values of the RV and
computes triang dist.

ptriang <- function(x, a, b, c){



ptri = numeric(length(x))

for (i in 1:length(x)) {

if (x[i] < a)

ptri[i] = 0

else if (x[i] >= a & x[i] < c)

ptri[i] = ((x[i]-a)^2)/((b-a)*(c-a))

else if (x[i] >= c & x[i] <= b)

ptri[i] = 1-(((x[i]-b)^2)/((b-a)*(b-c)))

else if (x[i] > b)

ptri[i] = 1


}

return(ptri)

}

ptriang(x = c(.5, 2, 4, 9), a = 1, b = 6, c = 3)
## [1] 0.0000000 0.1000000 0.7333333 1.0000000

Since the function is computing the probability for a Triangular distribution of the
value being less than or equal to each value, we know that the function is working
properly.
#Using inverse cdf of the original function, sampling from
distrobution.

rtriang <- function(n, a, b, c) {



unif = runif(n)

sample = numeric(n)

for (i in 1:n){

if (unif[i] > 0 & unif[i] < ((c-a)/(b-a)))

sample[i] = a + sqrt((b-a)*(c-a)*unif[i])

if (unif[i] > (c-a)/(b-a) & unif[i] < 1)

sample[i] = b - sqrt((b-a)*(b-c)*(1-unif[i]))

}

return (sample)

}

rtriang(n=10, a=1, b=6, c=3)
## [1] 3.705334 3.824399 4.339524 4.124677 2.729264 3.995893 3.182478
## [8] 3.331499 4.019541 4.153900
hist(rtriang(n=100000, a=1, b=6, c=3), xlab="Triangular distrobution",
main="Histogram of rtriang function")


From the histogram above, we can see that there has been a triangular density
generated by the function. This was done sampling from the triangular distrobution
from the first function.
2.2 b. Acceptance/Reject Sampling
#Taking a look at the function
source(url("http://eeyore.ucdavis.edu/stat141/homework/nodeDensity.R"))

nd=nodeDensity


plt=persp(outer(0:100,0:100.,nd), phi=30, theta=30, main ="Target
Density", xlab="X")


plt
## [,1] [,2] [,3] [,4]
## [1,] 1.732051e+00 -0.5000000 0.8660254 -0.8660254
## [2,] 1.000000e+00 0.8660254 -1.5000000 1.5000000
## [3,] -1.537228e-17 0.4348286 0.2510484 -0.2510484
## [4,] -1.366025e+00 -1.0490381 -2.9150635 3.9150635
contour(outer(0:100,0:100.,nd))


The two plots above clearly show the target density.
#Finding max of function
max(outer(0:100,0:100.,nd))
## [1] 3.983295

Found the max to be ~3.99


#c will scale the function g(x,y) to majorize the function.
c=(max(outer(0:100,0:100.,nd)))/(1/100^2)
c
## [1] 39832.95

C scales the function to reach the max of our original density.


#g(x,y)= 1/100^2 - Help from Nick Ulle to determine ths and write
function.
dprop = function(x, y){

#bounds for function
if(x >= 0 & x <= 100 & y >= 0 & y <= 100)

return(1/100^2)

else
return(0)
}

Specifying where the plane needs to lie above the function for our sampling density.
rprop = function(n) runif(n, 0, 100)
library(MASS)

# Make it a loop and wrap it in a function.


rnormtrunc = function(n) {

samp = matrix(nrow= n, ncol = 2)

accepted = 0

i = 0

while(accepted < n) {

# Step 1: Sample x,y.

x = rprop(1)

y= rprop(1)

# Step 2: Sample z.

z = runif(1, 0, c* dprop(x,y))

# Step 3: Accept/reject.

if (z < nodeDensity(x,y)) {

# ACCEPT! :D

accepted = accepted + 1

samp[accepted, 1] = x

samp[accepted, 2] = y
}

i = i + 1

}
pas=paste(accepted/i)


print(pas)

results= return(samp)
}

The loop above deoes just what it says in the comments.


#Declare sample size in rnormtrunc()
res=rnormtrunc(10000)
## [1] "0.339305103148751"
graph= kde2d(res[,1], res[,2])

#plot the sampled function to show it is similar to target.


persp(graph)


We can see that, after plotting the original function and finding the max of it, I was
able to create a function that sampled from the original distrobution, to mimick that
of the original. It can be seen that the two plots are similar, which means a success!
The efficiency is around 34%, meaning that about 34% of the sampled points
actually fall within the original distrobution.

S-ar putea să vă placă și